Linear Regression with Spotify Data

Project Objectives

In this project, you will build a linear regression model to predict the popularity of a song using features such as duration, loudness, and many more features. You’ll clean and prepare a Spotify dataset, and explore patterns in songs. Along the way, you’ll practice dealing with the model’s assumptions, dealing with multicollinearity, handling outliers, and feature selection. You’ll build the model and interpret it’s results.

Learning Objectives

Practice cleaning and preparing Spotify data for analysis.
Explore the target distribution and develop summary statistics for song popularity.
Identify and handle multicollinearity and outliers to improve model quality.
Build and interpret a linear regression model, checking its assumptions.
Understand the assumptions of linear regression and why they are necessary.
Perform feature selection to find a balance between fit and number of features in the model.
Interpret the results of the model using summary results.

Use 4 cores for this project.

Why Linear Regression?

Linear regression is a simple but powerful method for predicting a quantitative (continuous) target variable.

Even though newer models exist, linear regression remains popular because:

Results can be interpreted directly (coefficients show how predictors affect the target).
You can use a single predictor variable or multiple predictors.
Math behind the model is more transparent compared to black-box models.
You can model the relationship between a numeric response and one (or more) predictors.

We will use the Spotify Dataset (1986–2023) for this project. Our target variable, popularity, is what we aim to predict, and it is continuous, ranging from 0 to 100.

Dataset

/anvil/projects/tdm/data/spotify/linear_regression_popularity.csv

Learn more about the source: Spotify Dataset 1986–2023. Feature names and their descriptions are below for your reference:

Feature	Description
track_id	Unique identifier for the track
track_name	Name of the track
popularity	Popularity score (0–100) based on Spotify plays
available_markets	Markets/countries where the track is available
disc_number	Disc number (for albums with multiple discs)
duration_ms	Track duration in milliseconds
explicit	Whether the track contains explicit content (True/False)
track_number	Position of the track within the album
href	Spotify API endpoint URL for the track
album_id	Unique identifier for the album
album_name	Name of the album
album_release_date	Release date of the album
album_type	Album type (album, single, compilation)
album_total_tracks	Total number of tracks in the album
artists_names	Names of the artists on the track
artists_ids	Unique identifiers of the artists
principal_artist_id	ID of the principal/primary artist
principal_artist_name	Name of the principal/primary artist
artist_genres	Genres associated with the principal artist
principal_artist_followers	Number of Spotify followers of the principal artist
acousticness	Confidence measure of whether the track is acoustic (0–1)
analysis_url	Spotify API URL for detailed track analysis
danceability	How suitable a track is for dancing (0–1)
energy	Intensity and activity measure of the track (0–1)
instrumentalness	Predicts whether a track contains vocals (0–1)
key	Estimated key of the track (integer, e.g., 0=C, 1=C#/Db)
liveness	Presence of an audience in the recording (0–1)
loudness	Overall loudness of the track in decibels (dB)
mode	Modality of the track (1=major, 0=minor)
speechiness	Presence of spoken words (0–1)
tempo	Estimated tempo in beats per minute (BPM)
time_signature	Estimated overall time signature
valence	Musical positivity/happiness of the track (0–1)
year	Year the track was released
duration_min	Track duration in minutes

Feature

Description

track_id

Unique identifier for the track

track_name

Name of the track

popularity

Popularity score (0–100) based on Spotify plays

available_markets

Markets/countries where the track is available

disc_number

Disc number (for albums with multiple discs)

duration_ms

Track duration in milliseconds

explicit

Whether the track contains explicit content (True/False)

track_number

Position of the track within the album

href

Spotify API endpoint URL for the track

album_id

Unique identifier for the album

album_name

Name of the album

album_release_date

Release date of the album

album_type

Album type (album, single, compilation)

album_total_tracks

Total number of tracks in the album

artists_names

Names of the artists on the track

artists_ids

Unique identifiers of the artists

principal_artist_id

ID of the principal/primary artist

principal_artist_name

Name of the principal/primary artist

artist_genres

Genres associated with the principal artist

principal_artist_followers

Number of Spotify followers of the principal artist

acousticness

Confidence measure of whether the track is acoustic (0–1)

analysis_url

Spotify API URL for detailed track analysis

danceability

How suitable a track is for dancing (0–1)

energy

Intensity and activity measure of the track (0–1)

instrumentalness

Predicts whether a track contains vocals (0–1)

key

Estimated key of the track (integer, e.g., 0=C, 1=C#/Db)

liveness

Presence of an audience in the recording (0–1)

loudness

Overall loudness of the track in decibels (dB)

mode

Modality of the track (1=major, 0=minor)

speechiness

Presence of spoken words (0–1)

tempo

Estimated tempo in beats per minute (BPM)

time_signature

Estimated overall time signature

valence

Musical positivity/happiness of the track (0–1)

year

Year the track was released

duration_min

Track duration in minutes

Simple Linear Regression

Let’s start with a simple example by predicting popularity from a single feature (e.g., duration of the song in minutes).

import matplotlib.pyplot as plt
import seaborn as sns

# Convert duration to minutes for training data

duration_min_train = X_train["duration_ms"] / 60000

# See how the relation is between duration and popularity with a scatter plot:

plt.figure(figsize=(8,5))
sns.scatterplot(x=duration_min_train, y=y_train, alpha=0.6)
sns.regplot(x=duration_min_train, y=y_train, scatter=False, color="red", ci=None)

plt.xlabel("Duration (minutes)")
plt.ylabel("Popularity")
plt.title("Popularity vs. Duration (train set)")
plt.show()

Predictor (X): Duration (minutes)

Response (Y): popularity

Figure 1. Scatter Plot with a Simple Linear Regression Line - Spotify Data

Broadly speaking, we would like to model the relationship between X and Y using the form:

Y = f(X) + $\epsilon$

If we fit the data with a horizontal line (e.g., f(x) = c), the model would not capture the relationship well. This is an example of underfitting.
If we fit the data with a very wiggly curve that passes through nearly every point, the model becomes too complex. This is an example of overfitting.

So, our goal is to find a line that captures the main trend without falling into either extreme (underfitting or overfitting). The regression line should summarize the relationship between popularity (Y) and duration (X) well.

How Do We Define a Good Line?

We would like to use a linear function of X, writing our model with $\beta_1$ as the slope:

$$ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i, $$ where $$ \epsilon_i \sim N(0, \sigma^2). $$

This shows:

$\beta_0$ = intercept,
$\beta_1$ = slope (how much $Y$ changes for a one-unit change in $X$),
$\epsilon_i$ = error term.

In simple linear regression, we model $Y$ as a linear relationship with $X_i$.

A good line is defined as one that produces small errors or residuals, meaning the predicted values are close to the observed values. In other words, the best line is the one where as many points as possible lie close to the regression line.

We find the line that minimizes the sum of all squared distances from the points to the line. That is:

$$ \min_{\beta_0,\beta_1} \sum_{i=1}^{n} \left( y_i - (\beta_0 + \beta_1 x_i) \right)^2. $$

In practice, software like Python’s statsmodels solves this using calculus and linear algebra. For example, the code below would estimate the coefficient for you with ordinary least squares (OLS) and then you can view the results using model.summary().

import statsmodels.api as sm

X = df[["duration_min"]]
y = df["popularity"]

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())

Residuals

Residuals are the errors between observed and predicted values:

Residual = Observed Popularity – Predicted Popularity ($y - \hat(y)$)

Figure 2. Residuals Example Spotify Data

Interpretation of Coefficient (Simple Linear Regression)

The slope $\beta_1$:

Tells us how much our target, popularity, changes (on average) for each additional minute of track duration.
If $\beta_1 < 0$, longer songs tend to be less popular.
If $\beta_1 > 0$, longer songs tend to be more popular.

Assumptions

When building a linear regression model, it is important to check it’s assumptions. We will go deeper into what the assumptions are in question 5. If the assumptions are satisfied, we can trust the results of inference. If they are not, the results lose validity, for example, hypothesis tests may lead to misleading accept/reject decisions. In other words: if you’re giving a linear regression model information that doesn’t meet it’s assumptions, it will give you invalid information back.

Parts of these explanations have been adapted from Applied Statistics with R (Dalpiaz, book.stat420.org).

Other Important Terms

Slope tells us the direction/magnitude of the relationship (duration vs. popularity).
Residuals show the difference between actual popularity and predicted popularity.
R² tells us how much of the variation in popularity is explained by predictors.
p-value for the slope tests whether the relationship is statistically significant or could be due to chance.
We can expand the model by adding more features (loudness, danceability, energy, valence, etc.) for better predictions → this is called Multiple Linear Regression which we will explain in question 2.

Reading and Preparing the Data

In this section, we will load the Spotify popularity dataset and prepare it for modeling. This includes reading the data into Python, removing unnecessary columns, and separating the prediction target from the feature variables.

Step 1: Read the Data

We begin by reading the dataset into Python using pandas. After loading the data, we print the first five rows to get an initial sense of the structure and variables included in the dataset. We will store the dataset in a DataFrame called spotify_popularity_data.

import pandas as pd

spotify_popularity_data = pd.read_csv("/anvil/projects/tdm/data/spotify/linear_regression_popularity.csv")

spotify_popularity_data.head()

Step 2: Remove Unnecessary Columns

The original dataset contains many columns that are not needed for our analysis, such as identifiers, text-based fields, and metadata related to albums and artists. To simplify the dataset and focus on numeric features relevant for modeling, we drop these columns.

For more information on the drop function, you can refer to the pandas documentation: pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

drop_cols = [
    "Unnamed: 0", "Unnamed: 0.1", "track_id", "track_name", "available_markets", "href",
    "album_id", "album_name", "album_release_date", "album_type",
    "artists_names", "artists_ids", "principal_artist_id",
    "principal_artist_name", "artist_genres", "analysis_url", "duration_min"
]

# Drop unnecessary columns
spotify_popularity_data = spotify_popularity_data.drop(columns=drop_cols)

for col in spotify_popularity_data:
    print(col)

After dropping these columns, the dataset now includes the following variables:

popularity
disc_number
duration_ms
explicit
track_number
album_total_tracks
principal_artist_followers
acousticness
danceability
energy
instrumentalness
key
liveness
loudness
mode
speechiness
tempo
time_signature
valence
year

Step 3: Define the Target and Features

Next, we separate the dataset into a prediction target and a set of feature variables.

In this project, we are interested in predicting song popularity. Therefore, we use the popularity column as our target variable y, and all remaining columns as the feature matrix X.

# Target and features
y = spotify_popularity_data["popularity"].copy()
X = spotify_popularity_data.drop(columns=["popularity"]).copy()

# Quick check
print(X.shape)
print(y.shape)

This confirms that we have 1,500 observations and 19 feature variables, with one corresponding popularity value for each observation.

(1500, 19)
(1500,)

Splitting the Data and Understanding the Data

Multiple Linear Regression

Most of the time, a single feature is not enough to explain the behavior of the target variable or to predict it. Multiple Linear Regression (MLR) uses several predictors in the model.

$$ Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i, $$ where $$ \epsilon_i \sim N(0, \sigma^2). $$ In this notation we define: $$ i = 1, 2, \ldots, n. $$

Interpretation:

Each coefficient $\beta_i$ is the expected change in $Y$ for a 1-unit change in $X_i$, holding all the other predictors constant.

Why use it?

More predictors can help captures more of what explains popularity (e.g., duration might not be enough for accurate predictions, but combined with more variables it can help).

Splitting the Data

Models are not trained on entire datasets. Instead, we partition the data into multiple subsets to serve distinct roles in the model development process. The most common partitioning scheme involves subsets:

Training data is what the model actually learns from. It’s used to find patterns and relationships between the features and the target.
Test data is completely held out until the very end. It gives us a final check to see how well the model is likely to perform on brand-new data it has never seen before.

Understanding the Subsets

In supervised learning, our dataset is split into predictors (X) and a target variable (y). We can further divide these into training, and test subsets to properly evaluate model performance and prevent overfitting.

In practice, it’s recommended to use cross-validation, which provides a more reliable estimate of a model’s performance by repeatedly splitting the data into training and validation sets and averaging the results. This helps reduce the variability that can come from a single random split. However, for this project, we will only perform a single random train/test split using a fixed random seed.

Now that the data has been cleaned and structured, we move on to splitting the dataset into training and testing sets and exploring the distribution of the response variable. These steps are essential for building and evaluating predictive models.

Step 1: Create a Train/Test Split

To evaluate model performance fairly, we split the dataset into training and testing subsets. We use an 80/20 split, where 80% of the data is used for training and 20% is reserved for testing. A fixed random_state is set to ensure reproducibility.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("X_train shape:", X_train.shape)
print("X_test  shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test  shape:", y_test.shape)

This confirms that the training set contains 1,200 observations and the test set contains 300 observations.

X_train shape: (1200, 19)
X_test  shape: (300, 19)
y_train shape: (1200,)
y_test  shape: (300,)

Step 2: Examine the Distribution of Popularity

Before fitting any models, it is helpful to understand the distribution of the target variable. We begin by visualizing the popularity values in the training set using a histogram.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,5))
sns.histplot(y_train, bins=30, kde=True)
plt.xlabel("Popularity")
plt.ylabel("Count")
plt.title("Distribution of Track Popularity (Training Set)")
plt.show()

Figure 3. Track Popularity

After examining the histogram, we observe that the popularity distribution appears roughly symmetric and approximately normal. There is a slight right tail near the upper bound (around 100), but no strong outliers. Most popularity values fall between roughly 45 and 100, with the mean near the upper-middle of this range.

Step 3: Explore the Relationship Between Popularity and Duration

Next, we explore the relationship between song duration and popularity. Since duration is measured in milliseconds, we first convert it to minutes for easier interpretation. We then create a scatterplot using the training data and overlay a fitted linear regression line.

# Convert duration to minutes for training data
duration_min_train = X_train["duration_ms"] / 60000

plt.figure(figsize=(8,5))
sns.scatterplot(x=duration_min_train, y=y_train, alpha=0.6)
sns.regplot(x=duration_min_train, y=y_train, scatter=False, ci=None)

plt.xlabel("Duration (minutes)")
plt.ylabel("Popularity")
plt.title("Popularity vs. Duration (Training Set)")
plt.show()

Figure 4. Scatter Plot with a Simple Linear Regression Line - Spotify Data

From the scatterplot, we observe a weak relationship between track duration and popularity, with popularity varying widely across different song lengths. The regression line represents the best-fitting linear trend by minimizing the sum of squared residuals, providing a simple summary of the overall direction of the relationship despite substantial variability in the data.

Checking for Multicollinearity and Influential Points

$R^2$ (R-squared)

$R^2$ measures how well our regression model explains the variation in $Y$.
It is the proportion of variability in $Y$ that can be explained by the predictors $X_1, X_2, \dots, X_p$.
The value is always between $0$ and $1$:
- $R^2 = 0$ → the model explains none of the variation in $Y$.
- $R^2 = 1$ → the model perfectly explains all the variation in $Y$.

$R^2$ can also be expressed using sums of squares from the fitted model versus the mean-only model:

$$ R^2 = \frac{SS(\text{fit}) - SS(\text{mean})}{SS(\text{fit})} $$

$SS(\text{fit}) =$ sum of squared errors from the regression model,
$SS(\text{mean}) =$ sum of squared errors from the model that only uses the mean of $Y$ (no predictors).

Checking Multicollinearity with VIF

Before fitting our model, we use Variance Inflation Factor (VIF) to check for multicollinearity. Multicollinearity occurs when predictors are highly correlated with each other, which can inflate standard errors, make coefficient estimates unstable, and reduce the reliability of our interpretations.

VIF(Xᵢ) = 1 / (1 – ${R_i}^2$),

where ${R_i}^2$ is the $R^2$ from a regression of $X_i$ onto all of other predictors. You can see that having ${R_i}^2$ close to one shows signs of high correlation (collinearity) and so the VIF will be large.

A VIF above 10 suggests the variable is highly collinear and may need to be removed (this is a common threshold).

Influential Observations and Cook’s Distance

Some outliers only change the regression line a small amount, while others have a large effect. Observations that fall into the second category are called influential. A common measure of influence is Cook’s Distance, which is defined as:

$$ D_i = \frac{r_i^2}{p} \cdot \frac{h_i}{1 - h_i}. $$

A Cook’s Distance is often considered large if:

$$ D_i > \frac{4}{n}. $$

An observation with a large Cook’s Distance is called influential.

How we use it:

4/n is a simple rule of thumb for flagging unusually influential points with Cook’s Distance.
n = number of rows in your training data.
As n gets larger, 4/n gets smaller, the bar for “unusually influential” gets stricter.

Before fitting a linear regression model, it is important to ensure that the feature set does not suffer from severe multicollinearity and that influential outliers are appropriately handled. In this section, we use Variance Inflation Factor (VIF) to diagnose multicollinearity and Cook’s Distance to identify influential observations in the training data.

Step 1: Evaluate Multicollinearity Using VIF

We begin by assessing multicollinearity among the predictor variables using the Variance Inflation Factor (VIF). VIF measures how strongly each feature is linearly related to the other features in the model. Large VIF values indicate high multicollinearity, which can inflate standard errors and destabilize coefficient estimates.

Before computing VIF, we ensure that all variables are numeric by converting boolean columns to integers. We then use a provided iterative function that repeatedly removes the variable with the highest VIF until all remaining variables have VIF values below a specified threshold. In this project, we use a commonly accepted threshold of 10.

import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Convert booleans to ints
bool_cols = X_train.select_dtypes(include=["bool"]).columns
if len(bool_cols):
    X_train[bool_cols] = X_train[bool_cols].astype(int)


def calculate_vif_iterative(X, thresh=10.0):
    X_ = X.astype(float).copy()
    while True:
        vif_df = pd.DataFrame({
            "variable": X_.columns,
            "VIF": [
                variance_inflation_factor(X_.values, i)
                for i in range(X_.shape[1])
            ]
        }).sort_values("VIF", ascending=False).reset_index(drop=True)

        max_vif = vif_df["VIF"].iloc[0]
        worst = vif_df["variable"].iloc[0]

        if (max_vif <= thresh) or (X_.shape[1] <= 1):
            return X_, vif_df.sort_values("VIF")

        print(f"Dropping '{worst}' with VIF={max_vif:.2f}")
        X_ = X_.drop(columns=[worst])

Step 2: Filter Features Based on VIF

Next, we run the iterative VIF function using a threshold of 10. The function returns two objects: a filtered version of X_train containing only variables with acceptable VIF values, and a summary table listing the remaining variables and their corresponding VIFs.

# Run iterative VIF filtering
result_vif = calculate_vif_iterative(X_train, thresh=10.0)

# Split into the filtered dataset and the VIF summary
X_train = result_vif[0]
vif_summary = result_vif[1]

During this process, several highly collinear variables are removed:

Dropping 'year' with VIF=320.29
Dropping 'time_signature' with VIF=92.09
Dropping 'disc_number' with VIF=46.33
Dropping 'energy' with VIF=19.81
Dropping 'danceability' with VIF=19.10
Dropping 'tempo' with VIF=12.17

The remaining variables and their VIF values are shown below:

Variable	VIF
instrumentalness	1.243990
principal_artist_followers	1.420775
explicit	1.696639
acousticness	2.259710
speechiness	2.419409
liveness	2.551417
track_number	2.854535
mode	2.959880
key	3.133896
valence	5.012910
album_total_tracks	5.030855
loudness	6.561215
duration_ms	9.724675

All remaining features now have VIF values at or below the threshold, indicating acceptable levels of multicollinearity.

Step 3: Identify Influential Observations Using Cook’s Distance

After addressing multicollinearity, we examine the training data for influential observations using Cook’s Distance. Cook’s Distance measures how much each observation influences the fitted regression coefficients. Observations with unusually large Cook’s Distance values can disproportionately affect the model’s results.

We fit a linear regression model on the cleaned training data and compute Cook’s Distance for each observation. Any observation with Cook’s Distance greater than $4/n$, where $n$ is the number of observations, is flagged as potentially influential.

import statsmodels.api as sm

# Align and clean training data
X_train_cook = X_train.loc[X_train.index.intersection(y_train.index)]
y_train_cook = y_train.loc[X_train_cook.index]

# Keep only rows without missing or infinite values
mask = np.isfinite(X_train_cook).all(1) & np.isfinite(y_train_cook.to_numpy())
X_train_cook, y_train_cook = X_train_cook.loc[mask], y_train_cook.loc[mask]

# Fit model on the cleaned data
ols_model_cook = sm.OLS(
    y_train_cook,
    sm.add_constant(X_train_cook, has_constant="add")
).fit()

# Cook's Distance values
cooks_distance = ols_model_cook.get_influence().cooks_distance[0]
cooks_threshold = 4 / len(X_train_cook)

# Identify influential observations
cooks_outliers = X_train_cook.index[cooks_distance > cooks_threshold]

print(f"Flagged {len(cooks_outliers)} outliers (Cook's D > 4/n).")

# Remove influential observations
X_train = X_train.drop(index=cooks_outliers)
y_train = y_train.drop(index=cooks_outliers)

Flagged 61 outliers (Cook's D > 4/n).

Removing features with high multicollinearity is important because strongly correlated predictors can inflate standard errors and make it difficult to interpret which variables are truly associated with the response. Similarly, outliers identified by Cook’s Distance can exert an outsized influence on regression coefficients and distort the model’s fit. By addressing both multicollinearity and influential observations, we build a linear regression model that is more stable, reliable, and interpretable.

Feature Selection and Model Summary

Feature Selection with AIC and Forward Selection

Another important topic when building a model is feature selection. To reduce the number of features, we can use forward selection guided by Akaike Information Criterion (AIC):

$AIC = 2k – 2log(L)$,

where

$k$ is the number of parameters in the model,
$L$ is the likelihood of the model.

The model with the lowest AIC fits the data by striking a balance between fit and the number of parameters (features) used. If we pick the model with the smallest AIC, we are choosing the model with a low k (fewer features) while still ensuring it has a high likelihood log(L).

Forward selection begins with no predictors and adds them one at a time, at each step choosing the variable that leads to the greatest reduction in AIC.
Backward elimination begins with all predictors and removes them one at a time, dropping the variable whose removal leads to the greatest reduction in AIC.
Stepwise selection is a hybrid approach: it adds variables like forward selection but also considers removing variables at each step (backward pruning) if doing so reduces AIC further.

Feature selection is a very popular and important topic in machine learning. We recommend exploring additional resources to deepen your understanding. One excellent resource is An Introduction to Statistical Learning with Applications in Python (Springer textbook), which is available for free here. The section on Linear Model Selection and Regularization provides a detailed discussion of this topic.

AIC is one of several possible criteria for feature selection. While we are using AIC in this project, you could also use:

R²: Choose features that increase the model’s explained variance.
p-values: Add features that are statistically significant.
BIC (Bayesian Information Criterion): Similar to AIC but with a stronger penalty for complexity.

Each criterion has trade-offs. AIC is popular because it balances model fit and complexity, making it a solid choice when comparing linear regression models. For consistency, we’ll use AIC throughout this project.

Feature Selection and Model Interpretation

With a cleaned feature set and influential observations removed, we are now ready to build a linear regression model. Before fitting a final model, we perform feature selection to identify a subset of predictors that best explain variation in song popularity while balancing model complexity and goodness of fit.

Step 1: Perform Feature Selection Using Stepwise AIC

To select an appropriate set of predictors, we use stepwise selection with the Akaike Information Criterion (AIC). AIC rewards models that fit the data well but penalizes unnecessary complexity, helping to prevent overfitting.

The provided function implements a hybrid stepwise approach. At each iteration, it considers adding variables (forward selection) while also checking whether any previously added variables should be removed (backward pruning) if doing so further reduces AIC.

import statsmodels.api as sm

def stepwise_aic(X_train, y_train, max_vars=None, verbose=True):
    X = X_train.select_dtypes(include=[np.number]).astype(float).copy()
    y = pd.Series(y_train).astype(float)
    mask = np.isfinite(X).all(1) & np.isfinite(y)
    X, y = X.loc[mask], y.loc[mask]

    remaining, selected = set(X.columns), []
    current_aic = np.inf

    while remaining and (max_vars is None or len(selected) < max_vars):
        aics = [(sm.OLS(y, sm.add_constant(X[selected + [c]], has_constant='add')).fit().aic, c)
                for c in remaining]
        best_aic, best_var = min(aics, key=lambda t: t[0])

        if best_aic + 1e-6 < current_aic:
            selected.append(best_var)
            remaining.remove(best_var)
            current_aic = best_aic
            if verbose:
                print(f"+ {best_var} (AIC {best_aic:.2f})")

            # Backward step
            improved = True
            while improved and len(selected) > 1:
                backs = [(sm.OLS(y, sm.add_constant(X[[v for v in selected if v != d]], has_constant='add')).fit().aic, d)
                         for d in selected]
                back_aic, drop_var = min(backs, key=lambda t: t[0])
                if back_aic + 1e-6 < current_aic:
                    selected.remove(drop_var)
                    remaining.add(drop_var)
                    current_aic = back_aic
                    if verbose:
                        print(f"- {drop_var} (AIC {back_aic:.2f})")
                else:
                    improved = False
        else:
            break

    model = sm.OLS(y, sm.add_constant(X[selected], has_constant='add')).fit()
    return selected, model

Stepwise selection is useful because it helps identify a parsimonious model that balances interpretability and predictive performance. By reducing redundancy and excluding weak predictors, feature selection often leads to more stable coefficient estimates and better generalization.

Step 2: Run Stepwise Selection on the Training Data

Next, we apply the stepwise AIC function to the training data. The function returns both the selected feature names and a fitted OLS model using those features.

results_feature_selection = stepwise_aic(X_train, y_train, max_vars=None, verbose=True)

selected_cols = results_feature_selection[0]   # list of selected features
model = results_feature_selection[1]           # fitted OLS model

print(selected_cols)

During the selection process, features are added sequentially as long as they improve the AIC:

+ principal_artist_followers (AIC 8054.83)
+ duration_ms (AIC 8001.12)
+ loudness (AIC 7961.86)
+ album_total_tracks (AIC 7932.94)
+ acousticness (AIC 7909.56)
+ explicit (AIC 7887.99)
+ mode (AIC 7873.37)
+ valence (AIC 7861.57)
+ track_number (AIC 7858.87)
+ key (AIC 7857.61)
+ speechiness (AIC 7857.16)

The final selected model includes variables related to artist popularity, audio characteristics, and track structure. This suggests that both musical features and artist-level signals contribute to explaining song popularity.

Step 3: Interpret the Final Regression Model

We now examine the summary of the fitted regression model to interpret coefficient estimates, statistical significance, and overall model performance.

print(model.summary())

The model explains approximately 29% of the variability in song popularity ($R^2 = 0.289$), which is reasonable given the complexity of popularity as an outcome. Several predictors, such as principal_artist_followers, loudness, acousticness, and explicit, have statistically significant positive relationships with popularity. Other variables, including album_total_tracks, mode, and track_number, show significant negative associations.

Not all variables are statistically significant at conventional levels; for example, key and speechiness have larger p-values, suggesting weaker evidence of association with popularity. Overall, the results indicate that while the model captures meaningful trends, there remains substantial unexplained variability, leaving room for improvement through additional features or alternative modeling approaches.

Checking Assumptions of Linear Regression Model

Linear Regression Assumptions

Often we talk about the assumptions of this model, which are remembered by LINE.

Linear. The relationship between $Y$ and the predictors is linear.
Independent. The errors $\epsilon_i$ are independent.
Normal. The errors $\epsilon_i$ are normally distributed (the “error” around the line follows a normal distribution).
Equal Variance. At each value of $x$, the variance of $Y$ is the same, $\sigma^2$.

If you are a data science or statistics major, a solid understanding of these assumptions is frequently discussed in coursework and often asked about during interviews for data science roles. We encourage you to not only memorize these assumptions but also develop a clear understanding of their meaning and implications.

Normality Assumption - Histogram

We have several ways to assess the normality assumption. A simple check is a histogram of the residuals, if it looks roughly bell-shaped and symmetric, that supports treating the errors as approximately normal.

Normality Assumption - Q-Q plot

Another visual method for assessing the normality of errors, which is more powerful than a histogram, is a normal quantile-quantile plot, or Q-Q plot for short.

Essentially, if the points in a Q–Q plot don’t lie close to the straight line, that suggests the data are not normal. In essence, the plot puts the ordered sample values (sample quantiles) on the y-axis against the quantiles you’d expect under a normal distribution (theoretical quantiles) on the x-axis. Implementation details vary by software, but the idea is the same.

Independence Assumption - Durbin–Watson Independence Test

The question is “Are the residuals independent?”
If one residual is high and the next is also high, there’s positive autocorrelation. If they tend to alternate up/down, there’s negative autocorrelation. If there’s no pattern in residuals, they’re independent.

Rule of thumb:

~2 → residuals are approximately independent
< 2 → positive autocorrelation (closer to 0 is stronger)
> 2 → negative autocorrelation (closer to 4 is stronger)

Equal Variance Assumption - Residuals versus Fitted Values Plot

A Fitted vs. Residuals plot is one of the most useful diagnostics for checking the linearity and equal variance (homoscedasticity) assumptions.

What to look for:

Zero-centered residuals.
At any fitted value, the average residual should be about 0. This supports the linearity assumption. We can add a horizontal reference line at y = 0 to make this clear.
Even spread (constant variance).
Across all fitted values, the spread of residuals should be roughly the same.

After fitting the linear regression model, it is important to evaluate whether the key assumptions of linear regression are reasonably satisfied. In this section, we examine the normality, independence, and homoscedasticity assumptions by analyzing the model residuals.

Step 1: Check Normality Using a Histogram of Residuals

We begin by examining the normality of the residuals using a histogram of standardized residuals overlaid with a standard normal density curve. Standardizing the residuals allows us to directly compare their distribution to a normal distribution with mean 0 and variance 1.

import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as stats
import numpy as np

# Residuals
resid = model.resid
z = (resid - np.mean(resid)) / np.std(resid, ddof=1)

# Histogram with normal curve
plt.hist(z, bins=30, density=True, alpha=0.6, edgecolor="black")
x = np.linspace(z.min(), z.max(), 100)
plt.plot(x, stats.norm.pdf(x, 0, 1), "r", linewidth=2)
plt.title("Histogram of Standardized Residuals with Normal Curve")
plt.xlabel("Standardized Residuals")
plt.ylabel("Density")
plt.show()

The histogram appears approximately symmetric and closely follows the overlaid normal curve, suggesting that the residuals are reasonably normally distributed. While minor deviations may exist in the tails, there is no strong evidence of severe non-normality. Overall, the model appears to satisfy the normality assumption.

Step 2: Check Normality Using a Q–Q Plot

Next, we assess residual normality using a quantile–quantile (Q–Q) plot. In a Q–Q plot, normally distributed residuals should fall approximately along the 45-degree reference line.

# Q–Q Plot
sm.qqplot(z, line="45", fit=True)
plt.title("Q–Q Plot of Residuals")
plt.show()

The residuals closely follow the reference line, with only slight deviations at the extremes. This visual evidence further supports the conclusion that the normality assumption is reasonably met for this model.

Step 3: Test Independence Using the Durbin–Watson Statistic

We now test the independence assumption using the Durbin–Watson statistic, which detects autocorrelation in the residuals. Values close to 2 indicate little to no autocorrelation, while values far from 2 suggest potential dependence among residuals.

from statsmodels.stats.stattools import durbin_watson

dw_stat = durbin_watson(resid)
print("[Independence: Durbin–Watson]")
print(f"Statistic={dw_stat:.4f}")
print("H0: Residuals are independent (no autocorrelation).")

if 1.5 < dw_stat < 2.5:
    print("-> Pass (approx. independent)")
else:
    print("-> FAIL: possible autocorrelation")
print()

[Independence: Durbin–Watson]
Statistic=1.9603
H0: Residuals are independent (no autocorrelation).
-> Pass (approx. independent)

The Durbin–Watson statistic is very close to 2, indicating that the residuals are approximately independent. Independence is important because correlated residuals can lead to biased standard errors and invalid inference. Based on this test, the model satisfies the independence assumption.

Step 4: Check Linearity and Homoscedasticity Using Residuals vs. Fitted Values

Finally, we examine a residuals versus fitted values plot to assess the assumptions of linearity and constant variance (homoscedasticity). Ideally, residuals should be randomly scattered around zero with no clear patterns or changes in spread.

plt.scatter(model.fittedvalues, resid, alpha=0.6)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residuals vs. Fitted Values")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

The residuals appear to be evenly scattered around the horizontal reference line with no obvious patterns or funnel shapes. This suggests that the assumptions of linearity and homoscedasticity are reasonably satisfied, supporting the overall validity of the fitted linear regression model.

Model Evaluation and Predicting Song Popularity

In the final stage of the analysis, we evaluate the model’s performance on unseen data and use the fitted regression model to interpret feature effects and make predictions for a new song. These steps help assess how well the model generalizes and how it can be used in practice.

Step 1: Evaluate Model Performance on the Test Set

To assess how well the model generalizes to new data, we compute the coefficient of determination ($R^2$) on the test set. This measures the proportion of variability in song popularity that is explained by the model when applied to unseen observations.

from sklearn.metrics import r2_score
import statsmodels.api as sm

# Ensure test data uses the selected features
X_test = X_test[selected_cols].copy()
mask = np.isfinite(X_test).all(1)
X_test = X_test.loc[mask]
y_test = y_test.loc[X_test.index]

# Predict using the trained OLS model
y_pred = model.predict(sm.add_constant(X_test, has_constant='add'))

# R² on the test set
r2 = r2_score(y_test, y_pred)
print(f"Test R²: {r2:.3f}")

Test R²: 0.250

The test set $R^2$ of approximately 0.25 indicates that the model explains about 25% of the variability in song popularity on unseen data. While there is room for improvement, the test $R^2$ is reasonably close to the training $R^2$, suggesting that the model does not suffer from severe overfitting and generalizes moderately well.

Step 2: Interpret Feature Effects and Statistical Significance

Next, we examine a summary table that displays each selected variable’s coefficient, direction of effect, p-value, and statistical significance. This allows us to interpret which features are most strongly associated with popularity and whether their effects are statistically meaningful.

import pandas as pd

alpha = 0.05  # significance threshold

# Extract model components
coef = model.params.rename("coef")
pval = model.pvalues.rename("p_value")
ci = model.conf_int()

# Build summary table
coef_tbl = (
    pd.concat([coef, pval, ci], axis=1)
      .drop(index="const", errors="ignore")
      .assign(
          effect=lambda d: d["coef"].apply(lambda x: "Positive" if x > 0 else "Negative"),
          significant=lambda d: d["p_value"] < alpha
      )
      [["coef", "effect", "p_value", "significant"]]
      .sort_values("p_value")
)

coef_tbl_rounded = coef_tbl.round({"coef": 4, "p_value": 4})
print(coef_tbl_rounded)

 coef    effect  p_value  significant
principal_artist_followers  0.0000  Positive   0.0000         True
loudness                    0.6164  Positive   0.0000         True
duration_ms                -0.0000  Negative   0.0000         True
acousticness                5.4814  Positive   0.0000         True
album_total_tracks         -0.1788  Negative   0.0000         True
explicit                    2.7251  Positive   0.0000         True
mode                       -1.9647  Negative   0.0001         True
valence                    -3.5808  Negative   0.0002         True
track_number               -0.1283  Negative   0.0300         True
key                         0.1134  Positive   0.0777        False
speechiness                -4.6218  Negative   0.1195        False

Several variables, such as principal_artist_followers, loudness, acousticness, and explicit, show strong and statistically significant positive relationships with popularity. Other features, including album_total_tracks, valence, and track_number, are associated with lower popularity. Not all variables are statistically significant; for example, key and speechiness have p-values above 0.05, suggesting weaker evidence of association.

Step 3: Predict Popularity for a New Song

Finally, we use the trained regression model to predict the popularity of a new song based on its known characteristics. This demonstrates how the model can be applied to real-world scenarios.

The new song has the following features:

Principal artist followers: 5,000,000
Duration: 210,000 ms (approximately 3.5 minutes)
Loudness: −6.0
Album total tracks: 10
Acousticness: 0.20
Explicit: 0 (not explicit)
Mode: 1 (major key)
Valence: 0.40
Track number: 3
Key: 5 (F major)
Speechiness: 0.05

import pandas as pd
import statsmodels.api as sm

new_song = pd.DataFrame([{
    "principal_artist_followers": 5_000_000,
    "duration_ms": 210_000,
    "loudness": -6.0,
    "album_total_tracks": 10,
    "acousticness": 0.20,
    "explicit": 0,
    "mode": 1,
    "valence": 0.40,
    "track_number": 3,
    "key": 5,
    "speechiness": 0.05
}])

# Align columns with the trained model
new_song = new_song.reindex(columns=selected_cols)
new_song = new_song.fillna(X_train[selected_cols].median(numeric_only=True))

pred_pop = model.predict(sm.add_constant(new_song, has_constant="add"))[0]
print(f"Predicted popularity: {pred_pop:.1f}")

Predicted popularity: 68.9

Based on the fitted model, the predicted popularity score for this new song is approximately 69. This value reflects the combined influence of artist popularity, audio features, and track characteristics captured by the regression model.

References

Some explanations in this project have been adapted from other sources in statistics and machine learning, as listed below.

James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert; Taylor, Jonathan. An Introduction to Statistical Learning: with Applications in Python. Springer Texts in Statistics, 2023.
Dalpiaz, David. Applied Statistics with R. Available at: book.stat420.org/

Questions

If you have any questions about this guide or how to implement this model in your own work and project at the Data Mine, please submit a ticket by emailing [email protected]. Purdue students can submit a ticket here as well.