401 Project 09 - Regression: Basics

Project Objectives

In this project, we will learn about the basics of regression. We will explore common regression techniques and how to interpret their results. We will also investigate the strengths and weaknesses of different regression techniques and how to choose the right one for a given problem.

Learning Objectives

Basics of Regression
Regression specific terminology and metrics
Popular regression techniques

Supplemental Reading and Resources

Dataset

/anvil/projects/tdm/data/boston_housing/boston.csv

Questions

Question 1 (2 points)

The most common regression technique is linear regression. Have you ever generated a trendline in Excel? If so, that is a form of linear regression! There are multiple forms of linear regression, but the most common is called simple linear regression. Other forms include multiple linear regression, and polynomial regression.

It may seem counter intuitive that polynomial regression is considered a form of linear regression. When the regression model is trained for some polynomial degree, say y= ax^2 + bx + c, the model does not know that x^2 is the square of x. It instead treats x^2 as a separate variable (z = x^2), ie. y = az + bx + c, thus a linear equation. Colinearity between z and x are an issue, which is why regularization techniques, such as lasso and ridge regression, should be used to help prevent overfitting.

Each of these forms is slightly different, but at their core, they all attempt to model the relationship between one or more independent variables and one or more dependent variable.

Model	Independent	Dependent	Description
Simple Linear Regression	1 variable	1 variable	Models the relationship between one independent variable and one dependent variable. Think of y = ax + b, where y is the dependent variable, x is the independent variable, and a and b are coefficients.
Multiple Linear Regression	2+ variables	1 variable	Models the relationship between two or more independent variables and one dependent variable. Think z = ax + by + c, where z is the dependent variable, x and y are independent variables, and a, b, and c are coefficients.
Polynomial Regression	1 variable	1 variable	Models the relationship between one or more independent variables and one dependent variable using a polynomial function. Think y = ax^2 + bx + c, where y is the dependent variable, x is the independent variable, and a, b, and c are coefficients.
Multiple Polynomial Regression	2+ variables	1 variable	Models the relationship between two or more independent variables and one dependent variable using a polynomial function. Think z = ax^2 + by^2 + cx + dy + e, where z is the dependent variable, x and y are independent variables, and a, b, c, d, and e are coefficients.
Multivariate Linear Regression	2+ variables	2+ variables	Models the relationship between two or more independent variables and two or more dependent variables. Can be linear or polynomial. Think Y = AX + B. Where Y and X are matrices, and A and B are matrices of coefficients. This allows predictions of multiple dependent variables at once.
Multivariate Polynomial Regression	2+ variables	2+ variables	Same as Multivariate Linear Regression, but for polynomials. Also generalized to Y = AX + B, however X must have all independent variables and their polynomial terms, and A and B must be much larger matrices to store these coefficients.

Model

Independent

Dependent

Description

Simple Linear Regression

1 variable

Models the relationship between one independent variable and one dependent variable. Think of y = ax + b, where y is the dependent variable, x is the independent variable, and a and b are coefficients.

Multiple Linear Regression

2+ variables

1 variable

Models the relationship between two or more independent variables and one dependent variable. Think z = ax + by + c, where z is the dependent variable, x and y are independent variables, and a, b, and c are coefficients.

Polynomial Regression

1 variable

Models the relationship between one or more independent variables and one dependent variable using a polynomial function. Think y = ax^2 + bx + c, where y is the dependent variable, x is the independent variable, and a, b, and c are coefficients.

Multiple Polynomial Regression

2+ variables

1 variable

Models the relationship between two or more independent variables and one dependent variable using a polynomial function. Think z = ax^2 + by^2 + cx + dy + e, where z is the dependent variable, x and y are independent variables, and a, b, c, d, and e are coefficients.

Multivariate Linear Regression

2+ variables

Models the relationship between two or more independent variables and two or more dependent variables. Can be linear or polynomial. Think Y = AX + B. Where Y and X are matrices, and A and B are matrices of coefficients. This allows predictions of multiple dependent variables at once.

Multivariate Polynomial Regression

2+ variables

Same as Multivariate Linear Regression, but for polynomials. Also generalized to Y = AX + B, however X must have all independent variables and their polynomial terms, and A and B must be much larger matrices to store these coefficients.

For this question, please run the following code the load our very simple dataset.

import numpy as np
import matplotlib.pyplot as plt

# Data
x = np.array([1, 2, 3, 4, 5])
y = np.array([5.5, 7, 9.5, 12.5, 13])

Using the data above, find the best fit line for the data using simple linear regression. Store the slope and y-intercept in the variables a and b respectively.

We can find lines of best fit using the np.polyfit function. Although this function is built for polynomial regression, it can be used for simple linear regression by setting the degree parameter to 1. This function returns an array of coefficients, ordered from highest degree to lowest. For simple linear regression (y = mx + b), the first coefficient is the slope (m) and the second is the y-intercept (b).

# Find the best fit line

# YOUR CODE HERE
a, b = np.polyfit(x, y, 1)

# Plot the data and the best fit line

print(f'Coefficients Found: {a}, {b}')
y_pred = a * x + b

plt.scatter(x, y)
plt.plot(x, y_pred, color='red')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Deliverables

Coefficients found by np.polyfit with degree 1
Plot of the data and the best fit line

Question 2 (2 points)

After finding the best fit line, we should have two variables stored: y, and y_pred. Now that we have these, we can briefly discuss evaluation metrics for regression models. There are many, many metrics that can be used to evaluate regression models. We will discuss a few of the most common ones here, but we implore you to do further research on your own to learn about more metrics.

Metric	Description	Formula	Range
Mean Squared Error (MSE)	Average of the squared differences between the predicted and actual values.	$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$, where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value.	$[0, \infty)$
Root Mean Squared Error (RMSE)	Square root of the MSE.	$RMSE = \sqrt{MSE}$	$[0, \infty)$
Mean Absolute Error (MAE)	Average of the absolute differences between the predicted and actual values.	$MAE = \frac{1}{n} \sum_{i=1}^{n} \mid y_i - \hat{y}_i \mid $, where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value.	$[0, \infty)$
R-Squared	Explains the variance of dependent variables that can be explained by the independent variables.	$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$, where $SS_{res}$ is the sum of squared residuals (actual - prediction) and $SS_{tot}$ is the total sum of squares (actual - mean of actual).	$[0, 1]$

Metric

Description

Formula

Range

Mean Squared Error (MSE)

Average of the squared differences between the predicted and actual values.

$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$, where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value.

$[0, \infty)$

Root Mean Squared Error (RMSE)

Square root of the MSE.

$RMSE = \sqrt{MSE}$

$[0, \infty)$

Mean Absolute Error (MAE)

Average of the absolute differences between the predicted and actual values.

$MAE = \frac{1}{n} \sum_{i=1}^{n} \mid y_i - \hat{y}_i \mid $, where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value.

$[0, \infty)$

R-Squared

Explains the variance of dependent variables that can be explained by the independent variables.

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$, where $SS_{res}$ is the sum of squared residuals (actual - prediction) and $SS_{tot}$ is the total sum of squares (actual - mean of actual).

$[0, 1]$

Using the variables y and y_pred from the previous question, calculate the following metrics: MSE, RMSE, MAE, and R-Squared. Write functions for get_mse, get_rmse, get_mae, and get_r_squared, that each take in actual and predicted values. Call these functions on y and y_pred and store the results in the variables mse, rmse, mae, and r_squared respectively.

# Calculate the evaluation metrics
# YOUR CODE HERE
def get_mse(y, y_pred):
    pass

def get_rmse(y, y_pred):
    pass

def get_mae(y, y_pred):
    pass

def get_r_squared(y, y_pred):
    pass

mse = get_mse(y, y_pred)
rmse = get_rmse(y, y_pred)
mae = get_mae(y, y_pred)
r_squared = get_r_squared(y, y_pred)


print(f'MSE: {mse}')
print(f'RMSE: {rmse}')
print(f'MAE: {mae}')
print(f'R-Squared: {r_squared}')

Deliverables

Output of the evaluation metrics

Question 3 (2 points)

Now that we understand some evaluation metrics, let’s see how polynomial regression compares to simple linear regression on our same dataset. We will explore a range of polynomial degrees and see how the evaluation metrics change. Firstly, let’s write a function that will take in an x value and an array of coefficients and return the predicted y value using a polynomial function.

def poly_predict(x, coeffs):
    # y_intercept is the last element in the array
    y_intercept = None # your code here

    # predicted value can start as the y-intercept
    predicted_value = y_intercept

    # The rest of the elements are the coefficients, so we can determine the degree of the polynomial
    coeffs = coeffs[:-1]
    current_degree = None # your code here

    # Now, we can iterate through the coefficients and make a sum of coefficient * (x^current_degree)
    # remember that the first element in the array is the coefficient for the highest degree, and the last element is the coefficient for the lowest degree
    for i, coeff in enumerate(coeffs):
        # your code here to increment the predicted value

        pass

    return predicted_value

Once you have created this function, please run the following code to ensure that it works properly.

assert poly_predict(2, [1, 2, 3]) == 11
assert poly_predict(4, [1, 2, 3]) == 27
assert poly_predict(3, [1, 2, 3, 4, 5]) == 179
assert poly_predict(4, [2.5, 2, 3]) == 51
print("poly_predict function is working!")

Now, we will perform the np.polyfit function for degrees ranging from 2 to 5. For each degree, we will get the coefficients, calculate the predicted values, and then calculate the evaluation metrics. Store the results in a dictionary where the key is the degree and the value is a dictionary containing the evaluation metrics.

If you correctly implement this, numpy will issue a warning that says "RankWarning: Polyfit may be poorly conditioned". We expect you to run into this and think about what it means. You can hide this message by running the code

import warnings
warnings.simplefilter("ignore", np.RankWarning)

results = dict()
for degree in range(2, 6):
    # get the coefficients
    coeffs = None # your code here

    # Calculate the predicted values
    y_pred = None # your code here

    # Calculate the evaluation metrics
    mse = get_mse(y, y_pred)
    rmse = get_rmse(y, y_pred)
    mae = get_mae(y, y_pred)
    r_squared = get_r_squared(y, y_pred)

    # Store the results in a new dictionary that is stored in the results dictionary
    # eg, results[2] = {'mse': 0.5, 'rmse': 0.7, 'mae': 0.3, 'r_squared': 0.9}
    results[degree] = None # your code here

results

Deliverables

Function poly_predict
Output of the evaluation metrics for each degree of polynomial regression
Which degree of polynomial regression performed the best? Do you think this is the best model for this data? Why or why not?

Question 4 (2 points)

In question 1, we briefly mentioned that regularization techniques are used to help prevent overfitting. Regularization techniques add term to the loss function that penalizes the model for having large coefficients. In practice, this helps make sure that the model is fitting to patterns in the data, rather than noise or outliers. The two most common regularization techniques for machine learning are LASSO (L1 Regulariziation) and Ridge (L2 Regularization).

LASSO is an acronym for Least Absolute Shrinkage and Selection Operator. Essentially, this regularization technique computes the sum of the absolute values of the coefficients and uses it as the penalty term in the loss function. This helps ensure that the magnitude of coefficients is kept small, and can often lead to some coefficients being set to zero. This essentially helps the model perform feature selection to improve generalization.

Ridge regression works in a similar matter, however it uses the sum of each coefficient squared instead of the absolute value. This also helps force the model to use smaller coefficients, but typically does not set any coefficients to zero. This typically helps reduce collinearity between features.

Now, our 5th degree polynomial from the previous question had perfect accuracy. However, looking at the data yourself, do you really believe that the data is best represented by a 5th degree polynomial? The linear regression model from question 1 is likely the best model for this data. Using the coefficients from the 5th degree polynomial, print the predicted values are for the following x values:

x_values = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Are the predicted y values reasonable for 1 through 5? what about 6 through 10?

Let’s see if we can improve this 5th degree polynomial by using Ridge regression. Ridge regression is implemented in the scikit-learn linear models module under the Ridge class. Additionally, to ensure we are using a 5th degree polynomial, we will first need the PolynomialFeatures` class from the preprocessing module. Finally, we can use scikit-learn pipelines to chain these two models together through the make_pipeline function. The code below demonstrates how to use these three classes together.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline

n_degree = 5
polyfeatures = PolynomialFeatures(degree=n_degree)

alpha = 0.1 # regularization term. Higher values = more regularization, 0 = simple linear regression
ridge = Ridge(alpha=alpha)

model = make_pipeline(polyfeatures, ridge)

# we need to reshape the data to be a 2D array
x = np.array([1, 2, 3, 4, 5])
x = x.reshape(-1, 1)
y = np.array([5.5, 7, 9.5, 12.5, 13])

# fit the model
model.fit(x, y)

# predict the values
y_pred = model.predict(x)

Now that you have a fitted Ridge model, what are the coefficients (you can get them with model.named_steps['ridge'].coef_ and model.named_steps['ridge'].intercept_), and how do they compare to the previous 5th degree polynomial’s coefficients? Are these predicted values more reasonable for 1 through 5? what about 6 through 10?

Deliverables

Predicted values for x_values using the 5th degree polynomial
Are the predicted values reasonable for 1 through 5? what about 6 through 10?
Code to use Ridge regression on the 5th degree polynomial
How do the coefficients of the Ridge model compare to the 5th degree polynomial?
Are the L2 regularization predicted values more reasonable for 1 through 5? what about 6 through 10?

Question 5 (2 points)

As you see from the previous question, Ridge can help penalize large coefficients to help stop overfitting. However, it can never really fully recover when our baseline model is overfit. LASSO, on the other hand, can help us recover from overfitting by setting some coefficients to zero. Let’s see if LASSO can help us recover from our overfit 5th degree polynomial.

LASSO regression is implemented in the scikit-learn linear models module under the Lasso class. We can use the same pipeline as before, but with the Lasso class instead of the Ridge class.

The Lasso class has an additional parameter, max_iter, which is the maximum number of iterations to run the optimization. For this question, set max_iter=10000.

After you have done this, let’s see how changing the value of alpha affects our coefficients. To give an overall value to the coefficients, we will use the L1 method, which is the sum of the absolute values of the coefficients. For example, the below code will give the L1 value of the LASSO coefficients.

value = np.sum(np.abs(model.named_steps['lasso'].coef_))

For each alpha value from .1 to 1 in increments of .01, fit the LASSO model and Ridge model to the data. Calculate the L1 value of the model’s coefficients for each alpha value, and store them in the lists lasso_values and ridge_values respectively. Then, run the below code to plot the alpha values against the L1 values for both the LASSO and Ridge models.

plt.plot(np.arange(.1, 1.01, .01), lasso_values, label='LASSO')
plt.plot(np.arange(.1, 1.01, .01), ridge_values, label='Ridge')
plt.xlabel('Alpha')
plt.ylabel('L1 Value')
plt.legend()
plt.show()

Deliverables

How do the LASSO model’s coefficients compare to the 5th degree polynomial?
How do the LASSO model’s coefficients compare to the Ridge model’s coefficients?
What is the relationship between the alpha value and the L1 value for both the LASSO and Ridge models?

Question 6 (2 points)

There are many other forms of regression that can be discussed. Complex machine learning models such as Neural Networks and Support Vector Machines can be used for regression. Additionally, our previous classification models can all be adapted to solve regression problems. For example, we can have a KNN calculate the mean of the k nearest neighbors to predict a value.

As we mentioned in question 1, there are more complex multivariate regression models that are used to solve multiple dependent variable problems. These models can be linear or polynomial, and can be regularized with LASSO or Ridge.

Let’s first implement a Multivariate Linear Regression model. We will use new data before, as we will need multiple dependent variables. We will use the boston housing dataset for this question, and we will try to predict both the MEDV and CRIM columns. Please run the below code to load the dataset.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv('/anvil/projects/tdm/data/boston_housing/boston.csv')

X = df.drop(columns=['MEDV', 'CRIM'])
y = df[['MEDV', 'CRIM']]

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7)

In order to predict multiple columns at once, we can use the MultiOutputRegressor class from the scikit-learn multioutput module. This class takes in another model (linear, polynomial, ridge, etc), and fits it to each dependent variable in the dataset. The below code demonstrates how to use this class with a linear regression model.

In the background, the MultiOutputRegressor class is fitting a separate model to each dependent variable. This means that the model is not learning the relationship between the dependent variables, but rather the relationship between the independent variables and each dependent variable. This can be useful when the dependent variables are not related, but can be a limitation when they are.

from sklearn.linear_model import LinearRegression
from sklearn.multioutput import MultiOutputRegressor

model = LinearRegression()
multi_model = MultiOutputRegressor(model)

multi_model.fit(X_train, y_train)
y_pred = multi_model.predict(X_test)

# output the coefficients
print(multi_model.estimators_[0].coef_)

Given the above code for a Multivariate Linear Regression Model, can you implement a Multivariate LASSO and Ridge model? How do the coefficients of the LASSO and Ridge models compare to the Linear model? (Hint: use an alpha of 0.5)

Finally, compute the mean squared error and r-squared for each model for each dependent variable. Then, average the results to get a single value for each metric.

You can use the get_mse and get_r_squared functions from question 2 to calculate the evaluation metrics. However, since we have multiple dependent variables, you will need to calculate these metrics for each dependent variable and average them to get a single value for each metric. I recommend you split the y_test and y_pred arrays by column, e.g. cols = [y_pred[:, 0], y_pred[:, 1]], and then iterate through the columns to calculate the metrics.

Deliverables

Code to implement Multivariate LASSO and Ridge models
How do the coefficients of the LASSO and Ridge models compare to the Linear model?
Output of the evaluation metrics for each model for each dependent variable

Submitting your Work

Items to submit

firstname_lastname_project9.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.