401 Project 09  Regression: Basics
Project Objectives
In this project, we will learn about the basics of regression. We will explore common regression techniques and how to interpret their results. We will also investigate the strengths and weaknesses of different regression techniques and how to choose the right one for a given problem.
Questions
Question 1 (2 points)
The most common regression technique is linear regression. Have you ever generated a trendline in Excel? If so, that is a form of linear regression! There are multiple forms of linear regression, but the most common is called simple linear regression
. Other forms include multiple linear regression
, and polynomial regression
.
It may seem counter intuitive that 
Each of these forms is slightly different, but at their core, they all attempt to model the relationship between one or more independent variables and one or more dependent variable.
Model  Independent  Dependent  Description 

Simple Linear Regression 
1 variable 
1 variable 
Models the relationship between one independent variable and one dependent variable. Think of y = ax + b, where y is the dependent variable, x is the independent variable, and a and b are coefficients. 
Multiple Linear Regression 
2+ variables 
1 variable 
Models the relationship between two or more independent variables and one dependent variable. Think z = ax + by + c, where z is the dependent variable, x and y are independent variables, and a, b, and c are coefficients. 
Polynomial Regression 
1 variable 
1 variable 
Models the relationship between one or more independent variables and one dependent variable using a polynomial function. Think y = ax^2 + bx + c, where y is the dependent variable, x is the independent variable, and a, b, and c are coefficients. 
Multiple Polynomial Regression 
2+ variables 
1 variable 
Models the relationship between two or more independent variables and one dependent variable using a polynomial function. Think z = ax^2 + by^2 + cx + dy + e, where z is the dependent variable, x and y are independent variables, and a, b, c, d, and e are coefficients. 
Multivariate Linear Regression 
2+ variables 
2+ variables 
Models the relationship between two or more independent variables and two or more dependent variables. Can be linear or polynomial. Think Y = AX + B. Where Y and X are matrices, and A and B are matrices of coefficients. This allows predictions of multiple dependent variables at once. 
Multivariate Polynomial Regression 
2+ variables 
2+ variables 
Same as Multivariate Linear Regression, but for polynomials. Also generalized to Y = AX + B, however X must have all independent variables and their polynomial terms, and A and B must be much larger matrices to store these coefficients. 
For this question, please run the following code the load our very simple dataset.
import numpy as np
import matplotlib.pyplot as plt
# Data
x = np.array([1, 2, 3, 4, 5])
y = np.array([5.5, 7, 9.5, 12.5, 13])
Using the data above, find the best fit line for the data using simple linear regression. Store the slope and yintercept in the variables a
and b
respectively.
We can find lines of best fit using the np.polyfit function. Although this function is built for polynomial regression, it can be used for simple linear regression by setting the degree parameter to 1. This function returns an array of coefficients, ordered from highest degree to lowest. For simple linear regression (y = mx + b), the first coefficient is the slope (m) and the second is the yintercept (b). 
# Find the best fit line
# YOUR CODE HERE
a, b = np.polyfit(x, y, 1)
# Plot the data and the best fit line
print(f'Coefficients Found: {a}, {b}')
y_pred = a * x + b
plt.scatter(x, y)
plt.plot(x, y_pred, color='red')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Coefficients found by np.polyfit with degree 1

Plot of the data and the best fit line
Question 2 (2 points)
After finding the best fit line, we should have two variables stored: y
, and y_pred
. Now that we have these, we can briefly discuss evaluation metrics for regression models. There are many, many metrics that can be used to evaluate regression models. We will discuss a few of the most common ones here, but we implore you to do further research on your own to learn about more metrics.
Metric  Description  Formula  Range 

Mean Squared Error (MSE) 
Average of the squared differences between the predicted and actual values. 
$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i  \hat{y}_i)^2$, where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value. 
$[0, \infty)$ 
Root Mean Squared Error (RMSE) 
Square root of the MSE. 
$RMSE = \sqrt{MSE}$ 
$[0, \infty)$ 
Mean Absolute Error (MAE) 
Average of the absolute differences between the predicted and actual values. 
$MAE = \frac{1}{n} \sum_{i=1}^{n} \mid y_i  \hat{y}_i \mid $, where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value. 
$[0, \infty)$ 
RSquared 
Explains the variance of dependent variables that can be explained by the independent variables. 
$R^2 = 1  \frac{SS_{res}}{SS_{tot}}$, where $SS_{res}$ is the sum of squared residuals (actual  prediction) and $SS_{tot}$ is the total sum of squares (actual  mean of actual). 
$[0, 1]$ 
Using the variables y
and y_pred
from the previous question, calculate the following metrics: MSE, RMSE, MAE, and RSquared. Write functions for get_mse
, get_rmse
, get_mae
, and get_r_squared
, that each take in actual and predicted values. Call these functions on y and y_pred and store the results in the variables mse
, rmse
, mae
, and r_squared
respectively.
# Calculate the evaluation metrics
# YOUR CODE HERE
def get_mse(y, y_pred):
pass
def get_rmse(y, y_pred):
pass
def get_mae(y, y_pred):
pass
def get_r_squared(y, y_pred):
pass
mse = get_mse(y, y_pred)
rmse = get_rmse(y, y_pred)
mae = get_mae(y, y_pred)
r_squared = get_r_squared(y, y_pred)
print(f'MSE: {mse}')
print(f'RMSE: {rmse}')
print(f'MAE: {mae}')
print(f'RSquared: {r_squared}')

Output of the evaluation metrics
Question 3 (2 points)
Now that we understand some evaluation metrics, let’s see how polynomial regression compares to simple linear regression on our same dataset. We will explore a range of polynomial degrees and see how the evaluation metrics change. Firstly, let’s write a function that will take in an x value and an array of coefficients and return the predicted y value using a polynomial function.
def poly_predict(x, coeffs):
# y_intercept is the last element in the array
y_intercept = None # your code here
# predicted value can start as the yintercept
predicted_value = y_intercept
# The rest of the elements are the coefficients, so we can determine the degree of the polynomial
coeffs = coeffs[:1]
current_degree = None # your code here
# Now, we can iterate through the coefficients and make a sum of coefficient * (x^current_degree)
# remember that the first element in the array is the coefficient for the highest degree, and the last element is the coefficient for the lowest degree
for i, coeff in enumerate(coeffs):
# your code here to increment the predicted value
pass
return predicted_value
Once you have created this function, please run the following code to ensure that it works properly.
assert poly_predict(2, [1, 2, 3]) == 11
assert poly_predict(4, [1, 2, 3]) == 27
assert poly_predict(3, [1, 2, 3, 4, 5]) == 179
assert poly_predict(4, [2.5, 2, 3]) == 51
print("poly_predict function is working!")
Now, we will perform the np.polyfit function for degrees ranging from 2 to 5. For each degree, we will get the coefficients, calculate the predicted values, and then calculate the evaluation metrics. Store the results in a dictionary where the key is the degree and the value is a dictionary containing the evaluation metrics.
If you correctly implement this, numpy will issue a warning that says "RankWarning: Polyfit may be poorly conditioned". We expect you to run into this and think about what it means. You can hide this message by running the code

results = dict()
for degree in range(2, 6):
# get the coefficients
coeffs = None # your code here
# Calculate the predicted values
y_pred = None # your code here
# Calculate the evaluation metrics
mse = get_mse(y, y_pred)
rmse = get_rmse(y, y_pred)
mae = get_mae(y, y_pred)
r_squared = get_r_squared(y, y_pred)
# Store the results in a new dictionary that is stored in the results dictionary
# eg, results[2] = {'mse': 0.5, 'rmse': 0.7, 'mae': 0.3, 'r_squared': 0.9}
results[degree] = None # your code here
results

Function poly_predict

Output of the evaluation metrics for each degree of polynomial regression

Which degree of polynomial regression performed the best? Do you think this is the best model for this data? Why or why not?
Question 4 (2 points)
In question 1, we briefly mentioned that regularization techniques are used to help prevent overfitting. Regularization techniques add term to the loss function that penalizes the model for having large coefficients. In practice, this helps make sure that the model is fitting to patterns in the data, rather than noise or outliers. The two most common regularization techniques for machine learning are LASSO (L1 Regulariziation) and Ridge (L2 Regularization).
LASSO is an acronym for Least Absolute Shrinkage and Selection Operator. Essentially, this regularization technique computes the sum of the absolute values of the coefficients and uses it as the penalty term in the loss function. This helps ensure that the magnitude of coefficients is kept small, and can often lead to some coefficients being set to zero. This essentially helps the model perform feature selection to improve generalization.
Ridge regression works in a similar matter, however it uses the sum of each coefficient squared instead of the absolute value. This also helps force the model to use smaller coefficients, but typically does not set any coefficients to zero. This typically helps reduce collinearity between features.
Now, our 5th degree polynomial from the previous question had perfect accuracy. However, looking at the data yourself, do you really believe that the data is best represented by a 5th degree polynomial? The linear regression model from question 1 is likely the best model for this data. Using the coefficients from the 5th degree polynomial, print the predicted values are for the following x values:
x_values = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Are the predicted y values reasonable for 1 through 5? what about 6 through 10?
Let’s see if we can improve this 5th degree polynomial by using Ridge regression. Ridge regression is implemented in the scikitlearn linear models module under the Ridge
class. Additionally, to ensure we are using a 5th degree polynomial, we will first need the PolynomialFeatures`
class from the preprocessing module. Finally, we can use scikitlearn pipelines to chain these two models together through the make_pipeline
function. The code below demonstrates how to use these three classes together.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
n_degree = 5
polyfeatures = PolynomialFeatures(degree=n_degree)
alpha = 0.1 # regularization term. Higher values = more regularization, 0 = simple linear regression
ridge = Ridge(alpha=alpha)
model = make_pipeline(polyfeatures, ridge)
# we need to reshape the data to be a 2D array
x = np.array([1, 2, 3, 4, 5])
x = x.reshape(1, 1)
y = np.array([5.5, 7, 9.5, 12.5, 13])
# fit the model
model.fit(x, y)
# predict the values
y_pred = model.predict(x)
Now that you have a fitted Ridge model, what are the coefficients (you can get them with model.named_steps['ridge'].coef_
and model.named_steps['ridge'].intercept_
), and how do they compare to the previous 5th degree polynomial’s coefficients? Are these predicted values more reasonable for 1 through 5? what about 6 through 10?

Predicted values for x_values using the 5th degree polynomial

Are the predicted values reasonable for 1 through 5? what about 6 through 10?

Code to use Ridge regression on the 5th degree polynomial

How do the coefficients of the Ridge model compare to the 5th degree polynomial?

Are the L2 regularization predicted values more reasonable for 1 through 5? what about 6 through 10?
Question 5 (2 points)
As you see from the previous question, Ridge can help penalize large coefficients to help stop overfitting. However, it can never really fully recover when our baseline model is overfit. LASSO, on the other hand, can help us recover from overfitting by setting some coefficients to zero. Let’s see if LASSO can help us recover from our overfit 5th degree polynomial.
LASSO regression is implemented in the scikitlearn linear models module under the Lasso
class. We can use the same pipeline as before, but with the Lasso class instead of the Ridge class.
The Lasso class has an additional parameter, max_iter, which is the maximum number of iterations to run the optimization. For this question, set max_iter=10000. 
After you have done this, let’s see how changing the value of alpha
affects our coefficients. To give an overall value to the coefficients, we will use the L1 method, which is the sum of the absolute values of the coefficients. For example, the below code will give the L1 value of the LASSO coefficients.
value = np.sum(np.abs(model.named_steps['lasso'].coef_))
For each alpha value from .1 to 1 in increments of .01, fit the LASSO model and Ridge model to the data. Calculate the L1 value of the model’s coefficients for each alpha value, and store them in the lists lasso_values
and ridge_values
respectively. Then, run the below code to plot the alpha values against the L1 values for both the LASSO and Ridge models.
plt.plot(np.arange(.1, 1.01, .01), lasso_values, label='LASSO')
plt.plot(np.arange(.1, 1.01, .01), ridge_values, label='Ridge')
plt.xlabel('Alpha')
plt.ylabel('L1 Value')
plt.legend()
plt.show()

How do the LASSO model’s coefficients compare to the 5th degree polynomial?

How do the LASSO model’s coefficients compare to the Ridge model’s coefficients?

What is the relationship between the alpha value and the L1 value for both the LASSO and Ridge models?
Question 6 (2 points)
There are many other forms of regression that can be discussed. Complex machine learning models such as Neural Networks and Support Vector Machines can be used for regression. Additionally, our previous classification models can all be adapted to solve regression problems. For example, we can have a KNN calculate the mean of the k nearest neighbors to predict a value.
As we mentioned in question 1, there are more complex multivariate regression models that are used to solve multiple dependent variable problems. These models can be linear or polynomial, and can be regularized with LASSO or Ridge.
Let’s first implement a Multivariate Linear Regression model. We will use new data before, as we will need multiple dependent variables. We will use the boston housing dataset for this question, and we will try to predict both the MEDV
and CRIM
columns. Please run the below code to load the dataset.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv('/anvil/projects/tdm/data/boston_housing/boston.csv')
X = df.drop(columns=['MEDV', 'CRIM'])
y = df[['MEDV', 'CRIM']]
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7)
In order to predict multiple columns at once, we can use the MultiOutputRegressor
class from the scikitlearn multioutput module. This class takes in another model (linear, polynomial, ridge, etc), and fits it to each dependent variable in the dataset. The below code demonstrates how to use this class with a linear regression model.
In the background, the 
from sklearn.linear_model import LinearRegression
from sklearn.multioutput import MultiOutputRegressor
model = LinearRegression()
multi_model = MultiOutputRegressor(model)
multi_model.fit(X_train, y_train)
y_pred = multi_model.predict(X_test)
# output the coefficients
print(multi_model.estimators_[0].coef_)
Given the above code for a Multivariate Linear Regression Model, can you implement a Multivariate LASSO and Ridge model? How do the coefficients of the LASSO and Ridge models compare to the Linear model? (Hint: use an alpha of 0.5)
Finally, compute the mean squared error and rsquared for each model for each dependent variable. Then, average the results to get a single value for each metric.
You can use the 

Code to implement Multivariate LASSO and Ridge models

How do the coefficients of the LASSO and Ridge models compare to the Linear model?

Output of the evaluation metrics for each model for each dependent variable
Submitting your Work

firstname_lastname_project9.ipynb
You must double check your You will not receive full credit if your 