TDM 30200: Project 08 - Feature Selection (Regularization)

Project Objectives

This project aims to provide students with an understanding of feature selection techniques through the application of regularization methods, specifically Ridge Regression, and LASSO. By utilizing these methods on the Lahman Baseball Database in R, students will explore how selecting relevant variables can improve model performance, reduce overfitting, and enhance interpretability.

For this project, we will use R instead of Python, as it offers a more streamlined and efficient approach to modeling.

In this project, rather than focusing on which programming language is used, we will emphasize why certain tools and techniques are applied. For ease of understanding, we will demonstrate the concept using a simple linear model. However, the regularization methods we cover here are based on the same principles used in many machine learning algorithms and neural network models, too.

Please use the seminar-r kernel (not the seminar kernel), unless otherwise specified. When you use the seminar-r kernel with R, you do not need to use the %%R cell magic.

Learning Objectives

Develop a deeper understanding of the Batting and People tables within the Lahman dataset through data analysis.
Understand the impact of regularization for feature/variable selection on model performance and interpretability.
Apply regularization techniques, including Ridge and LASSO.
Compare different feature selection techniques and evaluate their effectiveness in predictive modeling.

Dataset

We will work with the Lahman Baseball Database, which is available in R through the Lahman package.

Introduction

One of the fundamental steps in modeling is identifying the relevant variables. While machine learning models can handle large datasets and a high number of features, strong dependencies among variables can reduce the reliability of model parameters. Also, when the number of variables (p) exceeds the number of observations (n), the difficulties become more significant.

For instance, in linear models, having p > n means that the system of equations is underdetermined, leading to infinitely many possible solutions. This can cause instability in parameter estimation and increase the risk of overfitting. To address this issue, regularization techniques such as Ridge regression or LASSO, can be used to impose constraints on the model and improve generalization.

Questions

Question 1 (2 points)

We’ve previously utilized the Batting table from the Lahman dataset, which contains batting statistics. Specifically, we examined the relationship between Homeruns (HR) and Runs Batted In (RBI), noting how HR influences RBI. However, the Batting table offers numerous other variables that can enhance our models. To explore these variables, execute the following R code first:

library(Lahman)

names(Batting)

To gain detailed information about the variables in the Batting table, use the ?Batting command in R to access the documentation.

People is another table including player names, date of births and demographics for baseball players. It provides more details about players and it shares unique 'playerID' which can be used to merge Batting and People data sets. People contains 26 variables, but many of them are expected to have a large number of missing (NA) values. For instance, since the data includes records up to 2023, the death- columns will have many empty entries, as most players in recent years are alive. Additionally, these variables are not meaningfully related to RBI, so they do not need to be considered in the modeling process.

Some variables may provide interesting insights about the data but be less important for modeling. For example, while the age at which players make their debut can be an intriguing variable, their actual age might contribute more significantly to the model. Additionally, the most common birthplaces highlight Major League Baseball’s international reach, but its contribution to the model may be less. For curious readers, below are chart and tables related to debut ages and the most common (first ten) birthplaces. (Reproducing the visual and table is not required for the project.)

Figure 1. Debut Ages of Players

Most common birth countries

Birth country	n
USA	18098
Dominican Republic	890
Venezuela	472
Puerto Rico	281
Canada	266
Cuba	237
Mexico	147
Japan	76
Panama	70
Ireland	50

R code for Debut Age:

library(dplyr)
library(ggplot2)

people <- People %>%
  mutate(age_at_debut = as.numeric(substr(debut, 1, 4)) - birthYear)

# Plot distribution of debut ages
ggplot(people, aes(x = age_at_debut)) +
  geom_histogram(binwidth = 1, fill = "blue", alpha = 0.7) +
  labs(title = "Distribution of Player Debut Ages", x = "Age", y = "Count")

The extensive variables within the Batting and People tables offer numerous opportunities for model development. Building upon our previous linear modeling approach, we aim to explain the variation in our target variable, Runs Batted In (RBI), using a broader set of predictors. To achieve this, we will first merge the datasets using the unique identifier, playerID:

# Merge two datasets
full_data <- Batting %>%
  left_join(People, by = "playerID") %>%
  na.omit()

The People dataset contains players' birth years, a potentially valuable predictor. However, given the dataset’s wide historical range (1871-2023), directly including birth year might not be optimal. Instead, calculating players' ages and incorporating that variable would better account for age-related differences in performance.

Deliverables

1.1. Check out column names, display the first six lines and examine summary statistics for Batting and People datasets.
1.2. Merge Batting and People datasets by playerID as the join key.
1.3. Calculate the age of each player (based on birthYear and yearID) and add it as a new column named 'age' to the merged dataset.

For the ease of reading, all variables are added with their codes and explanations in Appendix at the end of this document. Tables includes all variables for Batting and People datasets, respectively.

Question 2 (2 points)

Prior to modeling, let’s examine a heatmap of the numeric variables intended for use in our model.

Figure 2. Correlations between numeric variables (R Code for reproduction can be found at Appendix 2)

Data visualization offers essential clues to understanding your data before initiating the modeling process. For instance: At Bats (AB), Hits (H), and Doubles (X2B) are highly correlated because they follow a natural hierarchy in baseball statistics. AB represents a player’s batting opportunities, H is a subset of AB, counting successful hits, and X2B is a further subset, representing only doubles. Since more at-bats generally lead to more hits, and more hits increase the likelihood of doubles, these variables are inherently linked, resulting in strong correlations. In most cases, H (Hits) may be the best choice because it captures a player’s ability to reach base successfully, encompassing both singles and extra-base hits while avoiding redundancy.

Figure 3. Pittsburgh Pirates’ Oneil Cruz hits a two-run home run against the Milwaukee Brewers during the seventh inning of a baseball game Wednesday, Aug. 3, 2022, in Pittsburgh. (AP Photo/Keith Srakocic)

The source of the picture is apnews accessed at 03/08/2025.

The following linear model uses as much variables as possible to explain the changes in target, RBI variable (The age variable was generated in Question 1.3).

# Fit a linear model
lm_model <- lm(RBI ~ yearID + lgID + H + X3B + HR + SB + CS + BB + SO + age + height + weight + bats, data = full_data)

# Display model summary
summary(lm_model)

Although a linear model provides P-values as evidence for variable selection, shrinkage or regularization (e.g., LASSO, Ridge) methods are used for variable selection because P-values can sometimes be misleading, especially when the sample size is large, or when variables are highly correlated with one another. In such cases, a variable with a high P-value might still be relevant for the model, but its contribution isn’t significant enough to pass the threshold. In some cases, especially when dealing with a large number of variables, instead of focusing on interpreting the output, you may want the model to perform both parameter estimation and variable selection simultaneously. Furthermore, many machine learning algorithms, due to their lack of interpretability, do not provide evidence (such as P-values) of which variables contribute to the model.

Variable selection (feature engineering) in statistical models is crucial for improving both the model’s performance and interpretability. By choosing only the most relevant variables, we can simplify the model, making it easier to understand and interpret, which is particularly important in fields where the relationships between variables are essential. Additionally, variable selection helps prevent overfitting, a common issue when too many irrelevant variables are included, which can lead to a model that fits the noise in the data rather than the underlying patterns. A more focused model with fewer predictors tends to generalize better to new data, leading to improved prediction accuracy. By selecting the right variables, we can also reduce computational costs, as fewer predictors mean less memory and processing power are required, which is especially important in large datasets.

While shrinkage methods continue to evolve and new ones (such as Elastic net) are introduced, understanding Ridge and LASSO provides a strong foundation for grasping more advanced techniques. Now, let’s recall the linear model we discussed earlier and review its mathematical representation with a single predictor.

$RBI_{i} = \beta_0 + \beta_1 HR_{i} + \epsilon_{i}$,

where:

$\beta_0 =$ Common starting point for all players (overall intercept)
$\beta_1 =$ Average effect of HR across all players (overall slope)
$\epsilon_{i} =$ Error term (noise comes from modeling)

We can replace the variable names in the model with their symbolic representations:

$y_{i} = \beta_0 + \beta_1 x_{i} + \epsilon_{i}$.

Generalize it for $p$ variables:

$y_{ij} = \beta_0 + \sum_{j = 1}^{p}\beta_j x_{ij} + \epsilon_{ij}$.

Since we are looking for the best estimates for the unknown $\beta$ parameters in this model, we want to minimize $\epsilon_{ij}$ as much as possible. The method that provides the best estimates for these parameters is the least squares approach, which finds $\beta$ values that minimize the residual sum of squares (RSS):

$RSS = \sum_{i = 1}^{n}(y_i - \beta_0 - \sum_{j = 1}^{p}\beta_j x_{ij})^2$.

Even without prior knowledge of this method, if we were to consider the best way to predict $y_i$ based on the given model, our natural goal would be to make both sides of the equation as close as possible. The variables in the model are known, and the parameters are estimated from the data, while $\epsilon$ represents an uncontrollable term arising from inherent uncertainty. Since this term cannot be eliminated, our objective is to minimize its impact. Rather than minimizing $\epsilon$ directly, we minimize its squared value to ensure that positive and negative errors do not cancel each other out.

Deliverables

2.1. Fit a linear model with RBI as the dependent (target) variable, including all the independent variables you want to incorporate from both tables. You may use the same variables included in the full model from the previous question.
2.2. What is model interpretability?
2.3. Consider a linear model where RBI is the target (dependent) variable, and HR and age are the independent variables. We can fit this model using the lm function and extract the model parameters with the following R command:

lmmodel <- lm(RBI ~ HR + age, data = full_data)

coef(lmmodel)

Using the estimated coefficients from this model, calculate the Residual Sum of Squares (RSS) manually. Instead of using built-in residual functions in R, compute RSS by directly applying the following formula:

$RSS = \sum_{i = 1}^{n}(y_i - \beta_0 - \sum_{j = 1}^{p}\beta_j x_{ij})^2$

Question 3 (2 points)

Shrinkage/regularization methods introduce penalties when minimizing the Residual Sum of Squares (RSS), also known as the least squares loss function or objective function. These methods incorporate a penalty term into the RSS minimization process when estimating parameter values. Adjusting the magnitude of this penalty term influences the estimated parameters, helping to control model complexity and prevent overfitting.

Imagine you’re packing a suitcase for a trip. You want to bring everything you might need, but if you pack too much, your suitcase becomes heavy and difficult to carry. Shrinkage methods like Ridge and LASSO work similarly in a statistical model. Without any penalty, the model can include as many variables as possible, making it complex and potentially overfitting the data. However, by adding a penalty term (like an airline imposing a weight limit on luggage), the model is forced to prioritize important variables while reducing the impact of less significant ones. Ridge regression acts like a soft weight limit—allowing you to bring all your items but compressing them slightly to make the suitcase more manageable. LASSO, on the other hand, is stricter, forcing you to completely remove unnecessary items to meet the weight limit. This way, shrinkage methods prevent your model from becoming too complex while ensuring it still performs well.

Ridge regression adds a penalty term to the least squares objective function. This penalty term (L-2 penalty) is proportional to the squared magnitude of the coefficients, shrinking them towards zero:

$\sum_{i = 1}^{n}(y_i - \beta_0 - \sum_{j = 1}^{p}\beta_j x_{ij})^2 + \lambda \sum_{j = 1}^{p} \beta_j^2$

where $\lambda$ is a hyperparameter or it is called as tuning parameter.

Before moving on to the next paragraph, I recommend taking a minute to pause and consider what a hyperparameter is, how it differs from a parameter, and how it can be found in a general model setting.

As illustrated in the figure above, we can control the magnitude of the $\beta$ coefficients by adjusting the value of the tuning parameter, $\lambda$. During the minimization process, the Residual Sum of Squares (RSS) is minimized while simultaneously penalizing the size of the coefficients. This leads to the following relationships:

If $\lambda \rightarrow 0$, the estimated parameter values converge to those obtained from ordinary least squares.
If $\lambda \rightarrow \infty$, the ridge regression coefficients shrink toward zero.

Before using shrinkage methods like Ridge or LASSO, it is crucial to scale the predictor variables. These methods apply a penalty to the regression coefficients, and since the penalty depends on the magnitude of the coefficients, variables with larger scales can dominate the shrinkage process.

For example, if one variable is measured in thousands (e.g., salary in dollars) and another in single digits (e.g., years of experience), the penalty term will affect the variable with larger numerical values, even if both have similar importance in predicting the outcome. This can lead to biased coefficient estimates and misinterpretation of variable importance. To avoid this issue, standardization (subtracting the mean and dividing by the standard deviation) ensures that all variables contribute equally to the shrinkage process.

The programming languages such as R and Python estimates the parameter for different values of $\lambda$. glmnet is one of the libraries in R which can be used to run a regularization method. glmnet function can run several types of regression models with a grid of values for the regularization parameter, $\lambda$. Here is the example code:

# Load necessary library
library(glmnet)

# Prepare the matrix of predictors (excluding the response variable)
X <- model.matrix(RBI ~ yearID + lgID + H + X3B + HR + SB + CS + BB + SO + age + height + weight + bats, data = full_data)[, -1]
y <- full_data$RBI

# Standardize predictors
X_scaled <- scale(X)

# Replace NaNs (from zero variance) with 0
X_scaled[is.na(X_scaled)] <- 0  # lgID variable has some non-observed values in the data

# Fit ridge regression model (alpha = 0 for ridge)
ridge_model <- glmnet(X_scaled, y, alpha = 0)
coef(ridge_model)
ridge_model$lambda

The plot below illustrates how coefficient values vary with changes in log-lambda in Ridge regression.

Figure 4. Coefficients across various log-lambda values in Ridge regression

Deliverables

3.1. What is the role of $\lambda$ in penalized RSS?
3.2. Perform Ridge regression using the same variables from Question 2.1.
3.3. How many different $\lambda$ values were tested in the Ridge regression in 3.2?

Question 4 (2 points)

While Ridge Regression effectively shrinks coefficients towards zero, it rarely eliminates them entirely. This means that even with Ridge, you might end up with a model containing many features, some of which may be redundant. To address this, we can use LASSO (Least Absolute Shrinkage and Selection Operator). LASSO, like Ridge, uses a penalty term to regularize the model, but it employs the L1 norm (absolute value) instead of the L2 norm (square). This crucial difference allows LASSO to perform feature selection by driving some coefficients to exactly zero, effectively removing those features from the model. This time, the penalized RSS looks as follows:

$\sum_{i = 1}^{n}(y_i - \beta_0 - \sum_{j = 1}^{p}\beta_j x_{ij})^2 + \lambda \sum_{j = 1}^{p} |\beta_j|$

Similar to ridge, we have a tuning parameter, $\lambda$ and changing its value changes the magnitude of parameter coefficients. R code is also exactly the same as the ridge regression, the only difference is that using alpha = 1 instead of 0 in the glmnet function.

The following plot shows how coefficient values are changing with the change in Log Lambda in LASSO regression.

Figure 5. Coefficients across various log-lambda values in LASSO regression

Deliverables

4.1. Perform LASSO regression using the same linear model and variables from Question 2.1.
4.2. How many different $\lambda$ values were tested in the LASSO regression in 4.1?

Question 5 (2 points)

Up to this point, we have run models but have not yet determined the optimized model or final parameter estimates. Why is that?

So far, we have not performed any feature selection; we have only applied regularization methods across different $\lambda$ values and reported the results. To identify the final model, we must use cross-validation to evaluate each $\lambda$ on unseen test data. The optimal $\lambda$ is then selected based on model performance. Only after this step can we determine which explanatory variables contribute most effectively to explaining the variation in the dependent variable.

With the coding and cross-validation knowledge we’ve gained so far, we have the tools to implement cross-validation for regularization in our models. However, R provides the cv.glmnet function, which streamlines this process with a concise set of code. If you feel confident in your understanding of shrinkage/regularization and cross-validation, you can use this function to perform feature selection. But if you think you need more practice with these concepts, I recommend implementing cross-validation manually before using cv.glmnet.

Let’s proceed with feature selection with cross-validation for LASSO:

# Perform LASSO regression using cross-validation
set.seed(42) # optional, just for reproducibility
lasso_cv <- cv.glmnet(X, y, alpha = 1)  # alpha = 1 for LASSO

# Extract the best lambda value
best_lambda <- lasso_cv$lambda.min
print(paste("Best Lambda:", best_lambda))

Deliverables

5.1. Fit Ridge and LASSO regression models using their respective cross-validated optimal lambda values.
5.2. Extract and display the coefficients selected by the final LASSO and Ridge model.

Below is a general representation of shrinkage methods, which can inspire you to create your own regularization approaches.

Appendix 1

Batting Data Variables

Variable Code	Explanation
playerID	Player ID code (links to People dataset)
yearID	Year
stint	Player’s stint (order of appearances within a season)
teamID	Team; a factor
lgID	League; a factor with levels AA, AL, FL, NL, PL, UA
G	Games: number of games in which a player played
AB	At Bats
R	Runs
H	Hits: times reached base because of a batted, fair ball without error by the defense
X2B	Doubles: hits on which the batter reached second base safely
X3B	Triples: hits on which the batter reached third base safely
HR	Homeruns
RBI	Runs Batted In
SB	Stolen Bases
CS	Caught Stealing
BB	Base on Balls
SO	Strikeouts
IBB	Intentional walks
HBP	Hit by pitch
SH	Sacrifice hits
SF	Sacrifice flies
GIDP	Grounded into double plays

Variable Code

Explanation

playerID

Player ID code (links to People dataset)

yearID

Year

stint

Player’s stint (order of appearances within a season)

teamID

Team; a factor

lgID

League; a factor with levels AA, AL, FL, NL, PL, UA

Games: number of games in which a player played

At Bats

Runs

Hits: times reached base because of a batted, fair ball without error by the defense

X2B

Doubles: hits on which the batter reached second base safely

X3B

Triples: hits on which the batter reached third base safely

Homeruns

RBI

Runs Batted In

Stolen Bases

Caught Stealing

Base on Balls

Strikeouts

IBB

Intentional walks

HBP

Hit by pitch

Sacrifice hits

Sacrifice flies

GIDP

Grounded into double plays

People Data Variables

Variable Code	Explanation
playerID	A unique code assigned to each player. Links to other files.
birthYear	Year player was born
birthMonth	Month player was born
birthDay	Day player was born
birthCountry	Country where player was born
birthState	State where player was born
birthCity	City where player was born
deathYear	Year player died
deathMonth	Month player died
deathDay	Day player died
deathCountry	Country where player died
deathState	State where player died
deathCity	City where player died
nameFirst	Player’s first name
nameLast	Player’s last name
nameGiven	Player’s given name (typically first and middle)
weight	Player’s weight in pounds
height	Player’s height in inches
bats	Player’s batting hand (left (L), right ®, or both (B))
throws	Player’s throwing hand (left (L) or right ®)
debut	Date that player made first major league appearance
finalGame	Date that player made first major league appearance (blank if still active)
retroID	ID used by Retrosheet, www.retrosheet.org/
bbrefID	ID used by Baseball Reference website, www.baseball-reference.com/
birthDate	Player’s birthdate, in as.Date format
deathDate	Player’s deathdate, in as.Date format

Variable Code

Explanation

playerID

A unique code assigned to each player. Links to other files.

birthYear

Year player was born

birthMonth

Month player was born

birthDay

Day player was born

birthCountry

Country where player was born

birthState

State where player was born

birthCity

City where player was born

deathYear

Year player died

deathMonth

Month player died

deathDay

Day player died

deathCountry

Country where player died

deathState

State where player died

deathCity

City where player died

nameFirst

Player’s first name

nameLast

Player’s last name

nameGiven

Player’s given name (typically first and middle)

weight

Player’s weight in pounds

height

Player’s height in inches

bats

Player’s batting hand (left (L), right ®, or both (B))

throws

Player’s throwing hand (left (L) or right ®)

debut

Date that player made first major league appearance

finalGame

Date that player made first major league appearance (blank if still active)

retroID

ID used by Retrosheet, www.retrosheet.org/

bbrefID

ID used by Baseball Reference website, www.baseball-reference.com/

birthDate

Player’s birthdate, in as.Date format

deathDate

Player’s deathdate, in as.Date format

Appendix 2

R Code for Correlation Hep Map

library(ggcorrplot)

# Merge Batting and People tables - numeric variables
corrdata <- Batting %>%
  left_join(People, by = "playerID") %>%
  mutate(age = yearID - birthYear) %>%
  select(RBI, yearID, AB, H, X2B, X3B, HR, SB, CS, BB, SO, age, height, weight) %>%
  na.omit()  # Remove rows with NA values

# Compute correlation matrix
cor_matrix <- cor(corrdata, use = "pairwise.complete.obs")

# Plot correlation heatmap
ggcorrplot(cor_matrix, lab = TRUE, type = "lower", colors = c("blue", "white", "red"))

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project1.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.