TDM 30200: Project 06 - Data Types, Dependency and Mixed Model

Project Objectives

Motivation: Learning the dependencies in the data and how to conduct more personalized models.

Context: We will load Lahman data in R and will practice several linear models with the data.

Scope: Explore the use of mixed models in R, understand their advantages over simpler models, and interpret key results in a real-world context.

For this project (and moving forward, when you are using R), please use the seminar-r kernel (not the seminar kernel), unless otherwise specified. When you use the seminar-r kernel with R, you do not need to use the %%R cell magic.

Learning Objectives
  • Exploratory data analysis and visualization

  • The significance of subject-specific parameters in models

  • Storytelling with your data

  • Mixed models

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset

We will work with the Lahman Baseball Database, which is available in R through the Lahman package.

In this project, we are transitioning from Python to R, and as an aspiring data scientist, it’s important to stay flexible and practice using different tools to keep your skills sharp. Switching between languages helps reinforce key concepts and prevents you from forgetting the details over time. Additionally, R offers powerful capabilities for the type of analysis we’re focusing on in this project. To access the Lahman dataset in R, you can use the following code. This isn’t calling a new function; it’s simply loading the Lahman data that comes with the package:

library(Lahman)

Introduction

"All models are wrong, but some are useful."[1]

Lahman is a comprehensive data set for sports analytics, including statistics from 1871 to 2023 with two current leagues (American and National), the four other "major" leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875.

More specifically, we will use the Batting table in Lahman, which includes batting statistics in the database. The main form of this table is rectangular or raw, which includes variables on columns and observations on rows. For example, Homeruns (HR) and Runs Batted In (RBI) are closely related to baseball statistics, but they are not identical.

Additional information for those who have no interest in baseball: The goal of baseball is for players to take turns hitting the ball and running to the four bases and scoring runs. If a player hits the ball and it goes out of bounds (into the stands), it is called a Homerun (HR). In this case, that player and any runners on the field all score directly. In other words, more than one run can be scored with one hit. An Runs Batted In (RBI) is when a player hits a ball and enables their teammates to reach home plate and score. For example, if a player hits the ball and causes a teammate on the field to run, that player earns 1 RBI. A player who hits a homerun automatically gets 1 RBI for themselves. If there are other runners on the field at the time of the HR, the number of RBIs increases because they also scored. For example, an HR with two people on the field earns 3 RBIs in total.

HR
Figure 1. Home Run

Creator of the image: Jonathan Daniel (Image Link, accessed on 30th of January 2025). Copyright: 2017 Getty Images

Let’s go back to the data set in R and remember the first steps we take when working with the data. We use the head(Batting) command to see the first six lines in the data. If you want to see some more details about Lahman data, type ?Lahman and ?Batting. Also, try the following summary(Batting) and str(Batting) for some main statistics and structure of each variable (column).

Questions

Question 1 (2 points)

What are your first insights from the data so far?

Let’s try to ask some interesting questions, such as "Have players been hitting more Homeruns (HR) over the years? Can you check it over a plot?"

When analyzing trends over time, it’s important to consider rates rather than just raw counts. The total number of homeruns (HR) in a season can be influenced by many external factors, such as the number of games played, the number of players in the league, and how often they play. So, when analyzing sports statistics, always ask yourself:

Is using raw counts appropriate, or should I adjust for playing time or other factors by using a rate?

So, start calculating the HR rate (Homeruns per at-bat(AB)) and get a more meaningful measure of how often players hit home runs, independent of season length or other influences.

Note: After you get your plot, you may notice some sharp decreases in HR rates, for example, in 1943. When you do a quick googling, you may see how World War II was effective on it. See this link for more reading.

Add one more question you find interesting from the data and write your own interpretation.

Deliverables

1.1 Explain your insights from the data

1.2. Find HR rates and show how they change over time

1.3. Write your own interpretation with one more question written by you.

Question 2 (2 points)

Remember, HR guarantees at least one RBI since the batters score for themselves, and if there are runners on base, the RBI count increases. However, not all RBIs come from HRs; a batter can also earn RBIs through other types of hits, such as singles or doubles, that allow teammates to score. While power hitters who frequently hit HRs tend to have high RBI totals, their RBI count also depends on whether there are runners on base when they come up to bat.

As a result, examining the changes in these two variables individually and together can be important in understanding the hitting power of the players and their contributions to the team. Before looking at these changes, what you will notice in your initial analysis is that since not every player can hit HRs all the time, zero will be dominant in these vectors. Let’s work with values greater than zero for these two variables.

Deliverables

2.1. Generate a subset with the non-zero values of HR and RBI variables

2.2. Use a graph that shows how RBI and HR move together

2.3. What is the correlation between these two variables? Remember: Correlation measures the strength and direction of the relationship between two variables—in this case, how home runs (HR) and runs batted in (RBI) move together. If the correlation is strong and positive, it means that players who hit more home runs also tend to drive in more runs.

Question 3 (2 points)

Consider a linear model where you aim to explain changes in the RBI with HR. How effective do you think a simple regression line is in capturing this variability?

Now, let’s take a different approach. This time, we will run a separate linear model for each player. One way to achieve this is by using the lmList function from the lme4 library as follows:

You will see numerous warnings when running the code below—that’s intentional. In the following questions, we will refine the results further.

library(lme4)
model_each <- lmList(RBI ~ HR|playerID, data = Batting)
summary(model_each)

The output of this model gives all parameter estimates for some players, while for others, it gives NaN or NA values. There may be different reasons for this. Before reading the next paragraph, pause for a while and think about what can cause it.

Think

One of the reasons can be that some players might not have enough data points to estimate both the slope (HR coefficient) and intercept. If a player has only one data point, then the regression line is undefined because there is no variation in HR to estimate a slope. If a player has exactly two data points, the regression line is perfectly determined, meaning the standard error cannot be estimated, which can result in NaN.

Additionally, if a player’s HR values do not vary at all, the model cannot compute a meaningful slope, leading to NA values for the coefficient. In cases where RBI and HR have a perfect linear relationship, the regression model may achieve a perfect fit, resulting in undefined or extremely large standard errors. Finally, missing data in either HR or RBI can cause certain players to have too few valid observations, preventing proper coefficient estimation and leading to NA or NaN values in the results.

Deliverables

3.1. Run the linear model showing the change in the RBI variable with HR. What do you think about this model explaining this variability with a simple regression line?

3.2. For the second model, which includes one regression line for each player, remove players with one or two data points before modeling for potential fixes for NA and NaN values. Then, for handling cases with zero variance in HR, filter out players where HR does not vary. (Hint: var(HR, na.rm = TRUE) > 0)

Run the second model one more time and see how the intercept and slope vary around players.

Here are some more examples:

Some More Visualization Example with Lahman Data:

Question 4 (2 points)

Now, let’s discuss the previous models a little bit. Explaining this data with a regression model is too optimistic because there is a lot of variability between players. If you examine it in more detail, you will even notice that there is variability between years and leagues. On the other hand, if we model each player, league, or season separately, we introduce a different type of modeling error known as overfitting or loss of generalizability. By fitting a separate model for each subgroup, we risk capturing noise rather than meaningful patterns, making our predictions unreliable for new or unseen data. Additionally, splitting the data this way reduces the number of observations per model, leading to unstable coefficient estimates and high variance in the results. To handle this type of repeated data, different methods will be needed. One approach is to incorporate a random effect, which captures the variability among players within the same model, alongside the regression line.

You can find the following page, built by Michael Freeman, very useful to visually understand the mixed model approach. Please spare five minutes to go over it before you move to the next step: mfviz.com/hierarchical-models/

For the ones who wants to learn more about mixed models, the following books have the downloadable pdf copy:

The mixed model approach will include both your fixed parameter and a subject-specific parameter with random effects. We have the model definition with a random intercept for each player as follows:

$RBI_{ij} = \beta_0 + \beta_1 HR_{ij} + u_{0j} + \epsilon_{ij}$

where:

  • $\beta_0 =$ Common starting point for all players (overall intercept)

  • $u_{0j} =$ Player-specific intercept deviation (random intercept)

  • $\beta_1 =$ Average effect of HR across all players (overall slope)

  • $\epsilon_{ij} =$ Error term (noise comes from modeling)

This formulation allows each player to have their own intercept in a single model while still maintaining an overall trend across all players.

Now, by using the lmer function in the lme4 library, run a mixed model for the previous model. Here is the basic syntax for this:

lmer(RBI ~ HR + (1 | playerID), data = data)

(1 | playerID): This part adds random effects to your model for each player.

We can see the population (fixef) and player-specific (ranef) effects as follows in R:

# Fixed effects (population-level effects)
fixed_effects <- fixef(mixed_model)
fixed_effects

# Random effects (player-specific effects)
random_effects <- ranef(mixed_model)$playerID
head(random_effects) # View player-specific intercept adjustments
Deliverables

4.1. Run mixed effect model (Include players with at least 5 data points) with only random intercept for the same variables used in the linear model

4.2. Use the summary function to see the model results and find the population (fixef) and player-specific (ranef) effects

When you use summary for your mixed model, it will produce a familiar output where you can examine the fixed effects. The only difference from a linear model on this output is that random effects where show the magnitude of the variance among the players and how much residual variance there is.

Question 5 (2 points)

Lahman data includes the year variable, which generates the repeats for players, and it also includes a lot of information for modeling. Also, there are a lot of rule changes in baseball. So, it may be worthwhile to add year to account for these changes, too.

Depending on the aim of the research, if you assume that year has a consistent effect across all players, you can add it as a fixed effect. The interpretation for this model can be as follows: the coefficient for the year will show the general trend in RBI over time (e.g., a positive coefficient means RBI totals tend to increase over the years).

Or you may want to add the year as a random effect, which allows each year to have a different baseline effect, capturing year-to-year variability. With this approach, the model accounts for yearly fluctuations in RBI production that aren’t explained by HRs alone.

Deliverables

5.1. Add year as a fixed effect into the mixed model previously used in question 4.1.

5.2. Add year as a random effect into the mixed model previously used in question 4.1.

More Resources - Lahman Data:

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit
  • firstname_lastname_project6.ipynb

You must double check your .ipynb after submitting it in Gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.


1. From Wikipedia: The phrase "all models are wrong" was first attributed to George Box in a 1976 paper published in the Journal of the American Statistical Association.