TDM 30200: Project 06 - Data Types, Dependency and Mixed Model
Project Objectives
Motivation: Learning the dependencies in the data and how to conduct more personalized models.
Context: We will load Lahman
data in R and will practice several linear models with the data.
Scope: Explore the use of mixed models in R, understand their advantages over simpler models, and interpret key results in a real-world context.
For this project (and moving forward, when you are using R), please use the |
Dataset
We will work with the Lahman Baseball Database, which is available in R through the Lahman
package.
In this project, we are transitioning from Python to R, and as an aspiring data scientist, it’s important to stay flexible and
practice using different tools to keep your skills sharp. Switching between languages helps reinforce key concepts and
prevents you from forgetting the details over time. Additionally, R offers powerful capabilities for the type of analysis we’re focusing on in this project.
To access the Lahman
dataset in R, you can use the following code. This isn’t calling a new function; it’s simply loading the Lahman
data that comes with the package:
library(Lahman)
Introduction
"All models are wrong, but some are useful."[1]
Lahman
is a comprehensive data set for sports analytics, including statistics from 1871 to 2023 with two current leagues (American and National), the four other "major" leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875.
More specifically, we will use the Batting
table in Lahman
, which includes batting statistics in the database. The main form of this table is rectangular or raw, which includes variables on columns and observations on rows. For example, Homeruns (HR) and Runs Batted In (RBI) are closely related to baseball statistics, but they are not identical.
Additional information for those who have no interest in baseball: The goal of baseball is for players to take turns hitting the ball and running to the four bases and scoring runs. If a player hits the ball and it goes out of bounds (into the stands), it is called a Homerun (HR). In this case, that player and any runners on the field all score directly. In other words, more than one run can be scored with one hit. An Runs Batted In (RBI) is when a player hits a ball and enables their teammates to reach home plate and score. For example, if a player hits the ball and causes a teammate on the field to run, that player earns 1 RBI. A player who hits a homerun automatically gets 1 RBI for themselves. If there are other runners on the field at the time of the HR, the number of RBIs increases because they also scored. For example, an HR with two people on the field earns 3 RBIs in total. |

Creator of the image: Jonathan Daniel (Image Link, accessed on 30th of January 2025). Copyright: 2017 Getty Images
Let’s go back to the data set in R and remember the first steps we take when working with the data.
We use the head(Batting)
command to see the first six lines in the data.
If you want to see some more details about Lahman
data, type ?Lahman
and ?Batting
.
Also, try the following summary(Batting)
and str(Batting)
for some main statistics and structure of each variable (column).
Questions
Question 1 (2 points)
What are your first insights from the data so far?
Let’s try to ask some interesting questions, such as "Have players been hitting more Homeruns (HR) over the years? Can you check it over a plot?"
When analyzing trends over time, it’s important to consider rates rather than just raw counts. The total number of homeruns (HR) in a season can be influenced by many external factors, such as the number of games played, the number of players in the league, and how often they play. So, when analyzing sports statistics, always ask yourself: Is using raw counts appropriate, or should I adjust for playing time or other factors by using a rate? |
So, start calculating the HR rate (Homeruns per at-bat(AB)) and get a more meaningful measure of how often players hit home runs, independent of season length or other influences.
Note: After you get your plot, you may notice some sharp decreases in HR rates, for example, in 1943. When you do a quick googling, you may see how World War II was effective on it. See this link for more reading.
Add one more question you find interesting from the data and write your own interpretation.
1.1 Explain your insights from the data
1.2. Find HR rates and show how they change over time
1.3. Write your own interpretation with one more question written by you.
Question 2 (2 points)
Remember, HR guarantees at least one RBI since the batters score for themselves, and if there are runners on base, the RBI count increases. However, not all RBIs come from HRs; a batter can also earn RBIs through other types of hits, such as singles or doubles, that allow teammates to score. While power hitters who frequently hit HRs tend to have high RBI totals, their RBI count also depends on whether there are runners on base when they come up to bat.
As a result, examining the changes in these two variables individually and together can be important in understanding the hitting power of the players and their contributions to the team. Before looking at these changes, what you will notice in your initial analysis is that since not every player can hit HRs all the time, zero will be dominant in these vectors. Let’s work with values greater than zero for these two variables.
2.1. Generate a subset with the non-zero values of HR and RBI variables
2.2. Use a graph that shows how RBI and HR move together
2.3. What is the correlation between these two variables? Remember: Correlation measures the strength and direction of the relationship between two variables—in this case, how home runs (HR) and runs batted in (RBI) move together. If the correlation is strong and positive, it means that players who hit more home runs also tend to drive in more runs.
Question 3 (2 points)
Consider a linear model where you aim to explain changes in the RBI with HR. How effective do you think a simple regression line is in capturing this variability?
Now, let’s take a different approach. This time, we will run a separate linear model for each player. One way to achieve this is by using the lmList
function from the lme4
library as follows:
You will see numerous warnings when running the code below—that’s intentional. In the following questions, we will refine the results further. |
library(lme4)
model_each <- lmList(RBI ~ HR|playerID, data = Batting)
summary(model_each)
The output of this model gives all parameter estimates for some players, while for others, it gives NaN
or NA
values.
There may be different reasons for this. Before reading the next paragraph, pause for a while and think about what can cause it.

One of the reasons can be that some players might not have enough data points to estimate both the slope (HR coefficient) and intercept.
If a player has only one data point, then the regression line is undefined because there is no variation in HR to estimate a slope.
If a player has exactly two data points, the regression line is perfectly determined, meaning the standard error cannot be estimated,
which can result in NaN
.
Additionally, if a player’s HR values do not vary at all, the model cannot compute a meaningful slope, leading to NA
values for the coefficient. In cases where RBI and HR have a perfect linear relationship, the regression model may achieve a perfect fit, resulting in undefined or extremely large standard errors. Finally, missing data in either HR or RBI can cause certain players to have too few valid observations, preventing proper coefficient estimation and leading to NA
or NaN
values in the results.
3.1. Run the linear model showing the change in the RBI variable with HR. What do you think about this model explaining this variability with a simple regression line?
3.2. For the second model, which includes one regression line for each player, remove players with one or two data points before modeling for potential fixes for NA
and NaN
values. Then, for handling cases with zero variance in HR, filter out players where HR does not vary. (Hint: var(HR, na.rm = TRUE) > 0
)
Run the second model one more time and see how the intercept and slope vary around players.
Here are some more examples:
Some More Visualization Example with Lahman Data:
Question 4 (2 points)
Now, let’s discuss the previous models a little bit. Explaining this data with a regression model is too optimistic because there is a lot of variability between players. If you examine it in more detail, you will even notice that there is variability between years and leagues. On the other hand, if we model each player, league, or season separately, we introduce a different type of modeling error known as overfitting or loss of generalizability. By fitting a separate model for each subgroup, we risk capturing noise rather than meaningful patterns, making our predictions unreliable for new or unseen data. Additionally, splitting the data this way reduces the number of observations per model, leading to unstable coefficient estimates and high variance in the results. To handle this type of repeated data, different methods will be needed. One approach is to incorporate a random effect, which captures the variability among players within the same model, alongside the regression line.
You can find the following page, built by Michael Freeman, very useful to visually understand the mixed model approach. Please spare five minutes to go over it before you move to the next step: mfviz.com/hierarchical-models/
For the ones who wants to learn more about mixed models, the following books have the downloadable pdf copy:
The mixed model approach will include both your fixed parameter and a subject-specific parameter with random effects. We have the model definition with a random intercept for each player as follows:
$RBI_{ij} = \beta_0 + \beta_1 HR_{ij} + u_{0j} + \epsilon_{ij}$
where:
-
$\beta_0 =$ Common starting point for all players (overall intercept)
-
$u_{0j} =$ Player-specific intercept deviation (random intercept)
-
$\beta_1 =$ Average effect of HR across all players (overall slope)
-
$\epsilon_{ij} =$ Error term (noise comes from modeling)
This formulation allows each player to have their own intercept in a single model while still maintaining an overall trend across all players.
Now, by using the lmer
function in the lme4
library, run a mixed model for the previous model. Here is the basic syntax for this:
lmer(RBI ~ HR + (1 | playerID), data = data)
(1 | playerID)
: This part adds random effects to your model for each player.
We can see the population (fixef
) and player-specific (ranef
) effects as follows in R:
# Fixed effects (population-level effects)
fixed_effects <- fixef(mixed_model)
fixed_effects
# Random effects (player-specific effects)
random_effects <- ranef(mixed_model)$playerID
head(random_effects) # View player-specific intercept adjustments
4.1. Run mixed effect model (Include players with at least 5 data points) with only random intercept for the same variables used in the linear model
4.2. Use the summary function to see the model results and find the population (fixef
) and player-specific (ranef
) effects
When you use summary for your mixed model, it will produce a familiar output where you can examine the fixed effects. The only difference from a linear model on this output is that random effects where show the magnitude of the variance among the players and how much residual variance there is. |
Question 5 (2 points)
Lahman
data includes the year variable, which generates the repeats for players, and it also includes a lot of information for modeling. Also, there are a lot of rule changes in baseball. So, it may be worthwhile to add year to account for these changes, too.
Depending on the aim of the research, if you assume that year has a consistent effect across all players, you can add it as a fixed effect. The interpretation for this model can be as follows: the coefficient for the year will show the general trend in RBI over time (e.g., a positive coefficient means RBI totals tend to increase over the years).
Or you may want to add the year as a random effect, which allows each year to have a different baseline effect, capturing year-to-year variability. With this approach, the model accounts for yearly fluctuations in RBI production that aren’t explained by HRs alone.
5.1. Add year as a fixed effect into the mixed model previously used in question 4.1.
5.2. Add year as a random effect into the mixed model previously used in question 4.1.
More Resources - Lahman Data:
-
Home page for Lahman Data: seanlahman.com
-
Lahman explains the data: www.youtube.com/watch?v=tS_-oTbsDzs
-
Society for American Baseball Research: sabr.org/lahman-database/
-
R CRAN Page: cran.r-project.org/web/packages/Lahman/index.html
-
Manual in CRAN: cran.r-project.org/web/packages/Lahman/Lahman.pdf
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project6.ipynb
You must double check your You will not receive full credit if your |