TDM 10100: Project 9 - Visualizations 1

Project Objectives

Motivation: You can work really hard to come up with quality results from working with data. Making your results visual is especially important. This allows you to share what you have learned with others who might not have otherwise understood what it is you have done.

Context: Plotting in R can get quite complex and stylistic, but before diving into customization, it is essential to know how to make the basic plots.

Scope: R, plots, barplots, box and whisker plots

Learning Objectives

Gain a better understanding of how to plot in R
Practice variety of plotting methods
Expand knowledge of what is possible in plotting

Dataset

/anvil/projects/tdm/data/death_records/DeathRecords.csv

Please use 4 Cores for this project.

If AI is used in any cases, such as for debugging, research, etc., we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation.

The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”.

We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/fall2025/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty.

Death Records

Each year, the Centers for Disease Control and Prevention (CDC) publishes the most comprehensive data on mortality in the United States through the National Vital Statistics System. The dataset contains records of all deaths nationwide between 2005 and 2015, with detailed information on causes of death and demographic characteristics of the individuals.

The dataset used for this project contains information from the year 2014 about the mortality data records released by the CDC. This dataset is thorough and contains 38 columns and 2631171 rows of death record data. The columns cover everything from date of death to personal details (age, sex, marital status, etc.) to manner of death and more. Some of the columns may not make a lot of sense and that is OK. Feel free to explore this dataset as much as you would like to gain a better understanding.

Many of the columns only contain encoded values. There are accompanying datasets that will you help with understanding what is actually being shown here. Some examples of these decoder files are provided in Question 1.

Some of the columns from myDF we will be working with include:

MonthOfDeath: numerical month of death
Age: age at time of death
MaritalStatus: initial representing encoded marital status
DayOfWeekOfDeath: numerical day of the week
Race: numerical encoded value representing the person’s race

An example of what the decoder files may contain is:

Code: numerical value shown in the data,
Description: actual character meaning represented by the Code value.

Questions

Question 1 (2 points)

Read in the Death Records dataset as myDF from /anvil/projects/tdm/data/death_records/DeathRecords.csv. This dataset is the main source of the death records data, but there are a few different dataset files that go along with myDF that contain the decoder data keys for what the values in some of the columns mean.

Read in the following datasets to ensure correct interpretation of the values from the specific columns Race, MaritalStatus, and DayOfWeekOfDeath:

/anvil/projects/tdm/data/death_records/Race.csv
/anvil/projects/tdm/data/death_records/MaritalStatus.csv
/anvil/projects/tdm/data/death_records/DayOfWeekOfDeath.csv

These datasets contain descriptions for what each of the code values of the columns of myDF represent. In the table of the column Race in myDF, there are not too many unique values representing the different races. For reference, use the Race.csv dataset to find out what the numerical values of this Race column mean. Also, the number of occurrences of the different values in the Race column varies greatly. Some have a few hundred while others have tens of thousands of entries or more.

Make a plot of this table to help visualize this. To make a very basic plot in R, use plot(). The result is typically a scatter plot, with a dot or circle value representing each value in your data being plotted.

Learn about expanding your plotting abilities in base R with this helpful site here. The author creates many copies of the same plot, each time customizing it further to help with look and readability. While you might not do many customizations with this plot, this knowledge can come in handy in the later questions and future projects.

Deliverables

1.1 What did you learn about the Race column through the decoder?
1.2 Look at some of the other decoder files than those mentioned above. What other data from this dataset might be interesting to you to work with given what you have learned from what the codes mean?
1.3 Which race had the smallest number of entries?

Question 2 (2 points)

The Race category coded as '1' appears 2241510 times. Compared to other categories with counts such as 700, 4913, or even 13297, this value is disproportionately high, making it difficult to visualize alongside the others on the same scale.

Create a new named vector containing just the entries from the Race column that are not '1':

save_df <- myDF[myDF$Race != 1,  ]

Either use plot() or dotchart() to visualize this vector. Which method do you find more insightful?

Think about your new visualization. How many more values do you want to exclude until this is a good visualization to interpret? The '2' race values? '3'? More? Less? Try some of these and explain your final decision.

While it is almost always a good idea to include as much data as was provided when making a comparison, such as comparing how many death records of each race there were, it is sometimes the case that removing data actually helps. If you look at your first plot from Question 1, it is not hard to see that you cannot really tell what is going on with the lower values because the '1' race has extended the upper y-limit so high that everything else is small by comparison.

When you do end up removing the '1' and other values to limit what races are shown, be sure to document this. Write a comment in your notebook about "I removed this value because….." and just explain it a little bit. When you are finishing these plots, it can also be good to include what values you excluded in the title or labels, like "Death Record Counts by Race (Excluding ….) ". If these plots are just for you, then this gives you a quick reference in some time when you come back and wonder what you plotted. If these plots are for sharing data findings, even better, because these labels help the plot to be interpreted by others who are not in your head.

Deliverables

2.1 How did you make your plot(s) unique to what you know up to this point about plotting?
2.2 How have you ensured your plot(s) are readable for others who might not have as much in-depth knowledge about what you have done here?
2.3 What race(s) did you not include in your final plot and what is your reasoning that follows this?

Instead of omitting any values, think about how you could transform or adjust the visualization to make all categories visible and interpretable.

Try using a logarithmic scale on the y-axis, or plot the proportions of each race instead of the raw counts. Let’s experiment together to see how a log scale changes the way we interpret this data:

race_counts <- table(myDF$Race)
barplot(log10(race_counts), main = "Log-scaled Death Record Counts by Race")

We observe log scale may be effective for communicating the differences between groups without removing any data.

Question 3 (2 points)

The MaritalStatus column tracks what relationship stage each person was in at the time of death. Please make sure you know what the letters in this column represent before continuing. Additionally, the values in the Age column range from 1 to 999. Obviously, this 999 is an error code number, not a person who was actually that age. But let’s continue with it for now.

boxplot() is well suited for exploring the relationship between one categorical and one numerical variable. Create a boxplot() that shows how the different martial statuses compare to the ages of the people who have died:

boxplot(Age ~ MaritalStatus, data = myDF)

Box-and-whisker plots are often very confusing to read, even if you are very familiar with what is being shown. Checkout this resource for a bit of help here.

It is easy to see where the outliers are in this plot. With the 999 value being so much higher than the other (actual) ages, the rest of the plot gets squished down so it is not very useful.

Filter out the ages of myDF where they are 999, and save this as cleanDF. With this new cleanDF, make a boxplot to show the reasonable ages and the marital status of the people in the death records.

Take a specific age range (including at least 40 years of ages within the range) from the actual ages of the people who have died, and make a boxplot to show this against the marital statuses.

Deliverables

3.1 Compare your boxplot of all of the ages (with the 999 value) vs the boxplot of the actual ages without the 999 value
3.2 Explain (to your understanding) how the boxplot of the specific age range relates to the boxplot tracking the marital status across all of the ages

Question 4 (2 points)

Make a boxplot that is very similar to that which you just made, except only for the people whose marital status is "M" (married) OR "W" (widowed). Your plot should have two "boxes", with distinct ways to easily tell the marital status "boxes" apart.

Be sure to continue removing the 999 value from the Age column here.

For this boxplot, add proper title, axis labels, and colors. Any additional customizations you want to add are welcome.

People can die at any age but it is more likely for someone older to be widowed in their death record than it is for a younger person to be. The same is likely true for any status besides single, depending on how young or old of people you are looking at.

So, how do you compare the marital statuses across a certain age? You can plot it.

Filter the data so you are only working with the marital statuses Married and Widowed, and only the people who were 60. Try out a few different ages to figure out which you would like to use to make a barplot with here. It is up to you for the age, but this should still use just these two marital statuses.

Now working across all of the marital statuses, make a barplot comparing the marital status of each of the 70-year-olds in the Death Records.

How does this plot compare to a barplot of 60-year-olds across all marital statuses? What about for 80-year-olds? Does the quantity of people in each marital status category shift consistently across the different ages?

Deliverables

4.1 Explain some of what is shown in your Married vs Widowed people (all ages) boxplot to the best of your knowledge,
4.2 Barplot comparing the people who were 60 and were either Married or Widowed. Make at least one other barplot for a different age and explain what you learned from the two,
4.3 At least three barplots. Use all of the marital statuses, and have one barplot for the 70-year-olds, the 80-year-olds, and the 60-year-olds each. How are the marital statuses distributed across these plots?

Question 5 (2 points)

Take a look at the DayOfWeekOfDeath column. This column contains numerical values for each day of the week, and has the number 9 to represent any unentered or error days. Sometimes it is nice to have the text names stored in place of the numbers. But we don’t know if their day-of-the-week system follows the 'Sunday-Saturday' or 'Monday-Sunday' week system by only looking at that column. However, there is a separate csv file for it in the data. So, we can find this out from the DayOfWeekOfDeath.csv decoder dataset:

day_of_week_of_death <- read.csv("/anvil/projects/tdm/data/death_records/DayOfWeekOfDeath.csv")

In R, there is a function merge() that can take two datasets as input, and combine the data within them to help create a new column. We’re actually going to be using both datasets (DeathRecords.csv and DayOfWeekOfDeath.csv) to make the new day_of_week_of_death column.

Since these two columns have different names depending on which dataset they’re from:

DeathRecords.csv: DayOfWeekOfDeath
DayOfWeekOfDeath.csv: Code

You should specify the names of the columns you are merging from both datasets:

my_temp <- merge(myDF, day_of_week_of_death, by.x = "DayOfWeekOfDeath", by.y = "Code", all.x = TRUE)

This helpful page shows a good base example of what a merge() function can look like: here. The columns they’re merging in the example share a name. Ours do not, so you should use by.x and by.y to specify which columns share the same values from both of the datasets you’re using.

One way to double check your work after merging is to make a table comparing your numerical DayOfWeekOfDeath column with your new column containing the day of the week names. Each column should have one non-zero value mapping to one row - this represents every day listed as each specific name or number pair.

Show the table comparing the month of death by the names of the days of the week. Go ahead and visualize this table in a barplot. Then, filter the day of the week of death to only compare the days Monday and Friday to all of the different months.

Make sure to label this and all other visualizations in this project with a title, axis labels, and any other customizations needed to fully interpret what you are trying to show.

Deliverables

5.1 What does merge() do and how are you using it in this question?
5.2 Barplot of the table of the MaritalStatus column by the column containing the days of the week (now with name labels)
5.3 Make a plot similar to the days of the week by the counts of each marital status, but using the numbers of the months instead of the days of the week.

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project9.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.