TDM 10100: Project 6 — Fall 2023

Motivation: We want to have fun and get used to the function tapply

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/olympics/athlete_events.csv

  • /anvil/projects/tdm/data/death_records/DeathRecords.csv

Questions

Question 1 (1.5 pts)

(We do not need the tapply function for Question 1)

For this question, please read the dataset

/anvil/projects/tdm/data/olympics/athlete_events.csv

into a data frame called myDF as follows:

myDF <- read.csv("/anvil/projects/tdm/data/olympics/athlete_events.csv", stringsAsFactors=TRUE)
  1. Use the table function to list all Games with occurrences in this data frame

  2. Use the table function to list all countries participating in the Olympics during the year 1980. (The output should exclude all countries that did not have any athletes in 1980.)

  3. Use the subset function to create a new data frame containing data related to athletes that attended the Olympics more than one time.

(Use the original data frame myDF as a starting point for each of these three questions. Problems 1a and 1b and 1c are independent of each other. For instance, when you solve question 1c, do not restrict yourself to the year 1980.)

For question 1c, use duplicated to identify duplicated elements, for example:

vec <- c(3, 2, 6, 5, 1, 1, 1, 6, 5, 6, 4, 3)
duplicated(vec)
FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

Question 2 (1.5 pts)

Use the tapply command to solve each of these questions:

  1. What is the average age of the participants from each country?

  2. What is Maximum Height by Sport? For your output on this question, please sort the Maximum Heights in decreasing order, and display the first 5 values.

Question 3 (1 pt)

For this question, save the data from the data set

/anvil/projects/tdm/data/death_records/DeathRecords.csv

into a new data frame called myDF as follows:

myDF <- read.csv("/anvil/projects/tdm/data/death_records/DeathRecords.csv", stringsAsFactors = TRUE)

It might be helpful to get an overview of the structure of the data frame, by using the str() function:

str(myDF)
  1. How many observations (i.e., rows) are given in this dataframe?

  2. Change the column MonthOfDeath from numbers to months

  3. How many people died (altogether) during each month? For instance, group together all of the deaths in January, all of the months in February, etc., so that you can display the total numbers from January to December in a total of 12 output values.

You may factorize the month names with a specified level order:

month_order <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
myDF$MonthOfDeath <- factor(myDF$MonthOfDeath)
levels(myDF$MonthOfDeath) <- month_order

Question 4 (2 pts)

  1. For each race, what is the average age at the time of death? Use the race column, which has integer values, and sort your outputs into descending order.

  2. Now considering only data for females: for each race, what is the average age at the time of death? Now considering only data for males, we can ask the same question: for each race, what is the average age at the time of death?

If you want to see the list of race values from the CDC for this data, you can look at page 15 of this pdf file:

If you want to (this is optional!) you can use the method we used in question 3B to convert integer values into the string values that describe each race. This is not required but you are welcome to do this, if you want to.

Question 5 (2 pts)

  1. Using the data set about the Olympic athletes, create a graph or plot that you find interesting. Write 1-2 sentences about something you found interesting about the data set; explain what you noticed in the dataset.

  2. Using the data set about the death records, create a graph or plot that you find interesting. Write 1-2 sentences about something you found interesting about the data set; explain what you noticed in the dataset.

Project 06 Assignment Checklist

  • Jupyter Lab notebook with your code and comments for the assignment

    • firstname-lastname-project06.ipynb.

  • R code and comments for the assignment

    • firstname-lastname-project06.R.

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.