STAT 19000: Project 4 — Fall 2021

Motivation: Control flow consists of the tools and methods that you can use to control the order in which instructions are executed. We can execute certain tasks or code if certain requirements are met using if/else statements. In addition, we can perform operations many times in a loop using for loops. While these are important concepts to grasp, R differs from other programming languages in that operations are usually vectorized and there is little to no need to write loops.

Context: We are gaining familiarity working in Jupyter Lab and writing R code. In this project we introduce and practice using control flow in R, while continuing to reinforce concepts from the previous projects.

Scope: r, data.frames, recycling, factors, if/else, for loops

Learning Objectives
  • Explain what "recycling" is in R and predict behavior of provided statements.

  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

  • Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/olympics/*.csv

Questions

Question 1

Winning an olympic medal is an amazing achievement for athletes.

Before we take a deep dive into our data, it is always important to do some sanity checks, and understand what population our sample is representative of, particularly if we didn’t get a chance to participate in the data collection phase of the project.

Lets do a quick check on our dataset. We would expect that most athletes would not have won a medal. What percentage of athletes did not get a medal in the olympics data.frame?

For simplicity, consider an "athlete" a row of the data.frame. Do not worry about the same athlete participating in the different olympics games, or in different sports.

We are considering the combination of Sport, Event, and Games, as a unique identifier for an athlete.

Relevant topics: is.na, mean, indexing, sum

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

A friend of yours hypothesized that there is some association between getting a medal and the athletes age.

You want to test it out using our olympics data.frame. To do so we will compare 2 new variables:

  1. (This question) An indicator if the athlete in that year and sport won a medal or not

  2. (Next question) Age converted into categories of ages.

Create a new variable in your olympics data.frame called won_medal which indicates whether the athlete in that year and sport won a medal or not.

Relevant topics: is.na

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Now that we have our first new column, won_medal, let’s categorize the Age column. Using a for loop and if/else statements, create a new column in the olympics data.frame called age_cat. Use the guidelines below to do so.

  • "youth": less than 18 years old

  • "young adult": between 18 and 25 years old

  • "adult": 26 to 35 years old

  • "middle age adult": between 36 to 55 years old

  • "wise adult": greater than 55 years old

How many athletes are "young adults"?

Remember to consider the `NA`s as you are solving the problem.

Relevant topics: nrow, if/else, for loops, indexing, is.na

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • How many athletes are "young adults"?

Question 4

We did not need to use a for loop to solve the previous problem. Another way to solve the problem would have been to use a vectorized function called cut.

Create a variable called age_cat_cut using the cut function to solve the problem above.

To check that you are getting the same results, run the following commands.

If you used cut`s `labels argument in your code:

all.equal(as.character(age_cat_cut), olympics$age_cat)

If you didn’t use cut`s `labels argument in your code:

levels(age_cat_cut) <-  c('youth', 'young adult', 'adult', 'middle age adult', 'older adult')
all.equal(as.character(age_cat_cut), olympics$age_cat)

Note that by default cut considers the breaks as right intervals. For example, if the breaks are c(a,b,c) the intervals will be "(a, b], (b, c]".

You can use the argument labels in cut to label the categories similarly to what we did in question (2).

These past 2 questions do a good job emphasizing the importance of vectorized functions. How long did it take you to run the solution to question (3) vs question (4)? If you find yourself looping through one or more columns one at a time, there is likely a better option.

Relevant topics: cut

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Now that we have the new columns in the olympics data.frame, look at the data and write down your conclusions. Is there some association between winning a medal and the athletes age?

There a couple of ways you can look at the data to make your conclusions. You can visualize using plots, using functions like barplot, and pie. Alternatively, you can use numeric summaries, like a table or table with proportions (prop.table). Regardless of the method used, explain your findings, and feel free to get creative!

You do not need to use any special statistical test to make your conclusions. The goal of this question is to explore the data and think logically.

The argument margin may be useful if you use the prop.table function.

Relevant topics: barplot, pie, indexing, table, prop.table, balloonplot

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.