STAT 19000: Project 3 — Fall 2021

Motivation: data.frames are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a data.frame.

Context: In the previous project we got our feet wet, ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we’ve already learned and introduce a new, flexible data structure called `data.frame`s.

Scope: r, data.frames, recycling, factors

Learning Objectives
  • Explain what "recycling" is in R and predict behavior of provided statements.

  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/olympics/*.csv

Questions

Question 1

Use R code to print the names of the two datasets in the /depot/datamine/data/olympics directory.

Read the larger dataset into a data.frame called olympics.

Print the first 6 rows of the olympics data.frame, and take a look at the columns. Based on that, write 1-2 sentences describing the dataset (how many rows, how many columns, the type of data, etc.) and what it holds.

Relevant topics: list.files, file.info, read.csv, head

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences explaining the dataset.

Question 2

How many unique sports are accounted for in our olympics dataset? Print a list of the sports. Is there any sport that you weren’t expecting? Why or why not?

R is a case-sensitive language. What this means is that whether or not 1 or more letters in a word are capitalized is important. For example, the following two variables are different.

vec <- c(1,2,3)
Vec <- c(3,2,1) # note the capital "V" in our variable name

print(vec) # will print: 1,2,3
print(Vec) # will print: 3,2,1

So, when you are examining a data.frame and you see a column name that starts with a capital letter, it is critical that you use the same capitalization when trying to access said column.

colnames(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
iris$"sepal.Length" # will NOT work
iris$"Sepal.length" # will NOT work
iris$"Sepal.Length" # will work

Relevant topics: unique, length

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences explaining the results.

Question 3

Create a data.frame called us_athletes that contains only information on athletes from the USA. Use the column NOC (National Olympic Committee 3-letter code). How many rows does us_athletes have?

Now, perform the same operation on the olympics data.frame, this time containing only the information on the athletes from the country of your choice. Name this new data.frame appropriately. How many rows does it have?

Now, create a data.frame called both that contains the information on the athletes from the USA and the country of your choice. How many rows does it have?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • How many rows or athletes in the us_athletes dataset?

  • How many rows or athletes in the other country’s dataset?

  • How many rows or athletes in the both dataset?

Question 4

What percentage of US athletes are women? What percentage of US athletes with gold medals are women?

Answer the same questions for your "other" country from question (3).

Relevant topics: prop.table, table, indexing

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

What is the oldest US athlete to compete based on our us_athletes data.frame? At what age, in which sport, and what year did the athlete compete in?

Answer the same questions for your "other" country from question (3) and question (4).

Make sure you using indexing to only print the athlete’s information (age, sport, year).

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Age, sport, and olympics year that the oldest athlete competed in, for each of your countries.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.