# STAT 19000: Project 3 — Fall 2021

Motivation: `data.frames` are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame`.

Context: In the previous project we got our feet wet, ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we’ve already learned and introduce a new, flexible data structure called `data.frame`s.

Scope: r, data.frames, recycling, factors

Learning Objectives
• Explain what "recycling" is in R and predict behavior of provided statements.

• Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

• Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• List the differences between lists, vectors, factors, and data.frames, and when to use each.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset(s)

The following questions will use the following dataset(s):

• `/depot/datamine/data/olympics/*.csv`

## Questions

### Question 1

Use R code to print the names of the two datasets in the `/depot/datamine/data/olympics` directory.

Read the larger dataset into a data.frame called `olympics`.

Print the first 6 rows of the `olympics` data.frame, and take a look at the columns. Based on that, write 1-2 sentences describing the dataset (how many rows, how many columns, the type of data, etc.) and what it holds.

Items to submit
• Code used to solve this problem.

• Output from running the code.

• 1-2 sentences explaining the dataset.

### Question 2

How many unique sports are accounted for in our `olympics` dataset? Print a list of the sports. Is there any sport that you weren’t expecting? Why or why not?

 R is a case-sensitive language. What this means is that whether or not 1 or more letters in a word are capitalized is important. For example, the following two variables are different. ``````vec <- c(1,2,3) Vec <- c(3,2,1) # note the capital "V" in our variable name print(vec) # will print: 1,2,3 print(Vec) # will print: 3,2,1`````` So, when you are examining a `data.frame` and you see a column name that starts with a capital letter, it is critical that you use the same capitalization when trying to access said column. ``colnames(iris)`` `[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"` ``````iris\$"sepal.Length" # will NOT work iris\$"Sepal.length" # will NOT work iris\$"Sepal.Length" # will work``````

Relevant topics: unique, length

Items to submit
• Code used to solve this problem.

• Output from running the code.

• 1-2 sentences explaining the results.

### Question 3

Create a data.frame called `us_athletes` that contains only information on athletes from the USA. Use the column `NOC` (National Olympic Committee 3-letter code). How many rows does `us_athletes` have?

Now, perform the same operation on the `olympics` data.frame, this time containing only the information on the athletes from the country of your choice. Name this new data.frame appropriately. How many rows does it have?

Now, create a data.frame called `both` that contains the information on the athletes from the USA and the country of your choice. How many rows does it have?

Items to submit
• Code used to solve this problem.

• Output from running the code.

• How many rows or athletes in the `us_athletes` dataset?

• How many rows or athletes in the other country’s dataset?

• How many rows or athletes in the `both` dataset?

### Question 4

What percentage of US athletes are women? What percentage of US athletes with gold medals are women?

Relevant topics: prop.table, table, indexing

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 5

What is the oldest US athlete to compete based on our `us_athletes` data.frame? At what age, in which sport, and what year did the athlete compete in?

Answer the same questions for your "other" country from question (3) and question (4).

 Make sure you using indexing to only print the athlete’s information (age, sport, year).
Items to submit
• Code used to solve this problem.

• Output from running the code.

• Age, sport, and olympics year that the oldest athlete competed in, for each of your countries.

 Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.