TDM 10100: R Project 2 — 2024

Motivation: R is one of the most popular tools for data analysis. Indexing and grouping values in R are very powerful. (We can do a lot, with just one line of R!)

Context: We will load several data frames in R and will practice indexing the data in several ways.

Scope: R, Operators, Conditionals

Learning Objectives:

Get comfortable with extracting data in R that satisfy various conditions
Learning how to use indices in R
Apply these techniques with real-world data

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

/anvil/projects/tdm/data/olympics/athlete_events.csv
/anvil/projects/tdm/data/election/itcont1980.txt

Questions

For this project (and moving forward, when you are using R), please use the seminar-r kernel (not the seminar kernel), unless otherwise specified. When you use the seminar-r kernel with R, you do not need to use the %%R cell magic.

Question 1 (2 pts)

Import the Olympics data from the file /anvil/projects/tdm/data/olympics/athlete_events.csv into a data frame called myDF. Make a table from the values in the column myDF$Year and the plot this table. (Your work will be similar to Project 1, Questions 3, 4, 5.) [Take a look at the resulting plot: Does the resulting plot make sense? For instance: Does it make sense that the number of athletes is increasing over time? Can you see the halt in the Olympics during the two World Wars? Do you see the 2-year rotation between summer and winter Olympics began in the 1990s?]

Deliverables

A table showing the number of athletes participating in the Olympics during each year.
A plot showing the number of athletes participating in the Olympics during each year.
As always, be sure to document your work from Question 1 (and from all of the questions!), using some comments and insights about your work. We will stop adding this note to document your work, but please remember, we always assume that you will document every single question with your comments and your insights.

Question 2 (2 pts)

In the Olympics data:

Which value appears in the "NOC" column the most times?

Which value appears in the "Name" column the most times? Hint: If you try to view the entire table of values in the "Name" column, the table has length 134732, and it will not finish displaying. For this reason, you should only look at the head or the tail of your table, not the entire table itself.

Deliverables

The value that appears in the "NOC" column the most times.
The value that appears in the "Name" column the most times.

Question 3 (2 pts)

In the Olympics data:

When we examine the head of myDF, notice that the third row is from team "Denmark" while the fourth row is from team "Denmark/Sweden".

How many rows correspond exactly to team "Denmark"?

How many rows have "Denmark" in the team name ("Denmark" may or may not be the exact team name)? Hint: You can use the grep or grepl function.

Find the names of the teams that have "Denmark" in the team name but are not exactly "Denmark". Hint: There should be exactly 72 such rows.

Deliverables

The number of rows corresponding exactly to team "Denmark".
The number of rows with "Denmark" as part of the team name.
The names of teams that have "Denmark" included but are not exactly "Denmark".

Question 4 (2 pts)

Not all data comes in a comma-delimited format, i.e., with commas in between the pieces of data. In the data set of donations from the 1980 federal election campaigns, the symbol "|" is placed between pieces of data.

C00078279|A|M11|P|80031492155|22Y||MCKENNON, K R|MIDLAND|MI|00000|||10031979|400|||||CONTRIBUTION REF TO INDIVIDUAL|3062020110011466469
C00078279|A|M11||79031415137|15||OREFFICE, P|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1500||||||3061920110000382948
C00078279|A|M11||79031415137|15||DOWNEY, J|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|300||||||3061920110000382949
C00078279|A|M11||79031415137|15||BLAIR, E|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1000||||||3061920110000382950
C00078287|A|Q1||79031231889|15||BLANCHARD, JOHN A|CHICAGO|IL|60685|||03201979|200||||||3061920110000383914
C00078287|A|Q1||79031231889|15||CRAMER, JOHN H|CHICAGO|IL|60685|||02281979|200||||||3061920110000383915
C00078287|A|Q1||79031231889|15||MCHUGH, KEVIN|CHICAGO|IL|60685|||03051979|200||||||3061920110000383916
C00078287|A|Q1||79031231889|15||NOHA, EDWARD J|CHICAGO|IL|60685|||03121979|300||||||3061920110000383917
C00078287|A|Q1||79031231889|15||RYCROFT, DONALD C|CHICAGO|IL|60685|||03191979|200||||||3061920110000383918
C00078287|A|Q1||79031231889|15||VANDERSLICE, WILLIAM D|CHICAGO|IL|60685|||02271979|200||||||3061920110000383919

Instead of using the read.csv function to read in the data, we can use the fread function to read in the data, and it will automatically detect what symbol is placed between the pieces of data. The fread function is not available by default, so we first load the data.table library.

This data set also does not have the names of the columns built in! So we need to specify the names of the columns.

You can use the following to read in the data and name the columns properly:

library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

Now that you have the data read into the data frame myDF, here are two questions to get familiar with the data:

Which value appears in the "STATE" column the most times?

Which value appears in the "NAME" column the most times? Hint: As in question 2, if you try to view the entire table of values in the "NAME" column, the table has length 217646, and it will not finish displaying. For this reason, you should only look at the head or the tail of your table, not the entire table itself.

Deliverables

The value that appears in the "STATE" column the most times.
The value that appears in the "NAME" column the most times.

Question 5 (2 pts)

In the data set about the 1980 federal election campaigns:

Use the paste command to join the "CITY" and "STATE" columns, with the goal of determining the top 5 city-and-state locations where donations were made.

Hint: As in questions 2 and 4, if you try to view the entire table of values of city-and-state pairs, the table has length 217646, and it will not finish displaying. For this reason, you should only look at the head or the tail of your table, not the entire table itself.

Another hint: Please notice the fact that there are 11582 rows in the data set in which the "CITY" and "STATE" are both empty!

Deliverables

The top 5 city-and-state locations where donations were made in the 1980 federal election campaigns.

Submitting your Work

Great job, you’ve completed Project 2! This project was your first real foray into the world of R, and it is okay to feel a bit overwhelmed. R is likely a new language to you, and just like any other language, it will get much easier with time and practice. As we keep building on these fundamental concepts in the next few weeks, don’t be afraid to come back and revisit your previous work. As always, please ask any questions you have during seminar, on Piazza, or in office hours. We hope you have a great rest of your week, and we’re excited to keep learning about R with you in the next project!

Items to submit

firstname_lastname_project2.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.