TDM 10100: R Project 12 — 2024

Motivation: Some files are too big to read into R. We practice methods for extracting the portions of a file that we need.

Context: The fread function allows us to read a portion of a file (either a small number of rows, and/or a small number of columns).

Scope: We learn how to work with enormous files in R, if we only need a portion of the file in our analysis.

Learning Objectives:
  • Learn how to apply functions to large data sets and extract information.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales.csv (Iowa liquor sales)

  • /anvil/projects/tdm/data/election/itcont2020.txt (2020 election data)

  • /anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv (Internet Movie DataBase (IMDB))

Questions

Please use 2 cores in your Jupyter Lab session for this project.

Question 1 (2 pts)

First load: library(data.table) so that you have the fread function available, and also options(repr.matrix.max.cols=50) so that you can see 50 columns.

If you try to read the entire file /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales.csv then the seminar-r kernel will crash.

Instead, use fread with the option nrows=1000 (in part a) so that you can see (only) the first 1000 rows of the data frame.

  1. Which columns contain the Store Number, Store Name, Address, City, and Zip Code? (Use the option nrows=1000 for part a.)

  2. Use the select option in fread to read the entire data set, but only those 5 columns of the data. What is the dimension of the data set? (Hint: it should now be 5 columns and more than 27 million rows. Another hint: Do NOT use nrows=1000 for part b. Instead, use the select option that we learned in earlier projects.)

Deliverables
  • Which columns contain the Store Number, Store Name, Address, City, and Zip Code?

  • What is the dimension of the data set?

Question 2 (2 pts)

Now let’s find the most popular 10 stores (from all 27 million rows, and only the 5 columns in question 1) in two different ways:

  1. Use the table command to find the 10 most popular values of the "Store Number"

  2. Use the table command to find the 10 most popular values of the "Store Name"

  3. Use the table command to find the 10 most popular values of these three columns pasted together: Address, City, "Zip Code"

  4. Do your answers to parts (a) and (b) seem to agree? Do your answers to parts (a) and (c) seem to agree?

Deliverables
  • a. A table with the 10 most popular values of the "Store Number" and the number of occurrences of each.

  • b. A table with the 10 most popular values of the "Store Name" and the number of occurrences of each.

  • c. A table with the 10 most popular values of these three columns pasted together: Address, City, "Zip Code" and the number of occurrences of each.

  • d. Do your answers to parts (a) and (b) seem to agree? Do your answers to parts (a) and (c) seem to agree?

Question 3 (2 pts)

For this question, use fread to read all 27 million rows of the data set again, but this time, only read in the columns "Zip Code", "Category Name", "Sale (Dollars)".

  1. Use the tapply function to sum the values of "Sale (Dollars)" according to the "Zip Code". Find the 10 "Zip Code" values that have the largest sum of "Sale (Dollars)" altogether, and give those "Zip Code" values and each of their sums of "Sale (Dollars)".

  2. Now use the tapply function to sum the values of "Sale (Dollars)" according to the "Category Name". Find the 10 "Category Name" values that have the largest sum of "Sale (Dollars)" altogether, and give those "Category Name" values and each of their sums of "Sale (Dollars)".

Deliverables
  • Find the 10 "Zip Code" values that have the largest sum of "Sale (Dollars)" altogether, and give those "Zip Code" values and each of their sums of "Sale (Dollars)".

  • Find the 10 "Category Name" values that have the largest sum of "Sale (Dollars)" altogether, and give those "Category Name" values and each of their sums of "Sale (Dollars)".

Question 4 (2 pts)

Use fread to read only the 10th and 15th fields of this huge file: /anvil/projects/tdm/data/election/itcont2020.txt (do not worry about the warning in a pink box that appears).

Sum the amount of the donations (from the 15th field) according to the state (from the 10th field). List the top 10 states according to the sum of the donation amounts, and the sum of the donation amounts in each of these top 10 states.

Deliverables
  • List the top 10 states according to the sum of the donation amounts, and the sum of the donation amounts in each of these top 10 states.

Question 5 (2 pts)

Read the first 10 lines of /anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv and notice that the entries of the genres column are strings with several types of genres, separated by commas. On the genres column (for only these 10 lines), run the following:

myDF$genres

strsplit(myDF$genres, ',')

unlist(strsplit(myDF$genres, ','))

table(unlist(strsplit(myDF$genres, ',')))

Notice that, in this way, using the strsplit function, we can find out how many times each of the individual genres occur.

Now read in only the genres column of the entire file (do not worry about the warning that results). For each of the genres, list how many times it occurs. For instance, Action occurs 462531 times.

Deliverables
  • For each of the genres, list how many times it occurs.

Submitting your Work

This project enables students to select the relevant columns of a data frame for their analysis.

Items to submit
  • firstname_lastname_project12.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.