TDM 10100: R Project 12 — 2024
Motivation: Some files are too big to read into R. We practice methods for extracting the portions of a file that we need.
Context: The fread
function allows us to read a portion of a file (either a small number of rows, and/or a small number of columns).
Scope: We learn how to work with enormous files in R, if we only need a portion of the file in our analysis.
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales.csv
(Iowa liquor sales) -
/anvil/projects/tdm/data/election/itcont2020.txt
(2020 election data) -
/anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv
(Internet Movie DataBase (IMDB))
Questions
Please use 2 cores in your Jupyter Lab session for this project. |
Question 1 (2 pts)
First load: library(data.table)
so that you have the fread
function available, and also options(repr.matrix.max.cols=50)
so that you can see 50 columns.
If you try to read the entire file /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales.csv
then the seminar-r
kernel will crash.
Instead, use fread
with the option nrows=1000
(in part a) so that you can see (only) the first 1000 rows of the data frame.
-
Which columns contain the
Store Number
,Store Name
,Address
,City
, andZip Code
? (Use the optionnrows=1000
for part a.) -
Use the
select
option infread
to read the entire data set, but only those 5 columns of the data. What is the dimension of the data set? (Hint: it should now be 5 columns and more than 27 million rows. Another hint: Do NOT usenrows=1000
for part b. Instead, use theselect
option that we learned in earlier projects.)
-
Which columns contain the
Store Number
,Store Name
,Address
,City
, andZip Code
? -
What is the dimension of the data set?
Question 2 (2 pts)
Now let’s find the most popular 10 stores (from all 27 million rows, and only the 5 columns in question 1) in two different ways:
-
Use the
table
command to find the 10 most popular values of the"Store Number"
-
Use the
table
command to find the 10 most popular values of the"Store Name"
-
Use the
table
command to find the 10 most popular values of these three columns pasted together:Address
,City
,"Zip Code"
-
Do your answers to parts (a) and (b) seem to agree? Do your answers to parts (a) and (c) seem to agree?
-
a. A table with the 10 most popular values of the
"Store Number"
and the number of occurrences of each. -
b. A table with the 10 most popular values of the
"Store Name"
and the number of occurrences of each. -
c. A table with the 10 most popular values of these three columns pasted together:
Address
,City
,"Zip Code"
and the number of occurrences of each. -
d. Do your answers to parts (a) and (b) seem to agree? Do your answers to parts (a) and (c) seem to agree?
Question 3 (2 pts)
For this question, use fread
to read all 27 million rows of the data set again, but this time, only read in the columns "Zip Code"
, "Category Name"
, "Sale (Dollars)"
.
-
Use the tapply function to sum the values of
"Sale (Dollars)"
according to the"Zip Code"
. Find the 10"Zip Code"
values that have the largest sum of"Sale (Dollars)"
altogether, and give those"Zip Code"
values and each of their sums of"Sale (Dollars)"
. -
Now use the tapply function to sum the values of
"Sale (Dollars)"
according to the"Category Name"
. Find the 10"Category Name"
values that have the largest sum of"Sale (Dollars)"
altogether, and give those"Category Name"
values and each of their sums of"Sale (Dollars)"
.
-
Find the 10
"Zip Code"
values that have the largest sum of"Sale (Dollars)"
altogether, and give those"Zip Code"
values and each of their sums of"Sale (Dollars)"
. -
Find the 10
"Category Name"
values that have the largest sum of"Sale (Dollars)"
altogether, and give those"Category Name"
values and each of their sums of"Sale (Dollars)"
.
Question 4 (2 pts)
Use fread
to read only the 10th and 15th fields of this huge file: /anvil/projects/tdm/data/election/itcont2020.txt
(do not worry about the warning in a pink box that appears).
Sum the amount of the donations (from the 15th field) according to the state (from the 10th field). List the top 10 states according to the sum of the donation amounts, and the sum of the donation amounts in each of these top 10 states.
-
List the top 10 states according to the sum of the donation amounts, and the sum of the donation amounts in each of these top 10 states.
Question 5 (2 pts)
Read the first 10 lines of /anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv
and notice that the entries of the genres
column are strings with several types of genres, separated by commas. On the genres
column (for only these 10 lines), run the following:
myDF$genres
strsplit(myDF$genres, ',')
unlist(strsplit(myDF$genres, ','))
table(unlist(strsplit(myDF$genres, ',')))
Notice that, in this way, using the strsplit
function, we can find out how many times each of the individual genres
occur.
Now read in only the genres
column of the entire file (do not worry about the warning that results). For each of the genres
, list how many times it occurs. For instance, Action
occurs 462531 times.
-
For each of the
genres
, list how many times it occurs.
Submitting your Work
This project enables students to select the relevant columns of a data frame for their analysis.
-
firstname_lastname_project12.ipynb
You must double check your You will not receive full credit if your |