TDM 10100: R Project 9 — 2024
Motivation: Knowing how to merge data frames in R truly opens up a lot of functionality to us, and allows us to design a more comprehensive analysis of our data sets.
Context: The merge
function allows us to combine information from multiple data tables.
Scope: merging tables
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/icecream/combined/products.csv
(ice cream products) -
/anvil/projects/tdm/data/icecream/combined/reviews.csv
(ice cream reviews) -
/anvil/projects/tdm/data/flights/subset/airports.csv
(information about airports) -
/anvil/projects/tdm/data/flights/subset/1990.csv
(flight data from 1990)
Questions
Question 1 (2 pts)
Read in the ice cream products
and reviews
files into two separate data frames, as follows:
library(data.table)
productsDF <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv")
reviewsDF <- fread("/anvil/projects/tdm/data/icecream/combined/reviews.csv")
Notice that these two data frames have three columns in common, namely, brand
, key
, and ingredients
. If we try to merge these two data frames by default, R will try to match data on all three columns, BUT the ingredients
column has a different role in these two data frames. In the productsDF
, the ingredients
column has a list of the ice cream ingredients. In the ratingsDF
, the ingredients
column has values between 1 and 5, or an NA
value.
For this reason, when we merge the information from the two tables, we only want to merge the data according to the brand
and key
values, as follows:
newmergedDF <- merge(productsDF, reviewsDF, by = c("brand", "key") )
Notice that the productsDF
has 8 columns, and the reviewsDF
has 13 columns, and the newmergedDF
has 19 columns, namely, all of the columns from both of the previous two data frames, including two separate ingredients
columns, one from each data frame.
In this data frame, there are 978 rows that correspond to the name
being "Chocolate Chip Cookie Dough"
, which has brand == "bj"
and key == "16_bj"
. We can get this subset of the data as follows:
ChocolateChipCookieDoughDF <- subset(newmergedDF, (brand == "bj") & (key == "16_bj"))
Notice that this new data frame called ChocolateChipCookieDoughDF
has 978 rows.
What are the average number of stars for the 978 reviews in the data frame called ChocolateChipCookieDoughDF
?
-
Print the average number of stars for the 978 reviews in the data frame called
ChocolateChipCookieDoughDF
.
Question 2 (2 pts)
Starting with the newmergedDF
, find the average number of stars
for the reviews of ice cream with name == "Half Baked\302\256"
.
There are two characters that you will not see but they are there. They are encoded as |
-
Print the average number of
stars
for the reviews of ice cream withname == "Half Baked\302\256"
.
Question 3 (2 pts)
In Project 2, we learned about the grep
and grepl
functions. Using either of these two functions, you can limit attention to ice cream products that have "Chocolate"
in the title. (There are 49 such ice cream products.) Find the average number of stars for all 4831 of the reviews of these products that have "Chocolate"
in the product name.
-
Print the average number of stars for all 4831 of the reviews of these products that have
"Chocolate"
in the product name.
Question 4 (2 pts)
Read in the information about airports, and also the flight data from 1990, into two separate data frames, as follows:
library(data.table)
airportsDF <- fread("/anvil/projects/tdm/data/flights/subset/airports.csv")
flightdataDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv")
It is necessary to have 2 cores in your Jupyter Lab session for Question 4 and Question 5. |
Do not worry about the warning message from the |
These two data frames do not have any columns in common! Nonetheless, the "iata"
values from the airportsDF
are the three-letter codes corresponding to airports, which are also found in the Origin
and Dest
columns in the flightdataDF
. So when we merge the information from the two tables, if we want to study where the flights depart, then we only want to merge the data according to the iata
value (from the airportsDF
) merged with the Origin
value (from the flightdataDF
), as follows:
mybigDF <- merge(airportsDF, flightdataDF, by.x = "iata", by.y = "Origin")
Using this new data frame, find the average departure delay for all flights that have Origin
airport in Indiana.
-
Print the average departure delay for all flights that have
Origin
airport in Indiana.
Submitting your Work
Now you are familiar with the method of merging data from multiple data frames.
-
firstname_lastname_project9.ipynb
You must double check your You will not receive full credit if your |