TDM 10200: Python Project 9 — Spring 2025

Motivation: It is helpful for us to be able to merge data from two or more data frames together, into one big data frame.

Context: The merge function in Pandas allows us to combine two data frames, according to the information in columns that the data frames have in common.

Scope: merging tables

Learning Objectives:
  • Learn how to merge data frames

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following datasets:

  • /anvil/projects/tdm/data/icecream/combined/products.csv (ice cream products)

  • /anvil/projects/tdm/data/icecream/combined/reviews.csv (ice cream reviews)

  • /anvil/projects/tdm/data/flights/subset/airports.csv (information about airports)

  • /anvil/projects/tdm/data/flights/subset/1990.csv (flight data from 1990)

Question 1 (2 pts)

Read in the ice cream products and reviews files into two separate data frames, as follows:

productsDF = pd.read_csv("/anvil/projects/tdm/data/icecream/combined/products.csv")
reviewsDF = pd.read_csv("/anvil/projects/tdm/data/icecream/combined/reviews.csv")

Notice that these two data frames have three columns in common, namely, brand, key, and ingredients. If we try to merge these two data frames by default, Pandas will try to match data on all three columns, BUT the ingredients column has a different role in these two data frames. In the productsDF, the ingredients column has a list of the ice cream ingredients. In the ratingsDF, the ingredients column has values between 1 and 5, or an NA value.

For this reason, when we merge the information from the two tables, we only want to merge the data according to the brand and key values, as follows:

newmergedDF = productsDF.merge(reviewsDF, on=['brand','key'])

Notice that the productsDF has 8 columns, and the reviewsDF has 13 columns, and the newmergedDF has 19 columns, namely, all of the columns from both of the previous two data frames, including two separate ingredients columns, one from each data frame.

In this data frame, there are 978 rows that correspond to the name being "Chocolate Chip Cookie Dough", which has brand == "bj" and key == "16_bj". We can get this subset of the data as follows:

ChocolateChipCookieDoughDF = newmergedDF[(newmergedDF['brand'] == "bj") & (newmergedDF['key'] == "16_bj")]

Notice that this new data frame called ChocolateChipCookieDoughDF has 978 rows.

What are the average number of stars for the 978 reviews in the data frame called ChocolateChipCookieDoughDF?

Deliverables
  • Print the average number of stars for the 978 reviews in the data frame called ChocolateChipCookieDoughDF.

Question 2 (2 pts)

Starting with the newmergedDF, find the average number of stars for the reviews of ice cream with name == "Half Baked®".

Deliverables
  • Print the average number of stars for the reviews of ice cream with name == "Half Baked®".

Question 3 (2 pts)

We can limit attention to ice cream products that have "Chocolate" in the title, by using str.contains to search for patterns, like this:

chocolateDF = newmergedDF[newmergedDF['name'].str.contains("Chocolate")]

(There are 4831 reviews of the 49 ice cream products with "Chocolate" in the product name.) Find the average number of stars for all 4831 of the reviews of these products that have "Chocolate" in the product name.

Deliverables
  • Print the average number of stars for all 4831 of the reviews of these products that have "Chocolate" in the product name.

Question 4 (2 pts)

Read in the information about airports, and also the flight data from 1990, into two separate data frames, as follows:

airportsDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/airports.csv")
flightdataDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/1990.csv")

It is necessary to have 3 cores in your Jupyter Lab session for Question 4 and Question 5.

You need to ask Pandas to let you see all of the columns in flightdataDF, which is very wide. You can do this as follows: pd.set_option('display.max_columns', 200)

The two data frames airportsDF and flightdataDF do not have any columns in common! Nonetheless, the "iata" values from the airportsDF are the three-letter codes corresponding to airports, which are also found in the Origin and Dest columns in the flightdataDF. So when we merge the information from the two tables, if we want to study where the flights depart, then we only want to merge the data according to the iata value (from the airportsDF) merged with the Origin value (from the flightdataDF), as follows:

mybigDF = airportsDF.merge(flightdataDF, left_on = "iata", right_on = "Origin")

Using this new data frame, find the average departure delay for all flights that have Origin airport in Indiana.

Deliverables
  • Print the average departure delay for all flights that have Origin airport in Indiana.

Question 5 (2 pts)

Using the same data frame from Question 4, find the average departure delay for all flights that have Origin airport in Houston, Texas.

Deliverables
  • Print the average departure delay for all flights that have Origin airport in Houston, Texas.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.

Items to submit
  • firstname_lastname_project8.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.