TDM 10200: Python Project 3 — Spring 2025

Motivation: The groupby function can be used to split a data set, and then we can analyze each group of the data. This sounds complicated, but after you experiment with a few examples, this is straightforward and powerful!

Context: To use the groupby function, we need a Pandas data frame. We split the data frame according to the data in one of the columns. Then we can analyze group of the data.

Scope: using the groupby function on a Pandas data frame

Learning Objectives:
  • Analyzing data in groups

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following datasets:

  • /anvil/projects/tdm/data/election/itcont1980.txt

  • /anvil/projects/tdm/data/icecream/combined/products.csv

  • /anvil/projects/tdm/data/flights/subset/1990.csv

  • /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

  • /anvil/projects/tdm/data/olympics/athlete_events.csv

Questions

It is OK to use 3 cores in this project, to prevent the kernel from dying when you load your data. If your kernel does die while you are working, it is OK to select Kernel from the menu in Jupyter Lab, and then Restart Kernel from the dropdown under Kernel, and re-run your analysis for that question.

Three examples of the groupby function:

Example 1 Using the 1980 election data, we can find the amount of money donated in each state.

import pandas as pd

myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", header=None, sep='|')

myDF.columns = ["CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID"]

We group the data according to the state where the donation was made. Then, looking at the transaction amount column, we can find the sum of the amounts of the transactions (in each state).

myDF.groupby('STATE')['TRANSACTION_AMT'].sum()

because there are many states in the output, you can ask Jupyter Lab to show you all of the rows of output (one row per state) as follows:

pd.set_option('display.max_rows', None)

Example 2 Using the ice cream products data, we can find the average rating for each brand.

import pandas as pd
myDF = pd.read_csv("/anvil/projects/tdm/data/icecream/combined/products.csv")

We split the data according to the brand of the ice cream, and then we take the average (also called the mean) of the ratings for each brand.

myDF.groupby('brand')['rating'].mean()

Example 3 Using the 1990 airport data, we can find the average departure delay for flights from each airport.

import pandas as pd
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/1990.csv")

When you load the head of the data using myDF.head() you will not see all of the columns, because there are too many columns. You might find it helpful to run:

pd.set_option('display.max_columns', None)

so that you can see all of the columns.

We split the data according to the Origin airport, in other words, according to where the flight departed. Then we look at the departure delay and find the average (also called the mean) departure delay for each airport.

myDF.groupby('Origin')['DepDelay'].mean()

There are so many states in the output that you will not be able to see them all. So, just like we told Jupyter Lab earlier that we did not want a maximum number of columns, we can also tell Jupyter Lab that we do not want a maximum number of rows either:

pd.set_option('display.max_rows', None)

In this way, we will be able to see the results from all of the states.

Question 1 (2 pts)

Using the 1990 airport data, find the average arrival delay for flights arriving to each airport.

The arrival delays are in the ArrDelay column, and the planes arrive at the airports in the Dest (destination) column.

Question 2 (2 pts)

In the grocery store file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

Find the sum of the amount spent (in the SPEND column) at each of the store regions (the STORE_R column).

When you read in the grocery data, you need to assign column names, as follows:

myDF = pd.read_csv("/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")

myDF.columns = ["BASKET_NUM", "HSHD_NUM", "PURCHASE", "PRODUCT_NUM", "SPEND", "UNITS", "STORE_R", "WEEK_NUM", "YEAR"]

Question 3 (2 pts)

In the grocery store file (same file from question 2):

Find the total amount of money spent in 2016 altogether, and the total amount of money spent in 2017 altogether. (You can use the groupby to do this with just one cell.)

Question 4 (2 pts)

In the Olympics file /anvil/projects/tdm/data/olympics/athlete_events.csv

Find the average height of the athletes in each country (the country is the NOC column).

Question 5 (2 pts)

In the Olympics file (same file from question 4):

Find the average height of the athletes in each sport (the sport is the Sport column, of course!).

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.

Items to submit
  • firstname_lastname_project3.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.