TDM 10200: Python Project 4 — Spring 2025

Motivation: The groupby function, as we learned in the previous project, can be used to split a data set, and then we can analyze each group of the data. It is not too hard to use, but we wanted to provide some additional practice for students.

Context: We provide five more examples of using the groupby function on the Pandas data frames for five more data sets.

Scope: using the groupby function on a Pandas data frame

Learning Objectives:

Analyzing data in groups

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following datasets:

/anvil/projects/tdm/data/death_records/DeathRecords.csv
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
/anvil/projects/tdm/data/beer/reviews_sample.csv
/anvil/projects/tdm/data/election/itcont1980.txt
/anvil/projects/tdm/data/flights/subset/1990.csv

Questions

Questions 1, 2, 3 are just providing more practice for the groupby function, which we learned in Project 3.

Question 1 (2 pts)

It is OK to use 3 cores in this project, to prevent the kernel from dying when you load your data. If your kernel does die while you are working, it is OK to select Kernel from the menu in Jupyter Lab, and then Restart Kernel from the dropdown under Kernel, and re-run your analysis for that question.

In the death records file:

/anvil/projects/tdm/data/death_records/DeathRecords.csv

Find the mean Age of death for each Sex.

Deliverables

Show the mean Age of death for each Sex.

Question 2 (2 pts)

In the grocery store file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

Find the total amount of money spent on each PRODUCT_NUM. (You can just add up the values in the SPEND column, grouped according to the PRODUCT_NUM value. You can ignore the UNITS column.) Display the top 10 types of PRODUCT_NUM, according to the total amount of money spent on those products (i.e., according to the sum of the SPEND column for those 10 PRODUCT_NUM values).

When you read in the grocery data, you need to assign column names, as follows:

myDF = pd.read_csv("/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")

myDF.columns = ["BASKET_NUM", "HSHD_NUM", "PURCHASE", "PRODUCT_NUM", "SPEND", "UNITS", "STORE_R", "WEEK_NUM", "YEAR"]

Deliverables

Show the top 10 types of PRODUCT_NUM, according to the total amount of money spent on those products (i.e., according to the sum of the SPEND column for those 10 PRODUCT_NUM values).

Question 3 (2 pts)

In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv

please consider the mean beer scores on each date.

Find the three dates on which the mean score is a 5.

Just like in Question 2, it is possible to use sort_values() on the results, and then tail(3) to show the largest three mean scores, which should each be 5.

Deliverables

Show the three dates on which the mean score is a 5.

Question 4 (2 pts)

Revisit the video at the very beginning of Project 3, Example 1, in which we found the amount of money donated in each state, during the 1980 federal election cycle.

/anvil/projects/tdm/data/election/itcont1980.txt

Remember that we need to specify the column names for the 1980 election donation data:

myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", header=None, sep='|')

myDF.columns = ["CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID"]

This time, instead of finding the amount of money donated in each state in 1980, please find the amount of money donated in each city-and-state pair.

To accomplish this, combine the city and state together like this:

myDF.groupby(myDF['CITY'] + ', ' + myDF['STATE'], dropna = False)

Using the dropna = False will allow us to (also) see the amount of money donated from unknown city-and-state pairs.

Now find the top 20 city-and-state pairs, according to the amount of money donated, as well as the amount of money from unknown city-and-state pairs. For this reason, we should use tail(21) (so that we get 20 results in which the city-and-state pairs are known, and one result in which the city-and-state pairs are not known).

It turns out that the largest result is from missing city-and-state pairs.

We explain further, in the video.

Here are the top 6 city-and-state pairs (notice that the top result has a blank city-and-state pair, namely, many of the donations have a blank city-and-state):

WASHINGTON, DC      4273606
LOS ANGELES, CA     4569952
DALLAS, TX          4747262
HOUSTON, TX         7606806
NEW YORK, NY       11344277
NaN                18092711

In your solution, you need to show the top 20 city-and-state pairs as well as the missing values.

Deliverables

Show the top 20 city-and-state pairs, as well as the missing values, according to the amount of money donated.

Question 5 (2 pts)

Revisit the video at the very beginning of Project 3, Example 3, in which we studied the departure delays (DepDelay) in the 1990 flight data:

/anvil/projects/tdm/data/flights/subset/1990.csv

This time, instead of finding the mean departure delays according to where the flights depart (in the Origin column), please find mean departure delays on each Month / DayofMonth / Year triple

To accomplish this, combine the three columns for the Month, DayofMonth, and Year together, with '/' in between.

We want to find the worst 6 dates from 1990, according to the largest mean departure delay (DepDelay) values.

Here are the worst two dates from 1990, according to the largest mean departure delay (DepDelay) values.

12/22/1990    45.222249
12/21/1990    45.661782

In your solution, you need to show the worst 6 dates from 1990, according to the largest mean departure delay (DepDelay) values.

Deliverables

Show the worst 6 dates from 1990, according to the largest mean departure delay (DepDelay) values.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.

Items to submit

firstname_lastname_project4.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.