TDM 10200: Python Project 4 — Spring 2025
Motivation: The groupby
function, as we learned in the previous project, can be used to split a data set, and then we can analyze each group of the data. It is not too hard to use, but we wanted to provide some additional practice for students.
Context: We provide five more examples of using the groupby
function on the Pandas data frames for five more data sets.
Scope: using the groupby
function on a Pandas data frame
Dataset(s)
This project will use the following datasets:
-
/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
-
/anvil/projects/tdm/data/beer/reviews_sample.csv
-
/anvil/projects/tdm/data/election/itcont1980.txt
-
/anvil/projects/tdm/data/flights/subset/1990.csv
Questions
Questions 1, 2, 3 are just providing more practice for the |
Question 1 (2 pts)
It is OK to use 3 cores in this project, to prevent the kernel from dying when you load your data. If your kernel does die while you are working, it is OK to select Kernel from the menu in Jupyter Lab, and then Restart Kernel from the dropdown under Kernel, and re-run your analysis for that question. |
In the death records file:
/anvil/projects/tdm/data/death_records/DeathRecords.csv
Find the mean Age
of death for each Sex
.
-
Show the mean
Age
of death for eachSex
.
Question 2 (2 pts)
In the grocery store file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
Find the total amount of money spent on each PRODUCT_NUM
. (You can just add up the values in the SPEND
column, grouped according to the PRODUCT_NUM
value. You can ignore the UNITS
column.) Display the top 10 types of PRODUCT_NUM
, according to the total amount of money spent on those products (i.e., according to the sum
of the SPEND
column for those 10 PRODUCT_NUM
values).
When you read in the grocery data, you need to assign column names, as follows:
myDF = pd.read_csv("/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")
myDF.columns = ["BASKET_NUM", "HSHD_NUM", "PURCHASE", "PRODUCT_NUM", "SPEND", "UNITS", "STORE_R", "WEEK_NUM", "YEAR"]
-
Show the top 10 types of
PRODUCT_NUM
, according to the total amount of money spent on those products (i.e., according to thesum
of theSPEND
column for those 10PRODUCT_NUM
values).
Question 3 (2 pts)
In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv
please consider the mean beer scores on each date.
Find the three dates on which the mean score is a 5.
Just like in Question 2, it is possible to use |
-
Show the three dates on which the mean
score
is a 5.
Question 4 (2 pts)
Revisit the video at the very beginning of Project 3, Example 1, in which we found the amount of money donated in each state, during the 1980 federal election cycle.
/anvil/projects/tdm/data/election/itcont1980.txt
Remember that we need to specify the column names for the 1980 election donation data:
myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", header=None, sep='|')
myDF.columns = ["CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID"]
This time, instead of finding the amount of money donated in each state in 1980, please find the amount of money donated in each city-and-state pair.
To accomplish this, combine the city and state together like this:
myDF.groupby(myDF['CITY'] + ', ' + myDF['STATE'], dropna = False)
Using the dropna = False
will allow us to (also) see the amount of money donated from unknown city-and-state pairs.
Now find the top 20 city-and-state pairs, according to the amount of money donated, as well as the amount of money from unknown city-and-state pairs. For this reason, we should use tail(21)
(so that we get 20 results in which the city-and-state pairs are known, and one result in which the city-and-state pairs are not known).
It turns out that the largest result is from missing city-and-state pairs.
We explain further, in the video.
Here are the top 6 city-and-state pairs (notice that the top result has a blank city-and-state pair, namely, many of the donations have a blank city-and-state):
WASHINGTON, DC 4273606
LOS ANGELES, CA 4569952
DALLAS, TX 4747262
HOUSTON, TX 7606806
NEW YORK, NY 11344277
NaN 18092711
In your solution, you need to show the top 20 city-and-state pairs as well as the missing values.
-
Show the top 20 city-and-state pairs, as well as the missing values, according to the amount of money donated.
Question 5 (2 pts)
Revisit the video at the very beginning of Project 3, Example 3, in which we studied the departure delays (DepDelay
) in the 1990 flight data:
/anvil/projects/tdm/data/flights/subset/1990.csv
This time, instead of finding the mean departure delays according to where the flights depart (in the Origin
column), please find mean departure delays on each Month / DayofMonth / Year triple
To accomplish this, combine the three columns for the Month, DayofMonth, and Year together, with '/' in between.
We want to find the worst 6 dates from 1990, according to the largest mean departure delay (DepDelay
) values.
Here are the worst two dates from 1990, according to the largest mean departure delay (DepDelay
) values.
12/22/1990 45.222249
12/21/1990 45.661782
In your solution, you need to show the worst 6 dates from 1990, according to the largest mean departure delay (DepDelay
) values.
-
Show the worst 6 dates from 1990, according to the largest mean departure delay (
DepDelay
) values.
Submitting your Work
Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.
If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.
Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.
Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.
-
firstname_lastname_project4.ipynb
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |