TDM 10200: Python Project 7 — Spring 2025
Motivation: We learn even more about working on columns of data frames in Pandas, including multiple columns at once.
Context: This project combines our knowledge of indexing and grouping values and enhances our knowledge of these topics even further.
Scope: We will become even more familiar with indexing and grouping in Python.
Dataset(s)
This project will use the following datasets:
-
/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
-
/anvil/projects/tdm/data/beer/reviews_sample.csv
-
/anvil/projects/tdm/data/election/itcont1980.txt
-
/anvil/projects/tdm/data/flights/subset/1990.csv
Example 1a:
Example 1b:
Example 2:
Example 3:
Example 4:
Example 5:
Example 6:
Example 7:
Example 8:
Example 9:
Example 10:
Questions
If you session crashes when you read in the data (for instance, on question 2), you might want to try using 3 cores in your session instead of 1 core. |
Question 1 (2 pts)
In the death records file:
/anvil/projects/tdm/data/death_records/DeathRecords.csv
Use the pd.cut
from Pandas to classify people at their time of death into 5 bins:
"youth": less than or equal to 18 years old
"young adult": older than 18 but less than or equal to 25 years old
"adult": older than 25 but less than or equal to 35 years old
"middle age": older than 35 but less than or equal to 55 years old
"senior": greater than 55 years old but less than or equal to 150 years old (or any other upper threshold that you like)
"unknown": age of 999 (you could use, say, ages 150 to Inf for this category)
-
First
cut
the data from theAge
column into the bins described above. -
In the
cut
function, add labels corresponding to the 6 categories above. -
Now wrap the table into a
barplot
that shows the number of people in each of the 6 categories above (using the same matplotlib techniques that we have used in earlier projects).
-
a. A table showing how many people are in each of the 6 categories above at the time of their death. (The labels for part a should be the default labels, i.e., like this:
(-Inf,18] (18,25] (25,35] (35,55] (55,150] (150, Inf]
-
b. Same table output as in part a but now also adding labels corresponding to the 6 categories above.
-
c. A barplot that shows the number of people in each of the 6 categories above.
Question 2 (2 pts)
In the grocery store file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
Use the groupby
function to sum the values from the SPEND
column, according to 8 categories, namely, according to whether the YEAR
is 2016 or 2017, and according to whether the STORE_R
value is CENTRAL
, EAST
, SOUTH
, or WEST
.
-
Show the sum of the values in the
SPEND
column according to the 8 possible pairs ofYEAR
andSTORE_R
values.
Question 3 (2 pts)
In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv
Use groupby
to categorize the mean score
values in each month and year pair. Your groupby
should output a table with the years and the months clearly labelled.
-
Print a table displaying the mean
score
values for each month and year pair.
Question 4 (2 pts)
Read in the 1980 election data using:
import pandas as pd
myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", header=None, sep='|')
myDF.columns = ["CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID"]
In this question, we do not care about the dollar amounts of the election donations. In other words, do not pay any attention to the TRANSACTION_AMT
column. Only pay attention to the number of donations. There is one donation per row in the data set.
-
Make a subset called
indianaDF
that contains only the donations for which theSTATE
isIN
. Using only the data in the data frameindianaDF
, make a table of the number of occurrences of each value in theCITY
column. Sort your results in numerical order, and print the largest 41 entries. -
Same question as part a, but this time, do not make a separate data frame (and, in particular, do not use
indianaDF
). Instead, in one line, limit the values ofmyDF
to only focus on the ones withmyDF['STATE'] == "IN"
. Now consider the elements from theCITY
column. Amongst theseCITY
values, make a table of the number of occurrences of eachCITY
. Sort the results and print the largest 41 entries. (Your result from question 4a and 4b should look the same, but using these two different methods.) -
Find at least one strange thing about the top 41 entries in your result.
-
a. Using the data frame
indianaDF
that you created, find the top 41 cities in Indiana, according to the number of donations from people in that city. -
b. Using only indexing (do not use
indianaDF
), again find the top 41 cities in Indiana, according to the number of donations from people in that city. -
c. Find at least one strange thing about the top 41 entries in your result.
Question 5 (2 pts)
Consider the 1990 flight data:
/anvil/projects/tdm/data/flights/subset/1990.csv
The DepDelay
values are given in minutes. We will classify the number of flights according to how many hours that the flight was delayed.
Use the pd.cut
command to classify the number of flights in each of these categories:
Flight departed early or on time, i.e., DepDelay is negative or 0.
Flight departed more than 0 but less than or equal to 60 minutes late.
Flight departed more than 60 but less than or equal to 120 minutes late.
Flight departed more than 120 but less than or equal to 180 minutes late.
Flight departed more than 180 but less than or equal to 240 minutes late.
Flight departed more than 240 but less than or equal to 300 minutes late.
Etc., etc., and finally:
Flight departed more than 1380 but less than or equal to 1440 minutes late.
Print the results showing the number of flights in each of these categories.
Use the dropna = False
option in your value_counts()
, so that the number of flights without a known DepDelay
is also given.
-
Give the results described above, which classifies the number of flights according to the number of hours that the flights are delayed.
Submitting your Work
Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.
If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.
Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.
Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.
-
firstname_lastname_project7.ipynb
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |