TDM 10200: Python Project 7 — Spring 2025

Motivation: We learn even more about working on columns of data frames in Pandas, including multiple columns at once.

Context: This project combines our knowledge of indexing and grouping values and enhances our knowledge of these topics even further.

Scope: We will become even more familiar with indexing and grouping in Python.

Learning Objectives:

Learning even more about how to work with indexes in Python.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following datasets:

/anvil/projects/tdm/data/death_records/DeathRecords.csv
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
/anvil/projects/tdm/data/beer/reviews_sample.csv
/anvil/projects/tdm/data/election/itcont1980.txt
/anvil/projects/tdm/data/flights/subset/1990.csv

Example 1a:

Example 1b:

Example 2:

Example 3:

Example 4:

Example 5:

Example 6:

Example 7:

Example 8:

Example 9:

Example 10:

Questions

If you session crashes when you read in the data (for instance, on question 2), you might want to try using 3 cores in your session instead of 1 core.

Question 1 (2 pts)

In the death records file:

/anvil/projects/tdm/data/death_records/DeathRecords.csv

Use the pd.cut from Pandas to classify people at their time of death into 5 bins:

    "youth": less than or equal to 18 years old

    "young adult": older than 18 but less than or equal to 25 years old

    "adult": older than 25 but less than or equal to 35 years old

    "middle age": older than 35 but less than or equal to 55 years old

    "senior": greater than 55 years old but less than or equal to 150 years old (or any other upper threshold that you like)

    "unknown": age of 999 (you could use, say, ages 150 to Inf for this category)

First cut the data from the Age column into the bins described above.
In the cut function, add labels corresponding to the 6 categories above.
Now wrap the table into a barplot that shows the number of people in each of the 6 categories above (using the same matplotlib techniques that we have used in earlier projects).

Deliverables

a. A table showing how many people are in each of the 6 categories above at the time of their death. (The labels for part a should be the default labels, i.e., like this: (-Inf,18] (18,25] (25,35] (35,55] (55,150] (150, Inf]
b. Same table output as in part a but now also adding labels corresponding to the 6 categories above.
c. A barplot that shows the number of people in each of the 6 categories above.

Question 2 (2 pts)

In the grocery store file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

Use the groupby function to sum the values from the SPEND column, according to 8 categories, namely, according to whether the YEAR is 2016 or 2017, and according to whether the STORE_R value is CENTRAL, EAST, SOUTH, or WEST.

Deliverables

Show the sum of the values in the SPEND column according to the 8 possible pairs of YEAR and STORE_R values.

Question 3 (2 pts)

In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv

Use groupby to categorize the mean score values in each month and year pair. Your groupby should output a table with the years and the months clearly labelled.

Deliverables

Print a table displaying the mean score values for each month and year pair.

Question 4 (2 pts)

Read in the 1980 election data using:

import pandas as pd

myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", header=None, sep='|')

myDF.columns = ["CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID"]

In this question, we do not care about the dollar amounts of the election donations. In other words, do not pay any attention to the TRANSACTION_AMT column. Only pay attention to the number of donations. There is one donation per row in the data set.

Make a subset called indianaDF that contains only the donations for which the STATE is IN. Using only the data in the data frame indianaDF, make a table of the number of occurrences of each value in the CITY column. Sort your results in numerical order, and print the largest 41 entries.
Same question as part a, but this time, do not make a separate data frame (and, in particular, do not use indianaDF). Instead, in one line, limit the values of myDF to only focus on the ones with myDF['STATE'] == "IN". Now consider the elements from the CITY column. Amongst these CITY values, make a table of the number of occurrences of each CITY. Sort the results and print the largest 41 entries. (Your result from question 4a and 4b should look the same, but using these two different methods.)
Find at least one strange thing about the top 41 entries in your result.

Deliverables

a. Using the data frame indianaDF that you created, find the top 41 cities in Indiana, according to the number of donations from people in that city.
b. Using only indexing (do not use indianaDF), again find the top 41 cities in Indiana, according to the number of donations from people in that city.
c. Find at least one strange thing about the top 41 entries in your result.

Question 5 (2 pts)

Consider the 1990 flight data:

/anvil/projects/tdm/data/flights/subset/1990.csv

The DepDelay values are given in minutes. We will classify the number of flights according to how many hours that the flight was delayed.

Use the pd.cut command to classify the number of flights in each of these categories:

Flight departed early or on time, i.e., DepDelay is negative or 0.

Flight departed more than 0 but less than or equal to 60 minutes late.

Flight departed more than 60 but less than or equal to 120 minutes late.

Flight departed more than 120 but less than or equal to 180 minutes late.

Flight departed more than 180 but less than or equal to 240 minutes late.

Flight departed more than 240 but less than or equal to 300 minutes late.

Etc., etc., and finally:

Flight departed more than 1380 but less than or equal to 1440 minutes late.

Print the results showing the number of flights in each of these categories.

Use the dropna = False option in your value_counts(), so that the number of flights without a known DepDelay is also given.

Deliverables

Give the results described above, which classifies the number of flights according to the number of hours that the flights are delayed.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.

Items to submit

firstname_lastname_project7.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.