TDM 10200: Python Project 6 — Spring 2025

Motivation: We continue to focus on indexing in Python.

Context: It is helpful to have various ways to refer to data in Python, e.g., according to TRUE/FALSE values, or according to numbers, or by giving names as indices.

Scope: We will continue to get familiar with indexing in Python.

Learning Objectives:

Learning about how to work with indexes in Python.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following datasets:

/anvil/projects/tdm/data/death_records/DeathRecords.csv
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
/anvil/projects/tdm/data/beer/reviews_sample.csv
/anvil/projects/tdm/data/election/itcont1980.txt
/anvil/projects/tdm/data/flights/subset/1990.csv

Example 1:

Example 2:

Example 3:

Example 4:

Example 5:

Example 6:

Example 7:

Example 8:

Questions

Question 1 (2 pts)

In the death records file:

/anvil/projects/tdm/data/death_records/DeathRecords.csv

you can see what races the data from the Race column represents, by looking at page 15 of the pdf source file: www.cdc.gov/nchs/data/dvs/Record_Layout_2014.pdf

(You are not required to look at this pdf, and you do not analyze the individual values in the Race column, but we thought it would be helpful for context anyway.)

Make a table of the values in the Race column and how many times that each Race value occurs.
How many of the people in the data set have Filipino race? (The Race value 7 represents the Filipino race.)
Make a plot of the table of the counts of the Age values at the time of death, for people whose Race value is 7 (which stands for Filipino race) and for which their Age is not 999.

Deliverables

a. A table of the values in the Race column and how many times that each Race value occurs.
b. How many of the people in the data set have Filipino race?
c. Plot of the table of the counts of the Age values for people with Filipino race, whose Age is not equal to 999.

Question 2 (2 pts)

Consider the grocery store file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

Let’s re-examine Project 5, Question 2, as follows: Make a series of the values from the STORE_R column that satisfy the index condition SPEND < 0, without creating a new data frame!

Deliverables

Show the number of refunds for each STORE_R value, by only using indexing (not making a new data frame). (Your solutions should match the answers from Project 5, Question 2. For instance, CENTRAL stores had 2750 refunds. BUT this time, do not create a new data frame!)

Question 3 (2 pts)

In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv

When we say "most popular" we are referring to usernames who wrote the most reviews altogether.

Consider the values in the column username as follows: Please sort these values (by the number of occurrences) and show only the tail, so that you can see the most popular 10 username values, and the number of reviews that each of these 10 people wrote. Hint: The user named acurtis wrote the most reviews!
In part 3b, consider only the reviews written by the user acurtis. What is the average score of the reviews that were written by the user acurtis?

Deliverables

a. Print the most popular 10 username values, and the number of reviews that each of these 10 people wrote.
b. Find the average score of the reviews that were written by the user acurtis.

Question 4 (2 pts)

Revisit Question 4 from Project 5, about the 1980 election data. Again find the 9 NAME values for which the TRANSACTION_DT value is missing, but this time, do not create a new data frame!

(If you did not create a data frame in Question 4 of Project 5, then you can just repeat your solution here, for full credit.)

Deliverables

Give the 9 NAME values for which the TRANSACTION_DT value is missing, but this time, do not create a new data frame!

Question 5 (2 pts)

Consider the 1990 flight data:

/anvil/projects/tdm/data/flights/subset/1990.csv

In this question, we want to find three mean values, namely:

Find the mean of the DepDelay of all of the flights whose Origin airport is EWR.

Find the mean of the DepDelay of all of the flights whose Origin airport is JFK.

Find the mean of the DepDelay of all of the flights whose Origin airport is LGA.

It should be possible to find all three mean values with just one line of Python altogether. Hint: Consider using a groupby in your work.

Deliverables

Find the mean of the DepDelay of all of the flights whose Origin airport is EWR.
Find the mean of the DepDelay of all of the flights whose Origin airport is JFK.
Find the mean of the DepDelay of all of the flights whose Origin airport is LGA.
Use just one line of Python altogether for this purpose.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.

Items to submit

firstname_lastname_project6.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.