TDM 10200: Python Project 11 — Spring 2025

Motivation: We continue to learn how to create functions and apply the functions to many data sets in an efficient way. We can extract information in a straightforward way.

Context: We learn how to apply functions using map and using list comprehensions.

Scope: Applying functions to data.

Learning Objectives:
  • Learn how to apply functions to large data sets and extract information.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following datasets:

  • /anvil/projects/tdm/data/flights/subset/* (flights data)

  • /anvil/projects/tdm/data/election/itcont/* (election data)

  • /anvil/projects/tdm/data/icecream/* (ice cream data)

In this project, we walk students through these powerful techniques.

It should be enough to use 2 cores in your Jupyter Lab session for this project.

Some of the questions in this project take a few minutes to run.

Please remember to make comments (in full English sentences) about your understanding of each of the questions and solutions. This is especially important in this project, where we walk you through the solutions, but we still want you to write some explanations of your own!

Questions

Question 1 (2 pts)

For this question, only use the data corresponding to flights with Origin at the Indianapolis IND airport.

Hint: If you only want to import the 3 columns of the data that you need, recall that column 1 (the second column) is the month; column 15 (the 16th column) is the DepDelay; and column 16 (the 17th column) is the Origin. Python starts numbering from 0.

  1. Write a function called monthlydepdelays that takes a year as the input and uses groupby to return a Series of length 12 with the average DepDelay for flights starting at IND in each of the 12 months of that year.

  2. Test your function individually, one at a time, on the years 1990, 1998, and 2005. For instance, if you run monthlydepdelays(1990), the output should be something like:

1     7.282772
2     9.497027
3     6.924841
4     4.949858
5     5.471487
6     6.010835
7     4.307377
8     5.639782
9     4.455586
10    4.473725
11    3.408304
12    9.764105
Deliverables
  • Write a function called monthlydepdelays that takes a year as the input and returns a Series of length 12 with the average DepDelay for flights starting at IND in each of the 12 months of that year.

  • Show the output of monthlydepdelays(1990) and monthlydepdelays(1998) and monthlydepdelays(2005).

Question 2 (2 pts)

Consider this page, about using matplotlib to display several plots all at once:

We want to display 6 plots, which appear in 3 rows and 2 columns, namely: plot the results of monthlydepdelays for the years 1988 through 1993.

It could look as simple as this:

import matplotlib.pyplot as plt
results1988 = monthlydepdelays(1988)
results1989 = monthlydepdelays(1989)
results1990 = monthlydepdelays(1990)
results1991 = monthlydepdelays(1991)
results1992 = monthlydepdelays(1992)
results1993 = monthlydepdelays(1993)
fig, axs = plt.subplots(2, 3)
axs[0, 0].bar(results1988.index, results1988)
axs[0, 0].set_title("1988 Delays")
axs[0, 1].bar(results1989.index, results1989)
axs[0, 1].set_title("1989 Delays")
axs[0, 2].bar(results1990.index, results1990)
axs[0, 2].set_title("1990 Delays")
axs[1, 0].bar(results1991.index, results1991)
axs[1, 0].set_title("1991 Delays")
axs[1, 1].bar(results1992.index, results1992)
axs[1, 1].set_title("1992 Delays")
axs[1, 2].bar(results1993.index, results1993)
axs[1, 2].set_title("1993 Delays")

or if you want to use a loop, you could write:

import matplotlib.pyplot as plt
allmyyears = [myyear for myyear in range(1988,1994)]
myresults = [monthlydepdelays(myyear) for myyear in allmyyears]
nrow = 2
ncol = 3
fig, axs = plt.subplots(nrow, ncol)
for i, ax in enumerate(fig.axes):
    ax.bar(myresults[i].index, myresults[i])
    ax.set_title(str(allmyyears[i]) + " Delays")
Deliverables
  • Make a 3 by 2 frame of 6 plots, corresponding to the results of monthlydepdelays in the years 1988 through 1993.

Question 3 (2 pts)

For this question, only use the data corresponding to donations from the state of Indiana.

  1. Write a function called myindycities that takes a year as the input and uses groupby to make a Series of length 10, containing the top 10 cities in Indiana according to the sum of the amount of donations (in dollars) given in each city, during that year.

  2. Test your function individually, one at a time, on the years 1980, 1986, and 1992. For instance, if you run myindycities(1984), the output should be something like:

FT WAYNE         44665
TERRE HAUTE      52650
CARMEL           53200
EVANSVILLE       65250
SOUTH BEND       68387
INDPLS           76520
FORT WAYNE       80882
ELKHART          93171
MUNCIE          104260
INDIANAPOLIS    511935
Deliverables
  • Write a function called myindycities that takes a year as the input and uses groupby to make a Series of length 10, containing the top 10 cities in Indiana according to the sum of the amount of donations (in dollars) given in each city, during that year.

  • Show the output of myindycities(1980) and myindycities(1986) and myindycities(1992).

Question 4 (2 pts)

  1. Run the function myindycities on each of the even-numbered election years 1984 to 1994 as follows:

allmyyears = [myyear for myyear in range(1984,1995,2)]
myresults = [myindycities(myyear) for myyear in allmyyears]
  1. Now display 6 plots, which appear in 3 rows and 2 columns, namely: plot the results of myindycities for the even-numbered years 1984 through 1994.

If you want to use a loop, you could write:

import matplotlib.pyplot as plt
nrow = 2
ncol = 3
fig, axs = plt.subplots(nrow, ncol)
for i, ax in enumerate(fig.axes):
    ax.barh(myresults[i].index, myresults[i])
    ax.set_title(str(allmyyears[i]) + " Indy Cities")

Do not worry about the fact that the plots are overlapping.

Deliverables
  • Show the results of myindycities for each even-numbered year from 1984 to 1994.

  • Make a horizontal bar plot for each of the 6 years in part a.

Question 5 (2 pts)

  1. Find the average number of stars in each of these four files:

/anvil/projects/tdm/data/icecream/bj/reviews.csv

/anvil/projects/tdm/data/icecream/breyers/reviews.csv

/anvil/projects/tdm/data/icecream/hd/reviews.csv

/anvil/projects/tdm/data/icecream/talenti/reviews.csv

  1. Write a function myavgstars that takes a company name (e.g., either "bj" or "breyers" or "hd" or "talenti") as input, and returns the average number of stars for that company.

  2. Define a vector of length 4, with all 4 of these company names:

mycompanies = ["bj", "breyers", "hd", "talenti"]

and now use a list comprehension to run the function from part b that re-computes the values from part a, all at once, like this:

[myavgstars(x) for x in mycompanies]
Deliverables
  • Print the average number of stars for each of the 4 ice cream companies.

  • Write a function myavgstars that takes a company name (e.g., either "bj" or "breyers" or "hd" or "talenti") as input, and returns the average number of stars for that company.

  • Use a list comprehension to run the function from part b on the list mycompanies, which should give the same values as in part a.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.

Items to submit
  • firstname_lastname_project11.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.