TDM 10200: Python Project 8 — Spring 2025

Motivation: It is helpful to learn how to define our own functions in Python.

Context: In this project, we learn how to design functions and use them.

Scope: Of course, we start with the basics and learn several one-line functions, with lots of examples.

Learning Objectives:
  • User-defined functions in Python.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following datasets:

  • /anvil/projects/tdm/data/death_records/DeathRecords.csv

  • /anvil/projects/tdm/data/beer/reviews_sample.csv

  • /anvil/projects/tdm/data/election/itcont1980.txt

  • /anvil/projects/tdm/data/flights/subset/1990.csv

  • /anvil/projects/tdm/data/olympics/athlete_events.csv

Example 1:

Finding the average weight of Olympic athletes in a given country.

def avgweights (mycountry: str) -> float:
    """
    The avgweights function takes a 3-character string as input, representing the NOC code for a country, and returns the average weight of Olympic athletes in that country.

    Args:
    mycountry (str): This is a 3-character string as input, representing the NOC code for a country

    Returns:
    countryweight (float): This is the average weight of Olympic athletes in that country.  It is a decimal number, which is called a float in Python.
    """
    myDF = pd.read_csv('/anvil/projects/tdm/data/olympics/athlete_events.csv')
    countryweight = myDF[myDF['NOC'] == mycountry]['Weight'].mean()
    return countryweight

Example 2:

Finding the percentages of school metro types in a given state.

def myschoolpercentages (mystate: str) -> pd.Series:
    """
    The myschoolpercentages function takes a string as input,
    specifying the state of the school in which we want to study the metro type,
    and returns a Pandas Series with the School Metro Types, and the percentage of schools of each type.

    Args:
    mystate (str): This is a string as input,
    specifying the state of the school in which we want to study the metro type

    Returns:
    mymetrotypes (pd.Series): This is a series of the School Metro Types, and
    the percentage of schools of each type.
    """
    myDF = pd.read_csv('/anvil/projects/tdm/data/donorschoose/Schools.csv')
    mymetrotypes = myDF[myDF['School State'] == mystate]['School Metro Type'].value_counts(normalize = True)
    return mymetrotypes

Example 3:

In the 1980 election data, finding the sum of the donations in a given state.

def mystatesum (mystate: str) -> float:
    """
    The mystatesum function takes a string as input,
    specifying the 2-character abbreviation of the state where we sum the dollar amounts of the donations,
    and returns a floating point number that specifies the total dollar amount donated in that state.

    Args:
    mystate (str): This is a string as input,
    specifying the 2-character abbreviation of the state where we sum the dollar amounts of the donations

    Returns:
    mydollars (float): This is a floating point number that specifies
    the total dollar amount donated in that state.
    """
    myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", header=None, sep='|')
    myDF.columns = ["CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID"]
    mydollars = myDF[myDF['STATE'] == mystate]['TRANSACTION_AMT'].sum()
    return mydollars

Example 4:

Finding the average number of stars for a given author of reviews.

def myauthoravgstars (myauthor: str) -> float:
    """
    The myauthoravgstars function takes a string as input,
    specifying the author whose reviews we want to study,
    and returns a floating point number that specifies the average number of stars for that author's reviews.

    Args:
    myauthor (str): This is a string as input,
    specifying the author whose reviews we want to study,

    Returns:
    myaveragereviews (float): This is a floating point number that specifies the
    average number of stars for that author's reviews.
    """
    myDF = pd.read_csv("/anvil/projects/tdm/data/icecream/combined/reviews.csv")
    myaveragereviews = myDF[myDF['author'] == myauthor]['stars'].mean()
    return myaveragereviews

Example 5:

Finding the percentages of values in each category (we will test this in Question 1).

def makeatable (myseries: pd.Series) -> pd.Series:
    """
    The makeatable function takes a Pandas Series as input,
    with any type of data in the Series (for instance, it might be a column of a data frame)
    and returns the percentages of the data of each type.

    Args:
    myseries (pd.Series): This is any type of Series that we want to study,

    Returns:
    makeatable (pd.Series): This is a Pandas Series with the percentages of the data of each type.
    """
    mypercentages = myseries.value_counts(normalize = True)
    return mypercentages

Questions

Question 1 (2 pts)

Consider this user-defined function, which makes a table that shows the percentages of values in each category:

def makeatable (myseries: pd.Series) -> pd.Series:
    """
    The makeatable function takes a Pandas Series as input,
    with any type of data in the Series (for instance, it might be a column of a data frame)
    and returns the percentages of the data of each type.

    Args:
    myseries (pd.Series): This is any type of Series that we want to study,

    Returns:
    makeatable (pd.Series): This is a Pandas Series with the percentages of the data of each type.
    """
    mypercentages = myseries.value_counts(normalize = True)
    return mypercentages

If we do something like this, with a column from a data frame:

makeatable(myDF['put my column name here'])

Then it is the same as getting the value_counts(normalize = True) for the series.

In other words, makeatable is a user-defined function that returns a series, with the value counts expressed as a percentage.

Now consider the DeathRecords data set:

/anvil/projects/tdm/data/death_records/DeathRecords.csv

  1. Try the function makeatable on the Sex column of the DeathRecords.

  2. Also try the function makeatable on the MaritalStatus column of the DeathRecords.

Deliverables
  • Use the makeatable function to display table of values from the Sex column of the DeathRecords.

  • Use the makeatable function to display table of values from the MaritalStatus column of the DeathRecords.

Question 2 (2 pts)

Define a function called teenagecount as follows:

def teenagecount (myseries: pd.Series) -> int:
    """
    The teenagecount function takes a Pandas Series as input,
    with any type of data in the Series (for instance, it might be a column of a data frame)
    and returns the number of data points that are between 13 and 19 (inclusive).

    Args:
    myseries (pd.Series): This is any type of Series that we want to study,

    Returns:
    teenagecount (int): This is an integer that shows the number of values between 13 and 19 in the data.
    """
    teenagecount = len(myseries[(myseries >= 13) & (myseries <= 19)])
    return teenagecount
  1. Try this function on the Age column of the DeathRecords file /anvil/projects/tdm/data/death_records/DeathRecords.csv

  2. Also try this function on the Age column of the file /anvil/projects/tdm/data/olympics/athlete_events.csv

Deliverables
  • Display the number of teenagers in the DeathRecords data.

  • Display the number of teenagers in the Olympics Athlete Events data.

Question 3 (2 pts)

The function max in Python can find the longest string in a series, if we use the option key=len, as described here:

Define the function:

def longesttest (mydata: pd.Series) -> str:
    """
    The longesttest function takes a Pandas Series as input,
    containing strings, and returns the longest string.

    Args:
    mydata (pd.Series): This is any type of Series that contains strings which we want to study,

    Returns:
    mylongeststring (str): This is the longest of the strings in the Pandas Series
    """
    mylongeststring = max(mydata, key=len)
    return mylongeststring
  1. Use the function longesttest to find the longest review in the text column of the beer reviews data set /anvil/projects/tdm/data/beer/reviews_sample.csv

  2. Also use the function longesttest to find the longest name in the NAME column of the 1980 election data. Hint: On part (b), Pandas does not recognize that the NAME column contains strings by default, so you might try this way, to force Pandas to treat the NAME column as strings: longesttest(myDF['NAME'].astype(str))

Deliverables
  • Print the longest review in the text column of the beer reviews data set /anvil/projects/tdm/data/beer/reviews_sample.csv

  • Print the longest name in the NAME column of the 1980 election data.

Question 4 (2 pts)

  1. Create your own function called mostpopulardate that finds the most popular date in a column of dates, as well as the number of times that date occurs. Hint: There are multiple methods available here for doing this: www.tutorialspoint.com/how-to-display-most-frequent-value-in-a-pandas-series

  2. Test your function mostpopulardate on the date column of the beer reviews data /anvil/projects/tdm/data/beer/reviews_sample.csv

  3. Also test your function mostpopulardate on the TRANSACTION_DT column of the 1980 election data.

Deliverables
  • a. Define your function called mostpopulardate

  • b. Use your function mostpopulardate to find the most popular date in the beer reviews data /anvil/projects/tdm/data/beer/reviews_sample.csv

  • c. Also use your function mostpopulardate to find the most popular transaction date from the 1980 election data.

Question 5 (2 pts)

Define a function called myaveragedelay that takes a 3-letter string (corresponding to an airport code) and finds the average departure delays (after removing the NA values) from the DepDelay column of the 1990 flight data /anvil/projects/tdm/data/flights/subset/1990.csv for flights departing from that airport.

Try your function on the Indianapolis "IND" flights. In other words, myaveragedelay("IND") should print 5.9697722567287785 because the flights with Origin airport "IND" have an average departure delay of 5.9 minutes.

Try your function on the New York City "JFK" flights. In other words, myaveragedelay("JFK") should print 11.857274106360704 because the flights with Origin airport "JFK" have an average departure delay of 11.8 minutes.

Deliverables
  • a. Define your function called myaveragedelay

  • b. Use myaveragedelay("IND") to print the average departure delays for flights with Origin airport "IND".

  • c. Use myaveragedelay("JFK") to print the average departure delays for flights with Origin airport "JFK".

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.

Items to submit
  • firstname_lastname_project8.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.