TDM 10200: Python Project 8 — Spring 2025
Motivation: It is helpful to learn how to define our own functions in Python.
Context: In this project, we learn how to design functions and use them.
Scope: Of course, we start with the basics and learn several one-line functions, with lots of examples.
Dataset(s)
This project will use the following datasets:
-
/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
/anvil/projects/tdm/data/beer/reviews_sample.csv
-
/anvil/projects/tdm/data/election/itcont1980.txt
-
/anvil/projects/tdm/data/flights/subset/1990.csv
-
/anvil/projects/tdm/data/olympics/athlete_events.csv
Example 1:
Finding the average weight of Olympic athletes in a given country.
def avgweights (mycountry: str) -> float:
"""
The avgweights function takes a 3-character string as input, representing the NOC code for a country, and returns the average weight of Olympic athletes in that country.
Args:
mycountry (str): This is a 3-character string as input, representing the NOC code for a country
Returns:
countryweight (float): This is the average weight of Olympic athletes in that country. It is a decimal number, which is called a float in Python.
"""
myDF = pd.read_csv('/anvil/projects/tdm/data/olympics/athlete_events.csv')
countryweight = myDF[myDF['NOC'] == mycountry]['Weight'].mean()
return countryweight
Example 2:
Finding the percentages of school metro types in a given state.
def myschoolpercentages (mystate: str) -> pd.Series:
"""
The myschoolpercentages function takes a string as input,
specifying the state of the school in which we want to study the metro type,
and returns a Pandas Series with the School Metro Types, and the percentage of schools of each type.
Args:
mystate (str): This is a string as input,
specifying the state of the school in which we want to study the metro type
Returns:
mymetrotypes (pd.Series): This is a series of the School Metro Types, and
the percentage of schools of each type.
"""
myDF = pd.read_csv('/anvil/projects/tdm/data/donorschoose/Schools.csv')
mymetrotypes = myDF[myDF['School State'] == mystate]['School Metro Type'].value_counts(normalize = True)
return mymetrotypes
Example 3:
In the 1980 election data, finding the sum of the donations in a given state.
def mystatesum (mystate: str) -> float:
"""
The mystatesum function takes a string as input,
specifying the 2-character abbreviation of the state where we sum the dollar amounts of the donations,
and returns a floating point number that specifies the total dollar amount donated in that state.
Args:
mystate (str): This is a string as input,
specifying the 2-character abbreviation of the state where we sum the dollar amounts of the donations
Returns:
mydollars (float): This is a floating point number that specifies
the total dollar amount donated in that state.
"""
myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", header=None, sep='|')
myDF.columns = ["CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID"]
mydollars = myDF[myDF['STATE'] == mystate]['TRANSACTION_AMT'].sum()
return mydollars
Example 4:
Finding the average number of stars for a given author of reviews.
def myauthoravgstars (myauthor: str) -> float:
"""
The myauthoravgstars function takes a string as input,
specifying the author whose reviews we want to study,
and returns a floating point number that specifies the average number of stars for that author's reviews.
Args:
myauthor (str): This is a string as input,
specifying the author whose reviews we want to study,
Returns:
myaveragereviews (float): This is a floating point number that specifies the
average number of stars for that author's reviews.
"""
myDF = pd.read_csv("/anvil/projects/tdm/data/icecream/combined/reviews.csv")
myaveragereviews = myDF[myDF['author'] == myauthor]['stars'].mean()
return myaveragereviews
Example 5:
Finding the percentages of values in each category (we will test this in Question 1).
def makeatable (myseries: pd.Series) -> pd.Series:
"""
The makeatable function takes a Pandas Series as input,
with any type of data in the Series (for instance, it might be a column of a data frame)
and returns the percentages of the data of each type.
Args:
myseries (pd.Series): This is any type of Series that we want to study,
Returns:
makeatable (pd.Series): This is a Pandas Series with the percentages of the data of each type.
"""
mypercentages = myseries.value_counts(normalize = True)
return mypercentages
Questions
Question 1 (2 pts)
Consider this user-defined function, which makes a table that shows the percentages of values in each category:
def makeatable (myseries: pd.Series) -> pd.Series:
"""
The makeatable function takes a Pandas Series as input,
with any type of data in the Series (for instance, it might be a column of a data frame)
and returns the percentages of the data of each type.
Args:
myseries (pd.Series): This is any type of Series that we want to study,
Returns:
makeatable (pd.Series): This is a Pandas Series with the percentages of the data of each type.
"""
mypercentages = myseries.value_counts(normalize = True)
return mypercentages
If we do something like this, with a column from a data frame:
makeatable(myDF['put my column name here'])
Then it is the same as getting the value_counts(normalize = True)
for the series.
In other words, makeatable
is a user-defined function that returns a series, with the value counts expressed as a percentage.
Now consider the DeathRecords data set:
/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
Try the function
makeatable
on theSex
column of the DeathRecords. -
Also try the function
makeatable
on theMaritalStatus
column of the DeathRecords.
-
Use the
makeatable
function to display table of values from theSex
column of the DeathRecords. -
Use the
makeatable
function to display table of values from theMaritalStatus
column of the DeathRecords.
Question 2 (2 pts)
Define a function called teenagecount
as follows:
def teenagecount (myseries: pd.Series) -> int:
"""
The teenagecount function takes a Pandas Series as input,
with any type of data in the Series (for instance, it might be a column of a data frame)
and returns the number of data points that are between 13 and 19 (inclusive).
Args:
myseries (pd.Series): This is any type of Series that we want to study,
Returns:
teenagecount (int): This is an integer that shows the number of values between 13 and 19 in the data.
"""
teenagecount = len(myseries[(myseries >= 13) & (myseries <= 19)])
return teenagecount
-
Try this function on the
Age
column of the DeathRecords file/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
Also try this function on the
Age
column of the file/anvil/projects/tdm/data/olympics/athlete_events.csv
-
Display the number of teenagers in the DeathRecords data.
-
Display the number of teenagers in the Olympics Athlete Events data.
Question 3 (2 pts)
The function max
in Python can find the longest string in a series, if we use the option key=len
, as described here:
Define the function:
def longesttest (mydata: pd.Series) -> str:
"""
The longesttest function takes a Pandas Series as input,
containing strings, and returns the longest string.
Args:
mydata (pd.Series): This is any type of Series that contains strings which we want to study,
Returns:
mylongeststring (str): This is the longest of the strings in the Pandas Series
"""
mylongeststring = max(mydata, key=len)
return mylongeststring
-
Use the function
longesttest
to find the longest review in thetext
column of the beer reviews data set/anvil/projects/tdm/data/beer/reviews_sample.csv
-
Also use the function
longesttest
to find the longest name in theNAME
column of the 1980 election data. Hint: On part (b), Pandas does not recognize that theNAME
column contains strings by default, so you might try this way, to force Pandas to treat theNAME
column as strings:longesttest(myDF['NAME'].astype(str))
-
Print the longest review in the
text
column of the beer reviews data set/anvil/projects/tdm/data/beer/reviews_sample.csv
-
Print the longest name in the
NAME
column of the 1980 election data.
Question 4 (2 pts)
-
Create your own function called
mostpopulardate
that finds the most popular date in a column of dates, as well as the number of times that date occurs. Hint: There are multiple methods available here for doing this: www.tutorialspoint.com/how-to-display-most-frequent-value-in-a-pandas-series -
Test your function
mostpopulardate
on thedate
column of the beer reviews data/anvil/projects/tdm/data/beer/reviews_sample.csv
-
Also test your function
mostpopulardate
on theTRANSACTION_DT
column of the 1980 election data.
-
a. Define your function called
mostpopulardate
-
b. Use your function
mostpopulardate
to find the most populardate
in the beer reviews data/anvil/projects/tdm/data/beer/reviews_sample.csv
-
c. Also use your function
mostpopulardate
to find the most popular transaction date from the 1980 election data.
Question 5 (2 pts)
Define a function called myaveragedelay
that takes a 3-letter string (corresponding to an airport code) and finds the average departure delays (after removing the NA values) from the DepDelay
column of the 1990 flight data /anvil/projects/tdm/data/flights/subset/1990.csv
for flights departing from that airport.
Try your function on the Indianapolis "IND" flights. In other words, myaveragedelay("IND")
should print 5.9697722567287785 because the flights with Origin
airport "IND" have an average departure delay of 5.9 minutes.
Try your function on the New York City "JFK" flights. In other words, myaveragedelay("JFK")
should print 11.857274106360704 because the flights with Origin
airport "JFK" have an average departure delay of 11.8 minutes.
-
a. Define your function called
myaveragedelay
-
b. Use
myaveragedelay("IND")
to print the average departure delays for flights with Origin airport "IND". -
c. Use
myaveragedelay("JFK")
to print the average departure delays for flights with Origin airport "JFK".
Submitting your Work
Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.
If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.
Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.
Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.
-
firstname_lastname_project8.ipynb
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |