TDM 10200: Python Project 2 — Spring 2025

Motivation: We get more comfortable indexing and grouping values in Python. It is helpful to know that, when we ask for the value counts of a column in a Pandas data frame, we get the results in sorted order, without even having to sort the results ourselves!

Context: As a practice counting the number of data values using Python, it is helpful to pay attention to missing values.

Scope: Python, value counts, missing values

Learning Objectives:

Extract data in Python that meets certain conditions
Combine values from multiple columns of a Pandas data frame
Combine multiple conditions
Handle missing data

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following datasets:

/anvil/projects/tdm/data/olympics/athlete_events.csv
/anvil/projects/tdm/data/election/itcont1980.txt

Questions

Question 1 (2 pts)

Create a Pandas data frame called myDF from the file of Olympics athlete events data:

/anvil/projects/tdm/data/olympics/athlete_events.csv

Using the methods that you learned in Project 1, Questions 3, 4, 5, find the number of times that each of the years appears (in the Year column in myDF). As in Project 1, value_counts() should be helpful for this purpose.

Make a plot of the number of times that each year appears.

As a clarification, for this question, it suffices to only look at the Year column. You do not need to look at the athletes themselves.

Each line of the data frame is for one athlete in one year. So if you just want to know how many athletes compete per year, you can just look at the number of times that the Year appears in the Year column itself. (By the way, if an athlete appears multiple times, that is OK too. You only need to look at the Year column for this question.)

FYI, the results from value_counts() are called a Series. In a Series, we have values (in this case, the number of times that each year appears), and also an index (in this case, the years are the index).

Deliverables

The results of value_counts() showing the number of athletes participating in the Olympics during each year.
A plot showing the number of athletes participating in the Olympics during each year.
As always, be sure to document your work from Question 1 (and from all of the questions!), using some comments and insights about your work. We will stop adding this note to document your work, but please remember, we always assume that you will document every single question with your comments and your insights.

Please remember, we always assume that you will document every single question with your comments and your insights.

Question 2 (2 pts)

Again studying the Olympics athletes events from the same file:

/anvil/projects/tdm/data/olympics/athlete_events.csv

we can practice using the value_counts() a little more.

Which value appears in the "NOC" column the most times?

Which value appears in the "Name" column the most times?

When we use value_counts() it is helpful to know that this is called a "method" for "Series". A "method" is like a function but it is used with a particular class.

Deliverables

The value that appears in the "NOC" column the most times.
The value that appears in the "Name" column the most times.

Question 3 (2 pts)

Python starts numbering its rows from 0, which R starts numbering its rows from 1.

If you look at the head of myDF, notice that the third row (which is row 2) is from team "Denmark", while the fourth row (which is row 3) is from team "Denmark/Sweden".

How many rows correspond exactly to team "Denmark"?

How many rows have "Denmark" in the team name ("Denmark" may or may not be the exact team name)? Hint: You can use str.contains to find rows that contain the word that you are searching for in a column. Another hint: You can use the shape function to get the number of rows and columns in a Pandas data frame. The number of rows is shape[0] and the number of columns is shape[1]. You can also use len to get the number of rows in a Pandas data frame.

Find the names of the teams that have "Denmark" in the team name but are not exactly "Denmark". Hint: We use == to test whether two things are equal, and we use != to test whether two things are not equal. Another hint: There are exactly 6 such team names (which contain "Denmark" in the team name but are not exactly "Denmark"). We want you to print the 6 team names, please! (These 6 team names occur on 72 rows altogether.)

Deliverables

The number of rows corresponding exactly to team "Denmark".
The number of rows with "Denmark" as part of the team name.
The names of the 6 teams that have "Denmark" included but are not exactly "Denmark".

Question 4 (2 pts)

Not all data comes in a comma-delimited format, i.e., with commas in between the pieces of data.

The election data set stored at:

/anvil/projects/tdm/data/election/itcont1980.txt

is not a comma-delimited file. Instead, the data set of donations from the 1980 federal election campaigns is a plain text file, which happens to have the symbol "|" between pieces of data in the file.

Comma-delimited files usually contain comma-separated values and therefore usually end with the extension .csv (of course, just because a file ends in .csv it might not have comma-separated values, but it usually does!).

Text files (usually ending in .txt) can have any type of separator between the pieces of data in the file. In the election data, the symbol "|" occurs between pieces of data in the file.

We can tell Python that there is a "|" symbol between the pieces of data by writing sep='|' when we use read_csv to import the data.

The election data set looks like this:

C00078279|A|M11|P|80031492155|22Y||MCKENNON, K R|MIDLAND|MI|00000|||10031979|400|||||CONTRIBUTION REF TO INDIVIDUAL|3062020110011466469
C00078279|A|M11||79031415137|15||OREFFICE, P|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1500||||||3061920110000382948
C00078279|A|M11||79031415137|15||DOWNEY, J|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|300||||||3061920110000382949
C00078279|A|M11||79031415137|15||BLAIR, E|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1000||||||3061920110000382950
C00078287|A|Q1||79031231889|15||BLANCHARD, JOHN A|CHICAGO|IL|60685|||03201979|200||||||3061920110000383914
C00078287|A|Q1||79031231889|15||CRAMER, JOHN H|CHICAGO|IL|60685|||02281979|200||||||3061920110000383915
C00078287|A|Q1||79031231889|15||MCHUGH, KEVIN|CHICAGO|IL|60685|||03051979|200||||||3061920110000383916
C00078287|A|Q1||79031231889|15||NOHA, EDWARD J|CHICAGO|IL|60685|||03121979|300||||||3061920110000383917
C00078287|A|Q1||79031231889|15||RYCROFT, DONALD C|CHICAGO|IL|60685|||03191979|200||||||3061920110000383918
C00078287|A|Q1||79031231889|15||VANDERSLICE, WILLIAM D|CHICAGO|IL|60685|||02271979|200||||||3061920110000383919

The election data set does not have a header row inside the file! When we use read_csv, we need to indicate that there is no header row, by using header=None.

Because this election data set does not have the names of the columns built in, we need to specify the names of the columns.

You can use the following to read in the data and name the columns properly:

myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", header=None, sep='|')
myDF.columns = ["CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID"]

Now that you have the data read into the data frame myDF, here are two questions to get familiar with the data:

Which value appears in the "STATE" column the most times?

Which value appears in the "NAME" column the most times?

You should be very familiar with the value_counts() method by now! You can use value_counts() to solve both of these questions.

Deliverables

The value that appears in the "STATE" column the most times.
The value that appears in the "NAME" column the most times.

Question 5 (2 pts)

Again considering the data set about the 1980 federal election campaigns:

Use pd.concat with axis=1 to make one series that joins together (entry by entry) the "CITY" and "STATE" columns.

Now determine the top 5 city-and-state locations where donations were made.

Also determine the number of city-and-state locations that are both empty.

We can use dropna = False inside the value_counts() method to see the values that are missing.

We can also adjust the number of items in the head() of a Pandas data frame.

We did this for you in the videos with the airport data. Now we want you to try the same thing with the 1980 federal election campaign data!

Deliverables

The top 5 city-and-state locations where donations were made in the 1980 federal election campaign data.
The number of city-and-state locations in the 1980 federal election campaign data that are both empty.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.

Items to submit

firstname_lastname_project2.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.