Functions with the Election Dataset
This example is from TDM 102 Project 5 Spring 2024.
These example(s) depend on the database:
-
/anvil/projects/tdm/data/election/
Source
The data about campaign donations at the federal level is downloadable from:
under the section on "Contributions by individuals"
Description of the Data
The names, addresses, and amounts of money donations by individuals is given, during each 2-year election campaign cycle.
When you first read the project, you might think, WOW!, this project is long, but it isn’t actually very long at all. Dr Ward just wanted to make sure that students understand the way to think about functions, namely: You need to first make sure that you understand what to do, and then check it a few times, and then (afterwards) write the function, and (finally) verify the function. [It is usually too hard to write the function without (first) understanding the work itself.] Dr Ward broke all of this down into steps, and the sentences look long, but don’t be scared. Dr Ward is just breaking things down into little steps for you, so that the work is easier to do. |
Functions should have simple inputs and should create helpful outputs. That way, a person can use the function to get good things done, without having to remember the details. For the election data, it is hard to remember where the files are located, and what the column names should be, etc. This question will create a useful function that only requires the user to start with 1 year as input, and it returns a dataframe as the output. That dataframe contains all of the election data for that year.
Data Processing Variable Saving
(1a) First (without a function!) start with a variable called myyear
. It can be an election year, and it helps if you try this with a few of the early 1980’s elections years, so that they data is not-too-big. For instance, try myyear
with each of the years 1980, 1984, and 1988 (one at a time, not a list). Once you set the value of myyear
, then make a variable for the path to the data from that year, for instance, /anvil/projects/tdm/data/election/itcont1984.txt
import pandas a pd
myyear = 1984 # You are able to change this yourself
file = f'/anvil/projects/tdm/data/election/itcont{myyear}.txt'
print(file)
/anvil/projects/tdm/data/election/itcont1980.txt
Data Import and Formatting
(1b) Read in the data from the year stored in myyear
. Please note that the data does not have column headers built-in, so you need to add the column headers, as in the "tip" below. Please also note that the elements in the data set has |
between the data elements, not commas (this is why the file name ends in .txt
not .csv
in this case). So you need to use the parameter delimiter='|'
in your read_csv
function.
myyear = 1984 # Change this to the year you want
file = f'/anvil/projects/tdm/data/election/itcont{myyear}.txt'
mycolumnnames=["CMTE_ID","AMNDT_IND","RPT_TP","TRANSACTION_PGI","IMAGE_NUM","TRANSACTION_TP","ENTITY_TP","NAME","CITY","STATE","ZIP_CODE","EMPLOYER","OCCUPATION","TRANSACTION_DT","TRANSACTION_AMT","OTHER_ID","TRAN_ID","FILE_NUM","MEMO_CD","MEMO_TEXT","SUB_ID"]
mydictionarytypes = {"CMTE_ID": str, "AMNDT_IND": str, "RPT_TP": str, "TRANSACTION_PGI": str, "IMAGE_NUM": str, "TRANSACTION_TP": str, "ENTITY_TP": str, "NAME": str, "CITY": str, "STATE": str, "ZIP_CODE": str, "EMPLOYER": str, "OCCUPATION": str, "TRANSACTION_DT": str, "TRANSACTION_AMT": float, "OTHER_ID": str, "TRAN_ID": str, "FILE_NUM": str, "MEMO_CD": str, "MEMO_TEXT": str, "SUB_ID": int}
myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", delimiter='|', names=mycolumnnames, dtype=mydictionarytypes)
Data Inspection and Validation
(1c) Look at the head of the data set, after you read it in. Does it look correct? Does everything make sense? Try this for each of the years, 1980, then 1984, then 1988, one at a time, storing each of these values into the variable myyear
and then read in the data for that year, and check the head of the data each time. This is good practice for making sure that your work is designed properly.
myDF.head()
CMTE_ID | NAME | CITY | STATE | TRANSACTION_DT | TRANSACTION_AMT |
---|---|---|---|---|---|
C00078279 |
MCKENNON, K R |
MIDLAND |
MI |
10031979 |
400.0 |
C00078279 |
OREFFICE, P |
MIDLAND |
MI |
10261979 |
1500.0 |
C00078279 |
DOWNEY, J |
MIDLAND |
MI |
10261979 |
300.0 |
C00078279 |
BLAIR, E |
MIDLAND |
MI |
10261979 |
1000.0 |
C00078287 |
BLANCHARD, JOHN A |
CHICAGO |
IL |
03201979 |
200.0 |
Data Loading Function Creation
(1d) Now that you are sure that your work is OK, make a function called read_election_year
that takes one parameter called myyear
as the input, and returns a data frame that contains the data from that election year. Make sure to document your function with a docstring to explain how it works.
import pandas as pd
def read_election_year(myyear):
# We are constructing the file path for the year
file = f'/anvil/projects/tdm/data/election/itcont{myyear}.txt'
# Define column names
mycolumnnames = [
"CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM",
"TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE",
"EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT",
"OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID"]
# Specify data types
mydictionarytypes = {
"CMTE_ID": str, "AMNDT_IND": str, "RPT_TP": str, "TRANSACTION_PGI": str,
"IMAGE_NUM": str, "TRANSACTION_TP": str, "ENTITY_TP": str, "NAME": str,
"CITY": str, "STATE": str, "ZIP_CODE": str, "EMPLOYER": str,
"OCCUPATION": str, "TRANSACTION_DT": str, "TRANSACTION_AMT": float,
"OTHER_ID": str, "TRAN_ID": str, "FILE_NUM": str, "MEMO_CD": str,
"MEMO_TEXT": str, "SUB_ID": int}
# Read the data
myDF = pd.read_csv(file, delimiter='|', names=mycolumnnames, dtype=mydictionarytypes)
return myDF
df_1984 = read_election_year(1984)
It might be helpful to have 2 cores for this project. You might be able to do it with 1 core, but it is probably easier for you with 2 cores when using Anvil. |
Unique Committee Count
(2a) First (without a function!) start with a variable called myyear
, such as 1980, and find the number of (unique) committees that appear in the CMTE_ID
column in that year. Then do the same for the year 1984, and then do this again for 1988. Print your results for each of these three years in separate cells.
df_1980 = read_election_year(1980)
unique_committees = df_1980["CMTE_ID"].nunique()
print(unique_committees)
3856
Unique Committee Count Function
(2b) Now that you have part 2a working well, put your work from question 2a into a function. Namely, create a function called committees_function
that accepts a year as input, and returns the number of (unique) committees that appear in the CMTE_ID
column in that year. Use the function designed in Question 1 to help you accomplish this work.
def committees_function(myyear):
myDF = read_election_year(myyear)
unique_committees = myDF["CMTE_ID"].nunique()
return unique_committees
number_of_committees = committees_function(1984)
print(number_of_committees)
3733
Function Testing and Validation
(2c) Test your function for each of the years 1980, 1984, and 1988. How many (unique) committees appear in each of these 3 individual years? The output from this question should show, for each year, how many (unique) committees appear in the data for each of those 3 years. The output for each of these 3 years should agree with your output from question 2a.
number_of_committees_1980 = committees_function(year_1980)
number_of_committees_1984 = committees_function(year_1984)
number_of_committees_1988 = committees_function(year_1988)
print(number_of_committees_1980)
print(number_of_committees_1984)
print(number_of_committees_1988)
3856 3733 3995
State Transaction Amount Analysis
The goal of this question is to find the top 5 states in a given year, according to the total (sum) of the values in the TRANSACTION_AMT
column.
(3a) First (without a function!) start with a variable called myyear
, such as 1980, and find the total (sum) of the values from the TRANSACTION_AMT
column for each state in the data set. You only need to print the top 5 results (i.e., the top 5 states and the total of the transaction amounts from those states) for 1980. Then do this again for 1984, and then do this again for 1988.
myyear = 1980 # Able to change the year
df_1980 = read_election_year(myyear)
df_1980['TRANSACTION_AMT'] = df_1980['TRANSACTION_AMT'].astype(float)
state_totals = df_1980.groupby('STATE')['TRANSACTION_AMT'].sum()
top_states = state_totals.sort_values(ascending=False).head(5)
print(top_states)
Top State Transactions Function
(3b) Now that you have your work from Question 3a working well, build a function called top_five_states
. This function should take 1 year as input, and should return the top 5 states and the total (sum) of the values for each of the 5 states, from the TRANSACTION_AMT
column (for that state).
def top_five_states(year):
df_year = read_election_year(year)
df_year['TRANSACTION_AMT'] = df_year['TRANSACTION_AMT'].astype(float)
state_totals = df_year.groupby('STATE')['TRANSACTION_AMT'].sum()
top_states = state_totals.sort_values(ascending=False).head(5)
return top_states
top_five_states(1980)
Top Employer Transaction Analysis
The goal of this question is to identify the top 5 employers, according to the total (sum) of the values from the TRANSACTION_AMT
column for each employer.
(4a) First find the top 5 employers in each year 1980, 1984, and 1988, and print the top 5 for each of those years. Do this before you make a function.
df = read_election_year(1980)
df['TRANSACTION_AMT']=df['TRANSACTION_AMT'].astype(float)
df = df.dropna(subset=['EMPLOYER'])
emp_total = df.groupby('EMPLOYER')['TRANSACTION_AMT'].sum()
sorted(emp_total.items(), key=lambda x: x[1], reverse=True)[:5]
Top Employers Function Creation
(4b) Once that is working, then build a function called top_employers
that returns the top 5 employers in each year 1980, 1984, and 1988. Your results from question 4b should agree with your results from question 4a.
def top_employers():
years = [1980, 1984, 1988]
results = {}
for year in years:
df = read_election_year(year)
df['TRANSACTION_AMT'] = df['TRANSACTION_AMT'].astype(float)
df = df.dropna(subset=['EMPLOYER'])
emp_total = df.groupby('EMPLOYER')['TRANSACTION_AMT'].sum()
top_5 = sorted(emp_total.items(), key=lambda x: x[1], reverse=True)[:5]
results[year] = top_5
return results
top_employers_results = top_employers()
top_employers_results
{1980: [('RETIRED', 8565354.0), ('ATTORNEY', 5689870.0), ('HOUSEWIFE', 5457943.0), ('HOMEMAKER', 4649872.0), ('HOUSE WIFE', 4549216.0)], 1984: [('SELF-EMPLOYED', 19761941.0), ('RETIRED', 9257779.0), ('HOUSEWIFE', 6313918.0), ('SELF EMPLOYED', 3869043.0), ('HOMEMAKER', 3372784.0)], 1988: [('HOUSEWIFE', 14626839.0), ('HOMEMAKER', 9834325.0), ('ATTORNEY', 9488872.0), ('RETIRED', 5865073.0), ('PHYSICIAN', 3104940.0)]}