TDM 10200: Python Project 12 — Spring 2025

Motivation: Some files are too big to read into Python. We practice methods for extracting the portions of a file that we need.

Context: We can use the "usecols" option and the "nrows" option for the pd.read_csv function. These allow us to read a portion of a file (either a small number of rows, and/or a small number of columns).

Scope: We learn how to work with enormous files in Python, if we only need a portion of the file in our analysis.

Learning Objectives:

Learn how to apply functions to large data sets and extract information.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following datasets:

/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales.csv (Iowa liquor sales)
/anvil/projects/tdm/data/election/itcont2020.txt (2020 election data)
/anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv (Internet Movie DataBase (IMDB))

It should be enough to use 6 cores in your Jupyter Lab session for this project.

Questions

Question 1 (2 pts)

First load: import pandas as pd so that you have the pd.read_csv function available, and also pd.set_option('display.max_columns', 50) so that you can see 50 columns.

If you try to read the entire file /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales.csv then the seminar kernel will crash.

Instead, use read_csv with the option nrows=1000 (in part a) so that you can see (only) the first 1000 rows of the data frame.

(Do not worry about the memory warning that appears in pink.)

Which columns contain the Store Number, Store Name, Address, City, and Zip Code? (Use the option nrows=1000 for part a.)
Use the usecols option in read_csv to read the entire data set, but only those 5 columns of the data. What is the dimension of the data set? (Hint: it should now be 5 columns and more than 27 million rows. Another hint: Do NOT use nrows=1000 for part b. Instead, use the usecols option that we learned in earlier projects.)

Deliverables

Which columns contain the Store Number, Store Name, Address, City, and Zip Code?
What is the dimension of the data set?

Question 2 (2 pts)

In this question, you will likely use the value_counts() and sort_values() and tail() functions together.

Now let’s find the most popular 10 stores (from all 27 million rows, and only the 5 columns in question 1) in two different ways:

Find the 10 most popular values of the "Store Number"
Find the 10 most popular values of the "Store Name"
Find the 10 most popular values of these three columns pasted together: Address, City, "Zip Code" Hint: This may help: (myDF['Address'].astype(str) + ' ' + myDF['City'].astype(str) + ' ' + myDF['Zip Code'].astype(str))
Do your answers to parts (a) and (b) seem to agree? Do your answers to parts (a) and (c) seem to agree?

Deliverables

a. The 10 most popular values of the "Store Number" and the number of occurrences of each.
b. The 10 most popular values of the "Store Name" and the number of occurrences of each.
c. The 10 most popular values of these three columns pasted together: Address, City, "Zip Code" and the number of occurrences of each.
d. Do your answers to parts (a) and (b) seem to agree? Do your answers to parts (a) and (c) seem to agree?

Question 3 (2 pts)

For this question, use pd.read_csv to read all 27 million rows of the data set again, but this time, only read in the columns "Zip Code", "Category Name", "Sale (Dollars)".

When you read in the data, add this option to the read_csv so that the Zip_Code values are stored as strings:

dtype={'Zip Code':'string', 'Category Name':'string', 'Sale (Dollars)':'float'}

Use the groupby function to sum the values of "Sale (Dollars)" according to the "Zip Code". Find the 10 "Zip Code" values that have the largest sum of "Sale (Dollars)" altogether, and give those "Zip Code" values and each of their sums of "Sale (Dollars)".
Now use the groupby function to sum the values of "Sale (Dollars)" according to the "Category Name". Find the 10 "Category Name" values that have the largest sum of "Sale (Dollars)" altogether, and give those "Category Name" values and each of their sums of "Sale (Dollars)".

Deliverables

Find the 10 "Zip Code" values that have the largest sum of "Sale (Dollars)" altogether, and give those "Zip Code" values and each of their sums of "Sale (Dollars)".
Find the 10 "Category Name" values that have the largest sum of "Sale (Dollars)" altogether, and give those "Category Name" values and each of their sums of "Sale (Dollars)".

Question 4 (2 pts)

Read in the 10th and 15th columns of the election data (for the state and donation amounts) for the year 2020 as follows:

myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont2020.txt", header=None, sep='|', usecols=[9,14], encoding='Windows-1252')

myDF.columns = ["STATE", "TRANSACTION_AMT"]

You can fix the transaction amounts as follows:

import numpy as np

myDF['TRANSACTION_AMT'] = pd.to_numeric(myDF['TRANSACTION_AMT'], errors='coerce').fillna(0).astype(np.int64)

Sum the amount of the donations, grouped according to the state. List the top 10 states, according to the sum of the donation amounts, and the sum of the donation amounts in each of these top 10 states.

Deliverables

List the top 10 states according to the sum of the donation amounts, and the sum of the donation amounts in each of these top 10 states.

Question 5 (2 pts)

Use the nrows option of read_csv with the option sep="\t" to read the first 10 lines of /anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv and notice that the entries of the genres column are strings with several types of genres, separated by commas. On the genres column (for only these 10 lines), run the following:

myDF['genres']

myDF['genres'].str.split(',')

myDF['genres'].str.split(',').explode()

myDF['genres'].str.split(',').explode().value_counts()

Now read in only the genres column of the entire file. For each of the genres, list how many times it occurs. For instance, Action occurs 462384 times.

Deliverables

For each of the genres, list how many times it occurs.

Submitting your Work

Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.

Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.

Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.

Items to submit

firstname_lastname_project12.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.