TDM 10200: Python Project 12 — Spring 2025
Motivation: Some files are too big to read into Python. We practice methods for extracting the portions of a file that we need.
Context: We can use the "usecols" option and the "nrows" option for the pd.read_csv
function. These allow us to read a portion of a file (either a small number of rows, and/or a small number of columns).
Scope: We learn how to work with enormous files in Python, if we only need a portion of the file in our analysis.
Dataset(s)
This project will use the following datasets:
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales.csv
(Iowa liquor sales) -
/anvil/projects/tdm/data/election/itcont2020.txt
(2020 election data) -
/anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv
(Internet Movie DataBase (IMDB))
It should be enough to use 6 cores in your Jupyter Lab session for this project. |
Questions
Question 1 (2 pts)
First load: import pandas as pd
so that you have the pd.read_csv
function available, and also pd.set_option('display.max_columns', 50)
so that you can see 50 columns.
If you try to read the entire file /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales.csv
then the seminar
kernel will crash.
Instead, use read_csv
with the option nrows=1000
(in part a) so that you can see (only) the first 1000 rows of the data frame.
(Do not worry about the memory warning that appears in pink.)
-
Which columns contain the
Store Number
,Store Name
,Address
,City
, andZip Code
? (Use the optionnrows=1000
for part a.) -
Use the
usecols
option inread_csv
to read the entire data set, but only those 5 columns of the data. What is the dimension of the data set? (Hint: it should now be 5 columns and more than 27 million rows. Another hint: Do NOT usenrows=1000
for part b. Instead, use theusecols
option that we learned in earlier projects.)
-
Which columns contain the
Store Number
,Store Name
,Address
,City
, andZip Code
? -
What is the dimension of the data set?
Question 2 (2 pts)
In this question, you will likely use the value_counts()
and sort_values()
and tail()
functions together.
Now let’s find the most popular 10 stores (from all 27 million rows, and only the 5 columns in question 1) in two different ways:
-
Find the 10 most popular values of the
"Store Number"
-
Find the 10 most popular values of the
"Store Name"
-
Find the 10 most popular values of these three columns pasted together:
Address
,City
,"Zip Code"
Hint: This may help:(myDF['Address'].astype(str) + ' ' + myDF['City'].astype(str) + ' ' + myDF['Zip Code'].astype(str))
-
Do your answers to parts (a) and (b) seem to agree? Do your answers to parts (a) and (c) seem to agree?
-
a. The 10 most popular values of the
"Store Number"
and the number of occurrences of each. -
b. The 10 most popular values of the
"Store Name"
and the number of occurrences of each. -
c. The 10 most popular values of these three columns pasted together:
Address
,City
,"Zip Code"
and the number of occurrences of each. -
d. Do your answers to parts (a) and (b) seem to agree? Do your answers to parts (a) and (c) seem to agree?
Question 3 (2 pts)
For this question, use pd.read_csv
to read all 27 million rows of the data set again, but this time, only read in the columns "Zip Code"
, "Category Name"
, "Sale (Dollars)"
.
When you read in the data, add this option to the read_csv
so that the Zip_Code
values are stored as strings:
dtype={'Zip Code':'string', 'Category Name':'string', 'Sale (Dollars)':'float'}
-
Use the
groupby
function to sum the values of"Sale (Dollars)"
according to the"Zip Code"
. Find the 10"Zip Code"
values that have the largest sum of"Sale (Dollars)"
altogether, and give those"Zip Code"
values and each of their sums of"Sale (Dollars)"
. -
Now use the
groupby
function to sum the values of"Sale (Dollars)"
according to the"Category Name"
. Find the 10"Category Name"
values that have the largest sum of"Sale (Dollars)"
altogether, and give those"Category Name"
values and each of their sums of"Sale (Dollars)"
.
-
Find the 10
"Zip Code"
values that have the largest sum of"Sale (Dollars)"
altogether, and give those"Zip Code"
values and each of their sums of"Sale (Dollars)"
. -
Find the 10
"Category Name"
values that have the largest sum of"Sale (Dollars)"
altogether, and give those"Category Name"
values and each of their sums of"Sale (Dollars)"
.
Question 4 (2 pts)
Read in the 10th and 15th columns of the election data (for the state and donation amounts) for the year 2020 as follows:
myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont2020.txt", header=None, sep='|', usecols=[9,14], encoding='Windows-1252')
myDF.columns = ["STATE", "TRANSACTION_AMT"]
You can fix the transaction amounts as follows:
import numpy as np
myDF['TRANSACTION_AMT'] = pd.to_numeric(myDF['TRANSACTION_AMT'], errors='coerce').fillna(0).astype(np.int64)
Sum the amount of the donations, grouped according to the state. List the top 10 states, according to the sum of the donation amounts, and the sum of the donation amounts in each of these top 10 states.
-
List the top 10 states according to the sum of the donation amounts, and the sum of the donation amounts in each of these top 10 states.
Question 5 (2 pts)
Use the nlines
option of read_csv
with the option sep="\t"
to read the first 10 lines of /anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv
and notice that the entries of the genres
column are strings with several types of genres, separated by commas. On the genres
column (for only these 10 lines), run the following:
myDF['genres']
myDF['genres'].str.split(',')
myDF['genres'].str.split(',').explode()
myDF['genres'].str.split(',').explode().value_counts()
Now read in only the genres
column of the entire file. For each of the genres
, list how many times it occurs. For instance, Action
occurs 462384 times.
-
For each of the
genres
, list how many times it occurs.
Submitting your Work
Please make sure that you added comments for each question, which explain your thinking about your method of solving each question. Please also make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.
If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours.
Prior to submitting your work, you need to put your work into the project template, and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the detailed instructions on how to ensure that your submission is formatted correctly. To download your completed project, you can right-click on the file in the file explorer and click 'download'.
Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.
-
firstname_lastname_project12.ipynb
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |