STAT 19000: Project 3 — Spring 2022

Motivation: We’ve now been introduced to a variety of core Python data structures. Along the way we’ve touched on a bit of pandas, matplotlib, and have utilized some control flow features like for loops and if statements. We will continue to touch on pandas and matplotlib, but we will take a deeper dive in this project and learn more about control flow, all while digging into the data!

Context: We just finished a project where we were able to see the power of dictionaries and sets. In this project we will take a step back and make sure we are able to really grasp control flow (if/else statements, loops, etc.) in Python.

Scope: Python, dicts, lists, if/else statements, for loops, break, continue

Learning Objectives
  • List the differences between lists & tuples and when to use each.

  • Explain what is a dict and why it is useful.

  • Demonstrate a working knowledge of control flow in python: if/else statements, while loops, for loops, etc.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt

Questions

Question 1

Let’s begin this project by taking another look at question (4) from the previous project.

Although we were able to reduce the number of comparisons down a lot (from around 15000000 squared to 40000 squared) — it is still terrible and very very slow.

To see just how slow, let’s time it!

from block_timer.timer import Timer
import pandas as pd

# read in the intruder dataset and get the unique ids
df_intruder = pd.read_csv('/depot/datamine/data/noaa/2020_sampleB.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"])
intruder_ids = df_intruder["station_id"].dropna().tolist()
unique_intruder_ids = list(set(intruder_ids))

# read in the original dataset and get the unique ids
df_original = pd.read_csv('/depot/datamine/data/noaa/2020_sample.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"])
original_ids = df_original["station_id"].dropna().tolist()
unique_ids = list(set(original_ids))

with Timer():
    # compare the two lists
    for i in unique_intruder_ids:
        if i not in unique_ids:
            print(i)

Yikes! That’s really not very good!

So, what is the better way? To take advantage of the set object! Specifically, read the section titled "Operating on a Set" here, and think of a better way to get this value! Test out the new method — how fast was it compared to the method above?

On Brown, mine was 958 times faster than the original method! Definitely a worthwhile trick to use!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Unlike in R, where traditional loops are rare and typically accomplished via one of the apply functions, in Python, loops are extremely common and important to understand. In Python, any iterator can be looped over. Some common iterators are: tuples, lists, dicts, sets, pandas Series, and pandas DataFrames.

Let’s get started by reading in our dataset and taking a look.

import pandas as pd

df = pd.read_csv("/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt", sep=";")

Use the following code to extract the sales amount in dollars into a list.

sales_list = df['Sale (Dollars)'].dropna().tolist()

Write a loop that uses sales_list and sums up the total sales, and prints the average sales amount.

Of course, pandas provides a method to iterate over the Sale (Dollars) Series as well! It would start as follows.

for idx, val in df['Sale (Dollars)'].dropna().iteritems():
    # put code here for series loop

Use this method to calculate the average sales amount. Which is faster? Fill in the following skeleton code to find out.

from block_timer.timer import Timer

with Timer(title="List loop"):
    # code for list loop

with Timer(title="Series loop"):
    # code for series loop
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

You may have been surprised by the fact that iterating through the Series was slower than iterating through a list. Here is a good post explaining why it is so slow!

So why use pandas? Well, it starts to be pretty great when you can take advantage of vectorization.

Let’s do a new exercise. Instead of calculating the average sales amount, let’s calculate the z-scores of the sales amounts. Just like before, do this using 2 methods. The first is to just use for loops, the len function, and the sum function. The second is to use pandas. I’ve provided you with the pandas solution.

How do you calculate a z-score?

$\frac{x_i - \mu}{\sigma}$

Where

$\sigma = \sqrt{\sum_{i=0}^n{\frac{(x_i - \mu)^{2}}{n}}}$

$n$ is the number of elements in the list.

$x_i$ is the ith element in the list.

$\mu$ is the mean of the list.

$\sigma$ is the standard deviation of the list.

Give it a shot and fill in the code below. What do the results look like?

import pandas as pd
from block_timer.timer import Timer

# df = pd.read_csv("/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt", sep=";")
sales_list = df['Sale (Dollars)'].dropna().tolist()

with Timer(title="Loops"):

    # calculate the mean
    mean = sum(sales_list)/len(sales_list)

    # calculate the std deviation
    # you can use **2 to square a value and
    # **0.5 to square root a value

    # calculate the list of z-scores

    # print the first 5 z-scores
    print(zscores[:5])

with Timer(title="Vectorization"):
    print(((df['Sale (Dollars)'] - df['Sale (Dollars)'].mean())/df['Sale (Dollars)'].std()).iloc[0:5])
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

While it is nearly always best to try and vectorize your code when using pandas, sometimes it isn’t possible to do perfectly, or it just isn’t worth the time to do it. For this question, we don’t care about vectorization.

We want to look at Volume Sold (Gallons) by Store Number. Start by building a dict called volume_dict that maps Store Number to Volume Sold (Gallons).

Since we only care about those two columns now, let’s remove the rest.

df = df.loc[:, ('Store Number', 'Volume Sold (Gallons)')]

You can loop through the DataFrame as follows.

for idx, row in df.iterrows():
    # print(idx, row)

There, idx contains the row index, and row contains a Series object containing the row of data. You could then access either of the column using either row['Store Number'] or row['Volume Sold (Gallons)'].

Build your volume_dict.

Remember, you will need to instantiate each key in the dict to prevent `KeyError`s. Alternatively, you can use a defaultdict. A defaultdict is a dict that will automatically instantiate a new key to a particular value. You could for example do the following.

from collections import defaultdict

volume_dict = defaultdict(int)

Then, by default, all keys will be instantiated to 0.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Great! Now you have your volume_dict. Write a loop that loops through your volume_dict and prints the Store Number and Volume Sold (Gallons) for each key. If the volume sold is less than 100000 use the continue keyword to skip printing anything. If the volumn sold is greater than 149999, print "HIGH: " before the store number, if the volume sold is less than 150000 print "LOW: " before the store number.

The output should be the following.

Output
LOW: 2190.0
HIGH: 4829.0
HIGH: 2633.0
HIGH: 2512.0
LOW: 3494.0
LOW: 2625.0
HIGH: 3420.0
LOW: 3952.0
HIGH: 3385.0
LOW: 3354.0
LOW: 3814.0

The continue keyword skips the rest of the code in the loop, and progresses to the next iteration.

In Python, there is if/elif/else. Elif stands for "else if".

To iterate through a dictionary, you can use the items method.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.