TDM 10200: Project 3 — 2023

Motivation: Learning about Big Data. When working with large data sets, it is important to know how we can use loops to find our information, a little bit at a time, without reading in all of the files at once. We will need to set our cores to 4 for this when we spin up Juypter Notebook. This will give us more space, to handle processing this week’s datasets. If we do not adjust the cores, then our kernel will crash every time we try to run a cell.

Scope: Python, If statements, for loops,

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

/anvil/projects/tdm/data/flights/subset/

Questions

ONE

First let’s go ahead and see what files are in this dataset

ls /anvil/projects/tdm/data/flights/subset/

This allows you to see all of the files that are in this dataset.

  1. Considering the files in this directory, which years of data are available?

  2. What would the path be to access only the 2003 file in this data set?

  3. Go ahead and import the library pandas as pd, and import Path from pathlib

Helpful Hints
/anvil/projects/tdm/data/flights/subset/2003.csv

import pandas as pd
from pathlib import Path
Items to submit
  • Code used to answer the question.

TWO

files = [Path(f'/anvil/projects/tdm/data/flights/subset/{year}.csv') for year in range(1987,2009)]
files

This code uses list comprehension to create a list of file paths, ranging from the years of 1987-2008 files will now contain strings of file paths for all the csv files in this directory.

Notice that, in a range in Python, the final number in the range is not included.

Let’s test out the first file (from 1987) to see if we can find the column names

  1. How many columns are there?

  2. What are the column names?

  3. Display the data from the first five rows.

Helpful Hints

#reads the first file into a df called eightseven eightyseven = pd.read_csv(files[0]) #looks for column names from the df column_names = eightyseven.columns print(column_names)

Items to submit
  • Code used to answer the question.

  • Result of code.

THREE

Let’s look at the column Origin

  1. Print out the unique elements of the Origin column.

  2. Find the number of times that IND occurs in the Origin column.

Helpful Hint
eightyseven['Origin'].value_counts()['IND']
Items to submit
  • Code used to answer the question.

  • Result of the code

FOUR

Let’s do the same thing for the 1988, 1989, and 1990 data sets, naming them eightyeight, eightynine, ninety.

  1. How many times is 'IND' the Origin airport in eightyeight?

  2. How many times is 'IND' the Origin airport in eightynine?

  3. How many times is 'IND' the Origin airport in ninety?

Items to submit
  • Code used to answer questions

  • Result of the code

FIVE

Knowing that we can find how many times the value IND shows up in Origin, we could do this for each of the years, and add them up, to find the total number of times that Indianapolis airport was the origin airport for flights taken during the years 1987-2008. But that is a lot of work!! We can shorten the tedious work of keeping track and adding things manually, by (instead) using a for loop, to go thru all of the subset files and keep track of the total number of times that IND shows up in the Origin column.

We could use this code

count = 0
for file in files:
    df = pd.read_csv(file)
    count += len(df[df['Origin'] == 'IND'])

print(count)

BUT before you go ahead and copy and paste, this code will take ALL of the files and read them into memory, and this will crash the kernel, even if we upped the cores to 4.

This means we have to think of another way to do it.

We have to use for loops which is a way to iterate to check for certain conditions and repeatedly execute them. This is very helpful when you come across situations in which you need to use a specific code over and over again but you don’t want to write the same line of code multiple times.

We need to consider a way that will allow us to go thru the files line by line, and read them but then not commit them to memory. In this way, we can go thru all of the data files and still keep track of how many occurrences we have, for a specific value.

total_count = 0
for file in files:
    for df in pd.read_csv(file, chunksize=10000):
        for index, row in df.iterrows():
            if row['Origin'] == 'IND':
                total_count += 1

print(total_count)

You will note that doing the above code DOES produce the correct answer BUT it take a very long time to run! Is there a shorter way to run this code?

Helpful Hint
origin_ind = 0
for file in files:
    with open(file,'r') as f:
        for line in f:
            if line.split(",")[16] == 'IND':
                origin_ind += 1
print(origin_ind)
Items to submit
  • Code used to answer the question.

  • Result of code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.