STAT 19000: Project 1 — Spring 2022

Motivation: Last semester each project was completed in Jupyter Lab from ondemand.brown.rcac.purdue.edu. Although our focus this semester is on the use of Python to solve data-driven problems, we still get to stay in the same environment. In fact, Jupyter Lab is Python-first. Now instead of using the f2021-s2022-r kernel, instead select the f2021-s2022 kernel.

Context: In this project we will re-familiarize ourselves with Jupyter Lab and its capabilities. We will also introduce Python and begin learning some of the syntax.

Scope: Python, Jupyter Lab

Learning Objectives
  • Use Jupyter Lab to run Python code and create Markdown text.

  • Gain exposure to some Python syntax.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/noaa/*.csv

Questions

Question 1

Please review our updated submission guidelines before submitting your project.

Let’s start the semester by doing some basic Python work, and compare and contrast with R.

First, let’s learn how to run R code in our regular, non-R kernel, f2021-s2022. In a cell, run the following code in order to load the extension that allows us to run R code.

%load_ext rpy2.ipython

You can see these intructions here.

Next, in order to actually run R code, we need to place the following in the first line of every cell where we want to run R code: %%R. You can think of this as declaring the code cell as an R code cell. For example, the following will successfully run in our f2021-s2022 kernel.

%%R

my_vector <- c(1,2,3,4,5)
my_vector

Great! Run the cell in your notebook to see the output.

Now, let’s perform the equivalent operations in Python!

my_vector = (1,2,3,4,5)
my_vector

The f2021-s2022 kernel is a normal kernel. Essentially, it will assume that code in a cell is Python code, unless "told" otherwise.

As you can see — the output is essentially the same. However, in Python, there are actually a few "primary" ways you could do this.

my_tuple = (1,2,3,4,5,)
my_tuple
my_list = [1,2,3,4,5,]
my_list
import numpy as np

my_array = np.array([1,2,3,4,5])
my_array

The first two options are part of the Python standard library. The first option is a tuple, which is a list of values. A tuple is immutable, meaning that once you create it, you cannot change it.

my_tuple = (1,2,3,4,5)
my_tuple[0] = 10 # error!

The second option is a list, which is a list of values. A list is mutable, meaning that once you create it, you can change it.

my_list = [1,2,3,4,5]
my_list[0] = 10
my_list # [10,2,3,4,5]

The third option is a numpy array, which is a list of values. A numpy array is mutable, meaning that once you create it, you can change it. In order to use numpy arrays, you must import the numpy package first. numpy is a numerical computation library that is optimized for a lot of the work we will be doing this semester. With that being said, its best to get to learn about the basics in Python first, as a lot can be accomplished without using numpy.

For this question, read as much as you can about tuples and lists, and run the examples we provided above.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

In general, tuples are used when you have a set of known values that you want to store and access efficiently. Lists are used when you want to do the same, but you have the need to manipulate the data within. Most often, lists will be your go-to.

In Python, lists are an object. Objects have methods. Methods are most simply defined as functions that are associated with and operate on the data (usually) within the object itself.

Here you can find a list of the list methods. For example, the append method adds an item to the end of a list.

Methods are called using dot notation. The following is an example of using the append method and dot notation to add the number 99 to the end of our list, my_list.

my_list = [1,2,3,4,5]
my_list.append(99)
my_list # [1,2,3,4,5,99]

Create a list called my_list with the values 1,2,3,4,5. Then, use the list methods to change my_list to contain the following values, in order: 7,5,4,3,2,1,6. Do not manually set values using indexing — just use the list methods.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Solution
my_list = [1,2,3,4,5]
my_list.append(7)
my_list.reverse()
my_list.append(6)
my_list
[7, 5, 4, 3, 2, 1, 6]

Question 3

Great! You may have noticed (or already know) that to get the first value in a list (or tuple) we would do my_list[0]. Recall that in R, we would do my_list[1]. This is because Python has 0-based indexing instead of 1-based indexing. While at first this may be confusing, many people find it much easier to use 0-based indexing than 1 based indexing.

Use indexing to print the values 7,4,2,6 from the modified my_list in the previous question.

Use indexing to print the values in reverse order without using the reverse method.

Use indexing to print the second through 4th values in my_list (5,4,3).

The "jump" feature of Python indexing will be useful here!

Relevant topics: indexing

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Solution
my_list[::2]
my_list[::-1]
my_list[1:4]
[7, 4, 2, 6]
[6, 1, 2, 3, 4, 5, 7]
[5, 4, 3]

Question 4

Great! If you have 1 takeaway from the previous 3 questions it should be that when you see [] think lists. When you see () think tuples (or generators, but ignore this for now).

Its not a Data Mine project without data. After we get through some basics of Python, we will be primarily working with data using the pandas and numpy libraries.With that being said, there is no reason not to do some work manually in the meantime!

Python does not have the data frame concept in its standard library like R does. This will most likely make things that would be simple to do in R much more complicated in Python. The pandas library introduces the data frame, so be patient and don’t be too frustrated when we (at first) forgo the pandas library

Okay! Let’s get started with our noaa weather data. The following is a very small sample of the /depot/datamine/data/noaa/2020.csv dataset.

sample
AE000041196,20200101,TMIN,168,,,S,
AE000041196,20200101,PRCP,0,D,,S,
AE000041196,20200101,TAVG,211,H,,S,
AEM00041194,20200101,PRCP,0,,,S,
AEM00041194,20200101,TAVG,217,H,,S,
AEM00041217,20200101,TAVG,205,H,,S,
AEM00041218,20200101,TMIN,148,,,S,
AEM00041218,20200101,TAVG,199,H,,S,
AFM00040938,20200101,PRCP,23,,,S,
AFM00040938,20200101,TAVG,54,H,,S,

You can read here about what the data means.

  1. 11 character station ID

  2. 8 character date in YYYYMMDD format

  3. 4 character element code (you can see the element codes here in section III)

  4. value of the data (varies based on the element code)

  5. 1 character M-flag (10 possible values, see section III here)

  6. 1 character Q-flag (14 possible values, see section III here)

  7. 1 character S-flag (30 possible values, see section III here)

  8. 4 character observation time (HHMM) (0700 = 7:00 AM) — may be blank

Since we aren’t using the pandas library, we need to use something in order to bring the data into Python. In this case, we will use the csv library — a library used for reading and writing dsv (data separated value) data.

The official documentation for this library is here.

If you read the first example in the csv.reader section here, you will find the following quick and succinct example.

import csv (1)

with open('eggs.csv', newline='') as csvfile: (2)
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|') (3)
    for row in spamreader: (4)
        print(', '.join(row)) (5)
Output
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam

You do not need to understand everything that is happening in this example (yet). With that being said, the following is an explanation for each part.

1 We are importing the csv library. If we didn’t have this line, the program would crash when we try and call csv.reader(…​) in the fourth line.
2 We are opening the eggs.csv file. This is the file we will be reading. Here, eggs.csv is assumed to be in the same directory where we are running the code. It could just as easily be in a folder called "my_data" in the data depot, in which case we would replace eggs.csv with the absolute path to our file of interest: /depot/datamine/data/my_data/eggs.csv. In addition, we call our opened file csvfile.
3 Here, we create a csv.reader object called spamreader. This object is a generator that will yield one row at a time. We can loop through this "generator" to get a single row of data at a time.
4 Here, we are looping through each row of data from the spamreader object. For each loop, we save the data into a variable called row. Specifically, row is a list, where the first value is the first space-separated value in the row, the second is the second space separated value in the row, etc. We then use a string method called join on the ", " string, which takes each value in the row and puts a ", " between them. This results in "Spam, Spam, Spam, …​, Baked Beans" that we see in the output.

This code could have been written like this:

import csv

csvfile = open('eggs.csv', newline='')
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in spamreader:
    print(', '.join(row))

csvfile.close()

But we have to close the file — otherwise, it could cause issues down the road. The with statement, among other things, handles this automatically for you.

One important part of learning a new language is jumping right in and trying things out! Modify the provided code to read in the 2020.csv file and print the 4th column only.

We do not want you to print out every row of data — that would be a lot and cause your notebook to crash! Instead, in the line following the print statement write break. We will learn about this later, but the break statement will stop the loop as soon as it is run. This will cause the program to just print the first line of data.

In general, we never want more than 10 or so lines — maybe 100 at the maximum. When in doubt, just print 10 lines.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Solution
import csv

with open('/depot/datamine/data/noaa/2020.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        print(row[3])
        break
168

Question 5

Below we’ve provided you with code that we would like you to fill in. Print the first 10 rows of the data.

import csv

with open('/depot/datamine/data/noaa/2020.csv') as my_file:
    reader = csv.reader(my_file)

    # TODO: create variable to store how many rows we've printed so far

    for row in reader:
        print(row)

        # TODO: increment the variable storing our count, since we've printed a row

        # TODO: if we've printed 10 rows, run the break statement
        break

You will need to indent the break statement to run it "within" the if statement you will create. Yes, we haven’t taught if statements yet, but you can do this!

If you want to try and solve this another way, Google "enumerate Python" and see if you can figure out how to do this without using the counting variable you create.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Solution
import csv

with open('/depot/datamine/data/noaa/2020.csv') as my_file:
    reader = csv.reader(my_file)

    for ct, row in enumerate(reader):
        print(row)

        if ct == 9:
            break
['AE000041196', '20200101', 'TMIN', '168', '', '', 'S', '']
['AE000041196', '20200101', 'PRCP', '0', 'D', '', 'S', '']
['AE000041196', '20200101', 'TAVG', '211', 'H', '', 'S', '']
['AEM00041194', '20200101', 'PRCP', '0', '', '', 'S', '']
['AEM00041194', '20200101', 'TAVG', '217', 'H', '', 'S', '']
['AEM00041217', '20200101', 'TAVG', '205', 'H', '', 'S', '']
['AEM00041218', '20200101', 'TMIN', '148', '', '', 'S', '']
['AEM00041218', '20200101', 'TAVG', '199', 'H', '', 'S', '']
['AFM00040938', '20200101', 'PRCP', '23', '', '', 'S', '']
['AFM00040938', '20200101', 'TAVG', '54', 'H', '', 'S', '']

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.