STAT 19000: Project 2 — Spring 2021

Motivation: In Python it is very important to understand some of the data types in a little bit more depth than you would in R. Many of the data types in Python will seem very familiar. A character in R is similar to a str in Python. An integer in R is an int in Python. A numeric in R is similar to a float in Python. A logical in R is similar to a bool in Python. In addition to all of that, there are some very popular classes that packages like numpy and pandas introduces. On the other hand, there are some data types in Python like `tuple`s, `list`s, `set`s, and `dict`s that diverge from R a little bit more. It is integral to understand some basic concepts before jumping too far into everything.

Context: This is the second project introducing some basic data types, and demonstrating some familiar control flow concepts, all while digging right into a dataset.

Scope: tuples, lists, if statements, opening files

Learning Objectives
  • List the differences between lists & tuples and when to use each.

  • Gain familiarity with string methods, list methods, and tuple methods.

  • Demonstrate the ability to read and write data of various formats using various packages.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/craigslist/vehicles.csv

Questions

Question 1

Read in the dataset /class/datamine/data/craigslist/vehicles.csv into a pandas DataFrame called myDF. pandas is an integral tool for various data science tasks in Python. You can read a quick intro here. We will be slowly introducing bits and pieces of this package throughout the semester. Similarly, we will try to introduce byte-sized (ha!) portions of plotting packages to slowly build up your skills.

How big is the dataset (in Mb or Gb)?

If you didn’t do [optional question 6 in project 1](#p1-06), we would recommend taking a look.

Remember to check out a question’s relevant topics. We try very hard to link you to content and examples that will get you up and running as quickly as possible.

Items to submit
  • Python code used to solve the problem.

Question 2

In Question 1, we read our data into a pandas DataFrame. Use one of the pandas DataFrame attributes to get the number of columns and rows of our dataset. How many columns and rows are there? Use f-strings to print a message, for example:

` There are 123 columns in the DataFrame! There are 321 rows in the DataFrame! `

In project 1, we learned how to read a csv file in, line-by-line, and print values. Use the csv package to print just the first row, which should contain the names of the columns, OR instead of using the csv package, use one of the pandas attributes from myDF (to print the column names).

Items to submit
  • The output from printing the f-strings.

  • Python code used to solve the problem.

Question 3

Use the csv or pandas package to get a list called our_columns that contains the column names. Add a string, "extra", to the end of our_columns. Print the second value in the list. Without using a loop, print the 1st, 3rd, 5th, etc. elements of the list. Print the last four elements of the list ( "state", "lat", "long", and "extra") by accessing their negative index.

"extra" doesn’t belong in our list, you can easily remove this value from our list by doing the following…​

our_columns.pop(25)
# or even this, as pop removes the last value by default
our_columns.pop()

BUT the problem with this solution is that you must know the index of the value you’d like to remove, and sometimes you do not know the index of the value. Instead, please show how to use a list method to remove "extra" by value rather than by index.

Items to submit
  • Python code used to solve the problem.

  • The output from running your code.

Question 4

matplotlib is one of the primary plotting packages in Python. You are provided with the following code:

my_values = tuple(myDF.loc[:, 'odometer'].dropna().to_list())

The result is a tuple containing the odometer readings from all of the vehicles in our dataset. Create a lineplot of the odometer readings.

Well, that plot doesn’t seem too informative. Let’s first sort the values in our tuple:

my_values.sort()

What happened? A tuple is immutable. What this means is that once the contents of a tuple are declared they cannot be modified. For example:

# This will fail because tuples are immutable
my_values[0] = 100

You can read a good article about this here. In addition, here is a great post that gives you an idea when using a tuple might be a good idea. Okay, so let’s go back to our problem. We know that lists are mutable (and therefore sortable), so convert my_values to a list and then sort, and re-plot.

It looks like there are some (potential) outliers that are making our plot look a little wonky. For the sake of seeing how the plot would look, use negative indexing to plot the sorted values minus the last 50 values (the 50 highest values). New new plot may not look that different, that is okay.

To prevent plotting values on the same plot, close your plot with the close method, for example:

import matplotlib.pyplot as plt
my_values = [1,2,3,4,5]
plt.plot(my_values)
plt.show()
plt.close()
Items to submit
  • Python code used to solve the problem.

  • The output from running your code.

Question 5

We’ve covered a lot in this project! Use what you’ve learned so far to do one (or more) of the following tasks:

  • Create a cool graphic using matplotlib, that summarizes some data from our dataset.

  • Use pandas and your investigative skills to sift through the dataset and glean an interesting factoid.

  • Create some commented coding examples that highlight the differences between lists and tuples. Include at least 3 examples.

Items to submit
  • Python code used to solve the problem.

  • The output from running your code.