STAT 19000: Project 3 — Spring 2021

Motivation: A dictionary (referred to as a dict) is one of the most useful data structures in Python. You can think about them as a data structure containing key: value pairs. Under the hood, a dict is essentially a data structure called a hash table. Hash tables are a data structure with a useful set of properties. The time needed for searching, inserting, or removing a piece of data has a constant average lookup time, meaning that no matter how big your hash table grows to be, inserting, searching, or deleting a piece of data will usually take about the same amount of time. (The worst case time increases linearly.) Dictionaries (dict) are used a lot, so it is worthwhile to understand them. Although not used quite as often, another important data type called a set, is also worthwhile learning about.

Dictionaries, often referred to as dicts, are really powerful. There are two primary ways to "get" information from a dict. One is to use the get method, the other is to use square brackets and strings. Test out the following to understand the differences between the two:

my_dict = {"fruits": ["apple", "orange", "pear"], "person": "John", "vegetables": ["carrots", "peas"]}
# If "person" is indeed a key, they will function the same way
my_dict["person"]
my_dict.get("person")
# If the key does not exist, like below, they will not
# function the same way.
my_dict.get("height") # Returns None when key doesn't exist
print(my_dict.get("height")) # By printing, we can see None in this case
my_dict["height"] # Throws a KeyError exception because the key, "height" doesn't exist

Context: In our third project, we introduce some basic data types, and we demonstrate some familiar control flow concepts, all while digging right into a dataset. Throughout the course, we will slowly introduce concepts from pandas, and popular plotting packages.

Scope: dicts, sets, if/else statements, opening files, tuples, lists

Learning objectives
  • Explain what is a dict is and why it is useful.

  • Understand how a set works and when it could be useful.

  • List the differences between lists & tuples and when to use each.

  • Gain familiarity with string methods, list methods, and tuple methods.

  • Gain familiarity with dict methods.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/craigslist/vehicles.csv

Questions

Question 1

In project 2 we learned how to read in data using pandas. Read in the (/class/datamine/data/craigslist/vehicles.csv) dataset into a DataFrame called myDF using pandas. In R we can get a sneak peek at the data by doing something like:

head(myDF) # where myDF is a data.frame

There is a very similar (and aptly named method) in pandas that allows us to do the exact same thing with a pandas DataFrame. Get the head of myDF, and take a moment to consider how much time it would take to get this information if we didn’t have this nice head method.

Items to submit
  • Python code used to solve the problem.

  • The head of the dataset.

Question 2

Dictionaries, often referred to as dicts, are really powerful. There are two primary ways to "get" information from a dict. One is to use the get method, the other is to use square brackets and strings. Test out the following to understand the differences between the two:

my_dict = {"fruits": ["apple", "orange", "pear"], "person": "John", "vegetables": ["carrots", "peas"]}
# If "person" is indeed a key, they will function the same way
my_dict["person"]
my_dict.get("person")
# If the key does not exist, like below, they will not
# function the same way.
my_dict.get("height") # Returns None when key doesn't exist
print(my_dict.get("height")) # By printing, we can see None in this case
my_dict["height"] # Throws a KeyError exception because the key, "height" doesn't exist

Look at the dataset. Create a dict called my_dict that contains key:value pairs where the keys are years, and the values are a single int representing the number of vehicles from that year on craigslist. Use the year column, a loop, and a dict to accomplish this. Print the dictionary. You can use the following code to extract the year column as a list. In the next project we will learn how to loop over pandas DataFrames.

If you get a KeyError, remember, you must declare each key value pair just like any other variable. Use the following code to initialize each year key to the value 0.

myyears = myDF['year'].dropna().to_list()
# get a list containing each unique year
unique_years = list(set(myyears))
# for each year (key), initialize the value (value) to 0
my_dict = {}
for year in unique_years:
    my_dict[year] = 0

Here are some of the results you should get:

print(my_dict[1912]) # 5
print(my_dict[1982]) # 185
print(my_dict[2014]) # 31703

There is a special kind of dict called a defaultdict, that allows you to give default values to a dict, giving you the ability to "skip" initialization. We will show you this when we release the solutions to this project! It is not required, but it is interesting to know about!

Items to submit
  • Python code used to solve the problem.

  • my_dict printed.

Question 3

After completing question (2) you can easily access the number of vehicles from a given year. For example, to get the number of vehicles on craigslist from 1912, just run:

my_dict[1912]
# or
my_dict.get(1912)

A dict stores its data in key:value pairs. Identify a "key" from my_dict, as well as the associated "value". As you can imagine, having data in this format can be very beneficial. One benefit is the ability to easily create a graphic using matplotlib. Use matplotlib to create a bar graph with the year on the x-axis, and the number of vehicles from that year on the y-axis.

If when you end up seeing something like <BarContainer object of X artists>, you should probably end the code chunk with plt.show() instead. What is happening is Python is trying to print the plot object. That text is the result. To instead display the plot you need to call plt.show().

To use matplotlib, first import it:

import matplotlib.pyplot as plt
# now you can use it, for example
plt.plot([1,2,3,1])
plt.show()
plt.close()

The keys method and values method from dict could be useful here.

Items to submit
  • Python code used to solve the problem.

  • The resulting plot.

  • A sentence giving an example of a "key" and associated "value" from my_dict (e.g., a sentence explaining the 1912 example above).

Question 4

In the hint in question (2), we used a set to quickly get a list of unique years in a list. Some other common uses of a set are when you want to get a list of values that are in one list but not another, or get a list of values that are present in both lists. Examine the following code. You’ll notice that we are looping over many values. Replace the code for each of the three examples below with code that uses no loops whatsoever.

listA = [1, 2, 3, 4, 5, 6, 12, 12]
listB = [2, 1, 7, 7, 7, 2, 8, 9, 10, 11, 12, 13]
# 1. values in list A but not list B
# values in list A but not list B
onlyA = []
for valA in listA:
    if valA not in listB and valA not in onlyA:
        onlyA.append(valA)
print(onlyA) # [3, 4, 5, 6]
# 2. values in listB but not list A
onlyB = []
for valB in listB:
    if valB not in listA and valB not in onlyB:
        onlyB.append(valB)
print(onlyB) # [7, 8, 9, 10, 11, 13]
# 3. values in both lists
# values in both lists
in_both_lists = []
for valA in listA:
    if valA in listB and valA not in in_both_lists:
        in_both_lists.append(valA)
print(in_both_lists) # [1,2,12]

You should use a set.

In addition to being easier to read, using a set is much faster than loops!

A set is a group of values that are unordered, unchangeable, and no duplicate values are allowed. While they aren’t used a lot, they can be useful for a few common tasks like: removing duplicate values efficiently, efficiently finding values in one group of values that are not in another group of values, etc.

Items to submit
  • Python code used to solve the problem.

  • The output from running the code.

Question 5

The value of a dictionary does not have to be a single value (like we’ve shown so far). It can be anything. Observe that there is latitude and longitude data for each row in our DataFrame (lat and long, respectively). Wouldn’t it be useful to be able to quickly "get" pairs of latitude and longitude data for a given state?

First, run the following code to get a list of tuples where the first value is the state, the second value is the lat, and the third value is the long.

states_list = list(myDF.loc[:, ["state", "lat", "long"]].dropna().to_records(index=False))
states_list[0:3] # [('az', 34.4554, -114.269), ('or', 46.1837, -123.824), ('sc', 34.9352, -81.9654)]
# to get the first tuple
states_list[0] # ('az', 34.4554, -114.269)
# to get the first value in the first tuple
states_list[0][0] # az
# to get the second tuple
states_list[1] # ('or', 46.1837, -123.824)
# to get the first value in the second tuple
states_list[1][0] # or

If you have an issue where you cannot append values to a specific key, make sure to first initialize the specific key to an empty list so the append method is available to use.

Now, organize the latitude and longitude data in a dictionary called geoDict such that each state from the state column is a key, and the respective value is a list of tuples, where the first value in each tuple is the latitude (lat) and the second value is the longitude (long). For example, the first 2 (lat,long) pairs in Indiana ("in") are:

geoDict.get("in")[0:2] # [(39.0295, -86.8675), (38.8585, -86.4806)]
len(geoDict.get("in")) # 5687

Now that you can easily access latitude and longitude pairs for a given state, run the following code to plot the points for Texas (the state value is "tx"). Include the the graphic produced below in your solution, but feel free to experiment with other states.

You do NOT need to include this portion of Question 5 in your Markdown .Rmd file. We cannot get this portion to build in Markdown, but please do include it in your Python .py file.

from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoDataFrame
usa = gpd.read_file('/anvil/projects/tdm/data/boundaries/cb_2018_us_state_20m.shp')
usa.crs = {'init': 'epsg:4269'}
pts = [Point(y,x) for x, y in geoDict.get("tx")]
gdf = gpd.GeoDataFrame(geometry=pts, crs = 4269)
fig, gax = plt.subplots(1, figsize=(10,10))
base = usa[usa['NAME'].isin(['Hawaii', 'Alaska', 'Puerto Rico']) == False].plot(ax=gax, color='white', edgecolor='black')
gdf.plot(ax=base, color='darkred', marker="*", markersize=10)
plt.show()
plt.close()
# to save to jpg:
plt.savefig('q5.jpg')
Items to submit
  • Python code used to solve the problem.

  • Graphic file (q5.jpg) produced for the given state.

Question 6

Use your new skills to extract some sort of information from our dataset and create a graphic. This can be as simple or complicated as you are comfortable with!

Items to submit
  • Python code used to solve the problem.

  • The graphic produced using the code.