STAT 19000: Project 5 — Spring 2022

Motivation: We will pause in our series of pandas and numpy projects to learn one of the most important parts of writing programs — functions! Functions allow us to reuse snippets of code effectively. Functions are a great way to reduce the repetition of code and also keep the code organized and readable.

Context: We are focusing on learning about writing functions in Python.

Scope: python, functions, pandas, matplotlib

Learning Objectives
  • Understand what a function is.

  • Understand the components of a function in python.

  • Differentiate between positional and keyword arguments.

The following questions will use the following dataset(s):

  • /depot/datamine/data/whin/190/stations.csv

  • /depot/datamine/data/whin/190/observations.csv


We are very lucky to have great partners in the Wabash Heartland Innovation Network (WHIN)! They generously provide us with access to their API (here) for educational purposes. You’ve most likely either used their API in a previous project, or you’ve worked with a sample of their data to solve some sort of data-driven problem.

In this project, we will be using a slightly modified sample of their dataset to learn more about how to write functions.

Question 1

First, read both datasets into variables named stations and obs. Secondly, take a look at the head of both dataframes. You will notice, the station_id in the obs dataframe appears to correlate with the id column in the stations dataframe. This is a fairly common occurence when data has been normalized for a database. For our current project we will work with a single dataset.

pandas has a merge method that can be used to join two dataframes based on a common column. Here the id column from the stations dataframe matches the station_id column in the obs dataframe. Here is the explanation on merge.

Use left_on to specify the name of the column in the "left" dataframe. Use right_on to specify the name of the column in the "right" dataframe. Make the "left" dataframe be obs. Use the value "left" for the how argument to specify a left join.

Once merged, you will notice in the new dataframe, dat. That the id column from the obs dataframe is now labeled id_x, and the id column from the stations dataframe is now labeled id_y.

Use the pandas drop method to remove the id_y column. Use the pandas rename method to rename id_x to id, and name to station_name.

Great! We have cleaned up our dataframe so it is easier to work with, while learning a variety of useful pandas methods.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

When looking at the new dataset, you may have noticed a mix of letters and numbers in the id column. Below are a few samples of the contents of that column.

id sample

The use of numbers and letters in this column are a variation on ksuid — a K-sortable globally unique id. The reason they are beneficial is that Ksuids are sortable by time and unique identifers (there is a minimal chance that any two id’s would be the same). If you are interested you can read more here.

Next, write a function called get_datetime that accepts a ksuid (as a string) and returns the datetime.

You can use the parse method to decode a ksuid.

from cyksuid import ksuid

mydatetime = ksuid.parse('1NnyYGMtAHBFDYWOBlsDlqppzVI').datetime

Don’t forget to remove the "obs_" from the beginning of the ksuid.

The following code should result in the following output.

for k in ksuids:
2019-07-10 04:00:00
2019-07-10 04:15:00
2019-07-11 04:00:00
2019-07-11 04:15:00
2019-07-11 04:30:00

To verify that the ordering claim is true, (for example,the sorting of ksuids resulted in obervations are in chronological order). We must first, use the sample method to get 10 random id values from the dat dataframe.

Secondly sort the values, then loop through the sorted list of values, and use your get_datetime function to print the datetime.

Can you confirm that sorting the ksuids automatically sorts the observations by datetime?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

In this dataset we are given latitude and longitude values in degrees. We want to convert the degrees to radians. Write a function called degrees_to_radians that accepts a latitude or longitude value in degrees, and returns the same value in radians.

The formula to do this is.

$degrees*arctan2(0, -1)/180$

numpy has all of the needed functions for this!

import numpy as np


Make sure to convert your result from a pandas Series to a float.

To test out your function you can use:

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Write a function that accepts two pandas Series containing a latitude and longitude value. Also needs to be able to return the distance between two points in Kilometers. Call this function get_distance.

You can do this by using the Haversine formula.

$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$


  • $r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers

  • $\phi_1$ and $\phi_2$ are the latitude coordinates of the two points

  • $\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points

In the formula above, the latitude and longitudes need to be converted from degrees to radians. Your function from the Question 3 will be perfect for this!

You can even put your degrees_to_radians function in the get_distance function. Any "nested" function (a function within a function) can be called a "helper" function. If you have code that will be used multiple times it is beneficial to create a "helper" function.

It is common practice in the Python world to add an underscore as a prefix to helper functions. It is a sign that this function is just for "internal" use and should largly be ignored by the user. Follow this practice and prefix your degrees_to_radians function with an underscore.

numpy has all of the needed functions for this!

import numpy as np


Test your function on the 2 rows with the following id values.

id sample
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Great! Make sure to note these solutions for future use…​

Next, write a function called plot_stations. plot_stations should accept a dataset as an argument and produce a plot with the station locations plotted on a map.

For consistancy we will use plotly to produce the plot. This stackoverflow post will show some samples. For further understanding here is the explanation for the function.

We want to be careful we don’t plot the same point over and over. To avoid that we want to make sure we reduce the dataset (inside the function), this will plot each pair of latitude and longitude values only once.

Set hover_name to "station_id" so that hovering over a point will displays the station id.

Set scope to "usa" to reduce the map to the USA. Be sure to zoom in on the map so you can see the the stations within Indiana!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

