STAT 19000: Project 6 — Spring 2022

Motivation: We will pause in our series of pandas and numpy projects to learn one of the most important parts of writing programs — functions! Functions allow us to reuse snippets of code effectively. Functions are a great way to reduce the repetition of code and also keep the code organized and readable.

Context: We are focusing on learning about writing functions in Python.

Scope: python, functions, pandas, matplotlib

Learning Objectives
  • Understand what a function is.

  • Understand the components of a function in python.

  • Differentiate between positional and keyword arguments.

Make sure to read about, and use the template found here, and the important information about projects submissions here.


The following questions will use the following dataset(s):

  • /depot/datamine/data/whin/190/combined.csv

  • /depot/datamine/data/flights/subset/*.csv


Question 1

When submitting your .ipynb file for this project, if the .ipynb file doesn’t render in Gradescope, please export the notebook as a PDF and submit that as well — you will be helping the graders a lot!

In project 5, you read in two separate, but related datasets, and used the pandas merge method to combine them. In this project, we’ve provided you with a combined dataset already, called combined.csv.

Read in combined.csv into a data frame called dat.

Your friend shared the following code with you.

import as px

def plot_stations(df, *ids):
    df = df.groupby("station_id").head(1).loc[df['station_id'].isin(ids), ('station_id', 'latitude', 'longitude')]
    fig = px.scatter_geo(df, lat="latitude", lon="longitude", scope="usa",
    fig.update_layout(geo = dict(projection_scale=7, center=dict(lat=df['latitude'].iloc[0], lon=df['longitude'].iloc[0])))"jpg")

In order for your plotly maps to show up properly in Gradescope, you must use the renderer="jpg" option. In addition, this removes the ability to zoom in on the map. The update_layout method can be safely copied and pasted into all following functions you write so that the images that are rendered are zoomed in. This is critical so graders can properly see your work — feel free to post questions in Piazza if you have questions.

Please do the following:

  • Give a 1-2 sentence explanation of what this function does.

  • Use the function to plot 2 or more stations, BUT, use the function in two different ways (with the same result). Use tuple unpacking in 1 call of the plot_stations function, and do not in the other.

I would highly recommend taking the time to read through the entire article here. It is a very detailed article going through all the things you can do with functions in Python. The section on tuple packing and tuple unpacking may be particularly useful to you!

The documentation for the scatter_geo function can be found here.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

In project 5, question 5, you wrote a function called plot_stations that given a data frame, would plot the locations of the stations on the map.

Modify your plot_stations function so that it has an argument called weighted that defaults to False. If weighted is True, then the stations will be plotted with a size proportional to the number observations at each station.

plot_stations(df) # plots all stations same size
plot_stations(df, weighted=False) # plots all stations same size
plot_stations(df, weighted=True) # plots all stations with size proportional to the number of observations for the station

You can find the documentation on the scatter_geo function here.

This section will review default parameters.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

There are many columns in our data frame with numeric data. Some examples are: temperature_high, temperature_low, barometric_pressure, wind_speed_high, etc. Wouldn’t it be (kind of) cool to have an option in our plot_stations function that would weight the size of the points on the map based on those values instead of the number of observations?

Modify the function so that it has another argument called weight_by that defaults to None. If weight_by is None (and weighted is True), the points on the plot should be sized by number of observations (like in question 2). Otherwise, weight_by can accept a string with the name of the column to base the point sizes on. For example: plot_stations(dat, weighted=True, weight_by="temperature_high" would create a plot where the size of the points are based on the median value of temperature_high by station.

Please note, if weighted is False, then points should not be weighted regardless of the value of weight_by.

Of course, not all of the columns in our dataset are appropriate to weight by. Please demonstrate your function works by running the following calls to plot_stations.

plot_stations(dat, weighted=True, weight_by="temperature_high")
plot_stations(dat, weighted=True, weight_by="temperature_low")
plot_stations(dat, weighted=True, weight_by="wind_speed_high")
plot_stations(dat, weighted=False, weight_by="barometric_pressure")
plot_stations(dat, weighted=True, weight_by=None)

The wind_speed_high plot will have the most pronounced differences in size, but still rather small.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

You’ve learned a lot about plotting maps in plotly, the groupby method (most likely), and hopefully functions as well!

Check out all of the datasets in the /depot/datamine/data/flights/subset directory. Write a function that creates any new plot using some or all of the data in the subset directory. The plots could be maps, other plots, anything you want! The goal should be to make the function useful for exploring flight data in the provided format. Take advantage of the tuple packing and unpacking, default arguments, etc. You could even have a function inside another function (a helper function). Do you best to challenge yourself and have fun. Any solid effort will receive full credit.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5 (optional, 0 pts)

Write a function that accepts the WHIN weather dataset (as a data frame), and an argument n. This function should plot the largest n distances between stations on a map. See here for examples of plotting lines on a map.

If you are feeling very adventurous, there is a data structure called a kdtree that you can use to very efficiently find the n closest or furthest points, however, this is probably not necessary as there are not that many distances to calculate for this dataset.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.