STAT 19000: Project 11 — Spring 2021

Motivation: We’ve had a pretty intense series of projects recently, and, although you may not have digested everything fully, you may be surprised at how far you’ve come! What better way to realize this but to take a look at some familiar questions that you’ve solved in the past in R, and solve them in Python instead? You will (a) have the solutions in R to be able to compare and contrast what you come up with in Python, and (b) be able to fill in any gaps you find you have along the way.

Context: We’ve just finished a two project series where we built a beer recommendation system using Python. In this project, we are going to take a (hopefully restful) step back and tackle some familiar data wrangling tasks, but in Python instead of R.

Scope: python, r

Learning objectives
  • Use numpy, scipy, and pandas to solve a variety of data-driven problems.

  • Demonstrate the ability to read and write data of various formats using various packages.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/fars

Questions

Question 1

The fars dataset contains a series of folders labeled by year. In each year folder there is (at least) the files ACCIDENT.CSV, PERSON.CSV, and VEHICLE.CSV. If you take a peek at any ACCIDENT.CSV file in any year, you’ll notice that the column YEAR only contains the last two digits of the year. Add a new YEAR column that contains the full year. Use the pd.concat function to create a DataFrame called accidents that combines the ACCIDENT.CSV files from the years 1975 through 1981 (inclusive) into one big dataset. After (or before) creating that accidents DataFrame, change the values in the YEAR column from two digits to four digits (i.e., paste a 19 onto each year value).

One way to append strings to every value in a column is to first convert the column to str using astype and then use the + operator, like normal:

myDF["myCol"].astype(str) + "appending_this_string"
Items to submit
  • Python code used to solve the problem.

  • head of the accidents dataframe.

Question 2

Using the new accidents data frame that you created in (1), how many accidents are there in which 1 or more drunk drivers were involved in an accident with a school bus?

Look at the variables DRUNK_DR and SCH_BUS.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 3

Again using the accidents data frame: For accidents involving 1 or more drunk drivers and a school bus, how many happened in each of the 7 years? Which year had the largest number of these types of accidents?

Does the groupby method seem familiar to you? It should! It is extremely similar to tapply in R. Typically functions that behave like tapply are called something like "groupby" — R is the oddball this time.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 4

Again using the accidents data frame: Calculate the mean number of motorists involved in an accident (column PERSONS) with i drunk drivers (column DRUNK_DR), where i takes the values from 0 through 6.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 5

Break the day into portions, as follows: midnight to 6AM, 6AM to 12 noon, 12 noon to 6PM, 6PM to midnight, other. Find the total number of fatalities that occur during each of these time intervals. Also, find the average number of fatalities per crash that occurs during each of these time intervals.

You’ll want to pay special attention to the include_lowest option of pandas.cut (similarly to R’s cut).

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.