# STAT 19000: Project 11 — Spring 2021

Motivation: We’ve had a pretty intense series of projects recently, and, although you may not have digested everything fully, you may be surprised at how far you’ve come! What better way to realize this but to take a look at some familiar questions that you’ve solved in the past in R, and solve them in Python instead? You will (a) have the solutions in R to be able to compare and contrast what you come up with in Python, and (b) be able to fill in any gaps you find you have along the way.

Context: We’ve just finished a two project series where we built a beer recommendation system using Python. In this project, we are going to take a (hopefully restful) step back and tackle some familiar data wrangling tasks, but in Python instead of R.

Scope: python, r

Learning objectives
• Use numpy, scipy, and pandas to solve a variety of data-driven problems.

• Demonstrate the ability to read and write data of various formats using various packages.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/fars`

## Questions

### Question 1

The `fars` dataset contains a series of folders labeled by year. In each year folder there is (at least) the files `ACCIDENT.CSV`, `PERSON.CSV`, and `VEHICLE.CSV`. If you take a peek at any `ACCIDENT.CSV` file in any year, you’ll notice that the column `YEAR` only contains the last two digits of the year. Add a new `YEAR` column that contains the full year. Use the `pd.concat` function to create a DataFrame called `accidents` that combines the `ACCIDENT.CSV` files from the years 1975 through 1981 (inclusive) into one big dataset. After (or before) creating that `accidents` DataFrame, change the values in the `YEAR` column from two digits to four digits (i.e., paste a 19 onto each year value).

 One way to append strings to every value in a column is to first convert the column to `str` using `astype` and then use the `+` operator, like normal: ``myDF["myCol"].astype(str) + "appending_this_string"``
Items to submit
• Python code used to solve the problem.

• `head` of the `accidents` dataframe.

### Question 2

Using the new `accidents` data frame that you created in (1), how many accidents are there in which 1 or more drunk drivers were involved in an accident with a school bus?

 Look at the variables `DRUNK_DR` and `SCH_BUS`.
Items to submit
• Python code used to solve the problem.

• Output from running your code.

### Question 3

Again using the `accidents` data frame: For accidents involving 1 or more drunk drivers and a school bus, how many happened in each of the 7 years? Which year had the largest number of these types of accidents?

 Does the `groupby` method seem familiar to you? It should! It is extremely similar to `tapply` in R. Typically functions that behave like `tapply` are called something like "groupby" — R is the oddball this time.
Items to submit
• Python code used to solve the problem.

• Output from running your code.

### Question 4

Again using the `accidents` data frame: Calculate the mean number of motorists involved in an accident (column `PERSONS`) with `i` drunk drivers (column `DRUNK_DR`), where `i` takes the values from 0 through 6.

Items to submit
• Python code used to solve the problem.

• Output from running your code.

### Question 5

Break the day into portions, as follows: midnight to 6AM, 6AM to 12 noon, 12 noon to 6PM, 6PM to midnight, other. Find the total number of fatalities that occur during each of these time intervals. Also, find the average number of fatalities per crash that occurs during each of these time intervals.

 You’ll want to pay special attention to the `include_lowest` option of `pandas.cut` (similarly to R’s `cut`).
Items to submit
• Python code used to solve the problem.

• Output from running your code.