TDM 10200: Project 11 — Spring 2023

Motivation: Learning how to merge/join dataframes to access more information.

Scope: python, pandas, os

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

'/anvil/projects/tdm/data/fars'

ONE

Go ahead and list what is in /anvil/projects/tdm/data/fars. You do not need to read in all of the files. The fars directory contains years including 1975-2017. Each year also contains at least 3 CSV files. The one that we will be looking at is the ACCIDENTS.CSV

  1. List what files are in the year 1985

  2. Read in the ACCIDENTS.CSV and then go ahead and change the values in the YEAR column from two digits to four digits. For example, we should change 89 to 1989. Do this by adding a 19 to each year value.

  3. Now combine the MONTH, DAY, YEAR columns into a new column called DATE

Helpful Hint (for b)

We can append strings to every value in a column by first converting the column to str using astype then use the + operator:

myDF["myCol"].astype(str) + "appending_this_string"
  • append in coding takes an object and adds it to an existing list

Helpful Hint (for c)

If you see the numbers 99 or 9 it is an indicator that the information is unknown. If you want to learn more see here

Items to submit
  • Answers to the questions a,b,c above.

  • Code used to solve this problem.

  • Output from running the code.

TWO

What we want to do now is create a Dataframe called accidents that joins the ACCIDENT.CSV files from the years 1985-1989 (inclusive) into one large Dataframe.

Insider Knowledge

The Pandas library has three main functions that combine data.
merge() is typically used for combining data based on common columns or indices. Merge is similar to the join function in SQL. Important to note that merge() will default to an inner join unless specified.
join() is typically used for combining data based on a key column or an index.
concat() is typically used for combining Dataframes across rows or columns.

There are several different forms of joins we will just discuss two here.

  • inner-will return only matching rows from the tables, you will lose the rows that do not have a match in the other Dataframe’s key column.

  • outer- will return every row from both the left and right dataset. If the left dataset does not have a value for a specific row it will be left empty rather than the entire row be removed same goes for the right dataset

A great visual can be found here

Items to submit
  • Answers to the question above

  • Code used to solve this problem

  • Output from running the code.

THREE

Using the new accidents Dataframe that you just created, let’s take a look at some of the data.

  1. Change the values in the YEAR column from a 2 digit year to a 4 digit year, like we did in the last question, but using a different method.

  2. How many accidents are there in which one or more drunk drivers were involved in an accident with a school bus?

Helpful Hint (for a)

use the to_datetime function

df[''] = pd.to_datetime(df[''], format='%y').dt.strftime('%Y')
Helpful Hint (for b)

look at the specifically the variables DRUNK_DR and SCH_BUS

Items to submit
  • Answers to the two questions

  • Code used to solve this problem.

  • Output from running the code.

FOUR

  1. Find how many accidents happen in total per year between 1 or more drunk drivers and school bus.

    1. what year had the lowest number of accidents

    2. what year had the most number of accidents

  2. Now we want to consider which days of the week had the most accidents occur

  3. Is there a time of day where you see more accidents? Using 12am-6am/ 6am-12pm/ 12pm-6pm/ 6pm-12am as your time frames.

Items to submit
  • Answers to the 3 questions above

  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.