# STAT 19000: Project 9 — Spring 2022

Motivation: Learning how to wrangle and clean up data using pandas is extremely useful. It takes lots of practice to start to feel comfortable.

Context: At this point in the semester, we have a solid grasp on the basics of Python, and are looking to build our skills using `pandas` by using `pandas` to solve data-driven problems.

Scope: Python, pandas

Learning Objectives
• Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays.

• Demonstrate the ability to use pandas and the built in DataFrame and Series methods to perform some of the most common operations used when data wrangling.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset(s)

The following questions will use the following dataset(s):

• `/depot/datamine/data/disney/total.parquet`

## Questions

### Question 1

Let’s start by reading in the cleaned up and combined dataset. This is just the cleaned up dataset — essentially the same thing you got as a result from much of your processing from project 7.

How many rows of data are there for each ride?

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 2

Recall that a single row of data either has a value for `SPOSTMIN` or `SACTMIN`, but not both. How many rows of data are there in total? How many non-null rows for `SPOSTMIN`? How many non-null rows for `SACTMIN`? Create a new dataframe called `reduced` where:

• Each row has a value for both `SPOSTMIN` and `SACTMIN`. The value in the `SPOSTMIN` column is the value for the closest `SPOSTMIN` value in seconds from the datetime shown for the `SACTMIN` value.

• There is a new column called `time_diff` that is the difference (in seconds) between the `SACTMIN` value and associated closest `SPOSTMIN` value.

 This is the toughest question for this project. So it is OK if it takes you a bit more time to think of a solution.
 Check out the `shift` method in the `pandas` documentation. You could write a function that operates on a single dataframe (think a dataframe for a single ride), and adds a variety of columns to the dataset using the `shift` method, and systematically sets the `SPOSTMIN` values and `time_diff` values accordingly. This method could then be applied using the `groupby` method. This is one potential way to solve the problem! Don’t worry too much about edge cases — as long as you are close, you will get full credit.
Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 3

How many fewer rows does `reduced` have than the original dataset? What does the `time_diff` column look like?

In project 7 you calculated the median `SPOSTMIN` and `SACTMIN` by `ride_name`. Perform the same operation on `reduced`. Are the `SACTMIN` and `SPOSTMIN` medians closer or further away than our not-cleaned data from project 7?

Do you think that, overall, the data in `reduced` is close enough (by time) to be able to draw comparisons? Why or why not?

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 4

Any observation where the (absolute) `time_diff` is greater than an hour is probably not very high quality. Remove said observations from `reduced`. How many rows are left in `reduced`?

Finally, explore the refined dataset, `reduced`, more. Write a question you would like to have answered down, what you think the answer will be, and do your best to used the dataset to answer your question.

Your analysis should include: a question, your hypothesis, at least 1 graphic, any and all code you used, and your conclusions. You will not be graded on whether or not you are correct, but rather the effort you put into your analysis. Any good effort including the requirements will receive full credit. Have fun!

Items to submit
• Code used to solve this problem.

• Output from running the code.

 Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project.