STAT 19000: Project 9 — Spring 2022

Motivation: Learning how to wrangle and clean up data using pandas is extremely useful. It takes lots of practice to start to feel comfortable.

Context: At this point in the semester, we have a solid grasp on the basics of Python, and are looking to build our skills using pandas by using pandas to solve data-driven problems.

Scope: Python, pandas

Learning Objectives
  • Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays.

  • Demonstrate the ability to use pandas and the built in DataFrame and Series methods to perform some of the most common operations used when data wrangling.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/disney/total.parquet

Questions

Question 1

Let’s start by reading in the cleaned up and combined dataset. This is just the cleaned up dataset — essentially the same thing you got as a result from much of your processing from project 7.

How many rows of data are there for each ride?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Recall that a single row of data either has a value for SPOSTMIN or SACTMIN, but not both. How many rows of data are there in total? How many non-null rows for SPOSTMIN? How many non-null rows for SACTMIN? Create a new dataframe called reduced where:

  • Each row has a value for both SPOSTMIN and SACTMIN. The value in the SPOSTMIN column is the value for the closest SPOSTMIN value in seconds from the datetime shown for the SACTMIN value.

  • There is a new column called time_diff that is the difference (in seconds) between the SACTMIN value and associated closest SPOSTMIN value.

This is the toughest question for this project. So it is OK if it takes you a bit more time to think of a solution.

Check out the shift method in the pandas documentation. You could write a function that operates on a single dataframe (think a dataframe for a single ride), and adds a variety of columns to the dataset using the shift method, and systematically sets the SPOSTMIN values and time_diff values accordingly. This method could then be applied using the groupby method. This is one potential way to solve the problem!

Don’t worry too much about edge cases — as long as you are close, you will get full credit.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

How many fewer rows does reduced have than the original dataset? What does the time_diff column look like?

In project 7 you calculated the median SPOSTMIN and SACTMIN by ride_name. Perform the same operation on reduced. Are the SACTMIN and SPOSTMIN medians closer or further away than our not-cleaned data from project 7?

Do you think that, overall, the data in reduced is close enough (by time) to be able to draw comparisons? Why or why not?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Any observation where the (absolute) time_diff is greater than an hour is probably not very high quality. Remove said observations from reduced. How many rows are left in reduced?

Finally, explore the refined dataset, reduced, more. Write a question you would like to have answered down, what you think the answer will be, and do your best to used the dataset to answer your question.

Your analysis should include: a question, your hypothesis, at least 1 graphic, any and all code you used, and your conclusions. You will not be graded on whether or not you are correct, but rather the effort you put into your analysis. Any good effort including the requirements will receive full credit. Have fun!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.