TDM 10200: Project 9 — Spring 2023

Motivation: Working in pandas can be fun! Learning how to wrangle data and clean up data in pandas is a helpful tool to have in your tool belt!

Context: Now that we are feeling more comfortable with building functions and using pandas we want to continue to build skills and use pandas to solve data driven problems.

Scope: python, pandas, numpy

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

When launching Juypter Notebook on Anvil you will need to use 2 cores.

The following questions will use the following dataset(s):

/anvil/projects/tdm/data/disney/total.parquet

Helpful Hints
import pandas as pd
disney = pd.read_parquet('/anvil/projects/tdm/data/disney/total.parquet')
Insider Knowledge

It is helpful to use a Parquet file when we need efficient storage. If we tried to read in all the .csv files in the disney folder the kernel would crash. In short a Parquet file allows for high performance data compression and encoding schemes to deal with large amounts of complex data. The format is a column-oriented file format while .csv’s tend to be row-oriented.
You can read more about what row vs column oriented databases are here.

ONE

Luckily this data is being read in as already cleaned data. It also has been recently updated and has a lot more information, i.e., it has more data from more rides.

  1. Since there is a lot of new ride data, let’s print the name of each ride.

  2. How many rows of data are there for each ride?

  3. What is different about the information that you receive if you use the groupby() vs value_counts()? Which one yields the information asked by question 1b? Why?

  4. Go ahead and import the numpy package and see if you can find the frequency of JUST the ride named hall_of_presidents from the column ride_name. Under Helpful Hint there are two different ways to do that, but can you come up with a third?

Helpful Hint
import numpy as np
disney[disney.ride_name == 'hall_of_presidents'].shape[0]
#OR
import numpy as np
(disney['ride_name']=='hall_of_presidents').sum()
Insider Knowledge
  • Note that, before it gives you all the unique values in the column ride_name, it tells you that it is an array. An array is a ordered collection of elements where every value has the same data type.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Answer to questions a,b,c,d

TWO

Create a new function that accepts a ride name as an argument, and prints two things: (1) the first year the data for that ride was collected, and (2) the most recent year that the data for that ride was collected.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

THREE

Notice that the dataset has two columns SPOSTMIN and SACTMIN. Each row has either a value for SPOSTMIN or SACTMIN but not both.

  1. How many total rows of data do we have?

  2. How many non-null rows for SPOSTMIN?

  3. How many non-null rows for SACTMIN?

  4. Combine columns SPOSTMIN and SACTMIN to create a new variable named newcolumn

  5. What is the length of newcolumn? Is that the same as the number of rows in the disney dataframe?

Helpful Hints

It might be useful to use the combine_first function for question 3d:

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Answer to questions a,b,c,d,e

FOUR

  1. Find the max and min SACTMIN time for each ride

  2. Find the max and min SPOSTMIN time for each ride

  3. Find the average SPOSTMIN time for each ride

  4. Find the average SACTMIN time for each ride

Helpful Hint

Note that the value -999 indicates that the attraction was closed.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Answer to questions a-d

FIVE

  1. Find the date that each ride was most frequently checked.

  2. What was the most commonly closed ride? (Again, note that the value -999 indicates that the attraction was closed.)

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Answer to questions a and b

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.