TDM 10200: Project 9 — Spring 2024

Motivation: Working with pandas can be fun! Learning how to manipulate and clean up data in pandas is a helpful tool to have in your tool belt! Dive into pandas to transform and analyze data, equipping you with key data science skills.

Context: Hopefully, most students are feeling pretty comfortable now, with building functions and using pandas. In this project, we will continue working with pandas. We want to get better at analyzing big datasets and solving problems with data.

Scope: python, pandas

Reading and Resources

Dataset

/anvil/projects/tdm/data/whin/weather.csv

We added five new videos to help you with Project 9.

You need to use 2 cores for your Jupyter Lab session for Project 9 this week.

You can use pd.set_option('display.max_columns', None) if you want to see all of the columns in a very wide data frame.

This data is similar to last week’s data, from Project 8, but we are using the csv file instead of the parquet file this week.

Please take a look at the .head(), .shape, .dtypes, .info(), .describe(), etc., to remind yourself about this data. Since we have explored the WHIN weather data set a little bit already, we will dive right into some explorations.

We will assume that you read the data into a data frame called myDF.

Questions

Question 1 (2 points)

  1. Use the method value_counts() to get the number of records for each station.

  2. Use the method groupby() to get the number of records for each station.

  3. Explain the difference between the methodologies of using these two methods. We love getting your feedback. Which method do you prefer?

Question 2 (2 points)

  1. Find out how many null records exist in myDF, within each individual column. (Your answer should specify, for each column, how many null records are in that column.)

  2. Now count the total number of null values in the entire data frame myDF. (In other words, add up the values from all of the counts in part 2a.)

  3. Drop rows with any null values in myDF. Save the resulting cleaned data set into a new DataFrame called myDF_cleaned.

  4. Just to make sure that you did this properly, check myDF_cleaned carefully: Are there any null values remaining in myDF_cleaned? (There should not be.) How many rows and columns are in myDF_cleaned?

There are a variety of ways to approach this question.

  • The isnull() method is useful to find null records.

  • The sum() and dropna() methods might be useful on this question too.

Question 3 (2 points)

  1. Go back to the original data frame myDF. Create a new data frame, by removing all rows from myDF in which the column temperature is a null value.

  2. Combine the columns latitude and longitude into a new column called location with '_' in between.

  3. Grouping the data by the location column, find the average temperature for each location, and print your results: Each line that you print should have 1 location and 1 average temperature for that location.

  • You may refer to combine columns

  • The data in the new location column should be strings.

  • Drop only the rows from the original data frame myDF that have a null value in the temperature column.

Question 4 (2 points)

  1. Wrap your work from Question 3 into a function. This function should take a data frame as a parameter, and should drop records with null value in the data frame’s temperature column. The function should also create a new location column, and should calculate the average temperature (grouped by location). The function should return the Series of average temperatures (grouped by location).

Question 5 (2 points)

  1. Now considering the column called wind_gust_speed_mph (instead of temperature), do the work from questions 3 and 4 again.

  2. Which location has the largest average value of the wind_gust_speed_mph ?

Project 09 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project09.ipynb.

  • Python file with code and comments for the assignment

    • firstname-lastname-project09.py

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.