TDM 10200: Project 7 — 2024

Motivation: Pandas is a powerful tool to manipulate and analysis data in Python. We can use Pandas to extract information about one or more variables on data frames, sometimes we can group the data according to one variable and summarizing another variable within each of those groups.

Context: Understanding how to use Pandas and be able to develop functions allows for a systematic approach to analyzing data.

Scope: Pandas and functions

Reading and Resources

Make sure to read about, and use the template found here, and the important information about projects submissions here.
Python Functions
pandas Series
pandas aggregation functions

Datasets

/anvil/projects/tdm/data/noaa

Just as we did in Project 3, please remember to use header=None when you read in the data set, and remember to use:

names=["id","date","element_code","value","mflag","qflag","sflag","obstime"]

We added six new videos to help you with Project 7.

Questions

For all questions in this project, consider only the rows from the NOAA data that are for US records, namely, in which the first field starts with the letters `US`.

Dr Ward used this code in the videos, which might help you during your work:

def get_noaa_data (myyear: int) -> pd.DataFrame:
    """
    This function accepts a 4-digit year as input, and returns a data frame that contains the NOAA data for that year

    Args:
    myyear (int): This is a 4-digit year for which we will load a data frame to be returned.

    Returns:
    myDF (pd.DataFrame): This is the data frame that contains the NOAA data for that year
    """
    myfilepath = f'/anvil/projects/tdm/data/noaa/{myyear}.csv'
    mycolumnnames=["id","date","element_code","value","mflag","qflag","sflag","obstime"]
    myDF = pd.read_csv(myfilepath, names=mycolumnnames)
    myDF['date'] = pd.to_datetime(myDF['date'], format="%Y%m%d")
    return myDF

Question 1 (2 points)

For the year 1880, find how many rows are for US records. Then do this again (in separate cells) for the years 1881, 1882, 1883.
Now make a dictionary that has these years as the four keys, and has the counts (that you discovered in question 1a) as the values.
Now that you know how to do the work from questions 1a and 1b, wrap your work into a function that accepts a list of years as input, and returns a dictionary as the output.
Run the function for the years in the range from 1880 to 1883 (inclusive), and make sure that the results agree with your results from question 1a and 1b.

The following sample code can be used to get US records

us_records = df[df['id'].str.startswith('US')]

The shape of a data frame contains the number of rows and the number of columns.

The range of years to test is range(1880,1884); remember that this will include 1880 to 1883 and will not include the year 1884 (Python always drops the last year in a range.

Question 2 (2 points)

Revise your work from question 1b (before you built your function!), so that the dictionary is in reverse order. The code from the tip might help. Test your work on the years 1880 through 1883, and make sure that the resulting dictionary is in descending order.
Now that you know how to do the work from questions 2a, wrap your work from question 2a into a function that accepts a list of years as input, and returns a dictionary in descending order as the output.
Run the function for the years in the range from 1880 to 1883 (inclusive), and make sure that the results agree with your results from question 2a.

This code takes a dictionary called mydict and puts it into descending sorted order.

mydescendingdict = dict([key, mydict[key]] for key in sorted(mydict, key=mydict.get, reverse=True))

Question 3 (2 points)

For the year 1880, find how many rows (which are for US records) are SNOW days with a positive amount of snowfall. In other words look for rows with three conditions: The first field starts with US, and the element_code is SNOW, and the value is strictly positive. Then do this again (in separate cells) for the years 1881, 1882, 1883.
Now make a dictionary that has these years as the four keys, and has the counts (that you discovered in question 3a) as the values.
Now that you know how to do the work from questions 3a and 3b, wrap your work into a function that accepts a list of years as input, and returns a dictionary as the output.
Run the function for the years in the range from 1880 to 1883 (inclusive), and make sure that the results agree with your results from question 3a and 3a.

Question 4 (2 points)

For the year 1880, consider only the types of rows from question 3a (which are for US records, with element_code as SNOW, and with value as strictly positive). Group those rows according to the id, and determine which id has the largest number of snowfall days. Then do this again (in separate cells) for the years 1881, 1882, 1883.
Now make a dictionary that has these years as the four keys, and has the id values with the largest number of snowfall days in each of these individual years.
Now that you know how to do the work from questions 4a and 4b, wrap your work into a function that accepts a list of years as input, and returns a dictionary as the output.
Run the function for the years in the range from 1880 to 1883 (inclusive), and make sure that the results agree with your results from question 4a and 4a.

Question 5 (2 points)

For the year 1880, consider only the types of rows from question 3a/4a (again, which are for US records, with element_code as SNOW, and with value as strictly positive). Group those rows according to the id, and determine which id has the largest amount of snowfall (in other words, sum the snowfall amounts for each id). Then do this again (in separate cells) for the years 1881, 1882, 1883.
Now make a dictionary that has these years as the four keys, and has the id values with the largest amount of snowfall in each of these individual years.
Now that you know how to do the work from questions 5a and 5b, wrap your work into a function that accepts a list of years as input, and returns a dictionary as the output.
Run the function for the years in the range from 1880 to 1883 (inclusive), and make sure that the results agree with your results from question 5a and 5b.

Project 07 Assignment Checklist

Jupyter Lab notebook with your code, comments and output for the assignment
- firstname-lastname-project07.ipynb.
Python file with code and comments for the assignment
- firstname-lastname-project07.py
Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.