Writing Functions with the WHIN Dataset

This example is from TDM 102 Project 9 Spring 2024.

These example(s) depend on the database:

/anvil/projects/tdm/data/whin/weather.parquet

Learn more about the dataset here.

3aa. Go back to the original data frame myDF. Create a new data frame, by removing all rows from myDF in which the column temperature is a null value.*

import pandas as pd
myDF = pd.read_csv('/anvil/projects/tdm/data/whin/weather.csv')
myDF = myDF.dropna()
myDF['location']=myDF['latitude'].astype(str)+'_'+myDF['longitude'].astype(str)

import pandas as pd

myDF = myDF.dropna(subset=['temperature'])  # Drop NaN values

3ab. Combine the columns latitude and longitude into a new column called location with '_' in between.

import pandas as pd
myDF = pd.read_csv('/anvil/projects/tdm/data/whin/weather.csv')
myDF = myDF.dropna()
myDF['location']=myDF['latitude'].astype(str)+'_'+myDF['longitude'].astype(str)

3ac. Grouping the data by the location column, find the average temperature for each location, and print your results: Each line that you print should have 1 location and 1 average temperature for that location.

myDF.groupby('location')['temperature'].mean().sort_values(ascending = False)

Location	Temperature
40.53722_-86.95342	57.639716
40.10483_-86.86619	55.795827
40.70178_-86.70649	55.642011
40.97061_-86.35304	55.632911
40.38539_-87.51034	55.435129
40.51432_-86.45856	55.400226
40.59057_-86.39155	55.369460
40.84436_-86.18173	55.298587
40.43134_-86.53464	55.102383
40.58629_-87.43654	55.085078
40.78049_-86.89576	55.032875
40.37634_-86.59591	54.916286
40.5485_-87.12477	54.867862
41.01865_-86.7101	54.675659
40.98224_-86.38542	54.493940
40.16179_-87.35246	53.826035
40.42124_-86.84642	50.878671
40.93894_-86.47418	50.700275
40.2709571472446_-87.1486035394257	49.977179
40.38116_-86.40269	47.996461
40.3840069271445_-87.31664031744	44.836799
40.3861140862113_-87.1012964844704	42.005992

4a. Wrap your work from Question 3 into a function. This function should take a data frame as a parameter, and should drop records with null value in the data frame’s temperature column. The function should also create a new location column, and should calculate the average temperature (grouped by location). The function should return the Series of average temperatures (grouped by location).

import pandas as pd

def avg_temp(file_path):
    df = pd.read_csv(file_path)


    df_cleaned = df.dropna(subset=['temperature']).copy()


    df_cleaned['location'] = df_cleaned['latitude'].astype(str) + '_' + df_cleaned['longitude'].astype(str)


    avg_temp_loc = df_cleaned.groupby('location')['temperature'].mean()

    return avg_temp_loc

avg_temp('/anvil/projects/tdm/data/whin/weather.csv')

Location	Temperature
40.10483_-86.86619	56.057495
40.1492432088387_-86.7371410131454	61.413324
40.16179_-87.35246	54.052897
40.2709571472446_-87.1486035394257	49.803342
40.2967994865571_-87.3902853950858	57.855233
40.30156_-87.48248	58.889855
40.37634_-86.59591	54.794076
40.38116_-86.40269	47.765668
40.3840069271445_-87.31664031744	59.552215
40.38539_-87.51034	55.115662
40.3861140862113_-87.1012964844704	59.383360
40.42124_-86.84642	50.742410
40.43134_-86.53464	55.081819
40.48079_-87.20682	51.730032
40.4869749155437_-87.4914180859923	58.971553
40.51432_-86.45856	55.183031
40.53722_-86.95342	55.070667
40.5485_-87.12477	54.898998
40.58629_-87.43654	54.676701
40.59057_-86.39155	55.307808
40.70178_-86.70649	55.482287
40.78049_-86.89576	54.791890
40.84436_-86.18173	54.942505
40.93894_-86.47418	49.883742
40.9700622558594_-86.9013715815608	59.919768
40.97061_-86.35304	55.292530
40.98224_-86.38542	54.206106
41.01865_-86.7101	53.836086