Writing Functions with the WHIN Dataset

This example is from TDM 102 Project 9 Spring 2024.

These example(s) depend on the database:

  • /anvil/projects/tdm/data/whin/weather.parquet

Learn more about the dataset here.

3aa. Go back to the original data frame myDF. Create a new data frame, by removing all rows from myDF in which the column temperature is a null value.*

import pandas as pd
myDF = pd.read_csv('/anvil/projects/tdm/data/whin/weather.csv')
myDF = myDF.dropna()
myDF['location']=myDF['latitude'].astype(str)+'_'+myDF['longitude'].astype(str)
import pandas as pd

myDF = myDF.dropna(subset=['temperature'])  # Drop NaN values

3ab. Combine the columns latitude and longitude into a new column called location with '_' in between.

import pandas as pd
myDF = pd.read_csv('/anvil/projects/tdm/data/whin/weather.csv')
myDF = myDF.dropna()
myDF['location']=myDF['latitude'].astype(str)+'_'+myDF['longitude'].astype(str)

3ac. Grouping the data by the location column, find the average temperature for each location, and print your results: Each line that you print should have 1 location and 1 average temperature for that location.

myDF.groupby('location')['temperature'].mean().sort_values(ascending = False)
Location Temperature

40.53722_-86.95342

57.639716

40.10483_-86.86619

55.795827

40.70178_-86.70649

55.642011

40.97061_-86.35304

55.632911

40.38539_-87.51034

55.435129

40.51432_-86.45856

55.400226

40.59057_-86.39155

55.369460

40.84436_-86.18173

55.298587

40.43134_-86.53464

55.102383

40.58629_-87.43654

55.085078

40.78049_-86.89576

55.032875

40.37634_-86.59591

54.916286

40.5485_-87.12477

54.867862

41.01865_-86.7101

54.675659

40.98224_-86.38542

54.493940

40.16179_-87.35246

53.826035

40.42124_-86.84642

50.878671

40.93894_-86.47418

50.700275

40.2709571472446_-87.1486035394257

49.977179

40.38116_-86.40269

47.996461

40.3840069271445_-87.31664031744

44.836799

40.3861140862113_-87.1012964844704

42.005992

4a. Wrap your work from Question 3 into a function. This function should take a data frame as a parameter, and should drop records with null value in the data frame’s temperature column. The function should also create a new location column, and should calculate the average temperature (grouped by location). The function should return the Series of average temperatures (grouped by location).

import pandas as pd

def avg_temp(file_path):
    df = pd.read_csv(file_path)


    df_cleaned = df.dropna(subset=['temperature']).copy()


    df_cleaned['location'] = df_cleaned['latitude'].astype(str) + '_' + df_cleaned['longitude'].astype(str)


    avg_temp_loc = df_cleaned.groupby('location')['temperature'].mean()

    return avg_temp_loc
avg_temp('/anvil/projects/tdm/data/whin/weather.csv')
Location Temperature

40.10483_-86.86619

56.057495

40.1492432088387_-86.7371410131454

61.413324

40.16179_-87.35246

54.052897

40.2709571472446_-87.1486035394257

49.803342

40.2967994865571_-87.3902853950858

57.855233

40.30156_-87.48248

58.889855

40.37634_-86.59591

54.794076

40.38116_-86.40269

47.765668

40.3840069271445_-87.31664031744

59.552215

40.38539_-87.51034

55.115662

40.3861140862113_-87.1012964844704

59.383360

40.42124_-86.84642

50.742410

40.43134_-86.53464

55.081819

40.48079_-87.20682

51.730032

40.4869749155437_-87.4914180859923

58.971553

40.51432_-86.45856

55.183031

40.53722_-86.95342

55.070667

40.5485_-87.12477

54.898998

40.58629_-87.43654

54.676701

40.59057_-86.39155

55.307808

40.70178_-86.70649

55.482287

40.78049_-86.89576

54.791890

40.84436_-86.18173

54.942505

40.93894_-86.47418

49.883742

40.9700622558594_-86.9013715815608

59.919768

40.97061_-86.35304

55.292530

40.98224_-86.38542

54.206106

41.01865_-86.7101

53.836086