Writing Functions with the WHIN Dataset
This example is from TDM 102 Project 9 Spring 2024.
These example(s) depend on the database:
-
/anvil/projects/tdm/data/whin/weather.parquet
Learn more about the dataset here.
3aa. Go back to the original data frame myDF
. Create a new data frame, by removing all rows from myDF
in which the column temperature
is a null value.*
import pandas as pd
myDF = pd.read_csv('/anvil/projects/tdm/data/whin/weather.csv')
myDF = myDF.dropna()
myDF['location']=myDF['latitude'].astype(str)+'_'+myDF['longitude'].astype(str)
import pandas as pd
myDF = myDF.dropna(subset=['temperature']) # Drop NaN values
3ab. Combine the columns latitude
and longitude
into a new column called location
with '_' in between.
import pandas as pd
myDF = pd.read_csv('/anvil/projects/tdm/data/whin/weather.csv')
myDF = myDF.dropna()
myDF['location']=myDF['latitude'].astype(str)+'_'+myDF['longitude'].astype(str)
3ac. Grouping the data by the location
column, find the average temperature for each location, and print your results: Each line that you print should have 1 location and 1 average temperature for that location.
myDF.groupby('location')['temperature'].mean().sort_values(ascending = False)
Location | Temperature |
---|---|
40.53722_-86.95342 |
57.639716 |
40.10483_-86.86619 |
55.795827 |
40.70178_-86.70649 |
55.642011 |
40.97061_-86.35304 |
55.632911 |
40.38539_-87.51034 |
55.435129 |
40.51432_-86.45856 |
55.400226 |
40.59057_-86.39155 |
55.369460 |
40.84436_-86.18173 |
55.298587 |
40.43134_-86.53464 |
55.102383 |
40.58629_-87.43654 |
55.085078 |
40.78049_-86.89576 |
55.032875 |
40.37634_-86.59591 |
54.916286 |
40.5485_-87.12477 |
54.867862 |
41.01865_-86.7101 |
54.675659 |
40.98224_-86.38542 |
54.493940 |
40.16179_-87.35246 |
53.826035 |
40.42124_-86.84642 |
50.878671 |
40.93894_-86.47418 |
50.700275 |
40.2709571472446_-87.1486035394257 |
49.977179 |
40.38116_-86.40269 |
47.996461 |
40.3840069271445_-87.31664031744 |
44.836799 |
40.3861140862113_-87.1012964844704 |
42.005992 |
4a. Wrap your work from Question 3 into a function. This function should take a data frame as a parameter, and should drop records with null value in the data frame’s temperature column. The function should also create a new location column, and should calculate the average temperature (grouped by location). The function should return the Series of average temperatures (grouped by location).
import pandas as pd
def avg_temp(file_path):
df = pd.read_csv(file_path)
df_cleaned = df.dropna(subset=['temperature']).copy()
df_cleaned['location'] = df_cleaned['latitude'].astype(str) + '_' + df_cleaned['longitude'].astype(str)
avg_temp_loc = df_cleaned.groupby('location')['temperature'].mean()
return avg_temp_loc
avg_temp('/anvil/projects/tdm/data/whin/weather.csv')
Location | Temperature |
---|---|
40.10483_-86.86619 |
56.057495 |
40.1492432088387_-86.7371410131454 |
61.413324 |
40.16179_-87.35246 |
54.052897 |
40.2709571472446_-87.1486035394257 |
49.803342 |
40.2967994865571_-87.3902853950858 |
57.855233 |
40.30156_-87.48248 |
58.889855 |
40.37634_-86.59591 |
54.794076 |
40.38116_-86.40269 |
47.765668 |
40.3840069271445_-87.31664031744 |
59.552215 |
40.38539_-87.51034 |
55.115662 |
40.3861140862113_-87.1012964844704 |
59.383360 |
40.42124_-86.84642 |
50.742410 |
40.43134_-86.53464 |
55.081819 |
40.48079_-87.20682 |
51.730032 |
40.4869749155437_-87.4914180859923 |
58.971553 |
40.51432_-86.45856 |
55.183031 |
40.53722_-86.95342 |
55.070667 |
40.5485_-87.12477 |
54.898998 |
40.58629_-87.43654 |
54.676701 |
40.59057_-86.39155 |
55.307808 |
40.70178_-86.70649 |
55.482287 |
40.78049_-86.89576 |
54.791890 |
40.84436_-86.18173 |
54.942505 |
40.93894_-86.47418 |
49.883742 |
40.9700622558594_-86.9013715815608 |
59.919768 |
40.97061_-86.35304 |
55.292530 |
40.98224_-86.38542 |
54.206106 |
41.01865_-86.7101 |
53.836086 |