Pandas Aggregate Functions
groupby
The groupby
function is both very powerful and very commonly used with DataFrames and Series.
For a DataFrame, groupby
groups each unique value in a given column (or set of columns) and allows you to perform operations on those groups. For example, given a sample of the flights data what if you wanted to calculate the average departure delay, DepDelay
, for each Year
of data in the Year
column?
import pandas as pd
myDF = pd.read_csv("./flights_sample.csv")
myDF.head()
Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay 0 1987 10 14 ... NaN NaN NaN 1 1990 10 15 ... NaN NaN NaN 2 1990 10 17 ... NaN NaN NaN 3 1990 10 18 ... NaN NaN NaN 4 1991 10 19 ... NaN NaN NaN
Now we can use the groupby
functionality to get the average for each Year
:
myDF.groupby("Year").mean()
Month DayofMonth DayOfWeek ... NASDelay SecurityDelay LateAircraftDelay Year ... 1987 10.0 14.000000 3.000000 ... NaN NaN NaN 1990 10.0 16.666667 5.666667 ... NaN NaN NaN 1991 10.0 21.800000 3.800000 ... NaN NaN NaN
As you can see, the average for each column is now calculated for each of the 3 Year
values in the dataset: 1987, 1990, and 1991. If you wanted the DepDelay
column by itself you can isolate it before or after the calculation:
# Before
myDF.groupby("Year")['DepDelay'].mean()
or
# After
myDF.groupby("Year").mean()['DepDelay']
Year 1987 11 1990 3 1991 6
You have the ability to group by multiple variables as well! For example, you could find the mean DepDelay
for each day of the week for each year:
myDF.groupby(["Year", "DayOfWeek"])['DepDelay'].mean()
Year DayOfWeek 1987 3 11 1990 4 -1 6 11 7 -1 1991 1 19 3 -2 4 -2 5 1 6 14
You may notice that the output of the groupby
statements results in a Series, not a DataFrame. To move the Year
and DayOfWeek
indexes back to be columns of a DataFrame, you can use the reset_index()
method:
myDF.groupby(["Year", "DayOfWeek"])['DepDelay'].mean().reset_index()
Year DayOfWeek DepDelay 0 1987 3 11 1 1990 4 -1 2 1990 6 11 3 1990 7 -1 4 1991 1 19 5 1991 3 -2 6 1991 4 -2 7 1991 5 1 8 1991 6 14
These are only some of the methods that you can utilize while grouping data. Two additional examples are included below:
How do I calculate the number of rows for each Year
?
myDF.groupby("Year")["DepDelay"].count()
How do I calculate the median DepDelay
by Year
?
myDF.groupby("Year")["DepDelay"].median()
agg
Another powerful addition to the Pandas groupby
method is the agg
method. The end result of agg
is similar to the functionality described above. However, it allows for more aggregations to be performed at once:
list_1 = ['Wisconsin', 'IU', 'Rutgers', 'Michigan State', 'Ohio State']
list_2 = ['home', 'away', 'home', 'away', 'home']
list_3 = [85, 78, 90, 75, 74]
list_4 = [500, 1000, 430, 4800, 10000]
myDF = pd.DataFrame(zip(list_1, list_2, list_3, list_4), columns=['opponent', 'location', 'temp', 'attendance'])
print(myDF.groupby('location').agg({'temp': 'mean', 'attendance': 'mean'}).reset_index())
location temp attendance 0 away 76.5 2900.000000 1 home 83.0 3643.333333