Pandas Indexing
Pandas Methods
Pandas has some very beneficial indexing functions that allow additional logic to be added when working with data in a DataFrame or Series. Two of the most common methods are iloc
and loc
. The loc
method is primarily label based while iloc
is primarily and integer based indexer. For example, given the following DataFrame:
import pandas as pd
myDF = pd.read_csv("./flights_sample.csv")
myDF.head()
Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay 0 1987 10 14 ... NaN NaN NaN 1 1990 10 15 ... NaN NaN NaN 2 1990 10 17 ... NaN NaN NaN 3 1990 10 18 ... NaN NaN NaN 4 1991 10 19 ... NaN NaN NaN
The loc
function allows us to add logic to our indexing. If we wanted to get only the rows where Year
is 1990 we could try:
love_the_90s = myDF.loc[myDF.loc[:, 'Year'] == 1990, :]
print(love_the_90s)
Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay 1 1990 10 15 ... NaN NaN NaN 2 1990 10 17 ... NaN NaN NaN 3 1990 10 18 ... NaN NaN NaN
Alternatively, you could write the code below to get the same outcome:
love_the_90s = myDF.loc[myDF['Year'] == 1990, :]
Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay 1 1990 10 15 ... NaN NaN NaN 2 1990 10 17 ... NaN NaN NaN 3 1990 10 18 ... NaN NaN NaN
The :
simply means "all rows" if it is before the comma, and "all columns" if it is after the comma. In the same way, having myDF['Year'] == 1990
before the first comma, means to filter rows where myDF['Year'] == 1990
results in True
.
Note you could use iloc
instead of loc
for any of the previous examples. Both iloc
and loc
allow for boolean (logical) based indexing. Since myDF['Year'] == 1990
or myDF.loc[:, 'Year'] == 1990
both result in a Series of True
and False
values either will work.
Similarly, we can use either loc
or iloc
to isolate a single column or Series:
just_the_year = myDF.loc[:, 'Year']
print(just_the_year)
0 1987 1 1990 2 1990 3 1990 4 1991 5 1991 6 1991 7 1991 8 1991
just_the_year = myDF.iloc[:, 0] #Year is the first column.
print(just_the_year)
0 1987 1 1990 2 1990 3 1990 4 1991 5 1991 6 1991 7 1991 8 1991
The same two techniques can both be used for multiple rows as well:
year_month = myDF.loc[:, ('Year', 'Month')]
print(year_month)
year_month = myDF.iloc[:, 0:2]
print(year_month)
Year Month 0 1987 10 1 1990 10 2 1990 10 3 1990 10 4 1991 10 5 1991 10 6 1991 10 7 1991 10 8 1991 10
As mentioned above, the first entry in loc
and iloc
idicated the rows. If we wanted to isolate specific rows in a DataFrame we could use similar log to the previous example but for the rows entry:
i_want_three_rows = myDF.loc[0:2, :]
print(i_want_three_rows)
Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay 0 1987 10 14 ... NaN NaN NaN 1 1990 10 15 ... NaN NaN NaN 2 1990 10 17 ... NaN NaN NaN
Using iloc
looks pretty similar. However it should be noted that because the iloc
indexing is not inclusive it only returns 2 rows instead of 3.
i_want_three_rows = myDF.iloc[0:2, :]
print(i_want_three_rows)
Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay 0 1987 10 14 ... NaN NaN NaN 1 1990 10 15 ... NaN NaN NaN
Even though the code may look pretty similar the functionality behind loc
and iloc
is very different. We can show the difference in behavior if we change the index of the DataFrame:
list_1 = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
list_2 = ['Apple', 'Banana', 'Coffee', 'Nothing', 'Oatmeal']
list_3 = [1, 3, 6, 8, 1]
column_names = ['day_of_the_week', 'breakfast', 'nonsense']
index_values = [1,1,2,2,3]
myDF = pd.DataFrame(zip(list_1, list_2, list_3), columns=column_names, index=index_values)
print(myDF)
day_of_the_week breakfast nonsense 1 Monday Apple 1 1 Tuesday Banana 3 2 Wednesday Coffee 6 2 Thursday Nothing 8 3 Friday Oatmeal 1
print(myDF.loc[0:2,:])
day_of_the_week breakfast nonsense 1 Monday Apple 1 1 Tuesday Banana 3 2 Wednesday Coffee 6 2 Thursday Nothing 8
In this example the loc
function is saying that it should return any indexes that have values of 0, 1, or 2 [0:2]
. For the example DataFrame it only returns the rows with 1 or 2 since we don’t have any 0 indicies.
print(myDF.iloc[0:2,:])
day_of_the_week breakfast nonsense 1 Monday Apple 1 1 Tuesday Banana 3
In comparison the iloc
will only get the rows in positions 0 and 1. In this case they both happen to have an index of 1.
You can also index on both rows and columns:
rows_and_columns = myDF.iloc[0:2, 0:2]
print(rows_and_columns)
rows_and_columns = myDF.loc[0:1, ('day_of_the_week', 'breakfast')]
print(rows_and_columns)
day_of_the_week breakfast 1 Monday Apple 1 Tuesday Banana
The logic that loc
and iloc
allow is one of the most impactful features of Pandas indexing. In addition, the logic statements can be chained together. For example, if you wanted to get the rows for Monday
or that didn’t have Breakfast
you could do the following:
monday_or_breakfast = myDF.loc[(myDF.loc[:, "day_of_the_week"]=="Monday") | (myDF.loc[:, "breakfast"]=="Nothing"), :]
print(monday_or_breakfast)
day_of_the_week breakfast nonsense 1 Monday Apple 1 2 Thursday Nothing 8
Note: in this example the parentheses ()
are critical. Withouth the parentheses Python doesn’t know how to evaulate the multiple statements and will error. The following will not work:
monday_or_breakfast = myDF.loc[myDF.loc[:, "day_of_the_week"]=="Monday" | myDF.loc[:, "breakfast"]=="Nothing", :]
print(monday_or_breakfast)
You can use &
for the logical AND just as you can use |
for the logical OR:
apple_and_one = myDF.loc[(myDF.loc[:, "breakfast"]=="Apple") & (myDF.loc[:, "nonsense"]==1), :]
print(apple_and_one)
day_of_the_week breakfast nonsense 1 Monday Apple 1
Resources
If you haven’t done it before the 10 minute intro to Pandas is a great quick introduction. Similarly, the Pandas documentation has a good section on indexing and selecting data.