Indexing

Basics and Simple Accessing Methods

It is implausible that you will always need an entire dataset when doing analysis, so being able to efficiently pick subsets of data is an important skill. pandas has some vital indexing functions that add logical parameters when working with a DataFrame or Series.


[] and .

Using brackets and periods to index in pandas are the most basic methods of indexing, and thus allow for simpler subset selection. Both are limited to column selection, so you won’t be able to select the first row of an object with df[1] or df.1.

This indexing is done based on the column name — if we had a column named "columnA", we could select it using df['columnA'] or df.columnA.

Selecting multiple columns is also possible by nesting a list within your brackets: df[['columnA', 'columnB', 'columnC']]. This is not possible when using . to select attributes — you are additionally unable to use . to select:

  • Columns that match names of pandas methods — max, head, describe, etc.

  • Identifiers, including keywords, whitespace, and literals. If your column has a space in its name (which can happen often), you cannot select it with `.`


Slicing

If we consider simple [] and . indexing to be column selection, we can consider : indexing to be row selection. You still start by using brackets, but when : is included, the rows are now selected instead of columns. Only our rows can be selected when using simple slicing, but as you will soon see, slicing is a tool that is used heavily in more complex indexing.

: slicing is inclusive, with an implied 0 for the start and the object’s last index implied for the end. Thus, the input print(df[:]) selects all rows and is the same as print(df). You can add a negative sign to the end index to exclude the last n elements.

If you want to skip certain elements, you can add another colon after the stop index, which would take the following form: df[0:4:2]. There is an implied skip of 1 (meaning no skip at all), meaning print(df[:]) == print(df[::]) == print(df). You can add a negative sign in front of the skip parameter to slice the object in reverse order.


isin()

The isin method (derived from is in) is a pandas method that mirrors the in keyword in other programming languages. It’s a Series method that is passed a list (or list-like object), returning True or False depending on the value’s inclusion in the list.

The potential applications are simple to imagine — what if I only wanted to select certain states from a DataFrame containing information on the United States? How about information when the records include the day of the week, and I only want Mondays and Thursdays? These desires are very easily translated into code.

days = pd.Series(['Monday', 'Thursday', 'Wednesday', 'Saturday', 'Saturday', 'Monday'])
print(days.isin(['Monday', 'Thursday']))
0     True
1     True
2    False
3    False
4    False
5     True
print(days[days.isin(['Monday', 'Thursday'])])
0      Monday
1    Thursday
5      Monday

DataFrames have their own isin method, but as we’ll demonstrate in the next section, using the Series method is better when combined with the loc/iloc functions.


Examples

For the following, we use the code np.random.randn(num_rows, num_columns) to generate a random DataFrame. The point is not for the output values to match, but rather the rows/columns.

Return the column A from DataFrame df.

Click to see solution
# method 1: []
print(df['A'])
# method 2: .
print(df.A)
0   -2.676859
1    0.110410
2    1.263104
3    0.161416
4   -0.213868

Return columns A, B, and D from DataFrame df.

Click to see solution
print(df[['A', 'B', 'D']])
          A         B         D
0  0.461834  0.456688 -1.061509
1  1.003698  1.115509  0.120536
2  0.814746  2.793606 -0.281329
3  0.766533  0.138788  0.479603
4 -0.084290 -0.141935  0.755774

Return every other row in DataFrame numbers.

Click to see solution
print(numbers[::2])
          A         B         C         D
0 -0.234193 -0.775527 -1.250210  1.421642
2 -3.402812  0.388646  1.199761  1.366917
4 -0.373406  0.868126 -0.063795  1.202232
6 -0.872389  1.717326 -0.709681 -0.339897

Return all of numbers except the last 3 rows.

Click to see solution
print(numbers[:-3])
          A         B         C         D
0 -0.293664 -0.072110  0.937070  1.611655
1 -0.431300  1.992882  0.175886 -0.777462
2 -0.014344 -0.018958  0.085689  1.749314
3  0.577731 -0.505912  1.576066 -0.688136
4 -1.125129  1.710249  1.230097  0.634027


Multi-dimensional Indexing: loc and iloc

Fortunately, there are many ways to select multiple dimensions for subsetting, and they generally allow conditionals to further specify the data points you want.

The loc method allows for label-based indexing, while iloc is primarily integer-based. These are two of the best functions to use when indexing, so we’ll be using plenty of examples to demonstrate their versatility.

We can use either loc or iloc to isolate a single column or Series. The differences between the two are strict; if you try to use row/column indices for loc or row/column names with iloc, you’ll get an error.

# option 1: loc
just_the_year = myDF.loc[:, 'Year']
# option 2: iloc, matches option 1
just_the_year = myDF.iloc[:, 0]  # Year is the first column.
print(just_the_year)
0    1987
1    1990
2    1990
3    1990
4    1991
5    1991
6    1991
7    1991
8    1991

The comma is important here — everything before the comma indicates row selection, while everything after indicates column selection. You should recognize the : from slicing, and we can now slice on both dimensions thanks to loc!

The same two techniques can both be used for multiple rows and columns as well. Something to keep in mind is that .iloc slicing is exclusive, contrasting the inclusivity of standard slicing and .loc slicing.

year_month = myDF.loc[:, ('Year', 'Month')]
year_month = myDF.iloc[:, 0:2]
print(year_month)
   Year  Month
0  1987     10
1  1990     10
2  1990     10
3  1990     10
4  1991     10
5  1991     10
6  1991     10
7  1991     10
8  1991     10
putting_it_together = myDF.loc[0:2, ['Year', 'Month', 'DayofMonth']]
putting_it_together = myDF.iloc[0:3, 0:3]
print(putting_it_together)
   Year  Month  DayofMonth
0  1987     10          14
1  1990     10          15
2  1990     10          17


Lots of loc and iloc examples can look similar, but when we change the index of a DataFrame, we get to see some crucial differences:

list_1 = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
list_2 = ['Apple', 'Banana', 'Coffee', 'Nothing', 'Oatmeal']
list_3 = [1, 3, 6, 8, 1]

column_names = ['day_of_week', 'breakfast', 'rating']
index_values = [1,1,2,2,3]

myDF = pd.DataFrame(zip(list_1, list_2, list_3), columns=column_names, index=index_values)
print(myDF)
      day_of_week breakfast    rating
1          Monday     Apple         1
1         Tuesday    Banana         3
2       Wednesday    Coffee         6
2        Thursday   Nothing         8
3          Friday   Oatmeal         1
print(myDF.loc[0:2,:])
      day_of_week breakfast    rating
1          Monday     Apple         1
1         Tuesday    Banana         3
2       Wednesday    Coffee         6
2        Thursday   Nothing         8

Since loc is symbol-based, we will get all the rows with the index of 0, 1, or 2, instead of getting the first 3 rows. We get a much different result when directly swapping loc with iloc:

print(myDF.iloc[0:2,:])
      day_of_week breakfast    rating
1          Monday     Apple         1
1         Tuesday    Banana         3


Boolean Statements

Utilizing logical statements when indexing is vital — you will commonly encounter situations where using sequences to select data does not suffice. When the values you want are scattered throughout the DataFrame, you’ll want conditional statements.

Let’s say we want only Mondays and the days where no breakfast was eaten. We can combine conditional statements as follows:

monday_or_breakfast = myDF.loc[(myDF.loc[:, "day_of_week"]=="Monday") | (myDF.loc[:, "breakfast"]=="Nothing"), :]
print(monday_or_breakfast)
      day_of_week breakfast    rating
1          Monday     Apple         1
2        Thursday   Nothing         8

Nice! The parentheses () are critical here — pandas doesn’t know how to evaulate multiple statements with | alone and will throw an error. The following will not work (and also looks worse):

monday_or_breakfast = myDF.loc[myDF.loc[:, "day_of_week"]=="Monday" | myDF.loc[:, "breakfast"]=="Nothing", :]

You can use & for the logical AND just as you can use | for the logical OR:

apple_and_one = myDF.loc[(myDF.loc[:, "breakfast"]=="Apple") & (myDF.loc[:, "rating"]==1), :]
print(apple_and_one)
      day_of_week breakfast    rating
1          Monday     Apple         1


Examples

"flights_sample.csv" example

Click to see solution

Take the following example from the "flights_sample.csv" file:

import pandas as pd

myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/flights_sample.csv")
print(myDF.head())
   Year  Month  DayofMonth  ...  NASDelay  SecurityDelay  LateAircraftDelay
0  1987     10          14  ...       NaN            NaN                NaN
1  1990     10          15  ...       NaN            NaN                NaN
2  1990     10          17  ...       NaN            NaN                NaN
3  1990     10          18  ...       NaN            NaN                NaN
4  1991     10          19  ...       NaN            NaN                NaN

This call to head would be the equivalent to the line myDF.loc[0:4, :].

Let’s say we want only the data from 1990 — the best strategy is to use a conditional.

# option 1: nested .loc call
love_the_90s = myDF.loc[myDF.loc[:, 'Year'] == 1990, :]
# option 2: nested bracket selection; equivalent to 1
love_the_90s = myDF.loc[myDF['Year'] == 1990, :]
print(love_the_90s)
   Year  Month  DayofMonth  ...  NASDelay  SecurityDelay  LateAircraftDelay
1  1990     10          15  ...       NaN            NaN                NaN
2  1990     10          17  ...       NaN            NaN                NaN
3  1990     10          18  ...       NaN            NaN                NaN

Cool! This gives us what we want…​ but what’s with the weird nesting? The thing with conditionals is that the statement only evaluates to True or False — boolean values, in other words. In our example, the statement myDF['Year'] == 1990 on its own would evaluate to:

0    False
1     True
2     True
3     True
4    False

Obviously this only gives us information on Year and its values, but now we know which rows contain the information we want. We now nest our row selection in its proper location within our outer .loc call, then use : to select all columns of myDF. This is why nesting was necessary.


Now, how do we repeat this example with .iloc? We can try swapping .loc with .iloc directly in Option 1, but we’ll run into the error message ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types. This is a long way of saying we can only include numbers or a boolean list when using .iloc.

Knowing the above knowledge, we see that Year is the first column in myDF, corresponding to 0 (since pandas uses 0-indexing). This works for the inner call, but notice how we said iloc works on boolean lists, not boolean Series. If we try to use iloc on the outer call, we get another error, meaning that we need to keep it as loc:

# option 3: nested .iloc call
love_the_90s = myDF.loc[myDF.iloc[:, 'Year'] == 1990, :]
print(love_the_90s)
   Year  Month  DayofMonth  ...  NASDelay  SecurityDelay  LateAircraftDelay
1  1990     10          15  ...       NaN            NaN                NaN
2  1990     10          17  ...       NaN            NaN                NaN
3  1990     10          18  ...       NaN            NaN                NaN

Return the entries for which country is in Europe within DataFrame people.

Click to see solution
people = pd.DataFrame({'first_init': list('SRSE'), 'last_init': list('MVNS'),
                       'country': ['Ireland', 'Singapore', 'Japan', 'Netherlands']})

print(people.loc[people.country.isin(['Ireland', 'Netherlands']), :])
  first_init last_init      country
0          S         M      Ireland
3          E         S  Netherlands