Filtering and Selecting
In this document we will show how to:
-
Select and filter rows and columns using
[]
,loc
, andiloc
. -
Use
isin
for filtering specific values. -
Modify DataFrame values based on conditions.
-
Summarize datasets with
describe()
.
Basics and Simple Accessing Methods
It is not likely that you will always need an entire dataset when doing analysis, so being able to efficiently pick subsets of data is an important skill. Pandas
has some vital indexing functions that add logical parameters when working with a DataFrame
or Series
.
[]
and .
Using brackets and periods to index in pandas
are the most basic methods of indexing, and thus allow for simpler subset selection. Both are limited to column selection, so you won’t be able to select the first row of an object with df[1]
or df.1
.
This indexing is done based on the column name — if we had a column named "columnA", we could select it using df['columnA']
or df.columnA
.
Selecting multiple columns is also possible by nesting a list within your brackets: df[['columnA', 'columnB', 'columnC']]
. This is not possible when using .
to select attributes — you are additionally unable to use .
to select:
-
Columns that match names of
pandas
methods —max
,head
,describe
, etc. -
Identifiers, including keywords, whitespace, and literals. If your column has a space in its name (which can happen often), you cannot select it with `.`
Slicing
If we consider simple []
and .
indexing to be column selection, we can consider :
indexing to be row selection. You still start by using brackets, but when :
is included, the rows are now selected instead of columns. Only our rows can be selected when using simple slicing, but as you will soon see, slicing is a tool that is used heavily in more complex indexing.
:
slicing is inclusive, with an implied 0 for the start and the object’s last index implied for the end. Thus, the input print(df[:])
selects all rows and is the same as print(df)
. You can add a negative sign to the end index to exclude the last n elements.
If you want to skip certain elements, you can add another colon after the stop index, which would take the following form: df[0:4:2]
. There is an implied skip of 1 (meaning no skip at all), meaning print(df[:]) == print(df[::]) == print(df)
. You can add a negative sign in front of the skip parameter to slice the object in reverse order.
isin()
The isin
method (derived from is in) is a pandas
method that mirrors the in
keyword in other programming languages. It’s a Series
method that is passed a list (or list-like object), returning True
or False
depending on the value’s inclusion in the list.
The potential applications are simple to imagine — what if I only wanted to select certain states from a DataFrame
containing information on the United States? How about information when the records include the day of the week, and I only want Mondays and Thursdays? These desires are very easily translated into code.
days = pd.Series(['Monday', 'Thursday', 'Wednesday', 'Saturday', 'Saturday', 'Monday'])
print(days.isin(['Monday', 'Thursday']))
0 True 1 True 2 False 3 False 4 False 5 True
print(days[days.isin(['Monday', 'Thursday'])])
0 Monday 1 Thursday 5 Monday
DataFrames
have their own isin
method, but as we’ll demonstrate in the next section, using the Series
method is better when combined with the loc
/iloc
functions.
Examples
For the following, we use the code np.random.randn(num_rows, num_columns)
to generate a random DataFrame
. The point is not for the output values to match, but rather the rows/columns.
Return the column A
from DataFrame df
.
Click to see solution
# method 1: []
print(df['A'])
# method 2: .
print(df.A)
0 -2.676859 1 0.110410 2 1.263104 3 0.161416 4 -0.213868
Return columns A
, B
, and D
from DataFrame df
.
Click to see solution
print(df[['A', 'B', 'D']])
A B D 0 0.461834 0.456688 -1.061509 1 1.003698 1.115509 0.120536 2 0.814746 2.793606 -0.281329 3 0.766533 0.138788 0.479603 4 -0.084290 -0.141935 0.755774
Multi-dimensional Indexing: loc
and iloc
Fortunately, there are many ways to select multiple dimensions for subsetting, and they generally allow conditionals to further specify the data points you want.
The loc
method allows for label-based indexing, while iloc
is primarily integer-based. These are two of the best functions to use when indexing, so we’ll be using plenty of examples to demonstrate their versatility.
We can use either loc
or iloc
to isolate a single column or Series
. The differences between the two are strict; if you try to use row/column indices for loc
or row/column names with iloc
, you’ll get an error.
Indexing Options with DataFrame
The following table comes from the O’Reily textbook Python for Data Analysis:
Type |
Notes |
|
Select single column or sequence of columns from the DataFrame; special case conveniences: Boolean array (filter rows), (slice rows), or Boolean DataFrame (set values based on some criterion) |
|
Select single row or subset of rows from the DataFrame by label |
|
Select single colum or subset of columns by label |
|
Select both row(s) and column(s) by label |
|
Select both row(s) and column(s) by label |
|
Select both row(s) and column(s) by integer position |
|
Select both row(s) and column(s) by integer position |
|
df.at[row, col] |
|
Select either rows or columns by labels |
Extracting Rows/Columns from a Dataset
There are many ways to select and view data from a pandas DataFrame. We will use the following dataset(s) to understand filtering and selecting:
/anvil/projects/tdm/data/death_records/DeathRecords.csv
Let’s use our dataset to try the first indexing option.
import pandas as pd
myDF = pd.read_csv("/anvil/projects/tdm/data/death_records/DeathRecords.csv")
myDF['ResidentStatus']
The output of selecting one columns using []
is a pandas Series
:
0 20 1 1 2 1 3 1 4 1 .. 2631166 3 2631167 3 2631168 3 2631169 4 2631170 3 Name: ResidentStatus, Length: 2631171, dtype: int64
If we wanted to extract multiple columns we, can use a list of the column names inside the double brackets `[[]].
myDF[['ResidentStatus', 'Age']]
The output of selecting multiple columns using the double brackets is a pandas DataFrame
:
ResidentStatus Age 0 1 87 1 1 58 2 1 75 3 1 74 4 1 64 ... ... ... 2631166 3 84 2631167 3 74 2631168 3 7 2631169 4 49 2631170 3 39 2631171 rows × 2 columns
The iloc function
The iloc
function allows us to do integer-based indexing (an easy way to remember this is that the i in iloc stands for integer!)
For example, let’s select the very first observation of ResidentStatus
using iloc[]
(index 0).
myDF.iloc[0, 1]
1
Now let’s select multiple columns of the first obervation. In this example, myDF.iloc[0, 1:10] selects the first row (index 0) and the columns from positions 1 to 9 (remember that Python slicing is zero-indexed and excludes the stop index).
myDF.iloc[0,1:10]
ResidentStatus 1 Education1989Revision 0 Education2003Revision 2 EducationReportingFlag 1 MonthOfDeath 1 Sex M AgeType 1 Age 87 AgeSubstitutionFlag 0 Name: 0, dtype: object
We can also use iloc[]
to select the first row (index 0) and all columns using (:):
myDF.iloc[0, :]
Id 1 ResidentStatus 1 Education1989Revision 0 Education2003Revision 2 EducationReportingFlag 1 MonthOfDeath 1 Sex M AgeType 1 Age 87 AgeSubstitutionFlag 0 AgeRecode52 43 AgeRecode27 23 AgeRecode12 11 InfantAgeRecode22 0 PlaceOfDeathAndDecedentsStatus 4 MaritalStatus M DayOfWeekOfDeath 4 CurrentDataYear 2014 InjuryAtWork U MannerOfDeath 7 MethodOfDisposition C Autopsy N ActivityCode 99 PlaceOfInjury 99 Icd10Code I64 CauseRecode358 238 CauseRecode113 70 InfantCauseRecode130 0 CauseRecode39 24 NumberOfEntityAxisConditions 1 NumberOfRecordAxisConditions 1 Race 1 BridgedRaceFlag 0 RaceImputationFlag 0 RaceRecode3 1 RaceRecode5 1 HispanicOrigin 100 HispanicOriginRaceRecode 6 Name: 0, dtype: object
Next, we can use iloc[]
to select multiple rows from a single column. The code below returns the first 7 observations in the datset from the 7th column (Age).
myDF.iloc[0:7, 10]
0 43 1 37 2 41 3 40 4 38 5 44 6 42 Name: AgeRecode52, dtype: int64
Now, let’s filter our dataset to only include a few columns and rename the DataFrame.
filtered_columns = ['Id', 'ResidentStatus', 'Sex', 'Age', 'Race', 'MaritalStatus']
filtered_myDF = myDF[filtered_columns]
filtered_myDF
Id ResidentStatus Sex Age Race MaritalStatus 0 1 1 M 87 1 M 1 2 1 M 58 1 D 2 3 1 F 75 1 W 3 4 1 M 74 1 D 4 5 1 M 64 1 D ... ... ... ... ... ... ...
Finally, let’s try selecting multiple rows and multiple columns at the same time. When selecting multiple rows and multiple columns using iloc, the output is a subset of the DataFrame that contains the specified rows and all the columns. In this example, myDF.iloc[[0, 7, 9, 10], :] specifies the selection of rows 0, 7, 9, and 10 and all columns:
filtered_myDF.iloc[[0, 7, 9, 10], :]
Id ResidentStatus Sex Age Race MaritalStatus 0 1 1 M 87 1 M 7 8 1 M 55 2 S 9 10 1 M 23 1 S 10 11 1 F 79 1 W
The loc function
DataFrames in pandas allow for label-based indexing and integer-based indexing. The loc
function is for label-based indexing.
In our dataset, our rows are integers, so we can use integers as our row labels. Let’s extract the first observation for the column Age
filtered_myDF.loc[0, 'Age']
87
Now, let’s select all columns except one specific column using the loc[]
function. Let’s exclude the column Race
:
filtered_myDF.loc[:, filtered_myDF.columns != 'Race']
Id ResidentStatus Sex Age MaritalStatus 0 1 1 M 87 M 1 2 1 M 58 D 2 3 1 F 75 W 3 4 1 M 74 D 4 5 1 M 64 D ... ... ... ... ... ...
Select the first 10 observations for the column Age
using loc[]
:
filtered_myDF.loc[:10, 'Age']
0 87 1 58 2 75 3 74 4 64 5 93 6 82 7 55 8 86 9 23 10 79 Name: Age, dtype: int64
Filtering our Dataset
Let’s filter for one category. Let’s try the ==
to see which death records are for females only.
filtered_myDF['Sex'] == "F"
0 False 1 False 2 True 3 False 4 False ... 2631166 False 2631167 True 2631168 False 2631169 False 2631170 False Name: Sex, Length: 2631171, dtype: bool
When we evaluated the condition myDF['Sex'] == "F", it produced a Boolean series where each value corresponds to whether the condition was True or False for each row in the DataFrame. This Boolean series can be used to filter the DataFrame.
If we want to see only the rows where the Sex column is "F" (females), we can use this condition directly to subset the DataFrame as shown below:
filtered_myDF[filtered_myDF['Sex'] == "F"]
Id ResidentStatus Sex Age Race MaritalStatus 2 3 1 F 75 1 W 5 6 1 F 93 1 W 8 9 1 F 86 1 W 10 11 1 F 79 1 W 12 13 1 F 85 1 W ... ... ... ... ... ... 1299710 rows × 6 columns
We can also use .loc
for filtering for females.
filtered_myDF.loc[filtered_myDF['Sex'] == "F"]
Id ResidentStatus Sex Age Race MaritalStatus 2 3 1 F 75 1 W 5 6 1 F 93 1 W 8 9 1 F 86 1 W 10 11 1 F 79 1 W 12 13 1 F 85 1 W ... ... ... ... ... ...
Now let’s filter for two things. Let’s filter for Females who are 114 years old. Suprisingly, some people do live that long based on our dataset!
filtered_myDF[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
Id ResidentStatus Sex Age Race MaritalStatus 265482 265483 1 F 114 1 W 1304830 1304831 1 F 114 1 W 1372655 1372656 1 F 114 2 W 1981235 1981236 1 F 114 2 W 2407245 2407246 1 F 114 4 M
Another method that would get us the same results:
filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
Id ResidentStatus Sex Age Race MaritalStatus 265482 265483 1 F 114 1 W 1304830 1304831 1 F 114 1 W 1372655 1372656 1 F 114 2 W 1981235 1981236 1 F 114 2 W 2407245 2407246 1 F 114 4 M
Filtering and Modifying the Dataset
Let’s say there was a data entry mistake, and all females who are 114 should actually be 100 years old! Let’s fix this data entry error.
filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114), 'Age'] = 100
We can check whether it worked by trying to filter for females who are 114 again. The results should be 0 observations because we set them equal to 100.
filtered_myDF.loc[(filtered_myDF['Sex'] == "F") & (filtered_myDF['Age'] == 114)]
Id ResidentStatus Sex Age Race MaritalStatus