Getting Started with Exploratory Data Analysis
Exploratory Data Analysis or EDA is one of the most important steps when understanding your data. During EDA you learn about the data that you have available to you, and develop some questions or goals with the data (if you don’t already have them).
The most common operations start with reading data into a DataFrame, accessing the DataFrames’s attributes, and using the DataFrame’s methods to perform operations on the underlying data or with other DataFrames.
In this document we will:
-
Show how to load and inspect data using pandas (
read_csv
,head
,len
,shape
). -
Explore data attributes (
columns
,unique
,value_counts
,isin
) -
Perform data transformations (
rename
,iloc
) -
Summarize datasets with
describe()
Basic Functions in EDA
Here we list some commonly used functions used for EDA and DataFrames. You can explore how they get used in the code examples below.
The following functions are for the pandas package, but most data manipulation packages (like numpy) will have similar functionality, sometimes with the same names even. |
-
head()
-
describe()
-
len()
-
columns
-
shape
-
iloc
-
unique()
-
info()
-
value_counts()
-
isin()
Reading in the Data
The pandas
library provides various functions to load tabular data into a DataFrame. Since our dataset is in CSV format (which uses a comma as the default delimiter), we’ll use the read_csv()
function. For additional methods to import data, refer to this resource: pandas official documentation.
Please watch the videos below for an additional walk through:
We will start by reading in the airports dataset.
/anvil/projects/tdm/data/flights/subset/airports.csv
import pandas as pd
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/airports.csv")
Let’s use the .head()
function to view its contents.
myDF.head()
iata airport city state country lat long 0 00M Thigpen Bay Springs MS USA 31.953765 -89.234505 1 00R Livingston Municipal Livingston TX USA 30.685861 -95.017928 2 00V Meadow Lake Colorado Springs CO USA 38.945749 -104.569893 3 01G Perry-Warsaw Perry NY USA 42.741347 -78.052081 4 01J Hilliard Airpark Hilliard FL USA 30.688012 -81.905944
What if I wanted to display the first n
rows of my DataFrame?
In this case Pandas has a hand built-in head
function. By default head
will return the first 5 rows. We can also pass an n=
argument to the function if we want a different number of rows:
print(myDF.head(n=2))
iata airport city state country lat long 0 00M Thigpen Bay Springs MS USA 31.953765 -89.234505 1 00R Livingston Municipal Livingston TX USA 30.685861 -95.017928
Size of the Data
If we wanted to extract the number of rows in the dataset we could use the len()
function.
len(myDF)
Now let’s use .shape
function to see how many rows AND columns we have in our data.
# Getting the number of rows and columns
print(myDF.shape)
This returns a tuple with the first value as the number of rows and the second as the number of columns:
(3376, 7)
Data Extraction
Let’s say we wanted to just view the 11th row in the dataframe, you can use the .iloc function in pandas. Note that .iloc uses zero-based indexing, so the 11th row corresponds to index 10.
myDF.iloc[10,]
iata 04M airport Calhoun County city Pittsboro state MS country USA lat 33.930112 long -89.342852 Name: 10, dtype: object
The Variables in the Data
Now let’s use .columns
function to see the names of the columns we have in our dataset.
myDF.columns
Index(['iata', 'airport', 'city', 'state', 'country', 'lat', 'long'], dtype='object')
The variables in this dataset are:
-
iata - abbreviation used to identify airline
-
airport - airport name
-
city - The city where the airport is located
-
state - The state where the airport is located
-
country - The country where the airport is located
-
lat - latitude
-
long - longitude
Renaming Columns
What if I wanted to change the name of one of my columns?
myDF = myDF.rename(columns={'long': 'lon'})
You could also add the inplace=True
argument to make the change directly to the DataFrame:
myDF.rename(columns={'long': 'lon'}, inplace=True)
Either method would result in the long
column being renamed to lon
in this example.
myDF.rename(columns={'long': 'lon'}, inplace=True)
print(myDF.columns)
Index(['iata', 'airport', 'city', 'state', 'country', 'lat', 'lon'], dtype='object')
Dataset Summary with describe()
myDF.describe()
lat lon count 3376.000000 3376.000000 mean 40.036524 -98.621205 std 8.329559 22.869458 min 7.367222 -176.646031 25% 34.688427 -108.761121 50% 39.434449 -93.599425 75% 43.372612 -84.137519 max 71.285448 145.621384
The describe()
function in pandas generates a summary of descriptive statistics for numeric columns in the dataset. Based on our output, we can see that the average latitude in our dataset is approximately 40.04, while the average longitude is around -98.62. This suggests that, on average, the airports in this dataset are located in the central United States, as these coordinates fall near the center of the United States.
Unique values in a column
Additionally, if we wanted to see how many unique countries we have in the airports dataset we could use the unique
function. When performing EDA, it’s often useful to understand the quantity and uniqueness of a specific category, making this function particularly useful.
unique_countries = myDF['country'].unique()
print(unique_countries)
['USA' 'Thailand' 'Palau' 'N Mariana Islands' 'Federated States of Micronesia']
Understanding the columns in your dataset is a critical step when conducting initial exploratory data analysis.
Value Counts
When working with categorical data, value counts is also a useful function. The function values_counts() returns the number of times each value appears in the column. The output will be sorted in descending order when using value_counts()
.
airport_counts = myDF['airport'].value_counts() #One categorical variable
print(airport_counts)
airport Municipal 5 Jackson County 5 Monroe County 5 Lancaster 4 Plymouth Municipal 4 .. Chehalis-Centralia 1 Charlotte/Douglas International 1 Clearwater Air Park 1 Camarillo 1 Zanesville Municipal 1 Name: count, Length: 3245, dtype: int64
country_airport_counts = myDF[['country', 'airport']].value_counts() #Two categorical variables
print(country_airport_counts)
country airport USA Jackson County 5 Monroe County 5 Municipal 5 Plymouth Municipal 4 Lancaster 4 .. Georgetown-Scott County 1 Geraldine 1 Gettysburg & Travel Center 1 Gettysburg Municipal 1 Zephyrhills Municipal 1 Name: count, Length: 3245, dtype: int64