Getting Started with Exploratory Data Analysis

Exploratory Data Analysis or EDA is one of the most important steps when understanding your data. During EDA you learn about the data that you have available to you, and develop some questions or goals with the data (if you don’t already have them).

The most common operations start with reading data into a DataFrame, accessing the DataFrames’s attributes, and using the DataFrame’s methods to perform operations on the underlying data or with other DataFrames.

In this document we will:

  • Show how to load and inspect data using pandas (read_csv, head, len, shape).

  • Explore data attributes (columns, unique, value_counts, isin)

  • Perform data transformations (rename, iloc)

  • Summarize datasets with describe()

Basic Functions in EDA

Here we list some commonly used functions used for EDA and DataFrames. You can explore how they get used in the code examples below.

The following functions are for the pandas package, but most data manipulation packages (like numpy) will have similar functionality, sometimes with the same names even.
  • head()

  • describe()

  • len()

  • columns

  • shape

  • iloc

  • unique()

  • info()

  • value_counts()

  • isin()

Reading in the Data

The pandas library provides various functions to load tabular data into a DataFrame. Since our dataset is in CSV format (which uses a comma as the default delimiter), we’ll use the read_csv() function. For additional methods to import data, refer to this resource: pandas official documentation.

Please watch the videos below for an additional walk through:

We will start by reading in the airports dataset.

/anvil/projects/tdm/data/flights/subset/airports.csv

import pandas as pd
myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/airports.csv")

Let’s use the .head() function to view its contents.

myDF.head()
  iata               airport              city state country        lat        long
0  00M              Thigpen        Bay Springs    MS     USA  31.953765  -89.234505
1  00R  Livingston Municipal        Livingston    TX     USA  30.685861  -95.017928
2  00V           Meadow Lake  Colorado Springs    CO     USA  38.945749 -104.569893
3  01G          Perry-Warsaw             Perry    NY     USA  42.741347  -78.052081
4  01J      Hilliard Airpark          Hilliard    FL     USA  30.688012  -81.905944

What if I wanted to display the first n rows of my DataFrame?

In this case Pandas has a hand built-in head function. By default head will return the first 5 rows. We can also pass an n= argument to the function if we want a different number of rows:

print(myDF.head(n=2))
  iata               airport              city state country        lat        long
0  00M              Thigpen        Bay Springs    MS     USA  31.953765  -89.234505
1  00R  Livingston Municipal        Livingston    TX     USA  30.685861  -95.017928

Size of the Data

If we wanted to extract the number of rows in the dataset we could use the len() function.

len(myDF)

Now let’s use .shape function to see how many rows AND columns we have in our data.

# Getting the number of rows and columns
print(myDF.shape)

This returns a tuple with the first value as the number of rows and the second as the number of columns:

(3376, 7)

Data Extraction

Let’s say we wanted to just view the 11th row in the dataframe, you can use the .iloc function in pandas. Note that .iloc uses zero-based indexing, so the 11th row corresponds to index 10.

myDF.iloc[10,]
iata                  04M
airport    Calhoun County
city            Pittsboro
state                  MS
country               USA
lat             33.930112
long           -89.342852
Name: 10, dtype: object

The Variables in the Data

Now let’s use .columns function to see the names of the columns we have in our dataset.

myDF.columns
Index(['iata', 'airport', 'city', 'state', 'country', 'lat', 'long'], dtype='object')

The variables in this dataset are:

  • iata - abbreviation used to identify airline

  • airport - airport name

  • city - The city where the airport is located

  • state - The state where the airport is located

  • country - The country where the airport is located

  • lat - latitude

  • long - longitude

Renaming Columns

What if I wanted to change the name of one of my columns?

myDF = myDF.rename(columns={'long': 'lon'})

You could also add the inplace=True argument to make the change directly to the DataFrame:

myDF.rename(columns={'long': 'lon'}, inplace=True)

Either method would result in the long column being renamed to lon in this example.

myDF.rename(columns={'long': 'lon'}, inplace=True)
print(myDF.columns)
Index(['iata', 'airport', 'city', 'state', 'country', 'lat', 'lon'], dtype='object')

Dataset Summary with describe()

myDF.describe()
 	       lat 	           lon
count 	3376.000000 	3376.000000
mean 	40.036524 	    -98.621205
std 	8.329559 	    22.869458
min 	7.367222 	   -176.646031
25% 	34.688427 	   -108.761121
50% 	39.434449 	   -93.599425
75% 	43.372612 	   -84.137519
max 	71.285448 	   145.621384

The describe() function in pandas generates a summary of descriptive statistics for numeric columns in the dataset. Based on our output, we can see that the average latitude in our dataset is approximately 40.04, while the average longitude is around -98.62. This suggests that, on average, the airports in this dataset are located in the central United States, as these coordinates fall near the center of the United States.

Unique values in a column

Additionally, if we wanted to see how many unique countries we have in the airports dataset we could use the unique function. When performing EDA, it’s often useful to understand the quantity and uniqueness of a specific category, making this function particularly useful.

unique_countries = myDF['country'].unique()
print(unique_countries)
['USA' 'Thailand' 'Palau' 'N Mariana Islands'
 'Federated States of Micronesia']

Understanding the columns in your dataset is a critical step when conducting initial exploratory data analysis.

Value Counts

When working with categorical data, value counts is also a useful function. The function values_counts() returns the number of times each value appears in the column. The output will be sorted in descending order when using value_counts().

airport_counts = myDF['airport'].value_counts() #One categorical variable
print(airport_counts)
airport
Municipal                          5
Jackson County                     5
Monroe County                      5
Lancaster                          4
Plymouth Municipal                 4
                                  ..
Chehalis-Centralia                 1
Charlotte/Douglas International    1
Clearwater Air Park                1
Camarillo                          1
Zanesville Municipal               1
Name: count, Length: 3245, dtype: int64
country_airport_counts = myDF[['country', 'airport']].value_counts() #Two categorical variables
print(country_airport_counts)
country  airport
USA      Jackson County                 5
         Monroe County                  5
         Municipal                      5
         Plymouth Municipal             4
         Lancaster                      4
                                       ..
         Georgetown-Scott County        1
         Geraldine                      1
         Gettysburg  & Travel Center    1
         Gettysburg Municipal           1
         Zephyrhills Municipal          1
Name: count, Length: 3245, dtype: int64

Is In

The isin() function checks whether each element meets the condition specified by returning a boolean series.

myDF['country'].isin(['Thailand'])
0       False
1       False
2       False
3       False
4       False
        ...
3371    False
3372    False
3373    False
3374    False
3375    False
Name: country, Length: 3376, dtype: bool