Data Cleaning and Handling Missing Data

In this document, we will:

  • Understanding how to handle missing data

  • Learn why managing missing data is crucial for maintaining accuracy and reliability in data analysis.

  • Filtering out missing data: Use functions like dropna() to remove rows or columns with missing values, including options like applying thresholds.

  • Filling in missing data: Replace missing values using methods such as constants, statistical measures (mean/median), or interpolation techniques.

We will use the following dataset(s) to understand how to handle missing data:

/anvil/projects/tdm/data/zillow/Metro_time_series.csv

Handling Missing Data in Pandas

Pandas simplifies managing missing data for both numeric and non-numeric datasets. By default, descriptive statistics exclude missing values, ensuring accurate calculations.

For numeric data, pandas uses NaN (Not a Number) to represent missing values, while for non-numeric data, it may use None or NaN, depending on the data type. While pandas handles missing values by default, it is crucial to understand the data that is missing and how it might impact your analysis. Properly addressing missing data is essential for maintaining accuracy and ensuring data validation in your workflows.

Before identifying the missing data, let’s try to understand how big our dataset actually is.

import pandas as pd
myDF = pd.read_csv("/anvil/projects/tdm/data/zillow/Metro_time_series.csv")
rows, columns = myDF.shape
print(f"Total rows: {rows}")
print(f"Total columns: {columns}")
Total rows: 211182
Total columns: 95

Now, let’s summarize the number of missing values per column. We will use the pd.set_option function in pandas to adjust the display settings for how data is presented. Since by default, pandas limits the number of rows displayed when printing a DataFrame or Series.

pd.set_option('display.max_rows', None)
missing_summary = myDF.isnull().sum()
print(missing_summary)
Date                                                                  0
RegionName                                                            0
AgeOfInventory                                                   186486
DaysOnZillow_AllHomes                                            186200
InventorySeasonallyAdjusted_AllHomes                             142830
InventoryRaw_AllHomes                                            142830
InventorySeasonallyAdjusted_BottomTier                           171715
InventorySeasonallyAdjusted_MiddleTier                           169004
InventorySeasonallyAdjusted_TopTier                              165669
MedianListingPricePerSqft_1Bedroom                               203737
MedianListingPricePerSqft_2Bedroom                               172797
MedianListingPricePerSqft_3Bedroom                               151062
MedianListingPricePerSqft_4Bedroom                               166676
MedianListingPricePerSqft_5BedroomOrMore                         189536
MedianListingPricePerSqft_AllHomes                               139414
MedianListingPricePerSqft_CondoCoop                              193400
MedianListingPricePerSqft_DuplexTriplex                          201602
MedianListingPricePerSqft_SingleFamilyResidence                  139639
MedianListingPrice_1Bedroom                                      206071
MedianListingPrice_2Bedroom                                      178938
MedianListingPrice_3Bedroom                                      157181
MedianListingPrice_4Bedroom                                      173576
MedianListingPrice_5BedroomOrMore                                194056
MedianListingPrice_AllHomes                                      143712
MedianListingPrice_CondoCoop                                     196537
MedianListingPrice_DuplexTriplex                                 202966
MedianListingPrice_SingleFamilyResidence                         143683
MedianPctOfPriceReduction_AllHomes                               152838
MedianPctOfPriceReduction_CondoCoop                              200622
MedianPctOfPriceReduction_SingleFamilyResidence                  153190
MedianPriceCutDollar_AllHomes                                    152838
MedianPriceCutDollar_CondoCoop                                   200622
MedianPriceCutDollar_SingleFamilyResidence                       153190
MedianRentalPricePerSqft_1Bedroom                                202843
MedianRentalPricePerSqft_2Bedroom                                196267
MedianRentalPricePerSqft_3Bedroom                                194592
MedianRentalPricePerSqft_4Bedroom                                203685
MedianRentalPricePerSqft_5BedroomOrMore                          208512
MedianRentalPricePerSqft_AllHomes                                190257
MedianRentalPricePerSqft_CondoCoop                               204331
MedianRentalPricePerSqft_DuplexTriplex                           206190
MedianRentalPricePerSqft_MultiFamilyResidence5PlusUnits          196292
MedianRentalPricePerSqft_SingleFamilyResidence                   190806
MedianRentalPricePerSqft_Studio                                  203875
MedianRentalPrice_1Bedroom                                       202020
MedianRentalPrice_2Bedroom                                       195328
MedianRentalPrice_3Bedroom                                       194864
MedianRentalPrice_4Bedroom                                       204249
MedianRentalPrice_5BedroomOrMore                                 208852
MedianRentalPrice_AllHomes                                       191841
MedianRentalPrice_CondoCoop                                      205136
MedianRentalPrice_DuplexTriplex                                  203760
MedianRentalPrice_MultiFamilyResidence5PlusUnits                 194944
MedianRentalPrice_SingleFamilyResidence                          191947
MedianRentalPrice_Studio                                         203171
ZHVIPerSqft_AllHomes                                              34619
PctOfHomesDecreasingInValues_AllHomes                            167893
PctOfHomesIncreasingInValues_AllHomes                            167893
PctOfHomesSellingForGain_AllHomes                                210378
PctOfHomesSellingForLoss_AllHomes                                210378
PctOfListingsWithPriceReductionsSeasAdj_AllHomes                 147734
PctOfListingsWithPriceReductionsSeasAdj_BottomTier               197894
PctOfListingsWithPriceReductionsSeasAdj_CondoCoop                198774
PctOfListingsWithPriceReductionsSeasAdj_MiddleTier               196750
PctOfListingsWithPriceReductionsSeasAdj_SingleFamilyResidence    147822
PctOfListingsWithPriceReductionsSeasAdj_TopTier                  194110
PctOfListingsWithPriceReductions_AllHomes                        147734
PctOfListingsWithPriceReductions_BottomTier                      197348
PctOfListingsWithPriceReductions_CondoCoop                       198774
PctOfListingsWithPriceReductions_MiddleTier                      196345
PctOfListingsWithPriceReductions_SingleFamilyResidence           147822
PctOfListingsWithPriceReductions_TopTier                         193158
PriceToRentRatio_AllHomes                                        153683
Sale_Counts_Msa                                                  200580
Sale_Counts_Seas_Adj_Msa                                         200580
Sale_Prices_Msa                                                  201793
InventoryTierShare_BottomTier_AllHomes                           171715
InventoryTierShare_MiddleTier_AllHomes                           169004
InventoryTierShare_TopTier_AllHomes                              165669
ZHVI_1bedroom                                                    121824
ZHVI_2bedroom                                                     67888
ZHVI_3bedroom                                                     39562
ZHVI_4bedroom                                                     44115
ZHVI_5BedroomOrMore                                               75668
ZHVI_AllHomes                                                     36803
ZHVI_BottomTier                                                   61978
ZHVI_CondoCoop                                                   115924
ZHVI_MiddleTier                                                   36797
ZHVI_SingleFamilyResidence                                        36740
ZHVI_TopTier                                                      33776
ZRI_AllHomes                                                     153154
ZRI_AllHomesPlusMultifamily                                      152961
ZriPerSqft_AllHomes                                              154370
Zri_MultiFamilyResidenceRental                                   164451
Zri_SingleFamilyResidenceRental                                  153178
dtype: int64

Some columns have a significant number of missing values, but it’s important to assess this in relation to the entire dataset. Calculating the proportion of missing values compared to the total dataset provides better context and helps us evaluate the severity of the missing data.

Calculate the percentage of missing values for each column

missing_percentage = myDF.isnull().mean() * 100
print(missing_percentage)
Date                                                              0.000000
RegionName                                                        0.000000
AgeOfInventory                                                   88.305822
DaysOnZillow_AllHomes                                            88.170393
InventorySeasonallyAdjusted_AllHomes                             67.633605
InventoryRaw_AllHomes                                            67.633605
InventorySeasonallyAdjusted_BottomTier                           81.311381
InventorySeasonallyAdjusted_MiddleTier                           80.027654
InventorySeasonallyAdjusted_TopTier                              78.448447
MedianListingPricePerSqft_1Bedroom                               96.474605
MedianListingPricePerSqft_2Bedroom                               81.823735
MedianListingPricePerSqft_3Bedroom                               71.531665
MedianListingPricePerSqft_4Bedroom                               78.925287
MedianListingPricePerSqft_5BedroomOrMore                         89.750073
MedianListingPricePerSqft_AllHomes                               66.016043
MedianListingPricePerSqft_CondoCoop                              91.579775
MedianListingPricePerSqft_DuplexTriplex                          95.463629
MedianListingPricePerSqft_SingleFamilyResidence                  66.122586
MedianListingPrice_1Bedroom                                      97.579813
MedianListingPrice_2Bedroom                                      84.731653
MedianListingPrice_3Bedroom                                      74.429165
MedianListingPrice_4Bedroom                                      82.192611
MedianListingPrice_5BedroomOrMore                                91.890407
MedianListingPrice_AllHomes                                      68.051254
MedianListingPrice_CondoCoop                                     93.065223
MedianListingPrice_DuplexTriplex                                 96.109517
MedianListingPrice_SingleFamilyResidence                         68.037522
MedianPctOfPriceReduction_AllHomes                               72.372645
MedianPctOfPriceReduction_CondoCoop                              94.999574
MedianPctOfPriceReduction_SingleFamilyResidence                  72.539326
MedianPriceCutDollar_AllHomes                                    72.372645
MedianPriceCutDollar_CondoCoop                                   94.999574
MedianPriceCutDollar_SingleFamilyResidence                       72.539326
MedianRentalPricePerSqft_1Bedroom                                96.051273
MedianRentalPricePerSqft_2Bedroom                                92.937372
MedianRentalPricePerSqft_3Bedroom                                92.144217
MedianRentalPricePerSqft_4Bedroom                                96.449982
MedianRentalPricePerSqft_5BedroomOrMore                          98.735688
MedianRentalPricePerSqft_AllHomes                                90.091485
MedianRentalPricePerSqft_CondoCoop                               96.755879
MedianRentalPricePerSqft_DuplexTriplex                           97.636162
MedianRentalPricePerSqft_MultiFamilyResidence5PlusUnits          92.949210
MedianRentalPricePerSqft_SingleFamilyResidence                   90.351450
MedianRentalPricePerSqft_Studio                                  96.539951
MedianRentalPrice_1Bedroom                                       95.661562
MedianRentalPrice_2Bedroom                                       92.492731
MedianRentalPrice_3Bedroom                                       92.273016
MedianRentalPrice_4Bedroom                                       96.717050
MedianRentalPrice_5BedroomOrMore                                 98.896686
MedianRentalPrice_AllHomes                                       90.841549
MedianRentalPrice_CondoCoop                                      97.137067
MedianRentalPrice_DuplexTriplex                                  96.485496
MedianRentalPrice_MultiFamilyResidence5PlusUnits                 92.310898
MedianRentalPrice_SingleFamilyResidence                          90.891743
MedianRentalPrice_Studio                                         96.206590
ZHVIPerSqft_AllHomes                                             16.392969
PctOfHomesDecreasingInValues_AllHomes                            79.501567
PctOfHomesIncreasingInValues_AllHomes                            79.501567
PctOfHomesSellingForGain_AllHomes                                99.619286
PctOfHomesSellingForLoss_AllHomes                                99.619286
PctOfListingsWithPriceReductionsSeasAdj_AllHomes                 69.955773
PctOfListingsWithPriceReductionsSeasAdj_BottomTier               93.707797
PctOfListingsWithPriceReductionsSeasAdj_CondoCoop                94.124499
PctOfListingsWithPriceReductionsSeasAdj_MiddleTier               93.166084
PctOfListingsWithPriceReductionsSeasAdj_SingleFamilyResidence    69.997443
PctOfListingsWithPriceReductionsSeasAdj_TopTier                  91.915978
PctOfListingsWithPriceReductions_AllHomes                        69.955773
PctOfListingsWithPriceReductions_BottomTier                      93.449252
PctOfListingsWithPriceReductions_CondoCoop                       94.124499
PctOfListingsWithPriceReductions_MiddleTier                      92.974307
PctOfListingsWithPriceReductions_SingleFamilyResidence           69.997443
PctOfListingsWithPriceReductions_TopTier                         91.465182
PriceToRentRatio_AllHomes                                        72.772774
Sale_Counts_Msa                                                  94.979686
Sale_Counts_Seas_Adj_Msa                                         94.979686
Sale_Prices_Msa                                                  95.554072
InventoryTierShare_BottomTier_AllHomes                           81.311381
InventoryTierShare_MiddleTier_AllHomes                           80.027654
InventoryTierShare_TopTier_AllHomes                              78.448447
ZHVI_1bedroom                                                    57.686735
ZHVI_2bedroom                                                    32.146679
ZHVI_3bedroom                                                    18.733604
ZHVI_4bedroom                                                    20.889564
ZHVI_5BedroomOrMore                                              35.830705
ZHVI_AllHomes                                                    17.427148
ZHVI_BottomTier                                                  29.348145
ZHVI_CondoCoop                                                   54.892936
ZHVI_MiddleTier                                                  17.424307
ZHVI_SingleFamilyResidence                                       17.397316
ZHVI_TopTier                                                     15.993787
ZRI_AllHomes                                                     72.522279
ZRI_AllHomesPlusMultifamily                                      72.430889
ZriPerSqft_AllHomes                                              73.098086
Zri_MultiFamilyResidenceRental                                   77.871694
Zri_SingleFamilyResidenceRental                                  72.533644
dtype: float64

Remove rows or columns containing missing values

There are different ways to filter out missing data in pandas. One approach is using pandas.isnull() combined with boolean indexing to manually exclude rows or columns with missing values. This method offers control over how and where you handle missing data. Alternatively, pandas provides built-in functions that allow you to efficiently remove rows or columns containing missing values, making it easier to clean the dataset based on specific criteria.

Let’s show we can filter out missing Data using the dropna() function. The dropna() function in pandas is used for removing rows or columns with missing data. By default, dropna() removes rows where any value is missing. The code below removes all rows that have at least one missing value.

dropna_df = myDF.dropna()

We can also drop columns with all missing values. Axis=1 applies the operation to columns, how='all' ensures only columns where all values are missing are dropped.

dropna_df_axis = myDF.dropna(axis=1, how='all')

If we wanted to apply a threshold for removing misisng values, we can use the thresh function to drop columns with a minimum number of non-missing values. Let’s say we wanted to drop columns with 10,000 or more missing values. We could type:

filtered_df_10000 = myDF.dropna(axis=1, thresh=10000)

rows, columns = filtered_df_10000.shape

print(f"Total rows: {rows}")
print(f"Total columns: {columns}")
Total rows: 211182
Total columns: 76

Notice how the number of columns went down from 95 colums to 76 columns.

Addressing Missing Data

Instead of removing missing data and potentially losing valuable information, you can handle gaps by filling them using various techniques. The fillna() function is commonly used for this purpose.

For instance, you can fill missing values with a specific constant by providing the desired value to fillna():

filled_df_constant = myDF.fillna(0)

Interpolation

Interpolation is also a technique for estimating missing values based on surrounding data points. In pandas, the interpolate() function provides several methods to fill gaps, such as linear and other methods. It’s usually applied to numeric variables.

Using linear method:

interpolate_linear = myDF.select_dtypes(include='number').interpolate(method='linear')

You can also interpolate missing values using previous or next values.

Using forward method:

interpolated_df_forward = myDF.ffill()

Using backwards method:

interpolated_df_back = myDF.bfill()

Let’s demonstrate how the forward-fill method works by selecting a sample of the data and comparing it before and after applying the method.

sample_data_age_inventory = myDF[['AgeOfInventory']].tail(30)

print("Original Data:")
print(sample_data_age_inventory)

filled_data_age_inventory = sample_data_age_inventory.ffill()

# Display the data after forward fill
print("\nData After Forward Fill:")
print(filled_data_age_inventory)
Original Data:
        AgeOfInventory
211152           134.0
211153           155.0
211154             NaN
211155           137.0
211156            76.0
211157             NaN
211158           105.0
211159             NaN
211160             NaN
211161           112.0
211162             NaN
211163             NaN
211164            94.0
211165             NaN
211166             NaN
211167            95.0
211168            94.0
211169             NaN
211170            62.0
211171            76.0
211172             NaN
211173            88.0
211174             NaN
211175             NaN
211176             NaN
211177            60.0
211178            64.0
211179             NaN
211180             NaN
211181            92.0

Data After Forward Fill:
        AgeOfInventory
211152           134.0
211153           155.0
211154           155.0
211155           137.0
211156            76.0
211157            76.0
211158           105.0
211159           105.0
211160           105.0
211161           112.0
211162           112.0
211163           112.0
211164            94.0
211165            94.0
211166            94.0
211167            95.0
211168            94.0
211169            94.0
211170            62.0
211171            76.0
211172            76.0
211173            88.0
211174            88.0
211175            88.0
211176            88.0
211177            60.0
211178            64.0
211179            64.0
211180            64.0
211181            92.0

Mean or Median

The fillna() function can also fill missing data using statistical measures like the mean or median of a column. These methods are often preferred because they provide a reasonable estimate of the missing values. Using the mean or median helps to avoid introducing bias that could occur with other methods, like filling with a constant value. Some argue that the median is ideal, as it is less affected by outliers.

Let’s show how you can fill in the median with sample data (last 30 rows). Let’s fill in the column 'AgeofInventory' with the median:

sample_data_age_inventory = myDF[['AgeOfInventory']].tail(30)

print("Original Data:")
print(sample_data_age_inventory)

filled_data_age_inventory = sample_data_age_inventory.fillna(sample_data_age_inventory.median())

print("Data After Filling with Median:")
print(filled_data_age_inventory)
Original Data:
        AgeOfInventory
211152           134.0
211153           155.0
211154             NaN
211155           137.0
211156            76.0
211157             NaN
211158           105.0
211159             NaN
211160             NaN
211161           112.0
211162             NaN
211163             NaN
211164            94.0
211165             NaN
211166             NaN
211167            95.0
211168            94.0
211169             NaN
211170            62.0
211171            76.0
211172             NaN
211173            88.0
211174             NaN
211175             NaN
211176             NaN
211177            60.0
211178            64.0
211179             NaN
211180             NaN
211181            92.0

Data After Filling with Median:
        AgeOfInventory
211152           134.0
211153           155.0
211154            94.0
211155           137.0
211156            76.0
211157            94.0
211158           105.0
211159            94.0
211160            94.0
211161           112.0
211162            94.0
211163            94.0
211164            94.0
211165            94.0
211166            94.0
211167            95.0
211168            94.0
211169            94.0
211170            62.0
211171            76.0
211172            94.0
211173            88.0
211174            94.0
211175            94.0
211176            94.0
211177            60.0
211178            64.0
211179            94.0
211180            94.0
211181            92.0