Introduction

We are going to continue to see if we can uncover clues that will help the investigators uncover the murderer.

To do this we are going to use some controversial data, phone records.

In this case we simulated the data as an example, but it is important to note that many companies and applications do track location data for their users.

Geospatial Data

Today we get to work with a really interesting type of data, geospatial data.

This data looks very similar to the data that we’ve used so far with one important difference. This data has information about the location in the world.

Location can be found in different ways. In this case we have latitude and longitdue.

The other cool part about geospatial data is that we have many open-source options. For this part of the process we’ll start with the DC street map data that is provided by the city.

Practice Exercises

  1. We will start by importing all the different Python packages that we will need and taking a look at the DC street map data.

    Click to see solution
    import pandas as pd
    import geopandas as gpd
    import matplotlib.pyplot as plt
    import random
    import numpy as np
    
    from datetime import datetime, timedelta
    from shapely.geometry import Point, Polygon
    dc_streets = gpd.read_file('../data/dc_roads/Roads.shp')
    fig, ax = plt.subplots(figsize = (15,12))
    
    dc_streets.plot(ax = ax)
    
    plt.show()
    plt.close('all')
    High Level View of the Streets of DC
    Figure 1. High Level View of the Streets of DC

    This is a very interesting map, but it’s hard to read. We will do something about that, but first we can take a look at the raw data as well.

  2. Let’s print out the raw data to better understand the dataset.

    Click to see solution
    print(dc_streets.head())
       FEATURECOD   DESCRIPTIO  CAPTUREYEA CAPTUREACT     GIS_ID  OBJECTID  \
    0        1060        Alley  2015-04-24          E  RoadPly_1         1
    1        1065  Paved Drive  2015-04-24          E  RoadPly_2         2
    2        1070  Parking Lot  2015-04-24          E  RoadPly_3         3
    3        1050         Road  2015-04-24          E  RoadPly_4         4
    4        1050         Road  2015-04-24          E  RoadPly_5         5
    
       SHAPEAREA  SHAPELEN                                           geometry
    0          0         0  POLYGON ((-77.07695 38.92945, -77.07686 38.929...
    1          0         0  POLYGON ((-77.07839 38.93672, -77.07839 38.936...
    2          0         0  POLYGON ((-77.07602 38.94230, -77.07613 38.942...
    3          0         0  POLYGON ((-77.07870 38.94405, -77.07870 38.943...
    4          0         0  POLYGON ((-77.07542 38.92373, -77.07543 38.923...

    We can see that the DC streets data looks pretty similar to what we’ve been working with.

    The biggest difference are the geometry rows. These are sets of latitude and longitude points that Python understands spatially.

    This means that we can use these points for different sptial functions, like finding distance.

    First though, lets cut down our huge map to our area of interest. We’ll define our general area of interest around the Mall using Google Maps.

  3. Let’s cut down our map to a smaller scale.

    Click to see solution
    area_of_interest = [-77.062859, 38.880868, -76.982087, 38.915758]
    
    smaller_map = gpd.clip(dc_streets, area_of_interest)
    fig, ax = plt.subplots(figsize = (15,15))
    
    smaller_map.plot(ax = ax)
    plt.plot(-76.9926056723681, 38.90839920511692, c='orange', marker="*", markersize=30)
    
    plt.show()
    plt.close('all')
    Focused View of DC Streets
    Figure 2. Focused View of DC Streets
  4. Now that we have our smaller map we can take a look at our cell phone data.

    Click to see solution
    cell_phone_data = gpd.read_file('../data/cell_phone_records.geojson')
    print(cell_phone_data.head())
                         date  id        lat        lon  \
    0 2022-05-02 00:17:14.404   0  38.890393 -77.011107
    1 2022-05-02 00:27:14.404   0  38.905440 -76.982952
    2 2022-05-02 00:37:14.404   0  38.901316 -76.991544
    3 2022-05-02 00:47:14.404   0  38.908996 -77.048789
    4 2022-05-02 00:57:14.404   0  38.893913 -77.032013
    
                         geometry
    0  POINT (-77.01111 38.89039)
    1  POINT (-76.98295 38.90544)
    2  POINT (-76.99154 38.90132)
    3  POINT (-77.04879 38.90900)
    4  POINT (-77.03201 38.89391)

    This data looks pretty similar to our DC streets data. It has dates and times as well as the ID for the caller. The most important parts are the lat and lon columns that we can use to map the caller’s path.

  5. Let’s map out our data and see what it looks like!

    Details
    n = len(cell_phone_data['id'].unique())
    color = iter(plt.cm.rainbow(np.linspace(0, 1, n)))
    
    fig, ax1 = plt.subplots(1, 1, figsize=(15, 8))
    
    smaller_map.plot(ax = ax1)
    plt.plot(-76.9926056723681, 38.90839920511692, c='orange', marker="*", markersize=30)
    
    for i in range(0, cell_phone_data['id'].max()):
        person = cell_phone_data.loc[cell_phone_data['id'] == i].sort_values(by='date')
        plt.plot(person['lon'], person['lat'], c=next(color), linestyle='--')
    
    plt.show()
    plt.close('all')
    Map of DC with Phone Paths
    Figure 3. Map of DC with Phone Paths

    This is an interesting map, but it’s very noisy. For our next step lets see if we can limit the map to only calls that went near where our victim was found.

    In this section you’ll see references to crs with some odd looking numbers and letters after them like epsg:4326. These are what are known as coordinate reference systems.

    Their specific detail is beyond the scope of this workshop, but at a high-level they help convert coordinates to address different ways of looking at the earth.

    In our case, we have to convert the CRS' to something that is better for measuring distance in North America.

  6. Lets calculate the distance from our victim’s location to all of the lat/lon points that we have in the phone records.

    Details
    starting_point = gpd.GeoSeries([Point(-77.03718028811417, 38.88978312185629) for i in range(len(cell_phone_data))], crs='epsg:4326')
    
    cell_phone_data = cell_phone_data.to_crs('EPSG:32633')
    starting_point = starting_point.to_crs('EPSG:32633')
    
    cell_phone_data['distance'] = cell_phone_data.distance(starting_point)
    
    cell_phone_data['distance_miles'] = cell_phone_data['distance'] * 0.000621371
    
    print(cell_phone_data[cell_phone_data['distance_miles'] < 1])
                           date  id        lat        lon  \
    4   2022-05-02 00:57:14.404   0  38.893913 -77.032013
    10  2022-05-02 01:57:14.404   0  38.881946 -77.035672
    12  2022-05-02 02:17:14.404   0  38.892572 -77.026673
    18  2022-05-02 03:17:14.404   0  38.886712 -77.041343
    22  2022-05-02 03:57:14.404   0  38.885509 -77.046967
    ..                      ...  ..        ...        ...
    758 2022-05-01 20:30:00.000  25  38.896051 -77.043628
    762 2022-05-01 21:10:00.000  25  38.889835 -77.040619
    763 2022-05-01 21:20:00.000  25  38.892248 -77.036607
    766 2022-05-01 21:50:00.000  25  38.895527 -77.029528
    767 2022-05-01 22:00:00.000  25  38.892288 -77.033660
    
                                  geometry     distance  distance_miles
    4    POINT (-6130636.381 10277516.684)  1016.764717        0.631788
    10   POINT (-6132711.509 10278137.551)  1395.192331        0.866932
    12   POINT (-6130913.298 10276796.415)  1526.595396        0.948582
    18   POINT (-6131829.871 10278869.453)   787.591153        0.489386
    22   POINT (-6131997.658 10279654.020)  1542.512663        0.958473
    ..                                 ...          ...             ...
    758  POINT (-6130170.922 10279090.606)  1415.560118        0.879588
    762  POINT (-6131286.540 10278738.958)   473.128919        0.293989
    763  POINT (-6130893.552 10278164.074)   441.023789        0.274039
    766  POINT (-6130371.878 10277159.468)  1459.603516        0.906955
    767  POINT (-6130909.354 10277758.863)   654.950097        0.406967
  7. Now we can use visualizations to see where a good cutoff point may be for "close" to our victim.

    Details
    fix, ax1 = plt.subplots(1, 1, figsize=(8,6))
    
    ax1 = plt.hist(cell_phone_data['distance_miles'], bins=25)
    
    plt.show()
    plt.close('all')
    Histogram of Distance from Point of Interest
    Figure 4. Histogram of Distance from Point of Interest
  8. This helps, but let’s look at distances less than 1 mile.

    Details
    fix, ax1 = plt.subplots(1, 1, figsize=(8,6))
    
    ax1 = plt.hist(cell_phone_data.loc[cell_phone_data['distance_miles'] < 1]['distance_miles'], bins=25)
    
    plt.show()
    plt.close('all')
    Histogram of Distance from Point of Interest < 1 Mile
    Figure 5. Histogram of Distance from Point of Interest < 1 Mile

    Based on the histogram it looks like we could use half a mile as a good cutoff.

  9. Using half a mile lets see what our updated map looks like.

    Details
    id_of_interest = cell_phone_data.loc[cell_phone_data['distance_miles'] < 0.5]['id'].unique()
    cell_phone_data['close_point'] = cell_phone_data['id'].isin(id_of_interest)
    cell_phone_data_reduced = cell_phone_data.loc[cell_phone_data['close_point'] == True].reset_index()
    n = len(cell_phone_data_reduced['id'].unique())
    color_1 = iter(plt.cm.rainbow(np.linspace(0, 1, n)))
    
    fig, ax1 = plt.subplots(1, 1, figsize=(15, 8))
    
    smaller_map.plot(ax = ax1)
    plt.plot(-76.9926056723681, 38.90839920511692, c='orange', marker="*", markersize=30)
    
    for i in cell_phone_data_reduced['id'].unique():
        person = cell_phone_data_reduced.loc[cell_phone_data_reduced['id'] == i].sort_values(by='date')
        plt.plot(person['lon'], person['lat'], c=next(color_1), linestyle='--')
    
    plt.show()
    plt.close('all')
    DC Street Map with Distance Filtered Call Routes
    Figure 6. DC Street Map with Distance Filtered Call Routes

    We’ve reduced the noise some, but it’s still pretty hard to read.

    The investigators told us that due to the location the crime had to take place at nigh. We can use the time of the calls to see if we can reduce our data further.

  10. We will extract time and then filter to calls that only happened late at night.

    Details
    cell_phone_data_reduced['hour'] = cell_phone_data_reduced['date'].apply(lambda x: x.hour)
    cell_phone_data_reduced['minute'] = cell_phone_data_reduced['date'].apply(lambda x: x.minute)
    cell_phone_data_reduced.head()
       index                    date  id        lat        lon  \
    0      0 2022-05-02 00:17:14.404   0  38.890393 -77.011107
    1      1 2022-05-02 00:27:14.404   0  38.905440 -76.982952
    2      2 2022-05-02 00:37:14.404   0  38.901316 -76.991544
    3      3 2022-05-02 00:47:14.404   0  38.908996 -77.048789
    4      4 2022-05-02 00:57:14.404   0  38.893913 -77.032013
    
                                geometry     distance  distance_miles  \
    0  POINT (-6131415.531 10274679.579)  3588.816997        2.229987
    1  POINT (-6128984.462 10270666.938)  7951.698847        4.940955
    2  POINT (-6129644.341 10271886.629)  6597.517065        4.099506
    3  POINT (-6127856.245 10279670.550)  3739.428950        2.323573
    4  POINT (-6130636.381 10277516.684)  1016.764717        0.631788
    
       close_point  hour  minute
    0         True     0      17
    1         True     0      27
    2         True     0      37
    3         True     0      47
    4         True     0      57

    Now that we have hour and minute extracted lets filter our data to calls between 11 PM and 4 AM and map them.

    cell_phone_data_reduced_night = cell_phone_data_reduced.loc[(cell_phone_data_reduced['hour'] > 23) | (cell_phone_data_reduced['hour'] < 4)]
    
    id_count = cell_phone_data_reduced_night['id'].unique()
    print("We have {} late night IDs".format(len(id_count)))
    We have 7 late night IDs
    cell_phone_data['final_id'] = cell_phone_data['id'].isin(id_count)
    final_data = cell_phone_data.loc[cell_phone_data['final_id'] == True].reset_index()
    n = len(final_data['id'].unique())
    color_1 = iter(plt.cm.rainbow(np.linspace(0, 1, n)))
    
    fig, ax1 = plt.subplots(1, 1, figsize=(15, 8))
    
    smaller_map.plot(ax = ax1)
    plt.plot(-76.9926056723681, 38.90839920511692, c='orange', marker="*", markersize=30)
    
    for i in final_data['id'].unique():
        person = final_data.loc[final_data['id'] == i].sort_values(by='date')
        plt.plot(person['lon'], person['lat'], c=next(color_1), linestyle='--', label="caller {}".format(i))
    
    plt.legend()
    plt.show()
    plt.close('all')
    DC Street Map with Calls Filtered by Time
    Figure 7. DC Street Map with Calls Filtered by Time

    Narrowing our data down to 7 suspects is great!

    To help the investigators we can also break each of the 7 calls into its own graph.

  11. Let’s create a new vis that has a separate graph for each of the late-night calls.

    Details
    ids_to_plot = final_data['id'].unique()
    color_1 = iter(plt.cm.rainbow(np.linspace(0, 1, 7)))
    
    fig, axs = plt.subplots(nrows=3, ncols=3, figsize=(25, 20))
    
    for id, ax in zip(ids_to_plot, axs.ravel()):
        smaller_map.plot(ax = ax, alpha=0.25)
        single_caller = final_data.loc[final_data['id'] == id]
        ax.plot(single_caller['lon'], single_caller['lat'], c=next(color_1), linestyle='--')
        ax.plot(-76.9926056723681, 38.90839920511692, c='orange', marker="*", markersize=30)
    DC Street Map with Individual Calls - Part 1
    Figure 8. DC Street Map with Individual Calls - Part 1
    DC Street Map with Individual Calls - Part 2
    Figure 9. DC Street Map with Individual Calls - Part 2

    As a last analysis step we can create an animation that will show the caller’s path through DC.

  12. Let’s animate the paths based on id.

    Details
    from matplotlib.animation import FuncAnimation
    from IPython.display import HTML
    # We can use this list ot pick the ID that we are interested in.
    print(final_data['id'].unique())
    [ 0  6 14 15 16 24 25]
    data_subset = final_data.loc[final_data['id'] == 25].reset_index()
    
    fig = plt.figure(figsize=(15, 10))
    ax = plt.axes(xlim=(-77.062859, -76.982087), ylim=(38.880868, 38.915758))
    line, = ax.plot([], [], lw=2)
    
    x_data = []
    y_data = []
    
    def init():
        line.set_data([], [])
        return line,
    
    def animate(i):
        x_data.append(data_subset['lon'][i])
        y_data.append(data_subset['lat'][i])
    
        line.set_xdata(x_data)
        line.set_ydata(y_data)
    
        return line
    
    smaller_map.plot(ax = ax, alpha=0.25)
    anim = FuncAnimation(fig, animate, frames=len(data_subset), init_func=init, interval=300)
    
    HTML(anim.to_jshtml())

    This will create an animation that will display in our browser.

    We can update the final_data.loc[final_data['id'] == 25] ID to animate the different paths.

    Animation of the Caller’s Path
    Figure 10. Animation of the Caller’s Path

    We can pass these 7 suspects on to our next phase of investigation!