Basic EDA Using the Vehicles Database

This example is based on questions 2-4 from TDM 102 Project 2 Spring 2024.

These example(s) depend on the database:

  • /anvil/projects/tdm/data/craigslist/vehicles.csv

This dataset consists of vehicle listings sourced from Craigslist.org. It includes comprehensive information provided by Craigslist on car sales, featuring details such as price, condition, manufacturer, latitude/longitude, and 18 additional categories.

Find more about the dataset here.

For the example below, you need to use 3 cores in your Jupyter Lab session. If you started your Jupyter Lab session with only 1 core, just close your Jupyter Lab session and start a new session that uses 3 cores. Otherwise, your kernel will crash when you load the data.

Read in the dataset /anvil/projects/tdm/data/craigslist/vehicles.csv into a pandas DataFrame called mydf. (Optional: If you want to, you can use the first column id as the DataFrame’s index, but this is not required.)

myDF = pd.read_csv("/anvil/projects/tdm/data/craigslist/vehicles.csv",index_col=0)

First Five Rows of Vehicles Dataset

Display the first five rows of the mydf DataFrame:

myDF.head()

id

url

region

region_url

price

state

7222695916

prescott.craigslist.org/cto/d/prescott…​;

prescott

prescott.craigslist.org

6000

az

7218891961

fayar.craigslist.org/ctd/d/bentonville…​;

fayetteville

fayar.craigslist.org

11900

ar

7221797935

keys.craigslist.org/cto/d/summerland-k…​;

florida keys

keys.craigslist.org

21000

fl

7222270760

worcester.craigslist.org/cto/d/west-br…​;

worcester / central MA

worcester.craigslist.org

1500

ma

7210384030

greensboro.craigslist.org/cto/d/trinit…​;

greensboro

greensboro.craigslist.org

4900

nc

5 rows × 25 columns

Last Five Rows of Vehicles Dataset

Display the last five rows of the mydf DataFrame:

myDF.tail()

id

url

region

region_url

price

year

manufacturer

model

state

…​

7301591192

wyoming.craigslist.org/ctd/…​;

wyoming

wyoming.craigslist.org

23590

2019.0

nissan

maxima s sedan 4d

wy

…​

7301591187

wyoming.craigslist.org/ctd/…​;

wyoming

wyoming.craigslist.org

30590

2020.0

volvo

s60 t5 momentum sedan 4d

wy

…​

7301591147

wyoming.craigslist.org/ctd/…​;

wyoming

wyoming.craigslist.org

34990

2020.0

cadillac

xt4 sport suv 4d

wy

…​

7301591140

wyoming.craigslist.org/ctd/…​;

wyoming

wyoming.craigslist.org

28990

2018.0

lexus

es 350 sedan 4d

wy

…​

7301591129

wyoming.craigslist.org/ctd/…​;

wyoming

wyoming.craigslist.org

30590

2019.0

bmw

4 series 430i gran coupe

wy

…​

5 rows × 25 columns

Rows and Columns in Vehicles Dataset

Display how many rows and columns there are in the entire DataFrame mydf using .info():

mydf.info()
<class 'pandas.core.frame.DataFrame'>
Index: 426880 entries, 7222695916 to 7301591129
Data columns (total 25 columns):
 #   Column        Non-Null Count   Dtype
---  ------        --------------   -----
 0   url           426880 non-null  object
 1   region        426880 non-null  object
 2   region_url    426880 non-null  object
 3   price         426880 non-null  int64
 4   year          425675 non-null  float64
 5   manufacturer  409234 non-null  object
 6   model         421603 non-null  object
 7   condition     252776 non-null  object
 8   cylinders     249202 non-null  object
 9   fuel          423867 non-null  object
 10  odometer      422480 non-null  float64
 11  title_status  418638 non-null  object
 12  transmission  424324 non-null  object
 13  VIN           265838 non-null  object
 14  drive         296313 non-null  object
 15  size          120519 non-null  object
 16  type          334022 non-null  object
 17  paint_color   296677 non-null  object
 18  image_url     426812 non-null  object
 19  description   426810 non-null  object
 20  county        0 non-null       float64
 21  state         426880 non-null  object
 22  lat           420331 non-null  float64
 23  long          420331 non-null  float64
 24  posting_date  426812 non-null  object
dtypes: float64(5), int64(1), object(19)
memory usage: 84.7+ MB

Display how many rows and columns there are in the entire DataFrame mydf using the shape function:

rows, columns = mydf.shape
print(f'DataFrame has {rows} rows')
print(f'DataFrame has {columns} columns')
DataFrame has 426880 rows
DataFrame has 25 columns

List of Columns in Vehicles Dataset

Display a list of all the column names in the DataFrame mydf:

columns = mydf.columns
print(f"DataFrame's column name list: {columns.tolist()}")
DataFrame's column name list: ['url', 'region', 'region_url', 'price', 'year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color', 'image_url', 'description', 'county', 'state', 'lat', 'long', 'posting_date']

Price in Vehicles Dataset

Use the data from mydf to answer how many vehicles have a price that is strictly larger than $6000:

rows,columns = mydf[mydf['price']>6000].shape
print(f'{rows} cars are more than $6000 in the list')
312633 cars are more than $6000 in the list

OR

<class 'pandas.core.frame.DataFrame'>
Index: 312633 entries, 7218891961 to 7301591129
Data columns (total 25 columns):
 #   Column        Non-Null Count   Dtype
---  ------        --------------   -----
 0   url           312633 non-null  object
 1   region        312633 non-null  object
 2   region_url    312633 non-null  object
 3   price         312633 non-null  int64
 4   year          311494 non-null  float64
 5   manufacturer  301006 non-null  object
 6   model         309569 non-null  object
 7   condition     191086 non-null  object
 8   cylinders     179159 non-null  object
 9   fuel          310303 non-null  object
 10  odometer      310804 non-null  float64
 11  title_status  307086 non-null  object
 12  transmission  311016 non-null  object
 13  VIN           214441 non-null  object
 14  drive         220821 non-null  object
 15  size          79013 non-null   object
 16  type          253846 non-null  object
 17  paint_color   227459 non-null  object
 18  image_url     312595 non-null  object
 19  description   312593 non-null  object
 20  county        0 non-null       float64
 21  state         312633 non-null  object
 22  lat           309371 non-null  float64
 23  long          309371 non-null  float64
 24  posting_date  312595 non-null  object
dtypes: float64(5), int64(1), object(19)
memory usage: 62.0+ MB

Vehicles in Indiana:

Show how many vehicles are from Indiana:

mydf['state']= mydf['state'].str.lower()
rows,columns = mydf[mydf['state']=='in'].shape
print(f'{rows} cars are from indiana in the list')
5704 cars are from indiana in the list

Vehicles in Texas:

Show how many vehicles are from Texas:

rows,columns = mydf[mydf['state']=='tx'].shape
print(f'{rows} cars are from texas in the list')
22945 cars are from texas in the list

Regions in the Vehicles Dataset:

Display all of the regions listed in the data frame:

mydf['region'].unique().tolist()
['prescott',
 'fayetteville',
 'florida keys',
 'worcester / central MA',
 'greensboro',
 'hudson valley',
 'medford-ashland',
 'erie',
 'el paso',
 'bellingham',
 'skagit / island / SJI',
 'la crosse',
 'auburn',
 'birmingham',
 'dothan',
 'florence / muscle shoals',
 'gadsden-anniston',
 'huntsville / decatur',
 'mobile',
 'montgomery',
 'tuscaloosa',
 'anchorage / mat-su',
 'fairbanks',
 'kenai peninsula',
 'southeast alaska',
 'flagstaff / sedona',
 'mohave county',
 'phoenix',
 'show low',
 'sierra vista',
 'tucson',
 'yuma',
 'fort smith',
 'jonesboro',
 'little rock',
 'texarkana',
 'bakersfield',
 'chico',
 'fresno / madera',
 'gold country',
 'hanford-corcoran',
 'humboldt county',
 'imperial county',
 'inland empire',
 'los angeles',
 'mendocino county',
 'merced',
 'modesto',
 'monterey bay',
 'orange county',
 'palm springs',
 'redding',
 'reno / tahoe',
 'sacramento',
 'san diego',
 'san luis obispo',
 'santa barbara',
 'santa maria',
 'SF bay area',
 'siskiyou county',
 'stockton',
 'susanville',
 'ventura county',
 'visalia-tulare',
 'yuba-sutter',
 'boulder',
 'colorado springs',
 'denver',
 'eastern CO',
 'fort collins / north CO',
 'high rockies',
 'pueblo',
 'western slope',
 'eastern CT',
 'hartford',
 'new haven',
 'northwest CT',
 'washington, DC',
 'delaware',
 'daytona beach',
 'ft myers / SW florida',
 'gainesville',
 'heartland florida',
 'jacksonville',
 'lakeland',
 'north central FL',
 'ocala',
 'okaloosa / walton',
 'orlando',
 'panama city',
 'pensacola',
 'sarasota-bradenton',
 'south florida',
 'space coast',
 'st augustine',
 'tallahassee',
 'tampa bay area',
 'treasure coast',
 'albany',
 'athens',
 'atlanta',
 'augusta',
 'brunswick',
 'columbus',
 'macon / warner robins',
 'northwest GA',
 'savannah / hinesville',
 'statesboro',
 'valdosta',
 'hawaii',
 'boise',
 'east idaho',
 'lewiston / clarkston',
 'pullman / moscow',
 "spokane / coeur d'alene",
 'twin falls',
 'bloomington-normal',
 'champaign urbana',
 'chicago',
 'decatur',
 'la salle co',
 'mattoon-charleston',
 'peoria',
 'quad cities, IA/IL',
 'rockford',
 'southern illinois',
 'springfield',
 'st louis, MO',
 'western IL',
 'bloomington',
 'evansville',
 'fort wayne',
 'indianapolis',
 'kokomo',
 'lafayette / west lafayette',
 'muncie / anderson',
 'richmond',
 'south bend / michiana',
 'terre haute',
 'ames',
 'cedar rapids',
 'des moines',
 'dubuque',
 'fort dodge',
 'iowa city',
 'mason city',
 'omaha / council bluffs',
 'sioux city',
 'southeast IA',
 'waterloo / cedar falls',
 'kansas city, MO',
 'lawrence',
 'manhattan',
 'northwest KS',
 'salina',
 'southeast KS',
 'southwest KS',
 'topeka',
 'wichita',
 'bowling green',
 'eastern kentucky',
 'huntington-ashland',
 'lexington',
 'louisville',
 'owensboro',
 'western KY',
 'baton rouge',
 'central louisiana',
 'houma',
 'lafayette',
 'lake charles',
 'monroe',
 'new orleans',
 'shreveport',
 'maine',
 'annapolis',
 'baltimore',
 'cumberland valley',
 'eastern shore',
 'frederick',
 'southern maryland',
 'western maryland',
 'boston',
 'cape cod / islands',
 'south coast',
 'western massachusetts',
 'ann arbor',
 'battle creek',
 'central michigan',
 'detroit metro',
 'flint',
 'grand rapids',
 'holland',
 'jackson',
 'kalamazoo',
 'lansing',
 'muskegon',
 'northern michigan',
 'port huron',
 'saginaw-midland-baycity',
 'southwest michigan',
 'the thumb',
 'upper peninsula',
 'bemidji',
 'brainerd',
 'duluth / superior',
 'fargo / moorhead',
 'mankato',
 'minneapolis / st paul',
 'rochester',
 'southwest MN',
 'st cloud',
 'gulfport / biloxi',
 'hattiesburg',
 'meridian',
 'north mississippi',
 'southwest MS',
 'columbia / jeff city',
 'joplin',
 'kansas city',
 'kirksville',
 'lake of the ozarks',
 'southeast missouri',
 'st joseph',
 'st louis',
 'billings',
 'bozeman',
 'butte',
 'eastern montana',
 'great falls',
 'helena',
 'kalispell',
 'missoula',
 'asheville',
 'boone',
 'charlotte',
 'eastern NC',
 'hickory / lenoir',
 'outer banks',
 'raleigh / durham / CH',
 'wilmington',
 'winston-salem',
 'grand island',
 'lincoln',
 'north platte',
 'scottsbluff / panhandle',
 'elko',
 'las vegas',
 'central NJ',
 'jersey shore',
 'north jersey',
 'south jersey',
 'albuquerque',
 'clovis / portales',
 'farmington',
 'las cruces',
 'roswell / carlsbad',
 'santa fe / taos',
 'binghamton',
 'buffalo',
 'catskills',
 'chautauqua',
 'elmira-corning',
 'finger lakes',
 'glens falls',
 'ithaca',
 'long island',
 'new york city',
 'oneonta',
 'plattsburgh-adirondacks',
 'potsdam-canton-massena',
 'syracuse',
 'twin tiers NY/PA',
 'utica-rome-oneida',
 'watertown',
 'new hampshire',
 'bismarck',
 'grand forks',
 'north dakota',
 'akron / canton',
 'ashtabula',
 'chillicothe',
 'cincinnati',
 'cleveland',
 'dayton / springfield',
 'lima / findlay',
 'mansfield',
 'northern panhandle',
 'parkersburg-marietta',
 'sandusky',
 'toledo',
 'tuscarawas co',
 'youngstown',
 'zanesville / cambridge',
 'fort smith, AR',
 'lawton',
 'northwest OK',
 'oklahoma city',
 'stillwater',
 'texoma',
 'tulsa',
 'bend',
 'corvallis/albany',
 'east oregon',
 'eugene',
 'klamath falls',
 'oregon coast',
 'portland',
 'roseburg',
 'salem',
 'altoona-johnstown',
 'harrisburg',
 'lancaster',
 'lehigh valley',
 'meadville',
 'philadelphia',
 'pittsburgh',
 'poconos',
 'reading',
 'scranton / wilkes-barre',
 'state college',
 'williamsport',
 'york',
 'rhode island',
 'charleston',
 'columbia',
 'florence',
 'greenville / upstate',
 'hilton head',
 'myrtle beach',
 'northeast SD',
 'pierre / central SD',
 'rapid city / west SD',
 'sioux falls / SE SD',
 'south dakota',
 'chattanooga',
 'clarksville',
 'cookeville',
 'knoxville',
 'memphis',
 'nashville',
 'tri-cities',
 'abilene',
 'amarillo',
 'austin',
 'beaumont / port arthur',
 'brownsville',
 'college station',
 'corpus christi',
 'dallas / fort worth',
 'deep east texas',
 'del rio / eagle pass',
 'galveston',
 'houston',
 'killeen / temple / ft hood',
 'laredo',
 'lubbock',
 'mcallen / edinburg',
 'odessa / midland',
 'san angelo',
 'san antonio',
 'san marcos',
 'southwest TX',
 'tyler / east TX',
 'victoria',
 'waco',
 'wichita falls',
 'logan',
 'ogden-clearfield',
 'provo / orem',
 'salt lake city',
 'st george',
 'vermont',
 'charlottesville',
 'danville',
 'fredericksburg',
 'harrisonburg',
 'lynchburg',
 'new river valley',
 'norfolk / hampton roads',
 'roanoke',
 'southwest VA',
 'winchester',
 'kennewick-pasco-richland',
 'moses lake',
 'olympic peninsula',
 'seattle-tacoma',
 'wenatchee',
 'yakima',
 'eastern panhandle',
 'morgantown',
 'southern WV',
 'west virginia (old)',
 'appleton-oshkosh-FDL',
 'eau claire',
 'green bay',
 'janesville',
 'kenosha-racine',
 'madison',
 'milwaukee',
 'northern WI',
 'sheboygan',
 'wausau',
 'wyoming']
len(mydf['region'].unique().tolist())
404

OR you could use the nunique function:

mydf['region'].nunique()
404