Basic EDA Using the Vehicles Database
This example is based on questions 2-4 from TDM 102 Project 2 Spring 2024.
These example(s) depend on the database:
-
/anvil/projects/tdm/data/craigslist/vehicles.csv
This dataset consists of vehicle listings sourced from Craigslist.org. It includes comprehensive information provided by Craigslist on car sales, featuring details such as price, condition, manufacturer, latitude/longitude, and 18 additional categories.
Find more about the dataset here.
For the example below, you need to use 3 cores in your Jupyter Lab session. If you started your Jupyter Lab session with only 1 core, just close your Jupyter Lab session and start a new session that uses 3 cores. Otherwise, your kernel will crash when you load the data. We added a video about starting an anvil session with more cores |
Read in the dataset /anvil/projects/tdm/data/craigslist/vehicles.csv
into a pandas
DataFrame called mydf
. (Optional: If you want to, you can use the first column id
as the DataFrame’s index, but this is not required.)
myDF = pd.read_csv("/anvil/projects/tdm/data/craigslist/vehicles.csv",index_col=0)
First Five Rows of Vehicles Dataset
Display the first five rows of the mydf
DataFrame:
myDF.head()
id |
url |
region |
region_url |
price |
state |
7222695916 |
prescott |
6000 |
az |
||
7218891961 |
fayetteville |
11900 |
ar |
||
7221797935 |
florida keys |
21000 |
fl |
||
7222270760 |
worcester / central MA |
1500 |
ma |
||
7210384030 |
greensboro |
4900 |
nc |
5 rows × 25 columns
Last Five Rows of Vehicles Dataset
Display the last five rows of the mydf
DataFrame:
myDF.tail()
id |
url |
region |
region_url |
price |
year |
manufacturer |
model |
state |
… |
7301591192 |
wyoming |
23590 |
2019.0 |
nissan |
maxima s sedan 4d |
wy |
… |
||
7301591187 |
wyoming |
30590 |
2020.0 |
volvo |
s60 t5 momentum sedan 4d |
wy |
… |
||
7301591147 |
wyoming |
34990 |
2020.0 |
cadillac |
xt4 sport suv 4d |
wy |
… |
||
7301591140 |
wyoming |
28990 |
2018.0 |
lexus |
es 350 sedan 4d |
wy |
… |
||
7301591129 |
wyoming |
30590 |
2019.0 |
bmw |
4 series 430i gran coupe |
wy |
… |
5 rows × 25 columns
Rows and Columns in Vehicles Dataset
Display how many rows and columns there are in the entire DataFrame mydf using .info()
:
mydf.info()
<class 'pandas.core.frame.DataFrame'> Index: 426880 entries, 7222695916 to 7301591129 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 url 426880 non-null object 1 region 426880 non-null object 2 region_url 426880 non-null object 3 price 426880 non-null int64 4 year 425675 non-null float64 5 manufacturer 409234 non-null object 6 model 421603 non-null object 7 condition 252776 non-null object 8 cylinders 249202 non-null object 9 fuel 423867 non-null object 10 odometer 422480 non-null float64 11 title_status 418638 non-null object 12 transmission 424324 non-null object 13 VIN 265838 non-null object 14 drive 296313 non-null object 15 size 120519 non-null object 16 type 334022 non-null object 17 paint_color 296677 non-null object 18 image_url 426812 non-null object 19 description 426810 non-null object 20 county 0 non-null float64 21 state 426880 non-null object 22 lat 420331 non-null float64 23 long 420331 non-null float64 24 posting_date 426812 non-null object dtypes: float64(5), int64(1), object(19) memory usage: 84.7+ MB
Display how many rows and columns there are in the entire DataFrame mydf
using the shape
function:
rows, columns = mydf.shape
print(f'DataFrame has {rows} rows')
print(f'DataFrame has {columns} columns')
DataFrame has 426880 rows DataFrame has 25 columns
List of Columns in Vehicles Dataset
Display a list of all the column names in the DataFrame mydf
:
columns = mydf.columns
print(f"DataFrame's column name list: {columns.tolist()}")
DataFrame's column name list: ['url', 'region', 'region_url', 'price', 'year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color', 'image_url', 'description', 'county', 'state', 'lat', 'long', 'posting_date']
Price in Vehicles Dataset
Use the data from mydf
to answer how many vehicles have a price that is strictly larger than $6000:
rows,columns = mydf[mydf['price']>6000].shape
print(f'{rows} cars are more than $6000 in the list')
312633 cars are more than $6000 in the list
OR
<class 'pandas.core.frame.DataFrame'> Index: 312633 entries, 7218891961 to 7301591129 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 url 312633 non-null object 1 region 312633 non-null object 2 region_url 312633 non-null object 3 price 312633 non-null int64 4 year 311494 non-null float64 5 manufacturer 301006 non-null object 6 model 309569 non-null object 7 condition 191086 non-null object 8 cylinders 179159 non-null object 9 fuel 310303 non-null object 10 odometer 310804 non-null float64 11 title_status 307086 non-null object 12 transmission 311016 non-null object 13 VIN 214441 non-null object 14 drive 220821 non-null object 15 size 79013 non-null object 16 type 253846 non-null object 17 paint_color 227459 non-null object 18 image_url 312595 non-null object 19 description 312593 non-null object 20 county 0 non-null float64 21 state 312633 non-null object 22 lat 309371 non-null float64 23 long 309371 non-null float64 24 posting_date 312595 non-null object dtypes: float64(5), int64(1), object(19) memory usage: 62.0+ MB
Vehicles in Indiana:
Show how many vehicles are from Indiana:
mydf['state']= mydf['state'].str.lower()
rows,columns = mydf[mydf['state']=='in'].shape
print(f'{rows} cars are from indiana in the list')
5704 cars are from indiana in the list
Vehicles in Texas:
Show how many vehicles are from Texas:
rows,columns = mydf[mydf['state']=='tx'].shape
print(f'{rows} cars are from texas in the list')
22945 cars are from texas in the list
Regions in the Vehicles Dataset:
Display all of the regions listed in the data frame:
mydf['region'].unique().tolist()
['prescott', 'fayetteville', 'florida keys', 'worcester / central MA', 'greensboro', 'hudson valley', 'medford-ashland', 'erie', 'el paso', 'bellingham', 'skagit / island / SJI', 'la crosse', 'auburn', 'birmingham', 'dothan', 'florence / muscle shoals', 'gadsden-anniston', 'huntsville / decatur', 'mobile', 'montgomery', 'tuscaloosa', 'anchorage / mat-su', 'fairbanks', 'kenai peninsula', 'southeast alaska', 'flagstaff / sedona', 'mohave county', 'phoenix', 'show low', 'sierra vista', 'tucson', 'yuma', 'fort smith', 'jonesboro', 'little rock', 'texarkana', 'bakersfield', 'chico', 'fresno / madera', 'gold country', 'hanford-corcoran', 'humboldt county', 'imperial county', 'inland empire', 'los angeles', 'mendocino county', 'merced', 'modesto', 'monterey bay', 'orange county', 'palm springs', 'redding', 'reno / tahoe', 'sacramento', 'san diego', 'san luis obispo', 'santa barbara', 'santa maria', 'SF bay area', 'siskiyou county', 'stockton', 'susanville', 'ventura county', 'visalia-tulare', 'yuba-sutter', 'boulder', 'colorado springs', 'denver', 'eastern CO', 'fort collins / north CO', 'high rockies', 'pueblo', 'western slope', 'eastern CT', 'hartford', 'new haven', 'northwest CT', 'washington, DC', 'delaware', 'daytona beach', 'ft myers / SW florida', 'gainesville', 'heartland florida', 'jacksonville', 'lakeland', 'north central FL', 'ocala', 'okaloosa / walton', 'orlando', 'panama city', 'pensacola', 'sarasota-bradenton', 'south florida', 'space coast', 'st augustine', 'tallahassee', 'tampa bay area', 'treasure coast', 'albany', 'athens', 'atlanta', 'augusta', 'brunswick', 'columbus', 'macon / warner robins', 'northwest GA', 'savannah / hinesville', 'statesboro', 'valdosta', 'hawaii', 'boise', 'east idaho', 'lewiston / clarkston', 'pullman / moscow', "spokane / coeur d'alene", 'twin falls', 'bloomington-normal', 'champaign urbana', 'chicago', 'decatur', 'la salle co', 'mattoon-charleston', 'peoria', 'quad cities, IA/IL', 'rockford', 'southern illinois', 'springfield', 'st louis, MO', 'western IL', 'bloomington', 'evansville', 'fort wayne', 'indianapolis', 'kokomo', 'lafayette / west lafayette', 'muncie / anderson', 'richmond', 'south bend / michiana', 'terre haute', 'ames', 'cedar rapids', 'des moines', 'dubuque', 'fort dodge', 'iowa city', 'mason city', 'omaha / council bluffs', 'sioux city', 'southeast IA', 'waterloo / cedar falls', 'kansas city, MO', 'lawrence', 'manhattan', 'northwest KS', 'salina', 'southeast KS', 'southwest KS', 'topeka', 'wichita', 'bowling green', 'eastern kentucky', 'huntington-ashland', 'lexington', 'louisville', 'owensboro', 'western KY', 'baton rouge', 'central louisiana', 'houma', 'lafayette', 'lake charles', 'monroe', 'new orleans', 'shreveport', 'maine', 'annapolis', 'baltimore', 'cumberland valley', 'eastern shore', 'frederick', 'southern maryland', 'western maryland', 'boston', 'cape cod / islands', 'south coast', 'western massachusetts', 'ann arbor', 'battle creek', 'central michigan', 'detroit metro', 'flint', 'grand rapids', 'holland', 'jackson', 'kalamazoo', 'lansing', 'muskegon', 'northern michigan', 'port huron', 'saginaw-midland-baycity', 'southwest michigan', 'the thumb', 'upper peninsula', 'bemidji', 'brainerd', 'duluth / superior', 'fargo / moorhead', 'mankato', 'minneapolis / st paul', 'rochester', 'southwest MN', 'st cloud', 'gulfport / biloxi', 'hattiesburg', 'meridian', 'north mississippi', 'southwest MS', 'columbia / jeff city', 'joplin', 'kansas city', 'kirksville', 'lake of the ozarks', 'southeast missouri', 'st joseph', 'st louis', 'billings', 'bozeman', 'butte', 'eastern montana', 'great falls', 'helena', 'kalispell', 'missoula', 'asheville', 'boone', 'charlotte', 'eastern NC', 'hickory / lenoir', 'outer banks', 'raleigh / durham / CH', 'wilmington', 'winston-salem', 'grand island', 'lincoln', 'north platte', 'scottsbluff / panhandle', 'elko', 'las vegas', 'central NJ', 'jersey shore', 'north jersey', 'south jersey', 'albuquerque', 'clovis / portales', 'farmington', 'las cruces', 'roswell / carlsbad', 'santa fe / taos', 'binghamton', 'buffalo', 'catskills', 'chautauqua', 'elmira-corning', 'finger lakes', 'glens falls', 'ithaca', 'long island', 'new york city', 'oneonta', 'plattsburgh-adirondacks', 'potsdam-canton-massena', 'syracuse', 'twin tiers NY/PA', 'utica-rome-oneida', 'watertown', 'new hampshire', 'bismarck', 'grand forks', 'north dakota', 'akron / canton', 'ashtabula', 'chillicothe', 'cincinnati', 'cleveland', 'dayton / springfield', 'lima / findlay', 'mansfield', 'northern panhandle', 'parkersburg-marietta', 'sandusky', 'toledo', 'tuscarawas co', 'youngstown', 'zanesville / cambridge', 'fort smith, AR', 'lawton', 'northwest OK', 'oklahoma city', 'stillwater', 'texoma', 'tulsa', 'bend', 'corvallis/albany', 'east oregon', 'eugene', 'klamath falls', 'oregon coast', 'portland', 'roseburg', 'salem', 'altoona-johnstown', 'harrisburg', 'lancaster', 'lehigh valley', 'meadville', 'philadelphia', 'pittsburgh', 'poconos', 'reading', 'scranton / wilkes-barre', 'state college', 'williamsport', 'york', 'rhode island', 'charleston', 'columbia', 'florence', 'greenville / upstate', 'hilton head', 'myrtle beach', 'northeast SD', 'pierre / central SD', 'rapid city / west SD', 'sioux falls / SE SD', 'south dakota', 'chattanooga', 'clarksville', 'cookeville', 'knoxville', 'memphis', 'nashville', 'tri-cities', 'abilene', 'amarillo', 'austin', 'beaumont / port arthur', 'brownsville', 'college station', 'corpus christi', 'dallas / fort worth', 'deep east texas', 'del rio / eagle pass', 'galveston', 'houston', 'killeen / temple / ft hood', 'laredo', 'lubbock', 'mcallen / edinburg', 'odessa / midland', 'san angelo', 'san antonio', 'san marcos', 'southwest TX', 'tyler / east TX', 'victoria', 'waco', 'wichita falls', 'logan', 'ogden-clearfield', 'provo / orem', 'salt lake city', 'st george', 'vermont', 'charlottesville', 'danville', 'fredericksburg', 'harrisonburg', 'lynchburg', 'new river valley', 'norfolk / hampton roads', 'roanoke', 'southwest VA', 'winchester', 'kennewick-pasco-richland', 'moses lake', 'olympic peninsula', 'seattle-tacoma', 'wenatchee', 'yakima', 'eastern panhandle', 'morgantown', 'southern WV', 'west virginia (old)', 'appleton-oshkosh-FDL', 'eau claire', 'green bay', 'janesville', 'kenosha-racine', 'madison', 'milwaukee', 'northern WI', 'sheboygan', 'wausau', 'wyoming']
len(mydf['region'].unique().tolist())
404
OR you could use the nunique
function:
mydf['region'].nunique()
404