TDM 10100: R Project 13 — 2024
Motivation: It is fun and straightforward to do mapping in R!
Context: We will create several maps in R, including one with a Thanksgiving theme.
Scope: Maps in R are large, so when you download your Jupyter Lab project to your computer and then upload it to Gradescope, you won’t be able to easily view it in Gradescope, but that’s OK. The graders will still be able to see it.
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/craigslist/vehicles.csv
(Craigslist vehicles) -
/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2015-11.csv
(New York City taxi cab data from November 2015)
Questions
Please use 3 cores in your Jupyter Lab session for this project. |
Question 1 (2 pts)
First load: library(data.table)
so that you have the fread
function available, and also options(repr.matrix.max.cols=50)
so that you can see 50 columns.
Also load: options(jupyter.rich_display = T)
so that you can draw maps in Jupyter Lab.
For this project, we need two maps packages: library(leaflet)
and library(sf)
For this first question, we simply make a simple map, with three latitude and longitude values.
We call our data frame testDF
, and it is important that we name the columns:
testDF <- data.frame(c(40.4259, 41.8781, 39.0792), c(-86.9081, -87.6298, -84.17704))
names(testDF) <- c("lat", "long")
The whole data frame looks like this, just three rows and two columns:
testDF
Now we can define the points that we want to plot on a map:
points <- st_as_sf( testDF, coords=c("long", "lat"), crs=4326)
and we can render the map. We make each dot have radius 1, but you are welcome to change the radius if you want to:
addCircleMarkers(addTiles(leaflet( testDF )), radius=1)
Show the map with the three points.
Question 2 (2 pts)
Now load the Craiglist data as follows:
myDF <- fread("/anvil/projects/tdm/data/craigslist/vehicles.csv",
stringsAsFactors = TRUE, nrows=100)
Examine the head
of myDF
and also the dim
of myDF
, and see which columns are the state
and the long
and the lat
columns.
Now read ALL of the rows of the data set into a new data frame, but only the three columns called state
and long
and lat
(the other columns will not be needed).
Make a subset
of this new data frame, satisfying 3 conditions, namely, the state
variable indicates that the data is from Indiana, and the long
and lat
values are not missing.
(state=="in") & (!is.na(long)) & (!is.na(lat))
You should now have a data frame with 3 columns and 5634 rows.
Display the dimension of your new data frame, which should have 3 columns and 5634 rows.
Question 3 (2 pts)
Now make a plot of the data frame that you created in question 2, using these two lines of R:
points <- st_as_sf( mynewdataframe, coords=c("long", "lat"), crs=4326)
addCircleMarkers(addTiles(leaflet( mynewdataframe )), radius=1)
Please note that, with Craigslist, people can list items from anywhere in the country. So there are some items outside Indiana, even though we selected only the items that are supposed to be from the State of Indiana. BUT fortunately, you will see that most people’s listings are accurate. In other words, if you zoom in and out on the map, you will see that most of the dots appear in Indiana.
Show the map with the Craiglist data from Indiana. (Some of the data points will be outside Indiana, but most of them will be in the State of Indiana.)
Question 4 (2 pts)
In question 4 and question 5, we will verify the path of the Thanksgiving parade in New York City from 2015, as shown on this image:
You can import the New York taxi cab data from November 2015 as follows:
myDF <- fread("/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2015-11.csv", tz="")
(The tz=""
indicates that the time zone is not given in this data.)
Make a new data frame called thanksgivingdayDF
by using the subset
function, with the option grepl("2015-11-26", tpep_pickup_datetime)
to extract the rows of the data from Thanksgiving day. Your new data frame should have 242393 rows and 19 columns.
Create two times in R, one for the start of the parade, and one for the end of the parade:
paradestart <- strptime("2015-11-26 09:00:00", format="%Y-%m-%d %H:%M:%S", tz="EST")
paradeend <- strptime("2015-11-26 12:00:00", format="%Y-%m-%d %H:%M:%S", tz="EST")
Now make a vector of times (converting the pickup times from the taxi cab rides from strings into times).
mytimes <- strptime(thanksgivingdayDF$tpep_pickup_datetime, format="%Y-%m-%d %H:%M:%S", tz="")
Finally, make a new data frame called finalDF
from the data frame thanksgivingdayDF
, using the subset
function with the condition (mytimes > paradestart) & (mytimes < paradeend)
.
Your data frame finalDF
should have 28704 rows.
Display the dimension of your data frame called finalDF
, which should have 28704 rows.
Question 5 (2 pts)
If you examine the head of finalDF
, you see that the latitude values are called pickup_latitude
and pickup_longitude
.
We want them to be called lat
and long
instead, so we can make a new data frame as follows:
testDF <- data.frame( finalDF$pickup_latitude, finalDF$pickup_longitude)
names(testDF) <- c("lat","long")
Finally, plot the latitude and longitude values from testDF
using a smaller radius than you used in Question 1 and Question 3. We suggest radius=.1
.
You will notice that taxi cabs were unable to pickup passengers on the route of the Thanksgiving Day parade because those roads were closed. Please zoom into the map and verify this, comparing your map to the parade route map:
Show the map with the data from Thanksgiving morning on November 26, 2015, at the time of the parade.
Because of the maps in this project, when you upload your work to Gradescope, it will say: "Large file hidden. You can download it using the button above." That is what the graders will do, namely, they will download it when they are grading it. This warning is expected because your maps are large, and that is totally OK. |
Submitting your Work
This project gives you familiarity with mapping in R.
-
firstname_lastname_project13.ipynb
You must double check your You will not receive full credit if your |