TDM 10100: R Project 13 — 2024

Motivation: It is fun and straightforward to do mapping in R!

Context: We will create several maps in R, including one with a Thanksgiving theme.

Scope: Maps in R are large, so when you download your Jupyter Lab project to your computer and then upload it to Gradescope, you won’t be able to easily view it in Gradescope, but that’s OK. The graders will still be able to see it.

Learning Objectives:
  • Learn how to make maps in R.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/craigslist/vehicles.csv (Craigslist vehicles)

  • /anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2015-11.csv (New York City taxi cab data from November 2015)

Questions

Please use 3 cores in your Jupyter Lab session for this project.

Question 1 (2 pts)

First load: library(data.table) so that you have the fread function available, and also options(repr.matrix.max.cols=50) so that you can see 50 columns.

Also load: options(jupyter.rich_display = T) so that you can draw maps in Jupyter Lab.

For this project, we need two maps packages: library(leaflet) and library(sf)

For this first question, we simply make a simple map, with three latitude and longitude values.

We call our data frame testDF, and it is important that we name the columns:

testDF <- data.frame(c(40.4259, 41.8781, 39.0792), c(-86.9081, -87.6298, -84.17704))
names(testDF) <- c("lat", "long")

The whole data frame looks like this, just three rows and two columns:

testDF

Now we can define the points that we want to plot on a map:

points <- st_as_sf( testDF, coords=c("long", "lat"), crs=4326)

and we can render the map. We make each dot have radius 1, but you are welcome to change the radius if you want to:

addCircleMarkers(addTiles(leaflet( testDF )), radius=1)
Deliverables

Show the map with the three points.

Question 2 (2 pts)

Now load the Craiglist data as follows:

myDF <- fread("/anvil/projects/tdm/data/craigslist/vehicles.csv",
              stringsAsFactors = TRUE, nrows=100)

Examine the head of myDF and also the dim of myDF, and see which columns are the state and the long and the lat columns.

Now read ALL of the rows of the data set into a new data frame, but only the three columns called state and long and lat (the other columns will not be needed).

Make a subset of this new data frame, satisfying 3 conditions, namely, the state variable indicates that the data is from Indiana, and the long and lat values are not missing.

(state=="in") & (!is.na(long)) & (!is.na(lat))

You should now have a data frame with 3 columns and 5634 rows.

Deliverables

Display the dimension of your new data frame, which should have 3 columns and 5634 rows.

Question 3 (2 pts)

Now make a plot of the data frame that you created in question 2, using these two lines of R:

points <- st_as_sf( mynewdataframe, coords=c("long", "lat"), crs=4326)
addCircleMarkers(addTiles(leaflet( mynewdataframe )), radius=1)

Please note that, with Craigslist, people can list items from anywhere in the country. So there are some items outside Indiana, even though we selected only the items that are supposed to be from the State of Indiana. BUT fortunately, you will see that most people’s listings are accurate. In other words, if you zoom in and out on the map, you will see that most of the dots appear in Indiana.

Deliverables

Show the map with the Craiglist data from Indiana. (Some of the data points will be outside Indiana, but most of them will be in the State of Indiana.)

Question 4 (2 pts)

In question 4 and question 5, we will verify the path of the Thanksgiving parade in New York City from 2015, as shown on this image:

You can import the New York taxi cab data from November 2015 as follows:

myDF <- fread("/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2015-11.csv", tz="")

(The tz="" indicates that the time zone is not given in this data.)

Make a new data frame called thanksgivingdayDF by using the subset function, with the option grepl("2015-11-26", tpep_pickup_datetime) to extract the rows of the data from Thanksgiving day. Your new data frame should have 242393 rows and 19 columns.

Create two times in R, one for the start of the parade, and one for the end of the parade:

paradestart <- strptime("2015-11-26 09:00:00", format="%Y-%m-%d  %H:%M:%S", tz="EST")

paradeend <- strptime("2015-11-26 12:00:00", format="%Y-%m-%d  %H:%M:%S", tz="EST")

Now make a vector of times (converting the pickup times from the taxi cab rides from strings into times).

mytimes <- strptime(thanksgivingdayDF$tpep_pickup_datetime, format="%Y-%m-%d  %H:%M:%S", tz="")

Finally, make a new data frame called finalDF from the data frame thanksgivingdayDF, using the subset function with the condition (mytimes > paradestart) & (mytimes < paradeend).

Your data frame finalDF should have 28704 rows.

Deliverables

Display the dimension of your data frame called finalDF, which should have 28704 rows.

Question 5 (2 pts)

If you examine the head of finalDF, you see that the latitude values are called pickup_latitude and pickup_longitude.

We want them to be called lat and long instead, so we can make a new data frame as follows:

testDF <- data.frame( finalDF$pickup_latitude, finalDF$pickup_longitude)
names(testDF) <- c("lat","long")

Finally, plot the latitude and longitude values from testDF using a smaller radius than you used in Question 1 and Question 3. We suggest radius=.1.

You will notice that taxi cabs were unable to pickup passengers on the route of the Thanksgiving Day parade because those roads were closed. Please zoom into the map and verify this, comparing your map to the parade route map:

Deliverables

Show the map with the data from Thanksgiving morning on November 26, 2015, at the time of the parade.

Because of the maps in this project, when you upload your work to Gradescope, it will say: "Large file hidden. You can download it using the button above." That is what the graders will do, namely, they will download it when they are grading it. This warning is expected because your maps are large, and that is totally OK.

Submitting your Work

This project gives you familiarity with mapping in R.

Items to submit
  • firstname_lastname_project13.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.