TDM 10100: Project 4 — Fall 2023

Many data science tools including have powerful ways to index data.

R typically has operations that are vectorized and there is little to no need to write loops.
R typically also uses indexing instead of using an if statement.

  • Sequential statements (one after another) i.e.

    1. print line 45

    2. print line 15

if/else statements create an order of direction based on a logical condition.

if statement example:

x <- 7
if (x > 0){
print ("Positive number")
}

else statement example:

x <- -10
if(x >= 0){
print("Non-negative number")
} else {
print("Negative number")
}

In R, we can classify many numbers all at once:

x <- c(-10,3,1,-6,19,-3,12,-1)
mysigns <- rep("Non-negative number", times=8)
mysigns[x < 0] <- "Negative number"
mysigns

Context: As we continue to become more familiar with R this project will help reinforce the many ways of indexing data in R.

Scope: R, data.frames, indexing.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Using the seminar-r kernel Lets first see all of the files that are in the craigslist folder

list.files("/anvil/projects/tdm/data/craigslist")

Remember:

  • If we want to see the file size (aka how large) of the CSV.

file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size
  • You can also use 'file.info' to see other information about the file.

After looking at several of the files we will go ahead and read in the data frame on the Vehicles

myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv", stringsAsFactors = TRUE)

It is important that, each time we look at data, we start by becoming familiar with the contents of the data.
In past projects we have looked at the head/tail along with the structure and the dimensions of the data. We want to continue this practice.

This dataset has 25 columns. We are unable to see it all without adjusting the width. We can do this by

options(repr.matrix.max.cols=25, repr.matrix.max.rows=200)

and we also remember (from the previous project) that we can set the output in R to look more natural this way:

options(jupyter.rich_display = F)
  • Use 'head' to look at the first 6 rows

     head(myDF)
  • Use 'tail' to look at the last 6 rows

     tail(myDF)
  • Use str to check structure

     str(myDF)
  • Use dim to check dimensions

     dim(myDF)

To sort and order a single vector you can use this code:

head(myDF$year[order(myDF$year)])

You can also use the sort function. By default, it sorts in ascending order. If want the order to be descending, use decreasing = TRUE as an argument

head(sort(myDF$year, decreasing = TRUE))

vectorization

Most of R’s functions are vectorized, which means that the function will be applied to all elements of a vector, without needing to loop through the elements one at a time. The most common way to access individual elements is by using the [] symbol for indexing.

cut(myvector, breaks = c(-Inf,10,50,200,Inf) , labels = c("a","b","c","d"))

breaks value specified the range of myvector divided into the following intervals:
- (-∞, 10)
- [10, 50)
- [50, 200)
- [200, ∞)

labels values will be assigned
- Values less than 10: Will be labeled as "a".
- Values in the range [10, 50): Will be labeled as "b".
- Values in the range [50, 200): Will be labeled as "c".
- Values 200 and above: Will be labeled as "d".

Questions

Question 1 (1.5 pts)

  1. How many unique states are there in total? Which five of the states have the most occurrences?

  2. How many cars have a price that is greater than or equal to $2000 ?

  3. What is the average price of the vehicles in the dataset?

Question 2 (1.5 pts)

  1. Create a new column mileage_category in your data.frame that categorize the vehicle’s mileage into different buckets by using the cut function on the odometer column.

    1. "Low": [0, 50000)

    2. "Moderate": [50000, 100000)

    3. "High": [100000, 150000)

    4. "Very High": [150000, Inf)

  2. Create a new column called has_VIN that flags whether or not the listing Vehicle has a VIN provided.

  3. Create a new column called description_length to categorize listings based on the length of their descriptions (in terms of the number of characters).

    1. "Very Short": [0, 50)

    2. "Short": [50, 100)

    3. "Medium": [100, 200)

    4. "Long": [200, 500)

    5. "Very Long": [500, Inf)

You may count number of characters using the nchar function

mynchar <- nchar(as.character(myDF$description))

Remember to consider empty values and or NA values

Question 3 (1.5 pts)

  1. Using the table function, and the new column mileage_category that you created in Question 2, find the number of cars in each of the different mileage categories.

  2. Using the table function, and the new column has_VIN that you created in Question 2, identify how many vehicles have a VIN and how many do not have a VIN.

  3. Using the table function, and the new column description_length that you created in Question 2, identify how many vehicles are in each of the categories of description length.

Question 4 (1.5 pts)

Preparing for Mapping

  1. Extract all of the data for Texas into a data.frame called myTexasDF

  2. Identify the most popular state from myDF, and extract all of the data from that state into a data.frame called popularStateDF

  3. Create a third data.frame called myFavoriteDF with the data from a state of your choice

Question 5 (2 pts)

Mapping

  1. Using the R package leaflet, make 3 maps of the USA, namely, one map for the data in each of the data.frames from question 4.

Submitting your Work

Well done, you’ve finished Project 4! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions.

Project 4 Assignment Checklist

  • Code used to solve quesitons 1 to 5

  • All of your code and comments, and Output from running the code in a Jupyter Lab file:

    • firstname-lastname-project04.ipynb.

  • All of your code and comments in an R File:

    • firstname-lastname-project04.R.

  • submit files through Gradescope

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.