TDM 10100: Project 4 — Fall 2022

Many data science tools including R have powerful ways to index data.

Insider Knowledge

R typically has operations that are vectorized and there is little to no need to write loops.
R typically also uses indexing instead of using an if statement.

  • Sequential statements (one after another) i.e.

    1. print line 45

    2. print line 15

if/else statements create an order of direction based on a logical condition.

if statement example:

x <- 7
if (x > 0){
print ("Positive number")
}

else statement example:

x <- -10
if(x >= 0){
print("Non-negative number")
} else {
print("Negative number")
}

In R, we can classify many numbers all at once:

x <- c(-10,3,1,-6,19,-3,12,-1)
mysigns <- rep("Non-negative number", times=8)
mysigns[x < 0] <- "Negative number"
mysigns

Context: As we continue to become more familiar with R this project will help reinforce the many ways of indexing data in R.

Scope: r, data.frames, indexing.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Using the f2022-s2023-r kernel Lets first see all of the files that are in the craigslist folder

list.files("/anvil/projects/tdm/data/craigslist")

After looking at several of the files we will go ahead and read in the data frame on the Vehicles

myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv", stringsAsFactors = TRUE)
Helpful Hints

Remember:

  • If we want to see the file size (aka how large) of the CSV.

file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size
  • You can also use 'file.info' to see other information about the file.

ONE

It is so important that, each time we look at data, we start by becoming familiar with the data.
In past projects we have looked at the head/tail along with the structure and the dimensions of the data. We want to continue this practice.

This dataset has 25 columns, and we are unable to see it all without adjusting the width. We can do this by

options(repr.matrix.max.cols=25, repr.matrix.max.rows=200)

and we also remember (from the previous project) that we can set the output in R to look more natural this way:

options(jupyter.rich_display = F)
Helpful Hint

You can look at the first 6 rows (head), the last 6 rows (tail), the structure (str), and/or the dimensions (dim) of the dataset.

  1. How many unique regions are there in total? Name 5 of the different regions that are included in this dataset.

  2. How many cars are manufactured in 2011 or afterwards, i.e., they are made in 2011 or newer?

  3. In what year was the oldest model manufactured? In what year was the most recent model manufactured? In which year were the most cars manufactured?

Helpful Hint

To sort and order a single vector you can use this code:

head(myDF$year[order(myDF$year)])

You can also use the sort function, as demonstrated in earlier projects.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Answers to the 3 questions above.

TWO

  1. Create a new column in your data.frame that is labeled newflag which indicates if the vehicle for sale has been labeled as like new. In other words, the column newflag should be TRUE if the vehicle on that row is like new, and FALSE otherwise.

  2. Create a new column called pricecategory that is

    1. cheap for vehicles less than or equal to $1,500

    2. average for vehicles strictly more than $1,500 but less than or equal to $10,000

    3. expensive for vehicles strictly more than $10,000

  3. How many cars are there in each of these three pricecategories ?

Helpful Hint

Remember to consider any 0 values and or NA values

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answer to the questions above.

THREE

vectoriztion

Most of R’s functions are vectorized, which means that the function will be applied to all elements of a vector, without needing to loop through the elements one at a time. The most common way to access individual elements is by using the [] symbol for indexing.

  1. Using the table() function, and the column myDF$newflag, identify how many vehicles are like new and how many vehicles are not like new.

  2. Now using the cut function and appropriate breaks, create a new column called newpricecategory. Verify that this column is identical to the previously created pricecategory column, created in question TWO.

  3. Make another column called odometerage, which has values new or middle age or old, according to whether the odometer is (respectively): less than or equal to 50000; strictly greater than 50000 and less than or equal to 100000; or strictly greater than 100000. How many cars are in each of these categories?

Helpful Hint
cut(myvector, breaks = c(10,50,200) , labels = c(a,b,c))
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answer to the questions above.

FOUR

Preparing for Mapping

  1. Extract all of the data for indianapolis into a data.frame called myIndy

  2. Identify the most popular region from myDF, and extract all of the data from that region into a data.frame called popularRegion.

  3. Create a third data.frame with the data from a region of your choice

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answer to the questions above.

FIVE

Mapping

Using the R package leaflet, make 3 maps of the USA, namely, one map for the data in each of the data.frames from question FOUR.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answers to the 3 questions above.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.