# TDM 10100: Project 4 — Fall 2022

Many data science tools including R have powerful ways to index data.

Insider Knowledge

R typically has operations that are vectorized and there is little to no need to write loops.
R typically also uses indexing instead of using an if statement.

• Sequential statements (one after another) i.e.

1. print line 45

2. print line 15

if/else statements create an order of direction based on a logical condition.

if statement example:

``````x <- 7
if (x > 0){
print ("Positive number")
}``````

else statement example:

``````x <- -10
if(x >= 0){
print("Non-negative number")
} else {
print("Negative number")
}``````

In `R`, we can classify many numbers all at once:

``````x <- c(-10,3,1,-6,19,-3,12,-1)
mysigns <- rep("Non-negative number", times=8)
mysigns[x < 0] <- "Negative number"
mysigns``````

Context: As we continue to become more familiar with `R` this project will help reinforce the many ways of indexing data in `R`.

Scope: r, data.frames, indexing.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Using the f2022-s2023-r kernel Lets first see all of the files that are in the `craigslist` folder

``list.files("/anvil/projects/tdm/data/craigslist")``

After looking at several of the files we will go ahead and read in the data frame on the Vehicles

``myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv", stringsAsFactors = TRUE)``

Remember:

• If we want to see the file size (aka how large) of the CSV.

`file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")\$size`
• You can also use 'file.info' to see other information about the file.

### ONE

It is so important that, each time we look at data, we start by becoming familiar with the data.
In past projects we have looked at the head/tail along with the structure and the dimensions of the data. We want to continue this practice.

This dataset has 25 columns, and we are unable to see it all without adjusting the width. We can do this by

``options(repr.matrix.max.cols=25, repr.matrix.max.rows=200)``

and we also remember (from the previous project) that we can set the output in `R` to look more natural this way:

``options(jupyter.rich_display = F)``

You can look at the first 6 rows (`head`), the last 6 rows (`tail`), the structure (`str`), and/or the dimensions (`dim`) of the dataset.

1. How many unique regions are there in total? Name 5 of the different regions that are included in this dataset.

2. How many cars are manufactured in 2011 or afterwards, i.e., they are made in 2011 or newer?

3. In what year was the oldest model manufactured? In what year was the most recent model manufactured? In which year were the most cars manufactured?

To sort and order a single vector you can use this code:

``head(myDF\$year[order(myDF\$year)])``

You can also use the `sort` function, as demonstrated in earlier projects.

Items to submit
• Code used to solve this problem.

• Output from running the code.

• Answers to the 3 questions above.

### TWO

1. Create a new column in your data.frame that is labeled `newflag` which indicates if the vehicle for sale has been labeled as `like new`. In other words, the column `newflag` should be `TRUE` if the vehicle on that row is `like new`, and `FALSE` otherwise.

2. Create a new column called `pricecategory` that is

1. `cheap` for vehicles less than or equal to \$1,500

2. `average` for vehicles strictly more than \$1,500 but less than or equal to \$10,000

3. `expensive` for vehicles strictly more than \$10,000

3. How many cars are there in each of these three `pricecategories` ?

Remember to consider any 0 values and or `NA` values

Items to submit
• Code used to solve this problem.

• Output from running the code.

• The answer to the questions above.

### THREE

vectoriztion

Most of R’s functions are vectorized, which means that the function will be applied to all elements of a vector, without needing to loop through the elements one at a time. The most common way to access individual elements is by using the `[]` symbol for indexing.

1. Using the `table()` function, and the column `myDF\$newflag`, identify how many vehicles are `like new` and how many vehicles are not `like new`.

2. Now using the `cut` function and appropriate `breaks`, create a new column called `newpricecategory`. Verify that this column is identical to the previously created `pricecategory` column, created in question TWO.

3. Make another column called `odometerage`, which has values `new` or `middle age` or `old`, according to whether the odometer is (respectively): less than or equal to 50000; strictly greater than 50000 and less than or equal to 100000; or strictly greater than 100000. How many cars are in each of these categories?

``cut(myvector, breaks = c(10,50,200) , labels = c(a,b,c))``
Items to submit
• Code used to solve this problem.

• Output from running the code.

• The answer to the questions above.

#### FOUR

Preparing for Mapping

1. Extract all of the data for `indianapolis` into a `data.frame` called `myIndy`

2. Identify the most popular region from `myDF`, and extract all of the data from that region into a `data.frame` called `popularRegion`.

3. Create a third `data.frame` with the data from a region of your choice

Items to submit
• Code used to solve this problem.

• Output from running the code.

• The answer to the questions above.

#### FIVE

Mapping

Using the R package `leaflet`, make 3 maps of the USA, namely, one map for the data in each of the `data.frames` from question FOUR.

Items to submit
• Code used to solve this problem.

• Output from running the code.

• The answers to the 3 questions above.

 Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project.