# TDM 10100: Project 4 — Fall 2023

Many data science tools including have powerful ways to index data.

 R typically has operations that are vectorized and there is little to no need to write loops. R typically also uses indexing instead of using an if statement. Sequential statements (one after another) i.e. print line 45 print line 15 if/else statements create an order of direction based on a logical condition. if statement example: ``````x <- 7 if (x > 0){ print ("Positive number") }`````` else statement example: ``````x <- -10 if(x >= 0){ print("Non-negative number") } else { print("Negative number") }`````` In `R`, we can classify many numbers all at once: ``````x <- c(-10,3,1,-6,19,-3,12,-1) mysigns <- rep("Non-negative number", times=8) mysigns[x < 0] <- "Negative number" mysigns``````

Context: As we continue to become more familiar with `R` this project will help reinforce the many ways of indexing data in `R`.

Scope: R, data.frames, indexing.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Using the seminar-r kernel Lets first see all of the files that are in the `craigslist` folder

``list.files("/anvil/projects/tdm/data/craigslist")``
 Remember: If we want to see the file size (aka how large) of the CSV. `file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")\$size` You can also use 'file.info' to see other information about the file.

After looking at several of the files we will go ahead and read in the data frame on the Vehicles

``myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv", stringsAsFactors = TRUE)``

It is important that, each time we look at data, we start by becoming familiar with the contents of the data.
In past projects we have looked at the head/tail along with the structure and the dimensions of the data. We want to continue this practice.

This dataset has 25 columns. We are unable to see it all without adjusting the width. We can do this by

``options(repr.matrix.max.cols=25, repr.matrix.max.rows=200)``

and we also remember (from the previous project) that we can set the output in `R` to look more natural this way:

``options(jupyter.rich_display = F)``
 Use 'head' to look at the first 6 rows `` head(myDF)`` Use 'tail' to look at the last 6 rows `` tail(myDF)`` Use `str` to check structure `` str(myDF)`` Use `dim` to check dimensions `` dim(myDF)`` To sort and order a single vector you can use this code: ``head(myDF\$year[order(myDF\$year)])`` You can also use the `sort` function. By default, it sorts in ascending order. If want the order to be descending, use `decreasing = TRUE` as an argument ``head(sort(myDF\$year, decreasing = TRUE))``

vectorization

Most of R’s functions are vectorized, which means that the function will be applied to all elements of a vector, without needing to loop through the elements one at a time. The most common way to access individual elements is by using the `[]` symbol for indexing.

 ``````cut(myvector, breaks = c(-Inf,10,50,200,Inf) , labels = c("a","b","c","d")) breaks value specified the range of myvector divided into the following intervals: - (-∞, 10) - [10, 50) - [50, 200) - [200, ∞) labels values will be assigned - Values less than 10: Will be labeled as "a". - Values in the range [10, 50): Will be labeled as "b". - Values in the range [50, 200): Will be labeled as "c". - Values 200 and above: Will be labeled as "d".``````

## Questions

### Question 1 (1.5 pts)

1. How many unique states are there in total? Which five of the states have the most occurrences?

2. How many cars have a price that is greater than or equal to \$2000 ?

3. What is the average price of the vehicles in the dataset?

### Question 2 (1.5 pts)

1. Create a new column `mileage_category` in your data.frame that categorize the vehicle’s mileage into different buckets by using the `cut` function on the `odometer` column.

1. "Low": [0, 50000)

2. "Moderate": [50000, 100000)

3. "High": [100000, 150000)

4. "Very High": [150000, Inf)

2. Create a new column called `has_VIN` that flags whether or not the listing Vehicle has a VIN provided.

3. Create a new column called `description_length` to categorize listings based on the length of their descriptions (in terms of the number of characters).

1. "Very Short": [0, 50)

2. "Short": [50, 100)

3. "Medium": [100, 200)

4. "Long": [200, 500)

5. "Very Long": [500, Inf)

 You may count number of characters using the `nchar` function ``mynchar <- nchar(as.character(myDF\$description))``
 Remember to consider empty values and or `NA` values

### Question 3 (1.5 pts)

1. Using the `table` function, and the new column `mileage_category` that you created in Question 2, find the number of cars in each of the different mileage categories.

2. Using the `table` function, and the new column `has_VIN` that you created in Question 2, identify how many vehicles have a VIN and how many do not have a VIN.

3. Using the `table` function, and the new column `description_length` that you created in Question 2, identify how many vehicles are in each of the categories of description length.

### Question 4 (1.5 pts)

Preparing for Mapping

1. Extract all of the data for Texas into a data.frame called `myTexasDF`

2. Identify the most popular state from myDF, and extract all of the data from that state into a data.frame called `popularStateDF`

3. Create a third data.frame called `myFavoriteDF` with the data from a state of your choice

### Question 5 (2 pts)

Mapping

1. Using the R package `leaflet`, make 3 maps of the USA, namely, one map for the data in each of the `data.frames` from question 4.

### Submitting your Work

Well done, you’ve finished Project 4! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions.

Project 4 Assignment Checklist

• Code used to solve quesitons 1 to 5

• All of your code and comments, and Output from running the code in a Jupyter Lab file:

• `firstname-lastname-project04.ipynb`.

• All of your code and comments in an R File:

• `firstname-lastname-project04.R`.

• submit files through Gradescope

 You must double check your `.ipynb` after submitting it in gradescope. A very common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.
 Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project.