# TDM 10100: Project 3 — Fall 2023

Motivation: `data.frames` are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame`.

Context: In Project 2 we ran our first R code, learned about vectors and indexing, and explored some basic functions in R. In this project, we will continue to enforce what we’ve already learned and learn more about how dataframes, formally called `data.frame`, work in R.

Scope: r, data.frames, factors

Learning Objectives
• Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

• Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• List the differences between lists, vectors, factors, and data.frames, and when to use each.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset(s)

The following questions will use the following dataset(s):

• `/anvil/projects/tdm/data/craigslist/vehicles.csv`

## Setting Up

First, let’s take a look at all of the data available to students. In order to do this, we are going to use a new function as listed below to list all of the files in the craigslist folder.

Let’s run the below command using the seminar-r kernel to view all the files in the folder.

``list.files("/anvil/projects/tdm/data/craigslist")``

As you can see, we have two different files worth of information from Craigslist. For this project, we are interested in looking at the `vehicles.csv` file

Before we read in the data, we should check the size of the file to get an idea of how big it is. This is important because if the file is too large, we may need more cores for our project or else our core will 'die'.

We can check the size of our file (in bytes) using the following command.

``file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")\$size``
 You can also use `file.info` to see other information about the file. size- double: File size in bytes. isdir- logical: Is the file a directory? mode- integer of class "octmode". The file permissions, printed in octal, for example 644. mtime, ctime, atime- integer of class "POSIXct": file modification, ‘last status change’ and last access times. uid- integer: the user ID of the file’s owner. gid- integer: the group ID of the file’s group. uname- character: uid interpreted as a user name. grname - character: gid interpreted as a group name. (Unknown user and group names will be NA.)

Now that we have made sure our file isn’t too big (1.44 GB), let’s read it into a dataframe in the same way that we have done in the previous two projects.

 We recommend using 2 cores for your Jupyter Lab session this week.

Now we can read in the data and get started with our analysis.

``myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv")``

## Questions

### Question 1 (1 pt)

1. How many rows and columns does our dataframe have?

2. What type/s of data are in this dataframe (example: numerical values, and/or text strings, etc.)

3. 1-2 sentences giving an overall description of our data.

As we stressed in Project 2, familiarizing yourself with the data you are going to work with is an important first step. For this question, we want to figure out how many rows and columns are in our data along with what the types of data are in our data frame. The hint below contains all of the functions that we need to solve this problem. (We also covered these functions in detail in Project 2, so feel free to reference the previous project if you want more information.)

When answering sub-question C., consider talking about where the data appears to be taken from, what the data contains, and any important details that immediately stand out to you about the data.

 The `head()`, `dim()`, and `str()` functions could be helpful in answering this question.
Items to submit
• The number of rows and columns in our dataframe, in a markdown cell.

• The types of data in our dataframe, in a markdown cell.

• 1-2 sentences summarizing our data.

### Question 2 (1 pt)

1. Print the number of NA values in the 'year' column of `myDF`, and the percentage of the total number of rows in `myDF` that this represents.

2. Create a new data frame called `goodyearsDF` with only the rows of `myDF` that have a defined `year` (non `NA` values). Print the `head` of this new data frame.

3. Create a new data frame called `missingyearsDF` with only the rows of `myDF` that are missing data in the `year` column. Print the `head` of this new data frame.

Now that we have a better understanding of the general structure and contents of our data, let’s focus on some specific patterns in our data that may make analysis more challenging.

Often, one of these patterns is missing data. This can come in many forms, such as NA, NaN, NULL, or simply a blank space in one of our dataframes cells. When performing data analysis, it is important to consider missing data and decide how to handle it appropriately.

In this question, we will look at filtering out rows with missing data. The `R` function `is.na()` indicates `TRUE` or `FALSE` is the analogous data is missing or not missing (respectively). An exclamation mark changes `TRUE` to `FALSE` and changes `FALSE` to `TRUE`. For this reason, `!is.na()` indicates which data are not `NA` values, in other words, which data are not missing. As an example, if we wanted to create a new dataframe with all of the rows that are not missing the latitude values, we could do any of the following equivalent methods:

``````goodlatitudeDF <- subset(myDF, !is.na(myDF\$lat))
goodlatitudeDF <- subset(myDF, !is.na(lat))
goodlatitudeDF <- myDF[!is.na(myDF\$lat), ]``````

In the second method, the `subset` function knows that we are working with `myDF`, so we do not need to specify that `lat` is the latitude column in the `myDF` data frame, and instead, we can just refer to `lat` and the `subset` function knows that we are referring to a column.

In the third method, when we write `myDF[ , ]` we put things before the comma that are conditions on the rows, and we put things after the comma that are conditions on the columns. So we are saying that we want rows of `myDF` for which the `lat` values are not `NA`, and we want all of the columns of `myDF`.

If we compare the sizes of the original data frame and this new data frame, we can see that some rows were removed.

``dim(myDF)``
``dim(goodlatitudeDF)``

To answer question 2, we want you to work (instead) with the `year` column, and try the same things that we demonstrated above from the `lat` column. We were simply giving you examples using the `lat` column, so that you have an example about how to deal with missing data in the `year` column.

Items to submit
• The number of NA values in the `year` column of `myDF` and the percentage of the total number of rows in `myDF` that this represents, in a markdown cell.

• A dataframe called `goodyearsDF` containing only the rows in myDF that have a defined `year` (non NA values), and print the `head` of that data frame.

• A dataframe called `missingyearsDF` containing only the rows in myDF that are missing the `year` data, and print the `head` of that data frame.

### Question 3 (2 pts)

 Use the `myDF` data.frame for this question.
1. Print the mean price of vehicles by `year` during the last 20 years.

2. Find which `year` of vehicle appears most frequently in our data, and how frequently it occurs.

 Using the `aggregate` function is one possible way to solve this problem. An example of finding the mean `price` for each `type` of car is shown here: ``aggregate(price ~ type, data = myDF, FUN = mean)``

We want you to (instead) find the mean `price` for cars by `year`.

 Finding the most frequent value in our data can be done using `table`, which we have talked about previously, in conjunction with the `which.max` function. An example of finding the most frequent type of car is shown here: ``which.max(table(myDF\$type))``

Now we want you to (instead) find the year in which the most cars appear in the data set.

Items to submit
• The mean price of each year of vehicle for the last 20 years, in a markdown cell.

• The most frequent year in our data, and how frequently it occured.

### Question 4 (2 pts)

1. Among the `region_url` values in the data set, which `region_url` is most popular?

2. What are the three most popular states, in terms of the number of craigslist listings that appear?

Use the `table`, `sort`, and `tail` commands to find the most popular `region_url` and the most popular three states.

(These two questions are not related to each other. In other words, when you look for the three states that appear most frequently, they have nothing at all to do with the region_url that you found.)

Items to submit
• The most popular `region_url`.

• The three states that appear most frequently.

### Question 5 (2 pts)

1. In question 3, we found the average price of vehicles by year. ("Average" and "mean" are two difference words for the very same concept.) Choose at least two different plot types in R, and create two plots that show the average vehicle price by year.

2. Write 3-5 sentences detailing any patterns present in the data along with your personal observations. (i.e. shape, outliers, etc.)

 Remember, all plots should have a title and appropriate axis labels. Axes should also be scaled appropriately. It is also necessary to explain your plot using a few sentences.
Items to submit
• 2 different plots of average price of vehicle by year.

• A 3-5 sentence explanation of any patterns present in the data along with your personal observations.

• `firstname-lastname-project01.ipynb`.
• `firstname-lastname-project01.R`.
 You must double check your `.ipynb` after submitting it in gradescope. A very common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.