TDM 10100: Project 3 — Fall 2023

Motivation: data.frames are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a data.frame.

Context: In Project 2 we ran our first R code, learned about vectors and indexing, and explored some basic functions in R. In this project, we will continue to enforce what we’ve already learned and learn more about how dataframes, formally called data.frame, work in R.

Scope: r, data.frames, factors

Learning Objectives
  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/craigslist/vehicles.csv

Setting Up

First, let’s take a look at all of the data available to students. In order to do this, we are going to use a new function as listed below to list all of the files in the craigslist folder.

Let’s run the below command using the seminar-r kernel to view all the files in the folder.

list.files("/anvil/projects/tdm/data/craigslist")

As you can see, we have two different files worth of information from Craigslist. For this project, we are interested in looking at the vehicles.csv file

Before we read in the data, we should check the size of the file to get an idea of how big it is. This is important because if the file is too large, we may need more cores for our project or else our core will 'die'.

We can check the size of our file (in bytes) using the following command.

file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size

You can also use file.info to see other information about the file.

size- double: File size in bytes.
isdir- logical: Is the file a directory?
mode- integer of class "octmode". The file permissions, printed in octal, for example 644.
mtime, ctime, atime- integer of class "POSIXct": file modification, ‘last status change’ and last access times.
uid- integer: the user ID of the file’s owner.
gid- integer: the group ID of the file’s group.
uname- character: uid interpreted as a user name.
grname - character: gid interpreted as a group name.
(Unknown user and group names will be NA.)

Now that we have made sure our file isn’t too big (1.44 GB), let’s read it into a dataframe in the same way that we have done in the previous two projects.

We recommend using 2 cores for your Jupyter Lab session this week.

Now we can read in the data and get started with our analysis.

myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv")

Questions

Question 1 (1 pt)

  1. How many rows and columns does our dataframe have?

  2. What type/s of data are in this dataframe (example: numerical values, and/or text strings, etc.)

  3. 1-2 sentences giving an overall description of our data.

As we stressed in Project 2, familiarizing yourself with the data you are going to work with is an important first step. For this question, we want to figure out how many rows and columns are in our data along with what the types of data are in our data frame. The hint below contains all of the functions that we need to solve this problem. (We also covered these functions in detail in Project 2, so feel free to reference the previous project if you want more information.)

When answering sub-question C., consider talking about where the data appears to be taken from, what the data contains, and any important details that immediately stand out to you about the data.

The head(), dim(), and str() functions could be helpful in answering this question.

Items to submit
  • The number of rows and columns in our dataframe, in a markdown cell.

  • The types of data in our dataframe, in a markdown cell.

  • 1-2 sentences summarizing our data.

Question 2 (1 pt)

  1. Print the number of NA values in the 'year' column of myDF, and the percentage of the total number of rows in myDF that this represents.

  2. Create a new data frame called goodyearsDF with only the rows of myDF that have a defined year (non NA values). Print the head of this new data frame.

  3. Create a new data frame called missingyearsDF with only the rows of myDF that are missing data in the year column. Print the head of this new data frame.

Now that we have a better understanding of the general structure and contents of our data, let’s focus on some specific patterns in our data that may make analysis more challenging.

Often, one of these patterns is missing data. This can come in many forms, such as NA, NaN, NULL, or simply a blank space in one of our dataframes cells. When performing data analysis, it is important to consider missing data and decide how to handle it appropriately.

In this question, we will look at filtering out rows with missing data. The R function is.na() indicates TRUE or FALSE is the analogous data is missing or not missing (respectively). An exclamation mark changes TRUE to FALSE and changes FALSE to TRUE. For this reason, !is.na() indicates which data are not NA values, in other words, which data are not missing. As an example, if we wanted to create a new dataframe with all of the rows that are not missing the latitude values, we could do any of the following equivalent methods:

goodlatitudeDF <- subset(myDF, !is.na(myDF$lat))
goodlatitudeDF <- subset(myDF, !is.na(lat))
goodlatitudeDF <- myDF[!is.na(myDF$lat), ]

In the second method, the subset function knows that we are working with myDF, so we do not need to specify that lat is the latitude column in the myDF data frame, and instead, we can just refer to lat and the subset function knows that we are referring to a column.

In the third method, when we write myDF[ , ] we put things before the comma that are conditions on the rows, and we put things after the comma that are conditions on the columns. So we are saying that we want rows of myDF for which the lat values are not NA, and we want all of the columns of myDF.

If we compare the sizes of the original data frame and this new data frame, we can see that some rows were removed.

dim(myDF)
dim(goodlatitudeDF)

To answer question 2, we want you to work (instead) with the year column, and try the same things that we demonstrated above from the lat column. We were simply giving you examples using the lat column, so that you have an example about how to deal with missing data in the year column.

Items to submit
  • The number of NA values in the year column of myDF and the percentage of the total number of rows in myDF that this represents, in a markdown cell.

  • A dataframe called goodyearsDF containing only the rows in myDF that have a defined year (non NA values), and print the head of that data frame.

  • A dataframe called missingyearsDF containing only the rows in myDF that are missing the year data, and print the head of that data frame.

Question 3 (2 pts)

Use the myDF data.frame for this question.

  1. Print the mean price of vehicles by year during the last 20 years.

  2. Find which year of vehicle appears most frequently in our data, and how frequently it occurs.

Using the aggregate function is one possible way to solve this problem. An example of finding the mean price for each type of car is shown here:

aggregate(price ~ type, data = myDF, FUN = mean)

We want you to (instead) find the mean price for cars by year.

Finding the most frequent value in our data can be done using table, which we have talked about previously, in conjunction with the which.max function. An example of finding the most frequent type of car is shown here:

which.max(table(myDF$type))

Now we want you to (instead) find the year in which the most cars appear in the data set.

Items to submit
  • The mean price of each year of vehicle for the last 20 years, in a markdown cell.

  • The most frequent year in our data, and how frequently it occured.

Question 4 (2 pts)

  1. Among the region_url values in the data set, which region_url is most popular?

  2. What are the three most popular states, in terms of the number of craigslist listings that appear?

Use the table, sort, and tail commands to find the most popular region_url and the most popular three states.

(These two questions are not related to each other. In other words, when you look for the three states that appear most frequently, they have nothing at all to do with the region_url that you found.)

Items to submit
  • The most popular region_url.

  • The three states that appear most frequently.

Question 5 (2 pts)

  1. In question 3, we found the average price of vehicles by year. ("Average" and "mean" are two difference words for the very same concept.) Choose at least two different plot types in R, and create two plots that show the average vehicle price by year.

  2. Write 3-5 sentences detailing any patterns present in the data along with your personal observations. (i.e. shape, outliers, etc.)

Remember, all plots should have a title and appropriate axis labels. Axes should also be scaled appropriately. It is also necessary to explain your plot using a few sentences.

Items to submit
  • 2 different plots of average price of vehicle by year.

  • A 3-5 sentence explanation of any patterns present in the data along with your personal observations.

Submitting your Work

Nice work, you’ve finished Project 3! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions.

Items to submit
  • firstname-lastname-project01.ipynb.

  • firstname-lastname-project01.R.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.