TDM 10100: Project 2 — 2023

In this project we will dive in head-first and learn some of the basics while solving data-driven problems.

5 Basic Types of Data

  • Values like 1.5 are called numeric values, real numbers, decimal numbers, etc.

  • Values like 7 are called integers or whole numbers.

  • Values TRUE or FALSE are called logical values or Boolean values.

  • Texts consist of sequences of words (also called strings), and words consist of sequences of characters.

  • Values such as 3 + 2i[1] are called complex numbers. We usually do not encounter these in The Data Mine.

R and Python both have their advantages and disadvantages. A key part of learning data science methods is to understand the situations in which R is a more helpful tool to use, or Python is a more helpful tool to use. Both of them are good for their own purposes. In a similar way, hammers and screwdrivers and drills and many other tools are useful for construction, but they all have their own individual purposes.

In addition, there are many other languages and tools, e.g., Julia and Rust and Go and many other languages are emerging as relatively newer languages that each have their own advantages.

Context: In the last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some examples.

In this project, we will continue to use R within Jupyter Lab to solve problems. Soon, you will see how powerful R is and why it is often more effective than using spreadsheets as a tool for data analysis.

Learning Objectives
  • Be aware of the different concepts and when to apply them; such as lists, vectors, factors, and data.frames

  • Be able to explain and demonstrate: positional, named, and logical indexing.

  • Read and write basic (csv) data using R.

  • Identify good and bad aspects of simple plots.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset:

  • /anvil/projects/tdm/data/flights/subset/1995.csv

Questions

Question 1 (1 pt)

  1. How many columns does this data frame have? (0.25 pts)

  2. How many rows does this data frame have? (0.25 pts)

  3. What type/s of data are in this data frame (example: numerical values, and/or text strings, etc.) (0.5 pts)

"Kernel died" is a common error you could encounter during the semester. If you get a pop-up that says your "Kernel Died," it typically means that either 1) Anvil is down. Be sure to check your email and Piazza for updates, or 2) You need more cores for your project. Try starting a new session with an additional core. If you are using more than 4 cores, the problem is NOT this.

For this project, you will probably need to reserve 2-3 cores. Also, remember to use the seminar-r kernel going forward in this class.

It is important to get a good understanding of the dataset(s) with which you are working. This is the best first step to help solve any data-driven problems.

We are going to use the read.csv() function to load our datasets into a data frame.

To read data in R from a CSV file (.csv), you use the following command:

myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1995.csv")

R is a case-sensitive language, so if you try and take the head of mydf instead of myDF, it will not work.

Here myDF is a variable - a name that references our data frame. In practice, you should always use names that are specific, descriptive, and meaningful.

We want to use functions such as head, tail, dim, summary, str, class, to get a better understanding of our data frame.

  • head(myDF) - Look at the head (or top) of the data frame

  • tail(myDF) - Look at the tail (or bottom) of the data frame

  • class(myDF$Dest) - Return the type of data in a column of the data frame, for instance, in a column that stores the destination of flights (Dest)

  • Try and figure out dim, summary, and str on your own, but we give some details about them in the video as well.

Items to submit
  • Code used to solve sub-questions A, B, and C, and the output from running that code.

  • The number of columns and rows in the data frame, in a markdown cell.

  • A list of all of the types of data present in the data frame, in a markdown cell.

Question 2 (1 pt)

  1. What type of data is in the vector myairports? (0.5 pts)

  2. The vector myairports contains all of the airports where flights departed from in 1995. Print the first 250 of those airports. (Do not print all of the airports, because there are 5327435 such values!) How many of the first 250 flights departed from O’Hare? (0.5 pts)

A vector is a simple way to store a sequence of data. The data can be numeric data, logical data, textual data, etc.

Let’s create a new vector called myairports containing all of the origin airports (i.e., the airports where the flights departed) from the column myDF$Origin of the data frame myDF. We can do this using the $ operator. Documentation on the $ operator can be found here, and an example of how to use it is given below.

newVector <- myDF$ColumnName

# to generate our vector, this would look like
my_airports <- myDF$Origin

The head() function may help you with part B of this question.

Items to submit
  • Code used to create myairports and to solve the above sub-questions, and the output from running that code.

  • The type of data in your myairports vector in a markdown cell.

  • The number of flights that are from O’Hare in the first 250 entries of your myairports vector, in a markdown cell.

Question 3 (2 pts)

  1. How many flights departed from Indianapolis (IND) in 1995? How many flights landed in Indianapolis (IND) in 1995? (1 pt)

  2. Consider the flight data from row 894 the data frame. What airport did it depart from? Where did it arrive? (0.5 pts)

  3. How many flights have a distance of less than 200 miles? (0.5 pts)

There are many different ways to access data after we load it, and each has its own use case. One of the most common ways to access data is called indexing. Indexing is a way of selecting or excluding specific elements in our data. This is best shown through examples, some of which can be found here.

Accessing data can be done in many ways, one of those ways is called indexing. Typically we use brackets [ ] when indexing. By doing this we can select or even exclude specific elements. For example we can select a specific column and a certain range within the column. Some examples of symbols to help us select elements include:
* < less than
* > greater than
* <= less than or equal to
* >= greater than or equal to
* == is equal
* != is not equal

Many programming languages, such as Python and C, are called "zero-indexed". This means that they begin counting from '0' instead of '1'. Because R is not zero-indexed, we can count like humans normally do. In other words, R starts numbering with row '1'.

Helpful Examples
# get all of the data between row "row_index_start" and "row_index_end"
myDF$Distance[row_index_start:row_index_end,]

# get all of the data from row 3 of myDF
myDF[3,]

# get all of the data from column 5 of myDF
myDF[,5]

# get every row of data in the columns between
# myfirstcolumn and mylastcolumn
myDF[,myfirstcolumn:mylastcolumn]


# get the first 250 values from column 17
head(myDF[,17], n=250)

# retrieves all rows with Distances greater than 100
myDF$Distance[myDF$Distance > 100]

# retrieve all flights with Origin equal to "ORD"
myDF$Origin[myDF$Origin == "ORD"]
Items to submit
  • Code used to solve each sub-question above, and the output from running it.

  • The number of flights that departed from Indianapolis in our data, in a markdown cell.

  • The number of flights that landed in Indianapolis in our data, in a markdown cell.

  • The origin and destination airport from row 894 of the data frame, in a markdown cell.

  • The number of flights that have distances less than 200 miles, in a markdown cell.

Question 4 (2 pts)

  1. Rank the airline companies (in the column myDF$UniqueCarrier) according to their popularity, (i.e. according to the number of flights on each airline). (1 pt)

  2. Now find the ten airplanes that had the most flights in 1995. List them in order, from most popular to least popular. Do you notice anything unusual about the results? (1 pt)

Oftentimes we will be dealing with enormous quantities of data, and it just isn’t feasible to try and look at the data point-by-point in order to summarize the entire data frame. When we find ourselves in a situation like this, the table() function is here to save the day!

Take a look at this link for some examples of how to use the table() function in R. Once you have a good understanding of how it works, try and answer the three sub-questions below using the table() function. You may need to use some other basic R functions as well.

It is useful to use functions in R and see how they behave, and then to take a function of the result, and take a function of that result, etc. For instance, it is common to summarize a vector in a table, and then sort the results, and then take the first few largest or smallest values. This is known as "nesting" functions, and is common throughout programming.

Items to submit
  • Code used to solve the sub-questions above, and the output from running it.

  • The airline company codes in order of popularity, in a markdown cell.

  • The ten airplane tail codes with the most flights in our data, ordered from most flights to least flights, in a markdown cell.

Question 5 (2 pts)

  1. Using the R built-in function hist(), create a histogram of flight distances. Make sure your plot has an appropriate title and labelled axes for full credit. (1 pt)

  2. Write 2-3 sentences detailing any patterns you see in your plot and what those patterns tell you about the distance of flights in this dataset. (1 pt)

Graphs are a very important tool in analyzing data. By visualizing our data in any of a number of ways, we can discover patterns that may not be as readily apparent by simply looking at tables. As such, they are a vital skill in all data scientists' skillset.

In this question, we would like you to get comfortable with plotting in R. There are a number of built in tools for basic plotting in this language, but we will focus on histograms here. Using the Distance column of our data frame, create a histogram of the distribution of distances for our data. Then, write a few sentences describing your plot, any patterns you see, and what the distribution as a whole looks like.

Documentation on R histograms may help you understand how to complete this question.

Items to submit
  • Code used to generate your histogram.

  • A histogram of the distances of flights in our data with a title and labelled axes.

  • 2-3 sentences about the patterns in the data, and what those patterns tell you about the greater data, in a markdown cell.

Submitting your Work

Congratulations, you’ve finished Project 2! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions.

Items to submit
  • firstname-lastname-project02.ipynb.

  • firstname-lastname-project02.R.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.

Here is the Zoom recording of the 4:30 PM discussion with students from 28 August 2023: