TDM 10100: Project 2 — 2022

The R[1] environment is a powerful tool to perform data analysis. R’s language is often compared to Python. Both languages have their advantages and disadvantages, and both are worth learning.

In this project we will dive in head first and learn some of the basics while solving data-driven problems.

5 basic types of data
  • Values like 1.5 are called numeric values, real numbers, decimal numbers, etc.

  • Values like 7 are called integers or whole numbers.

  • Values TRUE or FALSE are called logical values or Boolean values.

  • Texts consist of sequences of words (also called strings), and words consist of sequences of characters.

  • Values such as 3 + 2i[2] are called complex numbers. We usually do not encounter these in The Data Mine.

R and Python both have their advantages and disadvantages. A key part of learning data science methods is to understand the situations in which R is a more helpful tool to use, or Python is a more helpful tool to use. Both of them are good for their own purposes. In a similar way, hammers and screwdrivers and drills and many other tools are useful for construction, but they all have their own individual purposes.

In addition, there are many other languages and tools, e.g., Julia and Rust and Go and many other languages are emerging as relatively newer languages that each have their own advantages.

Context: In the last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some examples.

In this project, we will continue to use R within Jupyter Lab to solve problems. Soon, you will see how powerful R is and why it is often more effective than using spreadsheets as a tool for data analysis.

Scope: r, vectors, lists, indexing

Learning Objectives
  • Be aware of the different concepts and when to apply them; such as lists, vectors, factors, and data.frames

  • Be able to explain and demonstrate: positional, named, and logical indexing.

  • Read and write basic (csv) data using R.

  • Identify good and bad aspects of simple plots.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1995.csv", stringsAsFactors = TRUE)

ONE

The data that we may be working with does not always come to us neat and clean[3]. It is important to get a good understanding of the dataset(s) with which you are working. This is the best first step to help solve any data-driven problems.

Insider Knowledge

Datasets can be thought or as one or more observations of one or more variables. For most datasets, each row is an observation and each column is a variable. (There may be some datasets do not follow that convention.)

We are going to use the read.csv function to load our datasets into a dataframe named …​
We want to use functions such as head, tail, dim, summary, str, class, to get a better understanding of our dataframe(DF).

Helpful Hints
#looks at the head of the dataframe
head(myDF)
#looks at the tail of the dataframe
tail(myDF)
#returns the type of data in a column of the dataframe, for instance, the type of data in the column that stores the destination airports of the flights
class(myDF$Dest)
  1. How many columns does this dataframe have?

  2. How many rows does this dataframe have?

  3. What type/s of data are in this dataframe (example: numerical values, and/or text strings, etc.)

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answers to the three questions above

TWO

We can create a new vector[4] containing all of the origin airports (i.e., the airports where the flights departed) from the column myDF$Origin of the data frame myDF.

#takes the selected information from the dataframe and puts it into a new vector called `myairports`
myairports <- myDF$Origin
Insider Knowledge

A vector is a simple way to store a sequence of data. The data can be numeric data, logical data, textual data, etc.

To assist with this question, please also see the end of the video from Question 1 (above).

  1. What type of data is in the vector myairports?

  2. The vector myairports contains all of the airports where flights departed in 1995. Print the first 250 of those airports. [Do not print all of the airports, because there are 5327435 such values!] How many of the first 250 flights departed from O’Hare?

  3. How many flights departed by O’Hare altogether in 1995?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answers to the 3 questions above.

THREE

Indexing

Insider Knowledge

Accessing data can be done in many ways, one of those ways is called indexing. Typically we use brackets [ ] when indexing. By doing this we can select or even exclude specific elements. For example we can select a specific column and a certian range within the column. Some examples of symbols to help us select elements include:
* < less than
* > greater than
* ⇐ less than or equal to
* >= greater than or equal to
* == is equal
* != is not equal
It is also important to note that indexing in R begins at 1. (This means that the first row of the dataframe will be numbered starting at 1.)

Helpful Hints
#finding data by their indices
myDF$Distance[row_index_start:row_index_end,]
#creates a new vector with the specific info
mynewvector <- myDF$putcolumnnamehere
#all of the data from row 3
myDF[3,]
#all of the data in all of the rows, with columns between myfirstcolumn and mylastcolumn
myDF[,myfirstcolumn:mylastcolumn]
#and/or
#the first 250 values from column 17
head(myDF[,17], n=250)
#puts all variables that are less than 6 from the dataframe
longdistances = myDF$Distance[myDF$Distance > 2000]
  1. How many flights departed from Indianapolis (IND) in 1995? How many flights landed there?

  2. Consider the flight data from row 894 the data frame. What airport did it depart from? Where did it arrive?

  3. How many flights have a distance of less than 200 miles?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answers to the 3 questions above.

FOUR

Summarizing vectors using tables

The table command is helpful to know, for summarizing large quantities of data.

Insider Knowledge

It is useful to use functions in R and see how they behave, and then to take a function of the result, and take a function of that result, etc. For instance, it is common to summarize a vector in a table, and then sort the results, and then take the first few largest or smallest values. Remember also that R is a case-sensitive language.

table(myDF$Origin)   # summarizes how many flights departed from each airport
sort(table(myDF$Origin))   # sorts those results in numeric order
tail(sort(table(myDF$Origin)),n=10)  # finds the 10 most popular airports, according to the number of flights that departed from each airport.
  1. Rank the airline companies (in the column myDF$UniqueCarrier) according to their popularity, i.e., according to the number of flights on each airline).

  2. Which are the three most popular airlines from 1995?

  3. Now find the ten airplanes that had the most flights in 1995. List them in order, from most popular to least popular. Do you notice anything unusual about the results?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answers to the 3 questions above.

FIVE

Basic graph types are helpful for visualizing data. They can be an important tool in discovering insights into the data you are working with.
R has a number of tools built in for basic graphs, such as scatter plots, bar charts, histograms, etc.

Insider Knowledge

A dot plot, also known as a dot chart, is similar to a bar chart or a scatter plot. In R, the categories are displayed along the vertical axis and the corresponding values are displayed according to the horizontal axis.

We can assign groups a color to help differentiate while plotting a dot chart

We can also plot a column that we find interesting as well to take a look at what the data might show us. For example if we wanted to see if there was a difference in days of the week and number of flights, we would use hist.

mydays<- myDF$DayOfWeek
hist(mydays)
Helpful Hints
mycities <- tail(sort(table(myDF$Origin)),n=10)
dotchart(mycities, pch = 21, bg = "green", pt.cex = 1.5)
  1. Pick a column of data that you are interested in studying, or a question that you want answered. Create either a plot, or a dotchart. Before making the plot, think about how many dots will be displayed on your plot or dotchart. If you try to display millions of dots, you might cause your Jupyter Lab session to freeze or crash. It is useful to think ahead and to consider how your plot might look, before you accidentally try to display millions of dots.

  2. Descibe any patterns you may see in your plot or your dotchart. If there are none, that is okay, and you can just write "there seem to be no patterns."

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The plot or dotchart and your commentary about what you created and what you observed.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.


1. R is case sensitive
3. "Raw data" vs "Clean data". Some datasets require "cleaning" such as removing duplicates, removing null values and disgarding irrelevent data