TDM 10100: Project 4 - Data Types

Project Objectives

Motivation: Learning about the data types and how to work with them is a useful skill that anyone working with data should have.

Context: We will learn how to begin working with data from the web, and will begin learning about vectors and subsetting data.

Scope: R, data types, vectors, plots

Learning Objectives

Learn about vectors in R
Practice moving around data
Make interpretable outputs

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset

/anvil/projects/tdm/data/titanic/titanic.csv

The Titanic struck an iceberg on April 14, and sank on April 15. This was one of the deadliest large-scale catastrophes involving ships during a time of peace, resulting in the deaths of up to 1,635 people.

This dataset provides a glimpse into the demographics of what circumstances the passengers would be in during this historical event. While not close to fully accounting for all of the passengers on the Titanic, this dataset contains cleaned information about the passengers with the least gaps in what we know about them, allowing for smoother use in a project.

The original source for this dataset can be found here, providing more insight into the description and what others have done with it. Also, the data is located in GitHub in here.

There are 12 columns and 891 rows of data for us to work with. These columns include:

Survived - whether or not the passenger survived (0 = no, 1 = yes)
Pclass - class of the passenger based on ticket
Name - name of the passenger
Sex - recorded gender of the passenger
Age - age in years of the passenger

There are more columns going into more detail about family, ticket, cabin, etc. of each passenger, but this project does not go into detail on those. You are welcome to continue working with this dataset and explore it however you would like. Feel free to include any interesting insights you find in this project notebook as well.

Questions

Question 1 (2 points)

Vectors are fundamental data structures and are used frequently for storing and manipulating data. A vector contains an ordered collection of values, all of the same type of data.

A vector can contain a singular type of data type (numeric, string, logical, …). If you try to add elements of different types, R will try to convert them all to a common type or the operation will just fail.

Example:

A numeric vector would contain only numerical values
my_num <- c(1, 2, 3, 4, 5, 6, 7)
Running my_num outputs: 1 2 3 4 5 6 7

A character vector would usually contain only string values
my_str <- c("This", "is", "my", "vector")
Running my_str outputs: 'This' 'is' 'a' 'string'

Combining the numeric vector my_num and the character vector my_str does not work well
c(my_num, my_str)
Output: '1' '2' '3' '4' '5' '6' '7' 'This' 'is' 'a' 'string'

The numeric data gets converted to strings to make all of the values in the vector of one type.

Read in the titanic dataset using one of the following three methods: Download the file from the GitHub titanic.csv file found here, upload it to your Jupyterlab files, and read it into your notebook using read.csv and your personal file path.

1 - Use getwd() to find your working directory / file path to your open notebook. Modify it slightly to read in the dataset.

2 - If you do not want to have to download anything, copy and paste the url of the raw format of this dataset into your read.csv statement and continue from there.

3 - You can get the data directly from our Anvil directory, We added the data to: '/anvil/projects/tdm/data/titanic/titanic.csv'

With the titanic data read in as myDF, look at the str() (structure) and summary() of the dataset. This shows the type of data within each column.

Create two vectors: my_names and my_ages, containing the Name and Age columns, repectively, of myDF.

Combine my_names and my_ages to create a new vector.

Before you run this vector of mixed types, think about it. Running this will output all of the entries of the rows from the first column AND the second column. The output should contain 1782 values, one column after the other. (This isn’t great to work with when there are this many entries, but we will not have to work with this combined vector beyond this.)

An alternative to creating a vector where the data has been converted to be of one type, you could also create a two-column dataframe containing just the columns you have selected. This isn’t a vector and would serve a different purpose, but is just also another way to pull some data and create a subset.

Look at the dimensions of this new dataframe to confirm you have 891 rows, 2 columns.

Deliverables

1.1 Vector containing all rows for the variables name and age converted to the character class.
1.2 Smaller dataframe containing two columns with their original classes
1.3 3-5 sentences on how vectors differ from dataframes in R

Question 2 (2 points)

It is easy to learn a lot about data just by looking at how it is made up. Checkout the length, class, type of, and entry from the 45th row of both my_names and my_ages.

my_names has the same class and type of data, while my_ages is numeric and double, respectively. The difference here is that class() is how R treats the data when it is worked with, and typeof() is how the data is stored.

In R, you can quickly explore what a function does and what arguments it takes by using '?'. For example, to view the documentation for 'class()', you can type:

?class()

my_ages is of type "double". Data can be of the class numeric. When this is the case, it is stored as either a "double" (decimal) or "integer" (whole number) internally.

Looking at the specific row entry with my_ages[45], we’re pulling a single row value. R prints it in the format most simple for us to read 19 instead of 19.00, because this is a whole number and does not need the decimal values. But this is still stored as a "double". Why?

If you use print() to display the values of my_ages, even whole numbers like 19 will be displayed in decimal.
When we look at more entries from this column, such as all ages listed as less than 20 years old, there are entries such as 14.5 that show there are, indeed, decimal values in this column. This makes all the data in this column be of that type, but for good reason because these values weren’t rounded just to be stored a certain way.

It is often helpful to not have NA values in the way and taking up space when we are trying to look through data. Include !is.na() to not show the NA values from this search for ages.

Not having the NAs will greatly shorten the list but will only show the entries where there are actual values listed as the person’s age.

In a similar format, we can find the names that go along with each of the "under 20" ages.

Save the non-NA values of my_ages that are under 20 as my_selection.

Use paste() to bring together my_ages and my_names filtered by my_selection. Save this as age_names and print().

Deliverables

2.1 How are typeof(), class(), and mode() different, and which do you prefer?
2.2 2-3 sentences on how else we could’ve dealt with the NA values
2.3 Names and ages of passengers under 20 years old

Question 3 (2 points)

Looking at the Survived column of the titanic dataset, we can see that it contains binary values for each person’s life status:

0 = Dead/did not survive
1 = Alive/did survive

We’re going to create a column containing the life status and sex of each person. This allows for future data analysis when looking at the counts of what sort of people survived, and so on.

But first it would be helpful to convert these values from 0/1 to Dead/Alive. This column is numeric data, and we want to make a new column containing labels for each value to make it easier to understand.

The factor() function takes the original vector - often numeric or character (the Survived column in this case) - splits up based on the unique values, and applies a label to each.

For example, if we wanted to split the Pclass (passenger class) column and add labels, we could factor() the Pclass based on each of the three choices (1, 2, 3), relabel them (First Class, Second Class, Third Class), and save this as a new column. Using the basic structure:

myDF$Passenger_Class <- factor(myDF$Pclass, levels = c( , , ), labels = c("", "", ""))

Make sure to fill out this code with the class values and the label you want each to have.

The class() of Passenger_Class is "factor", but the typeof() remains "integer" as with Pclass. This column still contains representation for those 1, 2, 3 values, just with the class labels for each numerical value.

Returning to the Survived column, use factor() to create a new column Status containing "Dead" and "Alive" labels on the values.

Now we get to combine this new Status column with the Sex column to create Combined. paste() makes this easy and fairly painless. It is totally up to you for what you would like to have as the separator between the two values in each row. Some commonly used ones are:

", " → Alive, female
" " → Alive female
" - " → Alive - female
" | " → Alive | female

Anything could be used, but these are what you would commonly see, and are often used to separate words.

Make sure to view some of this column to ensure everything looks how you would like it to. When you look at the table of this Combined column, check that it contains all four possible combinations.

Often it is helpful (or even fun) to make a visual to go along with findings from a table. Please use a barplot() to show the values from this table visually, and customize as you would like. There should be a bar for each category:

Alive, female
Alive, male
Dead, female
Dead, male

You can sort the data, add different colors, rotate the plot, rotate the labels, etc. Just make sure there are axis labels and a title that make sense to what you are showing.

Deliverables

3.1 Table of the Passenger_Class column
3.2 Plot of the Combined column
3.3 What is another column combination you think would be insightful? Why?

Question 4 (2 points)

The end goal of these last two questions is to create a plot that could provide some insight to how the age and sex of a person relate to whether or not they survived, and how common each of these occurrences are.

We’re going to be working with the Age column now. There are the ages of the passengers that range from 0.42 to 80 (make sure to find this yourself!), with many values in between.

Vectors are often shown in examples where they are taking numbers like 1, 2, 3, 4 and combining them to show how a vector can be created. These don’t really allow for understanding of how or why this would be done in the real-world, especially when working with big datasets.

With the titanic data in myDF, we’re going to look at the table of the Age column and choose a cut-off for what counts as "old". This will include the ages 61 - 80.

Create a vector old_ages containing the ages 61, 62, 63, 64, 65, 66, 70, 70.5, 71, 74, and 80.

The values in the old_ages vector match those in the upper values of the Age column, but are not directly tied to the dataset just yet. Printing this vector just shows the numbers 61, 62, 63, 64, 65, 66, 70, 70.5, 71, 74, and 80, with no counts related to any of them.

To find the values from the Age column that match those in this old_ages, use %in% to show which elements of the Age column are found in this vector. Name this result old.

old <- myDF$Age[myDF$Age %in% old_ages]

As fun as it was to have a long section of text saying 61, 62, 63, 64, 65, 66, 70, 70.5, 71, 74, and 80, it is often simpler to use a range when selecting values like this.

What we’re about to do is completely non-efficient but it gives us some practice working with and manipulating vectors.

First, create a vector my_vec1 containing all the values from the Age column that are less than or equal to 10 and print. Put that aside, and create a new vector my_vec2 containing a range of all the values from the Age column that are greater than 10 and go to 60.

Remember to remove NA values!

It is important to notice that there were no decimal values between 60 and 61. my_vec2 contains 60, and old contains 61. If there had been an in-between, we would’ve needed to do this differently.

With my_vec1 and my_vec2, combine them to make the vector people.

This question could have been simpler without making my_vec1 and my_vec2 but now you know that is true and know how to deal with data like that.

From people, we need to create three more vectors:

children: range of ages <= 20
young: range of ages > 20 & <= 40
adult: range of ages > 40 & <= 60
Looking at the table of each range will confirm that they do contain the correct values.

To add labels to the Age column that correspond to these different ranges, use %in% again to filter each entry of the Age column and assign each value a name. Save this as AgeGroup rather than overwriting the Age column.

Deliverables

4.1 Vectors children, young, adult, and old containing the values from Age
4.2 Why does it matter that we have split up the ages by these labels for each range?
4.3 What ranges would you have preferred to split the Age column by? Why?

Question 5 (2 points)

It is often the case that it is useful to create a visualization to help better understand comparisons. In the AgeGroup column, we have different counts of occurrences of each age range from the Age column.

In the Combined column, there are different counts of each survived Status (dead or alive) sorted by Sex. Having these values is great, but we haven’t done any analysis with them yet.

Going back to that plot we wanted to make since the start of Question 4, we are going to use AgeGroup and Combined to make it.

To start, make one table showing both the AgeGroup and Combined together. You usually make a table showing the counts for each unique value in one column. Here, it is useful to show the unique values from one column, and then those from another, and cross them to get the counts from each pairing.

This will give us some insight about what sorts of people survived (or didn’t). This is not a huge dataset, so the values can be on the smaller side, but this does make it easier to fully grasp what the changes in values are.

Having to think about 3 vs 70 is generally simpler to think about rather than 10,000 vs 11,000.

Save the table as a new variable, and plot it.

There are some choices when it comes to this:

Mosaicplot: the sizes of the boxes on the plot are directly related to the ratios of the values in the table. This helps to create a visual sense of "this is much bigger than that".
Barplot: use 'beside=TRUE' to view the plotted values alongside each other per category to directly compare their values against those in their group as well as across the entire plot. The colors represent each age group.
Heatmap: plot each column on an axis and have the intensity of the color be representative of the value on the table.

You are welcome to use any or all plotting methods (make two plots), and can bring in your own as you see fit. Make sure to include a legend when the color values are important to reading the plot. Also, the values from Combined are rather long, so you may have to adjust the margins values some.

Deliverables

5.1 First plot method from using the table of AgeGroup and Combined
5.2 Second plot method from using the table of AgeGroup and Combined
5.3 Share some findings (2-4 sentences) about what you have found from these plots

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project4.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.