TDM 10100: Project 4 - Data Types
Project Objectives
Motivation: Learning about the data types and how to work with them is a useful skill that anyone working with data should have.
Context: We will learn how to begin working with data from the web, and will begin learning about vectors and subsetting data.
Scope: R, data types, vectors, plots
Dataset
-
/anvil/projects/tdm/data/titanic/titanic.csv
The Titanic struck an iceberg on April 14, and sank on April 15. This was one of the deadliest large-scale catastrophes involving ships during a time of peace, resulting in the deaths of up to 1,635 people.
This dataset provides a glimpse into the demographics of what circumstances the passengers would be in during this historical event. While not close to fully accounting for all of the passengers on the Titanic, this dataset contains cleaned information about the passengers with the least gaps in what we know about them, allowing for smoother use in a project.
The original source for this dataset can be found here, providing more insight into the description and what others have done with it. Also, the data is located in GitHub in here.
There are 12 columns and 891 rows of data for us to work with. These columns include:
-
Survived
- whether or not the passenger survived (0 = no, 1 = yes) -
Pclass
- class of the passenger based on ticket -
Name
- name of the passenger -
Sex
- recorded gender of the passenger -
Age
- age in years of the passenger
There are more columns going into more detail about family, ticket, cabin, etc. of each passenger, but this project does not go into detail on those. You are welcome to continue working with this dataset and explore it however you would like. Feel free to include any interesting insights you find in this project notebook as well.
Questions
Question 1 (2 points)
Vectors are fundamental data structures and are used frequently for storing and manipulating data. A vector contains an ordered collection of values, all of the same type of data.
A vector can contain a singular type of data type (numeric, string, logical, …). If you try to add elements of different types, R will try to convert them all to a common type or the operation will just fail. |
Example:
A numeric vector would contain only numerical values
my_num <- c(1, 2, 3, 4, 5, 6, 7)
Running my_num
outputs:
1 2 3 4 5 6 7
A character vector would usually contain only string values
my_str <- c("This", "is", "my", "vector")
Running my_str
outputs:
'This' 'is' 'a' 'string'
Combining the numeric vector my_num
and the
character vector my_str
does not work well
c(my_num, my_str)
Output:
'1' '2' '3' '4' '5' '6' '7' 'This' 'is' 'a' 'string'
The numeric data gets converted to strings to make all of the values in the vector of one type.
Read in the titanic
dataset using one of the following three methods:
Download the file from the GitHub titanic.csv
file found here, upload it to your Jupyterlab files, and read it into your notebook using read.csv
and your personal file path.
1 - Use getwd()
to find your working directory / file path to your open notebook. Modify it slightly to read in the dataset.
2 - If you do not want to have to download anything, copy and paste the url of the raw format of this dataset into your read.csv
statement and continue from there.
3 - You can get the data directly from our Anvil directory, We added the data to: '/anvil/projects/tdm/data/titanic/titanic.csv'
With the titanic data read in as myDF
, look at the str()
(structure) and summary()
of the dataset. This shows the type of data within each column.
Create two vectors: my_names
and my_ages
, containing the Name
and Age
columns, repectively, of myDF
.
Combine my_names
and my_ages
to create a new vector.
Before you run this vector of mixed types, think about it. Running this will output all of the entries of the rows from the first column AND the second column. The output should contain |
An alternative to creating a vector where the data has been converted to be of one type, you could also create a two-column dataframe containing just the columns you have selected. This isn’t a vector and would serve a different purpose, but is just also another way to pull some data and create a subset.
Look at the dimensions of this new dataframe to confirm you have 891 rows, 2 columns.
1.1 Vector containing titanic data all converted to the character class
1.2 Smaller dataframe containing two columns with their original classes
1.3 3-5 sentences on how vectors differ from dataframes in R
Question 2 (2 points)
It is easy to learn a lot about data just by looking at how it is made up. Checkout the length, class, type of, and entry from the 45th row of both my_names
and my_ages
.
In R, you can quickly explore what a function does and what arguments it takes by using '?'. For example, to view the documentation for 'class()', you can type:
|
my_ages
is of type "double". Data can be of the class numeric. When this is the case, it is stored as either a "double" (decimal) or "integer" (whole number) internally.
Looking at the specific row entry with my_ages[45]
, we’re pulling a single row value. R prints it in the format most simple for us to read 19
instead of 19.00, because this is a whole number and does not need the decimal values. But this is still stored as a "double". Why?
If you use print()
to display the values of my_ages
, even whole numbers like 19 will be displayed in decimal.
When we look at more entries from this column, such as all ages listed as less than 20 years old, there are entries such as 14.5 that show there are, indeed, decimal values in this column. This makes all the data in this column be of that type, but for good reason because these values weren’t rounded just to be stored a certain way.
It is often helpful to not have NA values in the way and taking up space when we are trying to look through data. Include !is.na()
to not show the NA values from this search for ages.
Not having the NAs will greatly shorten the list but will only show the entries where there are actual values listed as the person’s age.
In a similar format, we can find the names that go along with each of the "under 20" ages.
Save the non-NA values of my_ages
that are under 20 as my_selection
.
Use paste() to bring together my_ages
and my_names
filtered by my_selection
. Save this as age_names
and print()
.
2.1 How are typeof(), class(), and mode() different, and which do you prefer?
2.2 2-3 sentences on how else we could’ve dealt with the NA values
2.3 Names and ages of passengers under 20 years old
Question 3 (2 points)
Looking at the Survived
column of the titanic dataset, we can see that it contains binary values for each person’s life status:
-
0 = Dead/did not survive
-
1 = Alive/did survive
We’re going to create a column containing the life status and sex of each person. This allows for future data analysis when looking at the counts of what sort of people survived, and so on.
But first it would be helpful to convert these values from 0/1 to Dead/Alive. This column is numeric data, and we want to make a new column containing labels for each value to make it easier to understand.
The factor()
function takes the original vector - often numeric or character (the Survived column in this case) - splits up based on the unique values, and applies a label to each.
For example, if we wanted to split the Pclass
(passenger class) column and add labels, we could factor()
the Pclass
based on each of the three choices (1, 2, 3), relabel them (First Class, Second Class, Third Class), and save this as a new column. Using the basic structure:
myDF$Passenger_Class <- factor(myDF$Pclass, levels = c( , , ), labels = c("", "", ""))
Make sure to fill out this code with the class values and the label you want each to have.
The class()
of Passenger_Class
is "factor", but the typeof()
remains "integer" as with Pclass
. This column still contains representation for those 1, 2, 3 values, just with the class labels for each numerical value.
Returning to the Survived
column, use factor()
to create a new column Status
containing "Dead"
and "Alive"
labels on the values.
Now we get to combine this new Status
column with the Sex
column to create Combined
. paste()
makes this easy and fairly painless. It is totally up to you for what you would like to have as the separator between the two values in each row. Some commonly used ones are:
-
", "
→ Alive, female -
" "
→ Alive female -
" - "
→ Alive - female -
" | "
→ Alive | female
Anything could be used, but these are what you would commonly see, and are often used to separate words.
Make sure to view some of this column to ensure everything looks how you would like it to. When you look at the table of this Combined
column, check that it contains all four possible combinations.
Often it is helpful (or even fun) to make a visual to go along with findings from a table. Please use a barplot()
to show the values from this table visually, and customize as you would like. There should be a bar for each category:
-
Alive, female
-
Alive, male
-
Dead, female
-
Dead, male
You can sort the data, add different colors, rotate the plot, rotate the labels, etc. Just make sure there are axis labels and a title that make sense to what you are showing.
3.1 Table of the Passenger_Class
column
3.2 Plot of the Combined
column
3.3 What is another column combination you think would be insightful? Why?
Question 4 (2 points)
The end goal of these last two questions is to create a plot that could provide some insight to how the age and sex of a person relate to whether or not they survived, and how common each of these occurrences are.
We’re going to be working with the Age
column now. There are the ages of the passengers that range from 0.42 to 80 (make sure to find this yourself!), with many values in between.
Vectors are often shown in examples where they are taking numbers like 1, 2, 3, 4 and combining them to show how a vector can be created. These don’t really allow for understanding of how or why this would be done in the real-world, especially when working with big datasets.
With the titanic data in myDF
, we’re going to look at the table of the Age
column and choose a cut-off for what counts as "old". This will include the ages 61 - 80.
Create a vector old_ages
containing the ages 61, 62, 63, 64, 65, 66, 70, 70.5, 71, 74, and 80
.
The values in the old_ages
vector match those in the upper values of the Age
column, but are not directly tied to the dataset just yet. Printing this vector just shows the numbers 61, 62, 63, 64, 65, 66, 70, 70.5, 71, 74, and 80, with no counts related to any of them.
To find the values from the Age
column that match those in this old_ages
, use %in%
to show which elements of the Age
column are found in this vector. Name this result old
.
old <- myDF$Ages[myDF$Ages %in% old_ages]
As fun as it was to have a long section of text saying 61, 62, 63, 64, 65, 66, 70, 70.5, 71, 74, and 80, it is often simpler to use a range when selecting values like this.
What we’re about to do is completely non-efficient but it gives us some practice working with and manipulating vectors.
First, create a vector my_vec1
containing all the values from the Age
column that are less than or equal to 10 and print. Put that aside, and create a new vector my_vec2
containing a range of all the values from the Age
column that are greater than 10 and go to 60.
Remember to remove NA values!
It is important to notice that there were no decimal values between 60 and 61. |
With my_vec1
and my_vec2
, combine them to make the vector people
.
This question could have been simpler without making |
From people
, we need to create three more vectors:
-
children
: range of ages <= 20 -
young
: range of ages > 20 & <= 40 -
adult
: range of ages > 40 & <= 60
Looking at the table of each range will confirm that they do contain the correct values.
To add labels to the Age
column that correspond to these different ranges, use %in%
again to filter each entry of the Age
column and assign each value a name. Save this as AgeGroup
rather than overwriting the Age
column.
4.1 Vectors children
, young
, adult
, and old
containing the values from Age
4.2 Why does it matter that we have split up the ages by these labels for each range?
4.3 What ranges would you have preferred to split the Age
column by? Why?
Question 5 (2 points)
It is often the case that it is useful to create a visualization to help better understand comparisons. In the AgeGroup
column, we have different counts of occurrences of each age range from the Age
column.
In the Combined
column, there are different counts of each survived Status
(dead or alive) sorted by Sex
. Having these values is great, but we haven’t done any analysis with them yet.
Going back to that plot we wanted to make since the start of Question 4, we are going to use AgeGroup
and Combined
to make it.
To start, make one table showing both the AgeGroup
and Combined
together. You usually make a table showing the counts for each unique value in one column. Here, it is useful to show the unique values from one column, and then those from another, and cross them to get the counts from each pairing.
This will give us some insight about what sorts of people survived (or didn’t). This is not a huge dataset, so the values can be on the smaller side, but this does make it easier to fully grasp what the changes in values are.
Having to think about 3 vs 70 is generally simpler to think about rather than 10,000 vs 11,000. |
Save the table as a new variable, and plot it.
There are some choices when it comes to this:
-
Mosaicplot: the sizes of the boxes on the plot are directly related to the ratios of the values in the table. This helps to create a visual sense of "this is much bigger than that".
-
Barplot: use 'beside=TRUE' to view the plotted values alongside each other per category to directly compare their values against those in their group as well as across the entire plot. The colors represent each age group.
-
Heatmap: plot each column on an axis and have the intensity of the color be representative of the value on the table.
You are welcome to use any or all plotting methods (make two plots), and can bring in your own as you see fit. Make sure to include a legend when the color values are important to reading the plot. Also, the values from Combined
are rather long, so you may have to adjust the margins values some.
5.1 First plot method from using the table of AgeGroup
and Combined
5.2 Second plot method from using the table of AgeGroup
and Combined
5.3 Share some findings (2-4 sentences) about what you have found from these plots
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project[].ipynb
You must double check your You will not receive full credit if your |