TDM 10200: Project 4 - Data Types

Project Objectives

Motivation: Understanding data types and how to work with them is an essential skill for anyone working with data.

Context: We will learn how to work with data from the web and introduce vectors and data subsetting.

Scope: Python, data types, vectors, plots

Learning Objectives

Learn about lists in Python
Practice moving around data
Make interpretable outputs

Dataset

/anvil/projects/tdm/data/titanic/titanic.csv

Titanic Data

The Titanic struck an iceberg on April 14, and sank on April 15. This was one of the deadliest large-scale catastrophes involving ships during a time of peace, resulting in the deaths of up to 1,635 people.

This dataset provides a glimpse into the demographics of what circumstances the passengers would be in during this historical event. While not close to fully accounting for all of the passengers on the Titanic, this dataset contains cleaned information about the passengers with the least gaps in what we know about them, allowing for smoother use in a project.

The original source for this dataset can be found here, providing more insight into the description and what others have done with it. Also, the data is located in GitHub in here.

There are 12 columns and 891 rows of data for us to work with. These columns include:

Survived - whether or not the passenger survived (0 = no, 1 = yes)
Pclass - class of the passenger based on ticket
Name - name of the passenger
Sex - recorded gender of the passenger
Age - age in years of the passenger

There are more columns going into more detail about family, ticket, cabin, etc. of each passenger, but this project does not go into detail on those. You are welcome to continue working with this dataset and explore it however you would like. Feel free to include any interesting insights you find in this project notebook as well.

If AI is used in any cases, such as for debugging, research, etc., we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation.

The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”.

We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/spring2026/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty.

Questions

Question 1 (2 points)

Lists are a kind of fundamental data structure, and are frequently used for storing and manipulating data. A list contains an ordered collection of values, which can be of any type. Data types can be mixed within the same list.

A vector (a close relative in R to the list in Python) can contain a singular type of data type (numeric, string, logical, etc.). Elements of mixed types will be converted to a common data type in a vector. This does not have to happen in lists.

For Example: A numeric list would contain numerical values
my_num = [1, 2, 3, 4, 5, 6, 7]
Running my_num outputs: [1, 2, 3, 4, 5, 6, 7]

A character list would contain string values
my_str = ["This", "is", "my", "list"]
Running my_str outputs: ['This', 'is', 'my', 'list']

Combining the numeric my_num list and the character my_str list puts these lists together, one after the other
combined = list(my_num) + list(my_str)
Output: [1, 2, 3, 4, 5, 6, 7, 'This', 'is', 'my', 'vector']

The numeric values did not convert to character values here - this is good!

Read in the titanic dataset with pd.read_csv from our Anvil directory. The data is located at '/anvil/projects/tdm/data/titanic/titanic.csv'.

With the titanic data read in as myDF, use

.info() to learn about the columns
.describe() to learn about the values

Create two lists: my_names and my_ages, containing the Name and Age columns, respectively, of myDF. Combine my_names and my_ages to create a new list.

An alternative to creating a list where two long lists have been lined up, one after the other, is to create a small dataframe containing just the columns you have selected - in this case, the Name and Age columns.

Look at the dimensions of this new dataframe to confirm that there are 891 rows, 2 columns.

Deliverables

1.1 Read in the titanic dataset as myDF.
1.2 A single list containing all of the entries from the lists of my_names and my_ages.
1.3 A smaller dataframe containing just the Name and Age columns from myDF.

Question 2 (2 points)

It is easy to learn a lot about data just by looking at how it is made up. Checkout the shape, dtype, and entry from the 45th row of both my_names and my_ages.

Remember that R starts row counting at 1, while Python starts at 0. So, please use my_names[44] and my_ages[44] to look at the 45th row of the dataset.

The outputs of my_names.dtype and myDF.info() show the Name column’s data type as two different names - object, and O. But these are the same thing. O is the short form of object. This data type can store strings, lists, mixed types, and many other things in Python.

Find and show the ages of the people who were strictly less than 20 years old.

my_ages[my_ages < 20] works fine here. The NA values are automatically set to False in the search for entries that are less than 20, so they do not show up in this filtering.

In a similar fashion, use my_names[my_ages < 20] to find the names that go with the age entries that are less than 20.

Save the "under 20" values of my_ages as my_selection with my_selection = my_ages < 20.

For each of my_names and my_ages, take only the values that are saved within my_selection. Paste the filtered series of my_names and my_ages together, and save the result as my_result as following:

my_result = my_names[my_selection] + ", age " + my_ages[my_selection].astype(str)

It is useful to convert everything here to be strings, since string concatenation requires all operands to have the same data type and numeric values cannot be directly combined with strings.

Deliverables

2.1 Who was the 45th passenger entry, and how old were they?
2.2 What type of data did my_names contain? What did my_ages contain?
2.3 Display the output of my_result.

Question 3 (2 points)

Looking at the Survived column of the titanic dataset, we can see that it contains binary values for each person’s life status:

0 = Dead/did not survive
1 = Alive/did survive

We’re going to create a column containing the life status and sex of each person. This can allow for future data analysis when looking at the counts of what sort of people survived.

The following example will use the Pclass column. Do NOT use this column for your work. For deliverables, you will create a Status column for the 0 and 1 values of the Survived column.

First, we need to convert the Pclass 1, 2, and 3 values to First Class, Second Class, and Third Class. This Pclass column contains numeric data, and we want to make a new column containing labels for each 1, 2, and 3 values to make it easier to interpret.

# Create a categorial Passenger_Class column
# Assign labels to each numerical value
# but these labels remain as the categories of the column
myDF["Passenger_Class"] = pd.Categorical(
    myDF["Pclass"],
    categories=[1, 2, 3],
    ordered=True
).rename_categories(["First Class", "Second Class", "Third Class"])

Another approach to creating this new column would be to map new labels to match each of the old values. There are no categories or pre-existing way the values are ordered that gets carried over to the new column. This method is for when you just want to create name labels.

# If there is a 1 value, label it as "First Class"
# If there is a 2 value, label it as "Second Class"
# If there is a 3 value, label it as "Third Class"
myDF["Passenger_Class"] = myDF["Pclass"].map({
    1: "First Class",
    2: "Second Class",
    3: "Third Class"
})

Using pd.Categorical preserves the fact that the variable is categorical and ordered, which can be important for sorting and analysis. In contrast, map() only replaces values with labels and does not retain any category or ordering information.

Either method is acceptable for creating the Passenger_Class column, and you may use the same approach when creating the Status column in your own code. For the Status column, assign the labels Dead and Alive to the binary values in the Survived column.

Combine the new Status column with the Sex column to create Combined. One way to do this is to use the .str() function and .cat() command here. It is totally up to you for what you would like to have as the separator between the two values in each row. Some examples of commonly used ones include:

", " → Alive, female
" " → Alive female
" - " → Alive - female
" | " → Alive | female

Find how the Status and Sex pairs are distributed throughout the four category pairs.

Deliverables

3.1 Show the value counts of the Status column.
3.2 Show the value counts of the Combined column.
3.3 Which method did you use to add labels to the Survived column values to create Status?

Question 4 (2 points)

Lists are often shown in examples as simple, containing meaningless numbers, strings, and Boolean values to help with teaching Python.

Look at the value counts of the Age column. The passengers aboard the Titanic were a huge range of ages. With myDF, we’re going to look at the values of the Age column and choose a cut-off for what counts as "old". This will include the ages 61 - 80. Now, create a list of old_ages containing the ages 61, 62, 63, 64, 65, 66, 70, 70.5, 71, 74, 80.

The values in the old_ages list match those in the upper values of the Age column, but are not directly tied to the dataset just yet. Printing this list just shows the numbers 61, 62, 63, 64, 65, 66, 70, 70.5, 71, 74, and 80, with no actual Age column values related to any of them.

To find the values from the Age column that match those in this old_ages, use .isin() to show which elements of the Age column are found in this list. Name this result old:

old = df["[column_name]"][df["[column_name]"].isin(old_ages)]

As fun as it was to have to write a section of text "61, 62, 63, 64, 65, 66, 70, 70.5, 71, 74, 80" to create this list, it is often simpler to use a range when selecting values like this.

What we’re about to do is completely non-efficient, but it does gives us some practice working with and manipulating list.

First, create a list my_list1 containing all the values from the Age column that are less than or equal to 10, and print the results. Put that aside, and create a new list my_list2 containing a range of all the values from the Age column that are greater than 10 and go to 60.

It is important to notice that there were not whole numbers between 60 and 61. my_list2 contains 60, and old contains 61. If there had been an in-between, we would’ve needed to do this differently.

In other words, this works because there are no whole numbers between 60 and 61, so the boundary between the two groups is clear. If there were values in between, we would need a different approach to define the groups.

With my_list1 and my_list2, combine them to make the list people.

This question could have been simpler without making my_list1 and my_list2, but now you know for certain that this is correct, and know how to deal with data like that.

Deliverables

4.1 How old was the youngest passenger aboard the Titanic? How old was the oldest?
4.2 Display the value counts of old.
4.3 Combine my_list1 and my_list2 to create people.

Question 5 (2 points)

From people, we need to create three most lists:

children: range of ages <= 20
young: range of ages > 20 & <= 40
adult: range of ages > 40 & <= 60

Checkout the value counts of each range to confirm that they do contain the expected values. Print the minimum and maximum values of each list to show what ranges that each account for.

Add labels to the Age column that correspond to these different ranges. Use .isin() again to do this. BUT there is something new to do here. We will be taking the values from Age for each value in each list, but will be creating a new column AgeGroup that will assign a label to each value based on what age range it belongs to.

In question 4, we created old like this: old = myDF["Age"][myDF["Age"].isin(old_ages)]. Here, to create the Old values in AgeGroup, we would do something like this: myDF.loc[myDF["Age"].isin(old), "AgeGroup"] = "Old". Display the value counts of AgeGroup once you have created values within it for each of the four age range lists. Finally, use pd.crosstab() to create a table that will compare the value combinations between the AgeGroup and Combined columns.

Deliverables

5.1 Lists children, young, adult, and old containing values that represent the Age column values.
5.2 Create the AgeGroup column that age range labels for each value of Age.
5.3 Cross tabulate AgeGroup with Combined and display the results.

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project4.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generative AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.