TDM 10200: Project 2 — 2023

Motivation: Pandas will enable us to work with data in Python (in a similar way to the data frames that we learned in R in the fall semester).

Context: This is our second project and we will continue to introduce some basic data types and go thru some similar control flow concepts like we did in R.

Scope: tuples, lists, if statements, opening files

Learning Objectives
  • List the differences between lists & tuples and when to use each.

  • Gain familiarity with string methods, list methods, and tuple methods.

  • Demonstrate the ability to read and write data of various formats using various packages.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/death_records/DeathRecords.csv

Questions

ONE

pandas is an integral tool for various data science tasks in Python. You can read a quick intro here.

Let’s first learn how to create a simple dataframe. If we have a store that we only sell hats that are blue and white and shoes that are red and purple.
Imagine its a Saturday and we wanted to keep track of our customer’s purchase. Our first sale is a red shoe and a blue hat. Second sale is a purple shoe and a blue hat. Third is a red shoe and a blue hat, fourth is a purple shoe and a white hat, fifth is a red shoe and a white hat, the sixth and seventh sale are red shoes and blue hats.

So it looks a bit like this:

data = {
    'shoes':['red', 'purple', 'red', 'purple', 'red', 'red', 'red'],
    'hats': ['blue', 'blue', 'blue', 'white', 'white', 'blue', 'blue']
}
  1. Create a data set named 'data'

  2. Take the data you created and make it into a dataframe named store

  3. Now change the index numbers 0-6 to customers Jay, Mary, Bill, Chris, Martha, Karen, Rob

Helpful Hint
store = pd.DataFrame(data, index=['Jay', 'Mary', 'Bill', 'Chris', 'Martha','Karen', 'Rob'])

store
Insider Knowledge

Pandas allows you to extract data from a CSV (comma-separated values) file. Pandas is a great way to get acquainted with your data, including the ability to clean, transform, and analyze data.

The two main components of pandas are the series and DataFrame. A series is one dimensional (you can think of it as a column of data) and a DataFrame is a table made up of a collection of series.

Notice that the indexing for our dataframe starts at 0. In python, the indexing starts at 0, as compared to R in the fall semester, where the indexing began at 1. This is an important fact to remember.

Items to submit
  • Code used to answer the question.

  • Result of code.

TWO

Open a new notebook in Jupyter Lab, and select the f2022-2023 kernel. We want to go ahead and read in the dataset /anvil/projects/tdm/data/death_records/DeathRecords.csv into a pandas DataFrame called myDF.

  1. Find the information for the 11th row in the dataframe

  2. Find the last five rows of the data frame

  3. Find how many rows and columns there are in the entire dataframe

  4. Print just the column names

Helpful Hints
.head()
.tail()
.shape
Items to submit
  • Code used to answer the question.

  • Result of code.

THREE

Let’s look for specific information in our dataframe so we can become a bit more familiar with what it contains.

  1. How many people over the age of 52 are on this list?

  2. How many males vs how many females

  3. How many females that are over the age of 70 on this list?

Items to submit
  • Code used to answer the question.

  • Result of code.

FOUR

Now that we have a bit of familiarity with the data, let’s introduce another common python package called matplotlib Let’s create a graphic using this package.

  1. Create a graphic that illustrates then number of people who are divorced, married, single, unmarried, or widowed.

  2. Create another graphic that illustrates the distribution of the age of the person at the time of death.

Helpful Hint
import matplotlib.pyplot as plt
Insider Knowledge

Matplotlib is a data visualization and plotting library for Python. It provides easy ways to visualize data.

Items to submit
  • Code used to answer the question.

  • Result of code.

FIVE

Now that you are familiar with the data and have an introduction to plotting, create a plot of your choice, to summarize something that you find interesting about the data.

  1. Use pandas and your investigative skills to look thru the data and find an interesting fact, and then create a graphic that summarizes some of the data from our dataset!

Items to submit
  • Code used to answer the question.

  • Result of code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.