TDM 10200: Project 5 — 2023

Motivation: Once we have some data analysis working in Python, we often want to wrap it into a function. Dr Ward usually tests anything that he wrote (usually 5 times), to make sure it works, before wrapping it into a function. Once we are sure our analysis works, if we wrap it into a function, it can usually be easier to use.

Context: Functions also help us to put our work into bite-size pieces that are easier to understand. The basic idea is similar to functions from R or from other languages and tools.

Scope: functions

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

/anvil/projects/tdm/data/craigslist/vehicles.csv

/anvil/projects/tdm/data/flights/subset/

/anvil/projects/tdm/data/death_records/DeathRecords.csv

We have written several examples to get you started with functions here. All 6 of the videos for this project are given at the top of this page. At the beginning, you probably only need to read the section on Arguments.

If it helps, you also have a longer article available here. It is a very detailed article going through many things that you can do with functions in Python. In particular, the section on Argument Passing might be helpful.

Questions

ONE

Read in the dataset from Project 4

/anvil/projects/tdm/data/craigslist/vehicles.csv

and name it cars.

  1. Write a function called mycarcount that takes two parameters: cars as a data frame, and year as an integer, and outputs the number of cars from that year. (Alternatively, you can just use 1 argument, the year, as a parameter, and then read through the cars data frame inside the function. Either way is OK.)

  2. Run the function for each of the years from Project 4, Question 4, namely, for the years 2011, 1989, 1997. Make sure that your answers agree with the results from that earlier project.

You can solve this question in a couple different ways. You can either read in the entire dataset into a data frame called cars or you can read just a few lines at a time, using the chunksize = 10000 and the iterrows like in the videos. Either way is OK!

Items to submit
  • Code used to answer the question.

  • Result of code.

TWO

  1. Run the function mycarcount for each year in the data set. (Of course, be sure to only run it once for each year!)

  2. Now make sure that the results agree, if you compare with the value_counts() from the year column.

It will take a long time to run mycarcount on each year in the data set, so you might want to start by running mycarcount on just a few years, for instance, on 5 years or 10 years, to make sure that things work, before running mycarcount on all of the years.

Items to submit
  • Code used to answer the question.

  • Result of code.

THREE

Use the csv data sets from this directory (the same as from Project 3):

/anvil/projects/tdm/data/flights/subset/

  1. Write a function that takes two parameters: myorigin as a string with three characters, and year as an integer, and outputs the number of flights that depart during that year, from the Origin airport indicated in myorigin.

  2. Test your function for a few years and airports of your choice. You can choose! Do your results look reasonable, i.e., do the airports in the big cities have lots of flights, compared to airports in smaller cities?

  3. Run the function for each of the years from 1987 to 2008, checking how many flights depart from IND in each year. Make sure that you use the method from the end of Project 3, Question 5.

For this question, you should not read the full data frame all at once, but instead, you should just a few lines at a time, using the chunksize = 10000 and the iterrows like in the videos.

It will take a long time to run your function on each year in the data set, so you might want to start by running your function on just a few years, for instance, on 3 years or 5 years, to make sure that things work, before running your function on all of the years.

Helpful Hint
total_count = 0
for df in pd.read_csv(putthefilenamehere, chunksize=10000):
    for index, row in df.iterrows():
        if row['Origin'] == myorigin:
            total_count += 1
Items to submit
  • Code used to answer the question.

  • Result of code.

FOUR

Extend your function from Question 3 as follows:

  1. Modify your function so that it takes three parameters: myorigin and mydest as strings that each have three characters, and year as an integer, and outputs the number of flights that depart during that year, from the Origin airport indicated in myorigin, and arrive at the Dest airport indicated in mydest.

  2. Test your function for a few years and pairs of airports (origin and destination airports) of your choice. Do the results look reasonable, e.g., if you compare popular flight paths, versus unpopular flight paths?

  3. Run the function for each of the years from 1987 to 2008, checking how many flights depart from IND and arrive at ORD in each year.

Again, for this question, you should not read the full data frame all at once, but instead, you should just a few lines at a time, using the chunksize = 10000 and the iterrows like in the videos.

Again, it will take a long time to run your function on each year in the data set, so you might want to start by running your function on just a few years, for instance, on 3 years or 5 years, to make sure that things work, before running your function on all of the years.

Items to submit
  • Code used to answer the question.

  • Result of code.

FIVE

Use the csv data set for the DeathRecords from Project 2:

/anvil/projects/tdm/data/death_records/DeathRecords.csv

  1. Write a function that takes two parameters: Sex (which will be F or M) and MaritalStatus (D or M or S or U or W), and outputs the number of people with the indicated Sex and MaritalStatus in the data set. (If you look at an earlier version of this question, in which we asked about the year of death, well, everyone in the data set died in 2014, so you do not need to worry about the year of death.)

You can solve this question in a couple different ways. You can either read in the entire dataset into a data frame, or you can read just a few lines at a time, using the chunksize = 10000 and the iterrows like in the videos. Either way is OK!

Items to submit
  • Code used to answer the question

  • Result of the code

TA applications for The Data Mine are currently being accepted. Please visit us here to apply!

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.