STAT 39000: Project 13 — Spring 2022

Motivation: This year you’ve been exposed to a lot of powerful (and maybe new for you) tools and concepts. It would be really impressive if you were able to retain all of it, but realistically that probably didn’t happen. It takes lots of practice for these skills to develop. One common term you may hear thrown around is ETL. It stands for Extract, Transform, Load. You may or may not ever have to work with an ETL pipeline, however, it is a worthwhile exercise to plan one out.

Context: This is the first of the final two projects where you will map out an ETL pipeline, and the remaining typical tasks of a full data science project, and execute. It wouldn’t be practical to make this exhaustive, but the idea is to think about and plan out the various steps in a project and execute it the best you can given time and resource constraints.

Scope: Python

Learning Objectives

Describe and plan out an ETL pipeline to solve a problem of interest.
Create a flowchart mapping out the steps in the project.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Question 1

Create a problem statement for the project. What question are you interested in answering? What theory do you have that you’d like to show to maybe be true? This could be anything. Some examples could be:

Should you draft running backs before wide receivers in fantasy football?
Are news articles more "positive" or "negative" on nytimes.com vs. washingtonpost.com?
Are the number of stars of an Amazon review important to tell if the review is fake or not?
Are flight delays more likely to happen in the summer or winter?

The question you want to answer can be as simple or complex as you want it to be.

When coming up with the problem statement, please take into consideration that in this project, and the next, we will ask you to utilize skills you were exposed to this year. Things like: SLURM, joblib, pytorch, JAX, docker/singularity, fastapi, sync/async, pdoc, pytest, etc. It is likely that you will want to use other skills from previous years as well. Things like: web scraping, writing scripts, data wrangling, SQL, etc.

Try to think of a question that could be solved by utilizing some of these skills.

Items to submit

Code used to solve this problem.
Output from running the code.

Question 2

Read about ETL pipelines here. Summarize each part of the pipeline (extract, transform, and load) in your own words. Follow this up by looking at the image at the top of this section of "R for Data Science". Where do you think the ETL pipeline could be added to this workflow? Read about Dr. Wickhams definition of tidy data. After reading about his definition, do you think the "Tidy" step in the chart is potentially different than the "transform" step in the ETL pipeline?

There are no correct answer to this question. Just think about the question and describe what you think.

Items to submit

Code used to solve this problem.
Output from running the code.

Question 3

Flowcharts are an incredibly useful tool that can help you visualize and plan a project from start to end. Flowcharts can help you realize what parts of the project you are not clear on, which could save a lot of work during implementation. Read about the various flowchart shapes here, and plan out your ETL pipeline and the remaining project workflow using this free online tool. Include the image of your flowchart in your notebook.

You are not required to follow this flow chart exactly. You will have an opportunity to point out any changes you ended up making to your project flow later on.

Items to submit

Code used to solve this problem.
Output from running the code.

Question 4

There will more or less be a few "major" steps in your project:

Extract: scrape, database queries, find and download data files, etc.
Transform: data wrangling using pandas, tidyverse, JAX, numpy, etc.
Load: load data into a database or a file that represents your "data warehouse".
Import/tidy: Grab data from your "data warehouse" and tidy it if necessary.
Iterate: Modify/visualize/model your data.
Communicate: Share your deliverable(s).

Of course, you don’t need to include all of these steps. Any well-planned approach will receive full credit.

This can be further boiled down to just a few steps:

Data collection/cleaning.
Analysis/modeling/visualization.
Report.

Implement your data collection/cleaning step. Be sure to submit any relevant files and code (e.g. python script(s), R script(s), simply some code cells in a Jupyter Notebook, etc.) in your submission.

To get full credit, simply choose at least 2 of the following skills to incorporate into this step (or these steps):

Google style docstrings, or tidyverse style comments if utilizing R.
Singularity/docker (if, for example, you wanted to use a container image to run your code repeatably).
sync/async code (if, for example, you wanted to speed up code that has a lot of I/O).
joblib (if, for example, you wanted to speed up the scraping of many files).
SLURM (if, for example, you wanted to speed up the scraping of many files).
requests/selenium (if, for example, you need to scrape data as a part of your collection process).
If you choose to use sqlite as your intermediate "data warehouse" (instead of something easier like a csv or parquet file), this will count as a skill.
If you use argparse and build a functioning Python script, this will count as a skill.
If you write pytest tests for your code, this will count as a skill.

Make sure to include a screenshot or two actually using your deliverable(s) in your notebook (for example, if it was a script, show some screenshots of your terminal running the code). In addition, make sure to clearly indicate which of the "skills" you chose to use for this step.

Items to submit

Code used to solve this problem.
Output from running the code.

Question 5

If you read about ETL pipelines, you are probably not exactly sure what a "data warehouse" is. Browse the internet and read about data warehouses. In your own words, summarize what a data warehouse is, and the typical components.

Here are some common data warehouse products:

Snowflake
Google BigQuery
Amazon Redshift
Apache Hive
Databricks Lakehouse Platform

Choose a product to read about and describe 2-3 things that it looks like the product can do, and explain why (or when) you think that functionality would be useful.

Items to submit

Code used to solve this problem.
Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.