STAT 39000: Project 14 — Spring 2022

Motivation: This year you’ve been exposed to a lot of powerful (and maybe new for you) tools and concepts. It would be really impressive if you were able to retain all of it, but realistically that probably didn’t happen. It takes lots of practice for these skills to develop. One common term you may hear thrown around is ETL. It stands for Extract, Transform, Load. You may or may not ever have to work with an ETL pipeline, however, it is a worthwhile exercise to plan one out.

Context: This is the first of the final two projects where you will map out an ETL pipeline, and the remaining typical tasks of a full data science project, and execute. It wouldn’t be practical to make this exhaustive, but the idea is to think about and plan out the various steps in a project and execute it the best you can given time and resource constraints.

Scope: Python

Learning Objectives
  • Describe and plan out an ETL pipeline to solve a problem of interest.

  • Create a flowchart mapping out the steps in the project.

Make sure to read about, and use the template found here, and the important information about projects submissions here.


Question 1

If you skipped project 13, please go back and complete project 13 and submit it as your project 14 submission. Please make a bold note at the top of your submission "This is my project 14, but is really project 13", so graders know what to expect. Thanks!

In the previous project, you probably spent most of the time reading about ETL, flowcharts, data warehouses, and planning out your project. The more you have things planned out the less amount of time it will likely take to implement. Your project probably looks something like this now.

  • Data collection/cleaning.

  • Analysis/modeling/visualization.

  • Report.

In this project, you will complete those last two steps.

Import data from your "data warehouse" and perform an analysis to answer the problem statement you created in the previous project. Your analysis should contain:

  • 1 or more data visualizations.

  • 1 or more sets of summary data (think .describe() from pandas or summary/prop.table from R).

Feel free to utilize the transformers package and the wide variety of pre-built models provided at huggingface.

Alternatively, you can build an API and/or dashboard using fastapi (or any other framework like django, flask, shiny, etc.). Simply make sure to include your code and screenshots of you utilizing the API or using the dashboard.

For either of the options above (summary data/visualizations or API/dashboard), in order to get full credit, simply choose at least 2 of the following skills to incorporate inot this step (or these steps):

  • Google style docstrings, or tidyverse style comments if utilizing R.

  • Singularity/docker (if, for example, you wanted to use a container image to run your code repeatably, or run your API/dashboard).

  • sync/async code (if, for example, you wanted to speed up code that has a lot of I/O).

  • joblib (if, for example, you wanted to speed up a parallelizable task or computation).

  • SLURM (if, for example, you wanted to speed up a parallelizable task or computation).

  • If you use argparse and build a functioning Python script, this will count as a skill.

  • If you write pytest tests for your code, this will count as a skill.

  • Use JAX (for example jax.jit) or pytorch for some numeric computation.

Make sure to include a screenshot or two actually using your deliverable(s) in your notebook (for example, if it was a script, show some screenshots of your terminal running the code). In addition, make sure to clearly indicate which of the "skills" you chose to use for this step.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

The final task, to create your deliverable to communicate your results, is probably the most important part of a typical project. It is important that people understand what you did, why it is important for answering your question, and why it provides value. Learning how to make a good slide deck is a really useful skill!

In our case, it makes more sense to have a Jupyter Notebook, since those are easy to read in Gradescope, and get the point across.

After your question 1 results are entered in your Jupyter Notebook, under the "Question 2" heading, create your deliverable. Use markdown cells to beautifully format the information you want to present. Include everything starting with your problem statement, leading all the way up to your conclusions (even if just anecdotal conclusions). Include code, graphics, and screenshots that are important to the story. Of course, you don’t need to include code from scripts (in the notebook — we do want all scripts from question 1 (if any) included in your submission), but you can mention that you had a script called that did X, Y, and Z.

The goal of this deliverable is that an outsider could read your notebook (starting from question 2) and understand what question you had, what you did (and why), and what were the results. Any good effort will receive full credit.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

In the previous project, you were asked to create a flow chart to describe the steps in your system/project. As you began implementing things, you may or may not have changed your original plan. If you did, update your flowchart and include it in your notebook. Otherwise, include your old flow chart and explain that you didn’t change anything.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

It has been a fun year. We hope that you learned something new!

  • Write 3 (or more) of your least favorite topics and/or projects from this past year (for STAT 39000).

  • Write 3 (or more) of your most favorite projects/topics, and/or 3 topics you wish you were able to learn more about.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.