TDM 30100: Project 2 — 2023

Motivation: Documentation is one of the most critical parts of a project. There are so many tools that are specifically designed to help document a project, and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can’t go wrong with tools like Sphinx, or pdoc.

Context: This is the first project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems.

Scope: Python, documentation

Learning Objectives
  • Use Sphinx to document a set of Python code.

  • Use pdoc to document a set of Python code.

  • Write and use code that serializes and deserializes data.

  • Learn the pros and cons of various serialization formats.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/apple/health/watch_dump.xml

Questions

In this project we will work with pdoc to build some simple documentation, review some Python skills that may be rusty, and learn about a serialization and deserialization of data — a common component to many data science and computer science projects, and a key topics to understand when working with APIs.

For the sake of clarity, this project will have more deliverables than the "standard" .ipynb notebook, .py file containing Python code, and PDF. In this project, we will ask you to submit an additional PDF showing the documentation webpage that you will have built by the end of the project. How to do this will be made clear in the given question.

Make sure to select 3 cores of memory for this project, otherwise you may get an issue reading the dataset in question 3.

Question 1 (2 pts)

  1. Create a module-level docstring as described below. You will submit the Python file containing this docstring at the end of this project.

Let’s start by navigating to ondemand.anvil.rcac.purdue.edu, and launching a Jupyter Lab instance. In the previous project, you learned how to run various types of code in a Jupyter notebook (the .ipynb file). Jupyter Lab is actually much more useful. You can open terminals on Anvil (the cluster), as well as open an editor for .R files, .py files, or any other text-based file.

Give it a try. In the "Other" category in the Jupyter Lab home page, where you would normally select the "seminar" kernel, instead select the "Python File" option. Upon clicking the square, you will be presented with a file called untitled.py. Rename this file to firstname-lastname-project02.py (where firstname and lastname are your first and last name, respectively).

Make sure you are in your $HOME directory when clicking the "Python File" square. Otherwise you may get an error stating you do not have permissions to create the file.

Read the "3.8.2 Modules" section of Google’s Python Style Guide. Each individual .py file is called a Python "module". It is good practice to include a module-level docstring at the top of each module. Create a module-level docstring for your new module. Rather than giving an explanation of the module, and usage examples, instead include a short description (in your own words, 3-4 sentences) of the terms "serialization" and "deserialization". In addition, list a few (at least 2) examples of different serialization formats, and include a brief description of the format, and some advantages and disadvantages of each. Lastly, if you could break all serialization formats into 2 broad categories, what would those categories be, and why?

Any good answer for the "2 broad categories" will be accepted. With that being said, a hint would be to think of what the serialized data looks like (if you tried to open it in a text editor, for example), or how it is read.

Save your module.

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit
  • A module-level docstring with all of the above requirements.

Question 2 (2 pts)

  1. Write a line of bash to generate documentation using pdoc.

Now, in Jupyter Lab, open a new notebook using the "seminar" kernel.

You can have both the Python file and the notebook open in separate Jupyter Lab tabs for easier navigation.

Fill in a code cell for question 1 with a Python comment.

# See firstname-lastname-project02.py

For this question, read the pdoc section, and run a bash command to generate the documentation for your module that you created in the previous question, firstname-lastname-project02.py. To do this, look at the example provided in the book. Everywhere in the example in the pdoc section of the book where you see "mymodule.py" replace it with your module’s name — firstname-lastname-project02.py.

As an optional step, you can write bash code to create a documentation directory for this project. This is not required, but is good practice in order to keep your projects organized. An example of how to do this is below, using the -p flag to only create the directory if it does not already exist:

mkdir -p $HOME/project2/docs

Use python3 not python in your command.

We are expecting you to run the command in a bash cell (likely using line magic). Additionally, ensure that your documentation directory is empty, as you will submit it at the end of this project and it should only have files for this project in it. Below are a few hints on how to get started writing this command.

python3 -m pdoc [other commands here]

Use the -o flag to specify the output directory — I would suggest making it somewhere in your $HOME directory to avoid permissions issues.

For example, I used $HOME/project2/docs.

You can use the d flag to specify the type of of docstring you are using. For example, you can use -d numpy to specify that you are using numpy-style docstrings.

Once complete, on the left-hand side of the Jupyter Lab interface, navigate to your documentation directory. You should see something called firstname-lastname-project02.html. To view this file in your browser, right click on the file, and select Open in New Browser Tab. A new browser tab should open with your freshly made documentation. Pretty cool!

Ignore the index.html file — we are looking for the firstname-lastname-project02.html file.

You may have noticed that the docstrings are (partially) markdown-friendly. Try introducing some markdown formatting in your docstring for more appealing documentation.

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit
  • Any and all bash used to generate your documentation.

Question 3 (2 pts)

  1. Write a function called get_records_for_date with the functionality described below.

  2. Write a Google-style docstring for the function, and regenerate your documentation.

Any references to "watch data" just mean the dataset for this project.

In your firstname-lastname-project02.py file, write a function called get_records_for_date that accepts an lxml etree (of our watch data, via etree.parse), and a datetime.date, and returns a list of Record Elements, for a given date. Raise a TypeError if the date is not a datetime.date, or if the etree is not an lxml.etree. This should be included in both your .ipynb and .py files.

Use the Google Python Style Guide’s "Functions and Methods" section to write the docstring for this function. Be sure to include type annotations for the parameters and return value.

Re-generate your documentation. How does the updated documentation look? You may notice that the formatting is pretty ugly and things like "Args" or "Returns" are not really formatted in a way that makes it easy to read.

Use the -d flag to specify the format as "google", and re-generate your documentation. How does the updated documentation look?

The following code should help get you started.

import lxml
import lxml.etree
from datetime import datetime, date

def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list[lxml.etree._Element]:
    # docstring goes here

    # test if `tree` is an `lxml.etree._ElementTree`, and raise TypeError if not

    # test if `for_date` is a `datetime.date`, and raise TypeError if not

    # loop through the records in the watch data using the xpath expression `/HealthData/Record`

    # how to see a record, in case you want to. (DO NOT PUT WITHIN THE FOR LOOP, OR YOU WILL GET A LOT OF OUTPUT AND POTENTIALLY AN ERROR)
    print(lxml.etree.tostring(record))

    # test if the record's `startDate` is the same as `for_date`, and append to a list if it is

    # return the list of records

# how to test this function
tree = lxml.etree.parse('/anvil/projects/tdm/data/apple/health/watch_dump.xml')
chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date()
my_records = get_records_for_date(tree, chosen_date)
my_records
output
[<Element Record at 0x7ffb7c27a440>,
 <Element Record at 0x7ffb7c27a480>,
 <Element Record at 0x7ffb7c27a4c0>,
 <Element Record at 0x7ffb7c27a500>,
 <Element Record at 0x7ffb7c27a540>,
 <Element Record at 0x7ffb7c27a580>,
 <Element Record at 0x7ffb7c27a5c0>,
 <Element Record at 0x7ffb7c27a600>,
 <Element Record at 0x7ffb7764e3c0>,
 <Element Record at 0x7ffb7764e400>,
 <Element Record at 0x7ffb7764e440>,
 <Element Record at 0x7ffb7764e480>,
 ....

The following is some code that will be helpful to test the types.

from datetime import datetime, date

isinstance(some_date_object, date) # test if some_date_object is a date
isinstance(some_xml_tree_object, lxml.etree._ElementTree) # test if some_xml_tree_object is an lxml.etree._ElementTree

To loop through records, you can use the xpath method.

for record in tree.xpath('/HealthData/Record'):
    # do something with record

The attrib method will allow you to access a specific attribute of a record. For example, record.attrib['endDate'] will return the endDate attribute of a record. However, this simply returns a string and not a datetime date object. If you are having trouble figuring out how to appropriately make the comparison between for_date and the date of a record, take a look back at the above code for testing your function. It may include some functions to help you out.

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit
  • get_records_for_date function as described above, with an appropriate, Google-style docstring.

Question 4 (2 pts)

  1. Modify your module so you do not need to pass the -d flag in order to let pdoc know that you are using Google-style docstrings.

  2. Write a new function called quad that calculates the roots of a given quadratic equation.

  3. Write a docstring for this function that includes math formulas, and render them appropriately using pdoc.

  4. Add a logo to your documentation.

This was hopefully a not-too-difficult project that gave you some exposure to tools in the Python ecosystem, as well as chipped away at any rust you may have had with writing Python code.

To end things off, investigate the official pdoc documentation in order to answer the rest of this question.

You will notice that there is a way to specify the docstring format in your module, so that you do not need to pass the -d flag when generating your documentation. Modify your module so that you do not need to pass the -d flag when generating your documentation.

Next, write a function called quad that accepts 3 parameters representing the coefficients of a quadratic equation, a, b, and c, and prints the roots of the equation. Raise a TypeError if any of the parameters are not int or float. Raise a ValueError if a is 0. Each root should be separated by a comma. Write a docstring for this function that includes math formulas, and render them appropriately using pdoc. Ensure that this function appears in both your .ipynb and .py files.

Below is the quadratic formula you should implement for this question:

$x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}$

Lastly, add a logo to your documentation. You can use the Purdue logo, or any other logo you would like (as long as it is appropriate).

At the time of this project’s writing, the Purdue logo can be found at this link.

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit
  • Modified module to specify your docstring format.

  • New quad function as described above.

  • Appropriate docstring for quad function, including properly rendered math formula.

  • Documentation with logo.

Submitting your Work

The submission requirements for this project are a bit complicated. Please take care to read this section carefully to ensure you recieve full credit for the work you did.

Items to submit

For this project, please submit the following files:

  • The .ipynb file with:

  • a simple comment for question 1,

  • a bash cell for question 2 with code that generates your pdoc html documentation,

  • a code cell with your get_records_for_date function (for question 3)

  • a code cell with the results of running +

# read in the watch data
tree = lxml.etree.parse('/anvil/projects/tdm/data/apple/health/watch_dump.xml')

chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date()
my_records = get_records_for_date(tree, chosen_date)
my_records
  • a bash code cell with the code that generates your pdoc documentation as described in question 4.

  • a code cell with your quad function (for question 4)

  • a code cell with the results of running +

quad(3, -11, 4)
  • An .html file with your newest set of documention (including your question 4 modifications)

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.

Here is the Zoom recording of the 4:30 PM discussion with students from 28 August 2023: