TDM 30100: Project 2 — 2022

Motivation: Documentation is one of the most critical parts of a project. There are so many tools that are specifically designed to help document a project, and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can’t go wrong with tools like Sphinx, or pdoc.

Context: This is the first project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems.

Scope: Python, documentation

Learning Objectives
  • Use Sphinx to document a set of Python code.

  • Use pdoc to document a set of Python code.

  • Write and use code that serializes and deserializes data.

  • Learn the pros and cons of various serialization formats.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/apple/health/watch_dump.xml

Questions

In this project we will work with pdoc to build some simple documentation, review some Python skills that may be rusty, and learn about a serialization and deserialization of data — a common component to many data science and computer science projects, and a key topics to understand when working with APIs.

For the sake of clarity, this project will have more deliverables than the "standard" .ipynb notebook, .py file containing Python code, and PDF. In this project, we will ask you to submit an additional PDF showing the documentation webpage that you will have built by the end of the project. How to do this will be made clear in the given question.

Make sure to select 4096 MB of RAM for this project. Otherwise you may get an issue reading the dataset in question 3.

Question 1

Let’s start by navigating to ondemand.anvil.rcac.purdue.edu, and launching a Jupyter Lab instance. In the previous project, you learned how to run various types of code in a Jupyter notebook (the .ipynb file). Jupyter Lab is actually much more useful. You can open terminals on Anvil (the cluster), as well as open a an editor for .R files, .py files, or any other text-based file.

Give it a try. In the "Other" category in the Jupyter Lab home page, where you would normally select the "f2022-s2023" kernel, instead select the "Python File" option. Upon clicking the square, you will be presented with a file called untitled.py. Rename this file to firstname-lastname-project02.py (where firstname and lastname are your first and last name, respectively).

Make sure you are in your $HOME directory when clicking the "Python File" square. Otherwise you may get an error stating you do not have permissions to create the file.

Read the "3.8.2 Modules" section of Google’s Python Style Guide. Each individual .py file is called a Python "module". It is good practice to include a module-level docstring at the top of each module. Create a module-level docstring for your new module. Rather than giving an explanation of the module, and usage examples, instead include a short description (in your own words, 3-4 sentences) of the terms "serialization" and "deserialization". In addition, list a few (at least 2) examples of different serialization formats, and include a brief description of the format, and some advantages and disadvantages of each. Lastly, if you could break all serialization formats into 2 broad categories, what would those categories be, and why?

Any good answer for the "2 broad categories" will be accepted. With that being said, a hint would be to think of what the serialized data looks like (if you tried to open it in a text editor, for example), or how it is read.

Save your module.

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Now, in Jupyter Lab, open a new notebook using the "f2022-s2023" kernel.

You can have both the Python file and the notebook open in separate Jupyter Lab tabs for easier navigation.

Fill in a code cell for question 1 with a Python comment.

# See firstname-lastname-project02.py

For this question, read the pdoc section, and run a bash command to generate the documentation for your module that you created in the previous question, firstname-lastname-project02.py. To do this, look at the example provided in the book. Everywhere in the example in the pdoc section of the book where you see "mymodule.py" replace it with your module’s name — firstname-lastname-project02.py.

Use python3 not python in your command.

We are expecting you to run the command in a bash cell, however, if you decide to run it in a terminal, please make sure to document your command. In addition, you’ll need to run the following in order for pdoc to be recognized as a module.

module use /anvil/projects/tdm/opt/core
module load tdm
module load python/f2022-s2023

Then you can run your command.

python3 -m pdoc other commands here

Use the -o flag to specify the output directory — I would suggest making it somewhere in your $HOME directory to avoid permissions issues.

For example, I used $HOME/output.

Once complete, on the left-hand side of the Jupyter Lab interface, navigate to your output directory. You should see something called firstname-lastname-project02.html. To view this file in your browser, right click on the file, and select Open in New Browser Tab. A new browser tab should open with your freshly made documentation. Pretty cool!

Ignore the index.html file — we are looking for the firstname-lastname-project02.html file.

You may have noticed that the docstrings are (partially) markdown-friendly. Try introducing some markdown formatting in your docstring for more appealing documentation.

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

When I refer to "watch data" I just mean the dataset for this project.

Write a function to called get_records_for_date that accepts an lxml etree (of our watch data, via etree.parse), and a datetime.date, and returns a list of Record Elements, for a given date. Raise a TypeError if the date is not a datetime.date, or if the etree is not an lxml.etree.

Use the Google Python Style Guide’s "Functions and Methods" section to write the docstring for this function. Be sure to include type annotations for the parameters and return value.

Re-generate your documentation. How does the updated documentation look? You may notice that the formatting is pretty ugly and things like "Args" or "Returns" are not really formatted in a way that makes it easy to read.

Use the -d flag to specify the format as "google", and re-generate your documentation. How does the updated documentation look?

The following code should help get you started.

import lxml
import lxml.etree
from datetime import datetime, date

def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list[lxml.etree._Element]:
    # docstring goes here

    # test if `tree` is an `lxml.etree._ElementTree`, and raise TypeError if not

    # test if `for_date` is a `datetime.date`, and raise TypeError if not

    # loop through the records in the watch data using the xpath expression `/HealthData/Record`
        # how to see a record, in case you want to
        print(lxml.etree.tostring(record))

        # test if the record's `startDate` is the same as `for_date`, and append to a list if it is

    # return the list of records

# how to test this function
tree = etree.parse('/anvil/projects/tdm/data/apple/health/watch_dump.xml')
chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date()
my_records = get_records_for_date(tree, chosen_date)
my_records
output
[<Element Record at 0x7ffb7c27a440>,
 <Element Record at 0x7ffb7c27a480>,
 <Element Record at 0x7ffb7c27a4c0>,
 <Element Record at 0x7ffb7c27a500>,
 <Element Record at 0x7ffb7c27a540>,
 <Element Record at 0x7ffb7c27a580>,
 <Element Record at 0x7ffb7c27a5c0>,
 <Element Record at 0x7ffb7c27a600>,
 <Element Record at 0x7ffb7764e3c0>,
 <Element Record at 0x7ffb7764e400>,
 <Element Record at 0x7ffb7764e440>,
 <Element Record at 0x7ffb7764e480>,
 ....

The following is some code that will be helpful to test the types.

from datetime import datetime, date

isinstance(some_date_object, date) # test if some_date_object is a date
isinstance(some_xml_tree_object, lxml.etree._ElementTree) # test if some_xml_tree_object is an lxml.etree._ElementTree

To loop through records, you can use the xpath method.

for record in tree.xpath('/HealthData/Record'):
    # do something with record

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

This was hopefully a not-too-difficult project that gave you some exposure to tools in the Python ecosystem, as well as chipped away at any rust you may have had with writing Python code.

Finally, investigate the official pdoc documentation, and make at least 2 changes/customizations to your module. Some examples are below — feel free to get creative and do something with pdoc outside of this list of options:

  • Modify the module so you do not need to pass the -d flag in order to let pdoc know that you are using Google-style docstrings.

  • Change the logo of the documentation to your own logo (or any logo you’d like).

  • Add some math formulas and change the output accordingly.

  • Edit and customize pdoc’s jinja2 template (or CSS).

For this project, please submit the following files:

  • The .ipynb file with:

  • a simple comment for question 1,

  • a bash cell for question 2 with code that generates your pdoc html documentation,

  • a code cell with your get_records_for_date function (for question 3)

  • a code cell with the results of running +

# read in the watch data
tree = lxml.etree.parse('/anvil/projects/tdm/data/apple/health/watch_dump.xml')

chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date()
my_records = get_records_for_date(tree, chosen_date)
my_records
  • a bash code cell with the code that generates your pdoc html documentation (using the google styles)

  • a markdown cell describing the changes you made for question 4.

  • An .html file with your newest set of documention (including your question 4 modifications)

Relevant topics: pdoc, Sphinx, Docstrings & Comments

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.