STAT 39000: Project 3 — Fall 2021

Thank yourself later and document now

Motivation: Documentation is one of the most critical parts of a project. There are so many tools that are specifically designed to help document a project, and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can’t go wrong with tools like Sphinx, or pdoc.

Context: This is the second project in a 2-project series where we explore thoroughly documenting Python code, while solving data-driven problems.

Scope: Python, documentation

Learning Objectives
  • Use Sphinx to document a set of Python code.

  • Use pdoc to document a set of Python code.

  • Write and use code that serializes and deserializes data.

  • Learn the pros and cons of various serialization formats.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/apple/health/watch_dump.xml

Questions

In this project, we are going to use the most popular Python documentation generation tool, Sphinx, to generate documentation for the module we created in project (2). If you chose to skip project (2), the module, in its entirety, will be posted at the latest, this upcoming Monday. You do not need that module to complete this project. Your module from project (2) does not need to be perfect for this project.

Last project was more challenging than intended. This project will provide a bit of a reprieve, and should (hopefully) be fun to mess around with.

project_02_module.py

"""This module is for project 2 for STAT 39000.

**Serialization:** Serialization is the process of taking a set or subset of data and transforming it into a specific file format that is designed for transmission over a network, storage, or some other specific use-case.

**Deserialization:** Deserialization is the opposite process from serialization where the serialized data is reverted back into its original form.

The following are some common serialization formats:

- JSON
- Bincode
- MessagePack
- YAML
- TOML
- Pickle
- BSON
- CBOR
- Parquet
- XML
- Protobuf

**JSON:** One of the more wide-spread serialization formats, JSON has the advantages that it is human readable, and has a excellent set of optimized tools written to serialize and deserialize. In addition, it has first-rate support in browsers. A disadvantage is that it is not a fantastic format storage-wise (it takes up lots of space), and parsing large JSON files can use a lot of memory.

**MessagePack:** MessagePack is a non-human-readable file format (binary) that is extremely fast to serialize and deserialize, and is extremely efficient space-wise. It has excellent tooling in many different languages. It is still not the *most* space efficient, or *fastest* to serialize/deserialize, and remains impossible to work with in its serialized form.

Generally, each format is either *human-readable* or *not*. Human readable formats are able to be read by a human when opened up in a text editor, for example. Non human-readable formats are typically in some binary format and will look like random nonsense when opened in a text editor.
"""

import lxml
import lxml.etree
from datetime import datetime, date


def my_function(a, b):
    """
    >>> my_function(2, 3)
    6
    >>> my_function('a', 3)
    'aaa'
    >>> my_function(1, 3)
    4
    """
    return a * b


def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list:
    """
    Given an `lxml.etree` object and a `datetime.date` object, return a list of records
    with the startDate equal to `for_date`.

    Args:
        tree (lxml.etree): The watch_dump.xml file as an `lxml.etree` object.
        for_date (datetime.date): The date for which returned records should have a startDate equal to.

    Raises:
        TypeError: If `tree` is not an `lxml.etree` object.
        TypeError: If `for_date` is not a `datetime.date` object.

    Returns:
        list: A list of records with the startDate equal to `for_date`.
    """

    if not isinstance(tree, lxml.etree._ElementTree):
        raise TypeError('tree must be an lxml.etree')

    if not isinstance(for_date, date):
        raise TypeError('for_date must be a datetime.date')

    results = []
    for record in tree.xpath('/HealthData/Record'):
        if for_date == datetime.strptime(record.attrib.get('startDate'), '%Y-%m-%d %X %z').date():
            results.append(record)

    return results


def from_msgpack(file: str) -> lxml.etree._Element:
    """
    Given the absolute path a msgpack file, return the deserialized `lxml.Element` object.

    Args:
        file (str): The absolute path of the msgpack file to deserialize.

    Raises:
        TypeError: If `file` is not a `str`.

    Returns:
        lxml.Element: The deserialized `lxml.Element` object.
    """

    if not isinstance(file, str):
        raise TypeError('file must be a str')

    with open(file, 'rb') as f:
        d = msgpack.load(f)

    e = etree.Element('Record')
    for key, value in d.items():
        e.attrib[key] = str(value)

    return e


def to_msgpack(element: lxml.etree._Element, file: str) -> None:
    """
    Given an `lxml.Element` object and a file path, serialize the `lxml.Element` object to
    a msgpack file at the given file path.

    Args:
        element (lxml.Element): The element to serialize.
        file (str): The absolute path of the msgpack file to and save.

    Raises:
        TypeError: If `file` is not a `str`.
        TypeError: If `element` is not an `lxml.Element`.

    Returns:
        None: None
    """

    if not isinstance(file, str):
        raise TypeError('file must be a str')

    if not isinstance(element, lxml.etree._Element):
        raise TypeError('element must be an lxml.Element')

    # Test if `type`, `sourceVersion`, `unit`, and `value` are present in the element.
    d = dict(element.attrib)
    if not d.get('type') or not d.get('sourceVersion') or not d.get('unit') or not d.get('value'):
        raise ValueError('element must have all of the following keys: type, sourceVersion, unit, and value')

    # Remove "other" keys from the dict
    keys_to_remove = []
    for key in d.keys():
        if key not in ['type', 'sourceVersion', 'unit', 'value']:
            keys_to_remove.append(key)

    for key in keys_to_remove:
        del d[key]

    with open(file, 'wb') as f:
        msgpack.dump(d, f)

if __name__ == '__main__':
    import doctest
    doctest.testmod()

Question 1

Please use Firefox for this project. If you choose to use Chrome, the appearance of the documentation will be horrible. If you choose to use Chrome anyway, it is recommended that you change a setting in Chrome, temporarily, for this project, by typing (where you would normally put the URL):

chrome://flags

Then, search for "samesite". For "SameSite by default cookies", change from "Default" to "Disabled", and restart the browser.

  • Create a new folder in your $HOME directory called project3.

  • Create a new Jupyter notebook in that folder called project3.ipynb, based on the normal project template.

    The majority of this notebook will just contain a single bash cell with the commands used to re-generate the documentation. This is okay, and by design. The main deliverable for this project will end up being the PDF of the documentation’s HTML page.

  • Copy and paste the code from project (2)'s firstname-lastname-project02.py module into the $HOME/project3 directory, you can rename this to be firstname_lastname_project03.py.

  • In a bash cell in your Jupyter notebook, make sure you cd the project3 folder, and run the following command:

    python -m sphinx.cmd.quickstart ./docs -q -p project3 -a "Kevin Amstutz" -v 1.0.0 --sep

    Please replace "Kevin Amstutz" with your own name.

    What do each of these arguments do? Check out this page of the official documentation.

You should be left with a newly created docs folder within your project3 folder. Your structure should look something like the following.

project03 folder contents
project03(1)
├── 39000_f2021_project03_solutions.ipynb(2)
├── docs(3)
│   ├── build (4)
│   ├── make.bat
│   ├── Makefile (5)
│   └── source (6)
│       ├── conf.py (7)
│       ├── index.rst (8)
│       ├── _static
│       └── _templates
└── kevin_amstutz_project03.py(9)

5 directories, 6 files
1 Our module (named project03) folder
2 Your project notebook (probably named something like firstname_lastname_project03.ipynb)
3 Your documentation folder
4 Your empty build folder where generated documentation will be stored
5 The Makefile used to run the commands that generate your documentation. Make the following changes:
# replace
SPHINXOPTS    ?=
SPHINXBUILD   ?= sphinx-build
SOURCEDIR     = source
BUILDDIR      = build

# with the following
SPHINXOPTS    ?=
SPHINXBUILD   ?= python -m sphinx.cmd.build
SOURCEDIR     = source
BUILDDIR      = build
6 Your source folder. This folder contains all hand-typed documentation.
7 Your conf.py file. This file contains the configuration for your documentation. Make the following changes:
# CHANGE THE FOLLOWING CONTENT FROM:

# -- Path setup --------------------------------------------------------------

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.')

# TO:

# -- Path setup --------------------------------------------------------------

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
8 Your index.rst file. This file (and all files ending in .rst) is written in reStructuredText — a Markdown-like syntax.
9 Your module. This is the module containing the code from the previous project, with nice, clean docstrings.

Finally, with the modifications above having been made, run the following command in a bash cell in Jupyter notebook to generate your documentation.

cd $HOME/project3/docs
make html

After complete, your module folders structure should look something like the following.

project03 folder contents
project03
├── 39000_f2021_project03_solutions.ipynb
├── docs
│   ├── build
│   │   ├── doctrees
│   │   │   ├── environment.pickle
│   │   │   └── index.doctree
│   │   └── html
│   │       ├── genindex.html
│   │       ├── index.html
│   │       ├── objects.inv
│   │       ├── search.html
│   │       ├── searchindex.js
│   │       ├── _sources
│   │       │   └── index.rst.txt
│   │       └── _static
│   │           ├── alabaster.css
│   │           ├── basic.css
│   │           ├── custom.css
│   │           ├── doctools.js
│   │           ├── documentation_options.js
│   │           ├── file.png
│   │           ├── jquery-3.5.1.js
│   │           ├── jquery.js
│   │           ├── language_data.js
│   │           ├── minus.png
│   │           ├── plus.png
│   │           ├── pygments.css
│   │           ├── searchtools.js
│   │           ├── underscore-1.13.1.js
│   │           └── underscore.js
│   ├── make.bat
│   ├── Makefile
│   └── source
│       ├── conf.py
│       ├── index.rst
│       ├── _static
│       └── _templates
└── kevin_amstutz_project03.py

9 directories, 29 files

In the left-hand pane in the Jupyter Lab interface, navigate to $HOME/project3/docs/build/html/, and right click on the index.html file and choose Open in New Browser Tab. You should now be able to see your documentation in a new tab.

Make sure you are able to generate the documentation before you proceed, otherwise, you will not be able to continue to modify, regenerate, and view your documentation.

Items to submit
  • Code used to solve this problem (in 2 Jupyter bash cells).

Question 2

One of the most important documents in any package or project is the README.md file. This file is so important that version control companies like GitHub and GitLab will automatically display it below the repositories contents. This file contains things like instructions on how to install the packages, usage examples, lists of dependencies, license links, etc. Check out some popular GitHub repositories for projects like numpy, pytorch, or any other repository you’ve come across that you believe does a good job explaining the project.

In the docs/source folder, create a new file called README.rst. Choose 3-5 of the following "types" of reStruturedText from the this webpage, and create a fake README. The content can be Lorem Ipsum type of content as long as it demonstrates 3-5 of the types of reStruturedText.

  • Inline markup

  • Lists and quote-like blocks

  • Literal blocks

  • Doctest blocks

  • Tables

  • Hyperlinks

  • Sections

  • Field lists

  • Roles

  • Images

  • Footnotes

  • Citations

  • Etc.

Make sure to include at least 1 section. This counts as 1 of your 3-5.

Once complete, add a reference to your README to the index.rst file. To add a reference to your README.rst file, open the index.rst file in an editor and add "README" as follows.

index.rst
.. project3 documentation master file, created by
   sphinx-quickstart on Wed Sep  1 09:38:12 2021.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to project3's documentation!
====================================

.. toctree::
   :maxdepth: 2
   :caption: Contents:

   README

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

Make sure "README" is aligned with ":caption:" — it should be 3 spaces from the left before the "R" in "README".

In a new bash cell in your notebook, regenerate your documentation. Check out the resulting index.html page, and click on the links. Pretty great!

Items to submit
  • Code used to solve this problem.

  • Screenshot or PDF labeled "question02_results".

Question 3

The pdoc package was specifically designed to generate documentation for Python modules using the docstrings in the module. As you may have noticed, this is not "native" to Sphinx.

Sphinx has extensions. One such extension is the autodoc extension. This extension provides the same sort of functionality that pdoc provides natively.

To use this extension, modify the conf.py file in the docs/source folder.

# -- General configuration ---------------------------------------------------

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
    'sphinx.ext.autodoc'
]

Next, update your index.rst file so autodoc knows which modules to extract data from.

.. project3 documentation master file, created by
   sphinx-quickstart on Wed Sep  1 09:38:12 2021.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to project3's documentation!
====================================

.. automodule:: firstname_lastname_project03
    :members:

.. toctree::
   :maxdepth: 2
   :caption: Contents:

   README

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

In a new bash cell in your notebook, regenerate your documentation. Check out the resulting index.html page, and click on the links. Not too bad!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Okay, while the documentation looks pretty good, clearly, Sphinx does not recognize Google style docstrings. As you may have guessed, there is an extension for that.

Add the napoleon extension to your conf.py file.

# -- General configuration ---------------------------------------------------

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
    'sphinx.ext.autodoc',
    'sphinx.ext.napoleon'
]

In a new bash cell in your notebook, regenerate your documentation. Check out the resulting index.html page, and click on the links. Much better!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

To make it explicitly clear what files to submit for this project:

  • firstname_lastname_project03.py

  • firstname_lastname_project03.ipynb

  • firstname_lastname_project03.pdf (result of exporting .ipynb to PDF)

  • firstname_lastname_project03_webpage.pdf (result of printing documentation webpage to PDF)

At this stage, you should have a pretty nice set of documentation, with really nice in-code documentation in the form of docstrings. However, there is still another "thing" to add to your docstrings that can take them to the next level.

doctest is a standard library tool that allows you to include code, with expected output inside your docstring. Not only can this be nice for the user to see, but both pdoc and Sphinx applies special formatting to such additions to a docstring.

Write a super simple function, it could be as simple as adding a couple of digits and returning a value. The following is an example. Come up with your own function with at least 1 passing test and 1 failing test (like the example).

def add(value1, value2):
    """Function to add two values.

    The first example below will pass (because 1+1 is 2), the second will fail (because 1+2 is not 5)

    >>> add(1, 1)
    2

    >>> add(1, 2)
    5
    """
    return value1 + value2

Where ">>>" represents the Python REPL and code demonstrating how you would use the function, and the line immediately following is the expected output.

Make sure your function actually does something so you can test to see if it is working as intended or not.

To use doctest, add the following to the bottom of your firstname_lastname_project03.py file.

if __name__ == '__main__':
    import doctest
    doctest.testmod()

Now, in a new bash cell in your notebook, run the following command.

python kevin_amstutz_project03.py -v

This will actually run your example code in the docstring and compare the output to the expected result! Very cool. We will learn more about this in the next couple of projects.

When including the -v option, both passing and failing tests will be printed. Without the -v option, only failling tests will be printed.

Now, regenerate your documentation again and check it out. Notice how the lines in the docstring are neatly formatted? Pretty great.

Okay, last but not least, check out the themes here, and choose one of the themes listed, regenerate your documentation, and save the webpage to a PDF for submission. Note that each theme may have slightly different requirements on how to "activate" it. For example, to use the "Readable" theme, you must add the following to your conf.py file.

import sphinx_readable_theme
html_theme = 'readable'
html_theme_path = [sphinx_readable_theme.get_html_theme_path()]

You can change a theme by changing the value of html_theme in the conf.py file.

If a theme doesn’t work, just select a different theme.

Unlike pdoc which only supports HTML output, Sphinx supports many output formats, including PDF. If interested, feel free to use the following code to generate a PDF of your documentation.

module load texlive/20200406
python -m sphinx.cmd.build -M latexpdf $HOME/project3/docs/source $HOME/project3/docs/build
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.