STAT 39000: Project 4 — Fall 2021

Write it. Test it. Change it. Bop it?

Motivation: Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have.

Context: This is the first of a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using pytest, doc tests, and mypy, while writing code to manipulate and work with data.

Scope: Python, testing, pytest, mypy, doc tests

Learning Objectives
  • Write and run unit tests using pytest.

  • Include and run doc tests in your docstrings, using doctest.

  • Gain familiarity with mypy, and explain why static type checking can be useful.

  • Comprehend what a function is, and the components of a function in Python.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/apple/health/2021/*

Questions

At the end of this project, you will need to submit the following:

  • A notebook (.ipynb) with the output of running your tests, and any other code.

  • An updated watch_data.py file with the doctests you wrote, fixed function, and custom function.

  • A test_watch_data.py file with the pytest tests you wrote.

Question 1

XPath expressions, while useful, have a very big limitation: the entire XML document must be read into memory. The is a problem for large XML documents. For example, to parse the export.xml file in the Apple Health data, takes nearly 7GB of memory when the file is only 980MB.

prof.py
from memory_profiler import profile

@profile
def main():
    import lxml.etree

    tree = lxml.etree.parse("/home/kamstut/apple_health_export/export.xml")

if __name__ == '__main__':
    main()
python -m memory_profiler prof.py
Output
Filename: prof.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     3     36.5 MiB     36.5 MiB           1   @profile
     4                                         def main():
     5     38.5 MiB      2.0 MiB           1       import lxml.etree
     6
     7   6975.3 MiB   6936.8 MiB           1       tree = lxml.etree.parse("/home/kamstut/apple_health_export/export.xml")

This is a very common problem, not just for reading XML files, but for dealing with larger dataset in general. You will not always have an abundance of memory to work with.

To get around this issue, you will notice we take a streaming approach, where only parts of a file are read into memory at a time, processed, and then freed.

Copy our library from /depot/datamine/data/apple/health/apple_watch_parser and import it into your code cell for question 1. Examine the code and test out at least 2 of the methods or functions.

To copy the library run the following in a new cell.

%%bash

cp -r /depot/datamine/data/apple/health/apple_watch_parser $HOME

To import and use the library, make sure your notebook (let’s say my_notebook.ipynb) is in the same directory (the $HOME directory) as the apple_watch_parser directory. Then, you can import and use the library as follows.

from apple_watch_parser import watch_data

dat = watch_data.WatchData("/depot/datamine/data/apple/health/2021")
print(dat)

You may be asking yourself "well, what does that dat = watch_data.WatchData("/depot/datamine/data/apple/health/2021") line do, other than let us print the WatchData object?" The answer is, access and utilize the methods within the WatchData object using the dot notation. Any function inside the WatchData class is called a method and can be access using the dot notation.

For example, if we had a function called my_function that was declared inside the WatchData class, we would call it as follows:

from apple_watch_parser import watch_data

dat = watch_data.WatchData("/depot/datamine/data/apple/health/2021")
dat.my_function(argument1, argument2)

Hopefully this is a good hint on how to use the dot notation to call methods in the WatchData class!

If you run help(watch_data.time_difference), you will get some nice info about the function including a note "Given two strings in the format matching the format in Apple Watch data: YYYY-MM-DD HH:MM:SS -XXXX". What does this mean? These are date/time format code (see here).

Let’s say you have a string 2018-05-21 04:35:49 -0500, and you want to convert it to a datetime object. To do so you would run the following.

import datetime

my_datetime_string = '2018-05-21 04:35:49 -0500'
my_datetime = datetime.datetime.strptime(my_datetime_string, '%Y-%m-%d %H:%M:%S %z')

The string '%Y-%m-%d %H:%M:%S %z' are format codes (see here). In order to convert from a string to a datetime object, you need to use a combination of format codes that match the format of the string. In this case, the string is '2018-05-21 04:35:49 -0500'. The "2018" part matches "%Y" from the format codes. The "05" part matches "%m" from the format codes. The "21" part matches "%d" from the format codes. The "04" part matches "%H" from the format codes. The "35" part matches "%M" from the format codes. The "49" part matches "%S" from the format codes. The " -0500" part matches "%z" from the format codes. If your datetime string follows a different format, you would need to modify the combination of format codes to use so it matches your datetime string.

Then, once you have a datetime object, you can do all sorts of fun things. The most obvious of which is converting the date back into a string, but formatting it exactly how you want. For example, lets say we dont want a string to have all the details '2018-05-21 04:35:49 -0500' has, and instead just want the month, day, and year using forward slashes instead of hyphens.

my_datetime.strftime('%m/%d/%Y') # '05/21/2018'
Items to submit
  • Code used to solve this problem — code that imports and uses our library and at least 2 of the methods or functions.

  • Output from running the code that uses 2 of the methods.

Question 2

As you may have noticed, the code contains fairly thorough docstrings. This is a good thing, and it is a good goal to aim for when writing your own Python functions, classes, modules, etc.

In the previous project, you got a small taste of using doctest to test your code using in-comment code. This is a great way to test parts of your code that are simple, straightforward, and don’t involve extra data or fixtures in order to test.

Examine the code, and determine which functions and/or methods are good candidates for doctests. Modify the docstrings to include at least 3 doctests each, and run the following to test them out!

Include the following doctest in the calculate_speed function. This does not count as 1 of your 3 doctests for this function. It will fail for this question — that is okay!

>>> calculate_speed(5.0, .55, output_distance_unit = 'm')
Traceback (most recent call last):
    ...
ValueError: output_distance_unit must be 'mi' or 'km'

Make sure to include the expected output of each doctest below each line starting with >>>. This means in the code chunk shown above, you should include the "Traceback", "…​", and "ValueError" lines as the expected output. Literally just copy and paste that entire code chunk into the calculate_speed docstring.

%%bash

python $HOME/apple_watch_parser/watch_data.py -v

If you need to read in data or type a lot in order to use a function or method, a doctest is probably not the right approach. Hint, hint, try the functions rather than methods.

There are 2 functions that are good candidates for doctests.

Don’t forget to add the following code to the bottom of watch_data.py so doctests will run properly.

if __name__ == '__main__':
    import doctest
    doctest.testmod()
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

In question 2, we wrote a doctest for the calculate_speed function. Figure out why the doctest fails, and make modifications to the function so it passes the doctest. Do not modify the doctest.

When you update the calculate_speed function, be sure to first save the watch_data.py file and then re-import the package so that way your modifications take effect.

Remember we want you to change the calculate_speed function to pass the doctest — not change the doctest to make it pass.

The output of calculate_speed(5.0, .55, output_distance_unit = 'm') is 9.09090909090909, but we want it to be ValueError: output_distance_unit must be 'mi' or 'km' because 'm' isn’t one of the two valid values, 'mi' or 'km'. Modify the calculate_speed function so it raises that error when the output_distance_unit parameter is not one of the two valid values.

Look carefully at the _convert_distance helper function — that is where you will want to make modifications. Your logic within each distance_unit if statement should be along the lines of: "Is the output_distance_unit parameter 'mi'? If so, convert and/or return this distance. Is it 'km'? If so, convert and/or return this distance. Otherwise, raise an error because output_distance_unit should only be 'mi' or 'km'."

To run the doctest:

%%bash

python $HOME/apple_watch_parser/watch_data.py -v

This is what doctests are for! This helps you easily identify that something fundamental has changed and the code isn’t ready for production. You can imagine a scenario where you automatically run all doctests automatically before releasing a new product, and having that system notify you when a test fails — very cool!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

While doctests are good for simple testing, a package like pytest is better. For the stand alone functions, write at least 2 tests each using pytest. Make sure these tests test different inputs than your doctests did — its not hard to come up with lots of tests!

This could end up being just 2 functions that run a total of 4 tests — that is okay! As long as each function has at least 2 assert statements.

Start by adding a new file called test_watch_data.py to your $HOME/apple_watch_parser directory. Then, fill the file with your tests. When ready to test, run the following in a new cell.

%%bash

cd $HOME/apple_watch_parser
python -m pytest

You may have noticed that we arbitrarily chose to place some functions outside of our WatchData class, and others inside. There is no hard and fast rule to determine if a function belongs inside or outside of a class. In general, however, if a function is related to the class, and works with the attributes/data of the class, it should be inside the class. If the function has no relationship to the class, or could be useful using other types of data, it should be outside of the class.

Of course, there are exceptions to this rule, and it is possible to write static methods for a class, which operate independently of the class and its attributes. We chose to write the functions outside of the class, more for demonstration purposes than anything else. They are functions that would most likely not be useful in any other context, but sort of demonstrate the concept and allow us to have good functions to practice writing doctests and pytest tests without fixtures.

In the following project, we will continue to learn about pytest, including some more advanced features, like fixtures.

Relevant topics: pytest

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Explore the data — there is a lot! Think of a function that could be useful for this module that would live outside of the WatchData class. Write the function. Include Google style docstrings, doctests (at least 2), and pytest tests (at least 2, different from your doctests). Re-run both your doctest tests and pytest tests.

You can simply add this function to your watch_data.py module, and run the tests just like you did for the previous questions!

Your function doesn’t need to be useful for data outside the WatchData class (you won’t lose credit if it isn’t really), but make an attempt! There are more types of elements and data that you can look at too other than just the Workout tags in the export.xml file. There is GPX data (xml data that can be used to map a workout route) in the /depot/datamine/data/apple/health/2021/workout-routes/ directory. Lots of options!

One way to peek around at the data (without having your notebook/kernel crash due to out of memory (OOM) errors) is something like the following:

from lxml import etree

tree = etree.iterparse("/depot/datamine/data/apple/health/2021/export.xml")
ct = 0
for event, element in tree:
    if element.tag == 'Workout':
        print(etree.tostring(element))
        ct += 1
        if ct > 100:
            break
    else:
        element.clear()

# to extract an element's attributes
element.attrib # dict-like object

Relevant topics: pytest, html, xml

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.