STAT 39000: Project 4 — Fall 2021
Write it. Test it. Change it. Bop it?
Motivation: Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have.
Context: This is the first of a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using pytest
, doc tests, and mypy
, while writing code to manipulate and work with data.
Scope: Python, testing, pytest, mypy, doc tests
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/apple/health/2021/*
Questions
At the end of this project, you will need to submit the following:
|
Question 1
XPath expressions, while useful, have a very big limitation: the entire XML document must be read into memory. The is a problem for large XML documents. For example, to parse the export.xml
file in the Apple Health data, takes nearly 7GB of memory when the file is only 980MB.
from memory_profiler import profile
@profile
def main():
import lxml.etree
tree = lxml.etree.parse("/home/kamstut/apple_health_export/export.xml")
if __name__ == '__main__':
main()
python
python -m memory_profiler prof.py
bash
Filename: prof.py Line # Mem usage Increment Occurences Line Contents ============================================================ 3 36.5 MiB 36.5 MiB 1 @profile 4 def main(): 5 38.5 MiB 2.0 MiB 1 import lxml.etree 6 7 6975.3 MiB 6936.8 MiB 1 tree = lxml.etree.parse("/home/kamstut/apple_health_export/export.xml")
This is a very common problem, not just for reading XML files, but for dealing with larger dataset in general. You will not always have an abundance of memory to work with.
To get around this issue, you will notice we take a streaming approach, where only parts of a file are read into memory at a time, processed, and then freed.
Copy our library from /depot/datamine/data/apple/health/apple_watch_parser
and import it into your code cell for question 1. Examine the code and test out at least 2 of the methods or functions.
To copy the library run the following in a new cell.
To import and use the library, make sure your notebook (let’s say
|
You may be asking yourself "well, what does that For example, if we had a function called
Hopefully this is a good hint on how to use the dot notation to call methods in the |
If you run Let’s say you have a string
The string '%Y-%m-%d %H:%M:%S %z' are format codes (see here). In order to convert from a string to a datetime object, you need to use a combination of format codes that match the format of the string. In this case, the string is '2018-05-21 04:35:49 -0500'. The "2018" part matches "%Y" from the format codes. The "05" part matches "%m" from the format codes. The "21" part matches "%d" from the format codes. The "04" part matches "%H" from the format codes. The "35" part matches "%M" from the format codes. The "49" part matches "%S" from the format codes. The " -0500" part matches "%z" from the format codes. If your datetime string follows a different format, you would need to modify the combination of format codes to use so it matches your datetime string. Then, once you have a datetime object, you can do all sorts of fun things. The most obvious of which is converting the date back into a string, but formatting it exactly how you want. For example, lets say we dont want a string to have all the details '2018-05-21 04:35:49 -0500' has, and instead just want the month, day, and year using forward slashes instead of hyphens.
|
-
Code used to solve this problem — code that imports and uses our library and at least 2 of the methods or functions.
-
Output from running the code that uses 2 of the methods.
Question 2
As you may have noticed, the code contains fairly thorough docstrings. This is a good thing, and it is a good goal to aim for when writing your own Python functions, classes, modules, etc.
In the previous project, you got a small taste of using doctest
to test your code using in-comment code. This is a great way to test parts of your code that are simple, straightforward, and don’t involve extra data or fixtures in order to test.
Examine the code, and determine which functions and/or methods are good candidates for doctests. Modify the docstrings to include at least 3 doctests each, and run the following to test them out!
Include the following doctest in the calculate_speed
function. This does not count as 1 of your 3 doctests for this function. It will fail for this question — that is okay!
>>> calculate_speed(5.0, .55, output_distance_unit = 'm')
Traceback (most recent call last):
...
ValueError: output_distance_unit must be 'mi' or 'km'
python
Make sure to include the expected output of each doctest below each line starting with |
%%bash python $HOME/apple_watch_parser/watch_data.py -v
ipython
If you need to read in data or type a lot in order to use a function or method, a doctest is probably not the right approach. Hint, hint, try the functions rather than methods. |
There are 2 functions that are good candidates for doctests. |
Don’t forget to add the following code to the bottom of
|
-
Code used to solve this problem.
-
Output from running the code.
Question 3
In question 2, we wrote a doctest for the calculate_speed
function. Figure out why the doctest fails, and make modifications to the function so it passes the doctest. Do not modify the doctest.
When you update the |
Remember we want you to change the |
The output of |
Look carefully at the |
To run the doctest:
%%bash python $HOME/apple_watch_parser/watch_data.py -v
ipython
This is what doctests are for! This helps you easily identify that something fundamental has changed and the code isn’t ready for production. You can imagine a scenario where you automatically run all doctests automatically before releasing a new product, and having that system notify you when a test fails — very cool!
-
Code used to solve this problem.
-
Output from running the code.
Question 4
While doctests are good for simple testing, a package like pytest
is better. For the stand alone functions, write at least 2 tests each using pytest
. Make sure these tests test different inputs than your doctests did — its not hard to come up with lots of tests!
This could end up being just 2 functions that run a total of 4 tests — that is okay! As long as each function has at least 2 assert statements. |
Start by adding a new file called test_watch_data.py
to your $HOME/apple_watch_parser
directory. Then, fill the file with your tests. When ready to test, run the following in a new cell.
%%bash cd $HOME/apple_watch_parser python -m pytest
ipython
You may have noticed that we arbitrarily chose to place some functions outside of our Of course, there are exceptions to this rule, and it is possible to write static methods for a class, which operate independently of the class and its attributes. We chose to write the functions outside of the class, more for demonstration purposes than anything else. They are functions that would most likely not be useful in any other context, but sort of demonstrate the concept and allow us to have good functions to practice writing doctests and |
In the following project, we will continue to learn about pytest
, including some more advanced features, like fixtures.
Relevant topics: pytest
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Explore the data — there is a lot! Think of a function that could be useful for this module that would live outside of the WatchData
class. Write the function. Include Google style docstrings, doctests (at least 2), and pytest
tests (at least 2, different from your doctests). Re-run both your doctest
tests and pytest
tests.
You can simply add this function to your |
Your function doesn’t need to be useful for data outside the |
One way to peek around at the data (without having your notebook/kernel crash due to out of memory (OOM) errors) is something like the following:
|
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |