TDM 30100: Project 4 — 2022

Motivation: Documentation is one of the most critical parts of a project. There are so many tools that are specifically designed to help document a project, and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can’t go wrong with tools like Sphinx, or pdoc.

Context: This is the third project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems.

Scope: Python, documentation

Learning Objectives
  • Use Sphinx to document a set of Python code.

  • Use pdoc to document a set of Python code.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json

  • /anvil/projects/tdm/data/goodreads/goodreads_book_series.json

  • /anvil/projects/tdm/data/goodreads/goodreads_books.json

  • /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json

Questions

Question 1

The listed datasets are fairly large, and interesting! They are json formatted data. Each row of a single json file can be individually read in and processed. Take a look at a single row.

%%bash

head -n 1 /anvil/projects/tdm/data/goodreads/goodreads_books.json

This is nice, because you can individually process a single row. Anytime you can do something like this, it is easy to break a problem into smaller pieces and speed up processing. The following demonstrates how you can read in a single line and process it.

import json

with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f:
    for line in f:
        print(line)
        parsed = json.loads(line)
        print(f"{parsed['isbn']=}")
        print(f"{parsed['num_pages']=}")
        break

In this project, the overall goal will be to implement functions that perform certain operations, write the best docstrings you can, and use your choice of pdoc or sphinx to generate a pretty set of documentation.

Begin this project by choosing a tool, pdoc or sphinx, and setting up a firstname-lastname-project04.py module that will host your Python functions. In addition, create a Jupyter Notebook that will be used to test out your functions, and generate your documentation. At the end of this project, your deliverable will be your .ipynb notebook and either a series of screenshots that captures your documentation, or a PDF created by exporting the resulting webpage of documentation.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Write a function called scrape_image_from_url that accepts a URL (as a string) and returns a bytes object of the data. Make sure scrape_image_from_url cleans up after itself and doesn’t leave any image files on the filesystem.

  1. Create a variable with a temporary file name using the uuid package.

  2. Use the requests package to get the response.

    import requests
    
    response = requests.get(url, stream=True)
    
    # then the first argument to copyfileobj will be response.raw
  3. Open the file and use the shutil packages copyfileobj method to copy the response.raw to the file.

  4. Open the file and read the contents into a bytes object.

    You can verify a bytes object by:

    type(my_object)
    output
    bytes
  5. Use os.remove to remove the image file.

  6. Return the bytes object.

You can verify your function works by running the following:

import shutil
import requests
import os
import uuid
import hashlib

url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg'
my_bytes = scrape_image_from_url(url)
m = hashlib.sha256()
m.update(my_bytes)
m.hexdigest()
output
ca2d4506088796d401f0ba0a72dda441bf63ca6cc1370d0d2d1d2ab949b00d02
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Write a function called json_to_sql that accepts a single row of the goodreads_books.json file (as a string), a table name (as a string), as well as a set of values to "skip". This function should then return a string that is a valid INSERT INTO SQL statement. See here for an example of an INSERT INTO statement.

The following is a real example you can test out.

with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f:
    for line in f:
        first_line = str(line)
        break

first_line
json_to_sql(first_line, 'books', {'series', 'popular_shelves', 'authors', 'similar_books'})
output
"INSERT INTO books (isbn,text_reviews_count,country_code,language_code,asin,is_ebook,average_rating,kindle_asin,description,format,link,publisher,num_pages,publication_day,isbn13,publication_month,edition_information,publication_year,url,image_url,book_id,ratings_count,work_id,title,title_without_series) VALUES ('0312853122','1','US','','','false','4.00','','','Paperback','https://www.goodreads.com/book/show/5333265-w-c-fields','St. Martin's Press','256','1','9780312853129','9','','1984','https://www.goodreads.com/book/show/5333265-w-c-fields','https://images.gr-assets.com/books/1310220028m/5333265.jpg','5333265','3','5400751','W.C. Fields: A Life on Film','W.C. Fields: A Life on Film');"

Here is some (maybe) helpful logic:

  1. Use the loads to convert json to a dict.

  2. Remove all key:value pairs from the dict where the key is in the skip set.

  3. Form a string of comma separated keys.

  4. Form a string of comma separated, single-quoted values.

  5. Assemble the INSERT INTO statement.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Create a new function, that does something interesting with one or more of these datasets. Just like all the previous functions, make sure to include detailed and clear docstrings.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Generate your final documentation, and assemble and submit your deliverables:

  • .ipynb file testing out your functions.

  • firstname-lastname-project04.py module that includes all of your functions, and associated docstrings.

  • Screenshots and/or a PDF exported from your resulting documentation web page. Basically, something that shows us your resulting documentation.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.