TDM 30100: Project 5 — 2023

Motivation: Documentation is one of the most critical parts of a project. There are so many tools that are specifically designed to help document a project, and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can’t go wrong with tools like Sphinx, or pdoc.

Context: This is the third project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems.

Scope: Python, documentation

Learning Objectives
  • Use Sphinx to document a set of Python code.

  • Use pdoc to document a set of Python code.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_authors.json

  • /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_series.json

  • /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json

  • /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_reviews_dedup.json

Questions

The listed datasets are fairly large, and interesting! They are json formatted data. Each row of a single json file can be individually read in and processed. Take a look at a single row.

%%bash

head -n 1 /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json

This is nice, because you can individually process a single row. Anytime you can do something like this, it is easy to break a problem into smaller pieces and speed up processing. The following demonstrates how you can read in a single line and process it.

import json

with open("/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json") as f:
    for line in f:
        print(line)
        parsed = json.loads(line)
        print(f"{parsed['isbn']=}")
        print(f"{parsed['num_pages']=}")
        break

In this project, the overall goal will be to implement functions that perform certain operations, write the best docstrings you can, and use your choice of pdoc or sphinx to generate a pretty set of documentation.

Begin this project by choosing a tool, pdoc or sphinx, and setting up a firstname-lastname-project05.py module that will host your Python functions. In addition, create a Jupyter Notebook that will be used to test out your functions, and generate your documentation. At the end of this project, your deliverable will be your .ipynb notebook and either a series of screenshots that captures your documentation, or a PDF created by exporting the resulting webpage of documentation.

Question 1 (1pt)

  1. In your Jupyter Notebook, run the example codes above, understand the data selected

Question 2 (2 pts)

  1. Write a function called scrape_image_from_url that accepts a URL (as a string), and returns a bytes object of the data.

  2. Write a function called display_image_from_bytes that display the image direclty without saving image to disk.

Make sure scrape_image_from_url cleans up after itself and doesn’t leave any image files on the filesystem.

scrape_image_from_url

  1. Create a variable with a temporary file name using the uuid package.

  2. Use the requests package to get the response.

    import requests
    
    response = requests.get(url, stream=True)
    
    # then the first argument to copyfileobj will be response.raw
  3. Open the file and use the shutil packages copyfileobj method to copy the response.raw to the file.

  4. Open the file and read the contents into a bytes object.

    You can verify a bytes object by:

    type(my_object)
    output
    bytes
  5. Use os.remove to remove the image file.

  6. Return the bytes object.

display_image_from_bytes

  1. Convert the byte data into a readable image format using an image processing library

  2. Open and display this readable image

from PIL import Image
from io import BytesIO
from IPython.display import display

You can verify your function works by running the following:

import shutil
import requests
import os
import uuid
import hashlib

url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg'
my_bytes = scrape_image_from_url(url)
m = hashlib.sha256()
m.update(my_bytes)
m.hexdigest()
display_image_from_bytes(my_bytes)
output
ca2d4506088796d401f0ba0a72dda441bf63ca6cc1370d0d2d1d2ab949b00d02

(image)

Question 3 (2 pts)

  1. Write a Python function called top_reviewers that reads file Goodreads_reviews_parsed.json and returns the IDs of the top 5 users who have provided the most reviews.

The following shows how to test the function

filename = "Goodreads_reviews_parsed.json"
print(top_reviewers(filename))
  1. When you run this code with the provided sample JSON file, the top_reviewers function will print out the IDs of the top 5 users with the most reviews.

  2. If there are ties in the number of reviews, it will return the users that appear first in the file.

Question 4 (2 pts)

  1. Create a new function, that does something interesting with one or more of these datasets. Just like all the previous functions, make sure to include detailed and clear docstrings.

Question 5 (1 pt)

  1. Generate your final documentation, and assemble and submit your deliverables:

    • Screenshots and/or a PDF exported from your resulting documentation web page

Project 05 Assignment Checklist

  • Jupyter .ipynb file with your codes, comments and outputs for the assignment

    • firstname-lastname-project05.ipynb.

  • Screenshots and/or a PDF exported from your resulting documentation web page.to show your outputs

  • A html file .htm with your newest set of documentation.

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.