TDM 30100: Project 3 - Python Web Scraping and Images

Project Objectives

Data is everywhere around us and it is one of the most crucial resources. It can be very accessible; however, we need to know the available tools and their usages to get the most out of any types of data.

NASA image of the day is from NASA’s website where they feature different photographs of the universe on a daily basis, and most of the times comes with descriptive text. We will use that to learn the fundamentals of web scraping and how images are represented and can be modified in Python.

Learning Objectives
  • Learn about basic python web scraping: beautiful soup, requests

  • Working with images: opencv, skimage

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset

We will use www.nasa.gov/image-of-the-day/ to work with websites and other questions on working with images.

Please use 16 Cores and 32GB RAM for this project. It is the third option when you are starting the Datamine Server.

Questions

Question 1 (2 points)

Before working with the images in the website, we need to access the content and modify the HTML format to be compatible and to extract the parts needed. Web scraping is the method used to get a big amount of data from websites, and following that we can perform modification or analysis.

BeautifulSoup is a python library widely used in web scraping as it is specifically for parsing HTML and XML documents and allows us to get specific webpage information easily.

We will also use the requests module, which lets us send HTTP requests. HTTP requests are sent to a server in order to obtain information or specific action.

Getting familiar with these tools is essential since we can access so much public data through web scraping, and like how we go through the EDA process in other projects, we similarly can transform unstructured data and formats into usable resources.

Let’s try it out:

import requests
from bs4 import BeautifulSoup

If necessary:

pip install bs4 requests

We can get all the urls using below code.

def getdata(url):
    req = requests.get(url)
    return req.text

data = getdata("https://www.nasa.gov/image-of-the-day/")
parse = BeautifulSoup(data, 'html.parser')

for item in parse.find_all('img'):
    print(item['src'])

requests.get(url) sends a HTTP GET request to the url. After asking the server for the page content, the response gets stored into req. We then access the text part of the response and return the string corresponding to the HTML content.

parse = BeautifulSoup(data, 'html.parser') initializes a BeautifulSoup object assigned to parse by passing in HTML string. html.parser is a built in python HTML parser. We do not have to work with raw strings anymore since there is a structured HTML document now.

find_all() finds all occurrences of elements with the specific tag name. Here we search all the HTML and obtain the list of <img> tags (which would be the images in the website). The url/file path is specified through the src attribute. For HTML <img> is it needed so that not only we can locate them but also so they appear on websites. For every <img> found we access its src.

Deliverables
  • 1a. Run the code above and make sure to write documentation.

  • 1b. Write a few sentences explaining why we are using beautifulsoup and requests.

  • 1c. Show the output of the code and check that you can see the images.

Question 2 (2 points)

from urllib.parse import urljoin
import pandas as pd

urlib is a package that lets us fetch URLs and work with them (such as parsing, accessing and reading, and handling errors). urlib.parse is more specifically for parsing URLs, meaning we can have sections of the URL string and combine them to produce a string/absolute URLs.

def get_image(parse):
    site = 'https://www.nasa.gov/image-of-the-day/'
    image = []

    for img in parse.find_all('img'):
        src = img.get('src', ' ')

        if not src.endswith('[email protected]'):
            full_url = urljoin(site, src)
            alt_text = img.get('alt', ' ')
            image.append({'url': full_url,
                          'alt_text': alt_text,
                          'filename': src.split('/')[-1].split('?')[0]})
    return pd.DataFrame(image)

The site has the base URL and later we can create absolute URL for the images that had relative paths. After getting all the <img> elements, we access the src of each of them.

full_url = urljoin(site, src) combines the site URL with the src relative image path. alt_text = img.get('alt', ' ') extracts the alternative text description attribute of each images. It is another HTML <img> tag attribute that describes the image content/context.

We can then append image data with full url, description (empty string if there is none), and file name, and return a converted pandas dataframe.

How 'filename': src.split('/')[-1].split('?')[0] works: This just extracts the file name. Take assets.science.nasa.gov/dynamicimage/assets/science/missions/hubble/nebulae/emission/Hubble_30Dor_potw2531a.jpg?w=1024 as an example.

  • First split src based on '/' End up with:

'https:', '', 'assets.science.nasa.gov', 'dynamicimage', 'assets', 'science', 'missions', 'hubble', 'nebulae', 'emission', 'Hubble_30Dor_potw2531a.jpg?w=1024'
  • Then take the last element [-1] This gives us:

Hubble_30Dor_potw2531a.jpg?w=1024
  • When we split 'Hubble_30Dor_potw2531a.jpg?w=1024' with ? as the delimiter, this results in:

'Hubble_30Dor_potw2531a.jpg', 'w=1024'
  • Finally, take the first element [0] (because Python numbers things from 0) and we get:

'Hubble_30Dor_potw2531a.jpg'
Deliverables
  • 2a. Run the code above and make sure to document the code.

  • 2b. Write a few sentences about the role of urllib.parse here

  • 2c. Output head of the dataframe and the number of images.

  • 2d. Output the shape, unique values, and duplicate filenames of the dataframe

  • 2e. Create a new dataframe that drops the duplicates, only keeping the first instance. Output the new shape and check there are no duplicates.

Question 3 (2 points)

(The solution for this question may throw a image size warning in pink but it is Ok.)

We will actually see all the photos in this question.

matplotlib expects image data coming from local files and not just url string, so it won’t work for us to just load the images we have directly. We need to convert the image bytes into a format understandable for matplotlib.

Needed import for this question below.

from PIL import Image as PILImage
from io import BytesIO
import matplotlib.pyplot as plt

We can write a for loop iterating over dataframe’s rows index, row data:

for i, (_, row) in enumerate(df.iterrows()):
        img_data = requests.get(row['url']).content
        img = PILImage.open(BytesIO(img_data))
        axes[i].imshow(img)

With requests.get(), we first get the returned response object containing needed information and data. PIL can work with many different types of files and allow us to access and manipulate the image files. The code we have allow us to create Pillow Image object with the loaded image data (byte string). We can then display the image as usual: axes[i].imshow(img) places the image on the current index subplot in matplotlib.

We can also add titles for each image:

if row['alt_text'].strip():
    title = row['alt_text']
else:
    title = row['filename']

if len(title) > 30:
    shorten_title = title[:30] + '...'
else:
    shorten_title = title

axes[i].set_title(shorten_title)

Here we chose to use the alt attribute (textual description of the image) as the title, but truncate if the title is too long for display purposes.

Deliverables
  • 3a. Write a function that outputs all images in the dataframe.

  • 3b. Write few sentences explaining the method used to display the images. Also explain what the resulting dataframe represents/contains.

Question 4 (2 points)

cv2 is a very helpful library in Python for working with images.

Imports:

import cv2
import numpy as np

Pick a photo you want to work with.

image_url = new_df.iloc[30]['url']

Now we get the image. One way you could consider is using response = requests.get(image_url) and np.asarray() so that we can convert the bytes from response.content into a bytearray, then into a numpy array. Note that imdecode() reads in images from a buffer in memory and expects 1-D uint8 array as the input.

cv2.cvtColor() allows us to convert the color space of an image. It takes in the image we want to change as the parameter and returns the new image. gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) Here we used cv2.COLOR_BGR2GRAY, which convert BGR image to grayscale image. imdecode() by default has the BGR ordering for images (and just for reference, same for imread() which is another very commonly used method that loads an image but from a specified file.)

Deliverables
  • 4a. Code and output of the original image (feel free to pick) and after converting into grayscale.

  • 4b. Output height, width, channel, and shape of the image.

  • 4c. What do the dimensions represent? Also why do you think grayscale images can be useful? In what applications and what about the changed structure of the image?

Question 5 (2 points)

In this question we will split the image into RGB using two methods.

cv2: R, G, B = cv2.split(img) splits the original BGR image into blue, green, and red channels. The BGR image is an numpy array - (height, width, channels). height is the number of rows of pixels, width is the number of columns of pixels, and channel can be 0,1,2 for blue, green, and red respectively.

skimage: This is another python library for image processing. It’s designed to be integrated well with Python’s other computing libraries such as numpy and scipy.

Note in skimage, io.imread() can load images directly using URLs. It returns a numpy array with rgb as default for the image.

img = new_df.iloc[30]['url']
img2 = io.imread(img)
Deliverables
  • 5a. Code and output of an image after splitting into the three RGB colour channels using cv2

  • 5b. Code and output of an image after splitting into the three RGB colour channels using skimage

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit
  • firstname_lastname_project3.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.