TDM 40100: Project 12 — 2023

Motivation: In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. We will continue use books.toscrape.com to practice scraping skills, visualize scrapped data, use sqlite3 to save scrapped data to database

Context: This is a third project focusing on web scraping combined with sqlite3

Scope: Python, web scraping, selenium, BeautifulSoup, sqlite3

Learning Objectives
  • Visualize scraped data.

  • Create tables for scraped data

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Question 1 (2 pts)

In the previous project, you have been able to scrape data from books.toscrape.com

Now let’s visualize your scrapped data

  1. Please visualize the books' price of music category with a bar plot. Split the prices into three price ranges: below 20, 20-30, above 30

You may need to change the price to float, like

prices = [float(book['price'].replace('£','')) for book in books]

books is the book list from the previous project’s function "get_all_books", like this:

books = get_all_books("Music_14")

You may use sum to group prices, like

price_less_20 = sum(1 for price in prices if price<20)
price_20_30 = sum(1 for price in prices if 30<=price<50)
...

You may use a bar chart, like

price_counts = [price_less_20, price_20_30,price_above_30]
labels = ["1","2","3"]
plt.bar(labels,price_counts,color=['purple','orange','green'])
# More plt settings and display statements

Question 2 (2 pts)

  1. Write CREATE TABLE statements to create 2 tables, namely, a categories table and a books table.

Check on the website for category information. The categories table may contain following fields - 'id' a unique identifier for each category, auto increment - 'category' like 'poetry_23'

Check on the website for book information. The "books" table may contain following fields - 'id' a unique identifier for each category, auto increment - 'title' like 'A light in the Attic" - 'category' like 'poetry_23' - 'price' like 51.77 - 'availability' like 'in stock(22 available)'

Use sqlite3 to create the tables in a database called $HOME/onlinebooks.db. You can do all of this from within Jupyter Lab.

%sql sqlite:///$HOME/onlinebooks.db
%%sql

CREATE TABLE ...

Run the following queries to confirm and show your table schemas.

PRAGMA table_info(categories);
PRAGMA table_info(books);

Question 3 (2 pts)

  1. Update the function "get_category" from project 11. After you get the information about categories from the website, populate the "categories" table with that data.

  2. Run a couple of queries that demonstrate that the data was successfully inserted into the database.

Here is partial code to assist.

import sqlite3
# connect to database
conn = sqlite3.connect('onlinebooks.db')
cur = conn.cursor()
for category in categories:
    cur.execute('INSERT INTO CATEGORIES (CATEGORY) VALUES (?)',(category,))
conn.commit()
conn.close()

Question 4 (2 pts)

  1. Update the function "get_all_books" from project 11. After you get the information about books from from website, populate the "books" table with that data. You may need to scrap new data for a new field of "category" that the book belongs to.

  2. Run a couple of queries that demonstrate that the data was successfully inserted into the database.

In project 11, we used an associate array to hold the book_info like this:

book_info = {
    ....book = {
    "title":title,
    "price":price,
    "category":category_name,
    "availability":availability
}
}

We may need to use a different data structure like tuple to hold book information since we need to insert it to the books table, like this:

book_info =(title,price,category_name,availability)

Here is partial code to assist.

import sqlite3

...
# code to get book information
book_info =(title,price,category_name,availability)
# connect to database
conn = sqlite3.connect('onlinebooks.db')
cur = conn.cursor()
for article in articles:
    cur.execute('INSERT INTO BOOKS (title,category,price,availability) VALUES (?,?,?,?)',book_info)
conn.commit()
conn.close()

Project 12 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project12.ipynb

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.