STAT 29000: Project 4 — Spring 2021

Motivation: In this project we will continue to hone your web scraping skills, introduce you to some "gotchas", and give you a little bit of exposure to a powerful tool called cron.

Context: We are in the second to last project focused on web scraping. This project will introduce some supplementary tools that work well with web scraping: cron, sending emails from Python, etc.

Scope: python, web scraping, selenium, cron

Learning objectives
  • Review and summarize the differences between XML and HTML/CSV.

  • Use the requests package to scrape a web page.

  • Use the lxml package to filter and parse data from a scraped web page.

  • Use the beautifulsoup4 package to filter and parse data from a scraped web page.

  • Use selenium to interact with a browser in order to get a web page to a desired state for scraping.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Question 1

Check out the following website: project4.tdm.wiki

Use selenium to scrape and print the 6 colors of pants offered.

You may have to interact with the webpage for certain elements to render.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 2

Websites are updated frequently. You can imagine a scenario where a change in a website is a sign that there is more data available, or that something of note has happened. This is a fake website designed to help students emulate real changes to a website. Specifically, there is one part of the website that has two possible states (let’s say, state A and state B). Upon refreshing the website, or scraping the website again, there is an x% chance that the website will be in state A and a 1-x% chance the website will be in state B.

Describe the two states (the thing (element or set of elements) that changes as you refresh the page), and scrape the website enough to estimate x.

You will need to interact with the website to "see" the change.

Since we are just asking about a state, and not any specific element, you could use the page_source attribute of the selenium driver to scrape the entire page instead of trying to use xpath expressions to find a specific element.

Your estimate of x does not need to be perfect.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

  • What state A and B represent.

  • An estimate for x.

Question 3

Dig into the changing "thing" from question (2). What specifically is changing? Use selenium and xpath expressions to scrape and print the content. What are the two possible values for the content?

Due to the changes that occur when a button is clicked, I’d highly advice you to use the data-color attribute in your xpath expression instead of contains(text(), 'blahblah').

parent:: and following-sibling:: may be useful xpath axes to use.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

Question 4

The following code allows you to send an email using Python from your Purdue email account. Replace the username and password with your own information and send a test email to yourself to ensure that it works.

Do NOT include your password in your homework submission. Any time you need to type your password in you final submission just put something like "SUPERSECRETPASSWORD" or "MYPASSWORD".

To include an image (or screenshot) in RMarkdown, try ![](./my_image.png) where my_image.png is inside the same folder as your .Rmd file.

The spacing and tabs near the message variable are very important. Make sure to copy the code exactly. Otherwise, your subject may not end up in the subject of your email, or the email could end up being blank when sent.

Questions 4 and 5 were inspired by examples and borrowed from the code found at the Real Python website.

def send_purdue_email(my_purdue_email, my_password, to, my_subject, my_message):
  import smtplib, ssl
  from email.mime.text import MIMEText
  from email.mime.multipart import MIMEMultipart

  message = MIMEMultipart("alternative")
  message["Subject"] = my_subject
  message["From"] = my_purdue_email
  message["To"] = to

  # Create the plain-text and HTML version of your message
  text = f'''\
Subject: {my_subject}
To: {to}
From: {my_purdue_email}

{my_message}'''
  html = f'''\
<html>
  <body>
    {my_message}
  </body>
</html>
'''
  # Turn these into plain/html MIMEText objects
  part1 = MIMEText(text, "plain")
  part2 = MIMEText(html, "html")

  # Add HTML/plain-text parts to MIMEMultipart message
  # The email client will try to render the last part first
  message.attach(part1)
  message.attach(part2)

  context = ssl.create_default_context()
  with smtplib.SMTP("smtp.purdue.edu", 587) as server:
    server.ehlo()  # Can be omitted
    server.starttls(context=context)
    server.ehlo()  # Can be omitted
    server.login(my_purdue_email, my_password)
    server.sendmail(my_purdue_email, to, message.as_string())

# this sends an email from [email protected] to [email protected]
# replace supersecretpassword with your own password
# do NOT include your password in your homework submission.
send_purdue_email("[email protected]", "supersecretpassword", "[email protected]", "put subject here", "put message body here")
Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

  • Screenshot showing your received the email.

Question 5

The following is the content of a new Python script called is_in_stock.py:

def send_purdue_email(my_purdue_email, my_password, to, my_subject, my_message):
  import smtplib, ssl
  from email.mime.text import MIMEText
  from email.mime.multipart import MIMEMultipart

  message = MIMEMultipart("alternative")
  message["Subject"] = my_subject
  message["From"] = my_purdue_email
  message["To"] = to

  # Create the plain-text and HTML version of your message
  text = f'''\
Subject: {my_subject}
To: {to}
From: {my_purdue_email}

{my_message}'''
  html = f'''\
<html>
  <body>
    {my_message}
  </body>
</html>
'''
  # Turn these into plain/html MIMEText objects
  part1 = MIMEText(text, "plain")
  part2 = MIMEText(html, "html")

  # Add HTML/plain-text parts to MIMEMultipart message
  # The email client will try to render the last part first
  message.attach(part1)
  message.attach(part2)

  context = ssl.create_default_context()
  with smtplib.SMTP("smtp.purdue.edu", 587) as server:
    server.ehlo()  # Can be omitted
    server.starttls(context=context)
    server.ehlo()  # Can be omitted
    server.login(my_purdue_email, my_password)
    server.sendmail(my_purdue_email, to, message.as_string())

def main():
    # scrape element from question 3

    # does the text indicate it is in stock?

    # if yes, send email to yourself telling you it is in stock.

    # otherwise, gracefully end script using the "pass" Python keyword
if __name__ == "__main__":
    main()

First, make a copy of the script in your $HOME directory:

cp /class/datamine/data/scraping/is_in_stock.py $HOME/is_in_stock.py
```

If you now look in the "Files" tab in the lower right hand corner of RStudio, and click the refresh button, you should see the file is_in_stock.py. You can open and modify this file directly in RStudio. Before you do so, however, change the permissions of the $HOME/is_in_stock.py script so only YOU can read, write, and execute it:

chmod 700 $HOME/is_in_stock.py

The script should now appear in RStudio, in your home directory, with the correct permissions. Open the script (in RStudio) and fill in the main function as indicated by the comments. We want the script to scrape to see whether the pants from question 3 are in stock or not.

A cron job is a task that runs at a certain interval. Create a cron job that runs your script, /class/datamine/apps/python/f2020-s2021/env/bin/python $HOME/is_in_stock.py every 5 minutes. Wait 10-15 minutes and verify that it is working properly. The long path, /class/datamine/apps/python/f2020-s2021/env/bin/python simply makes sure that our script is run with access to all of the packages in our course environment. $HOME/is_in_stock.py is the path to your script ($HOME expands or transforms to /home/<my_purdue_alias>).

If you struggle to use the text editor used with the crontab -e command, be sure to continue reading the cron section of the book. We highlight another method that may be easier.

Don’t forget to copy your import statements from question (3) as well.

Once you are finished with the project, if you no longer wish to receive emails every so often, follow the instructions here to remove the cron job.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.

  • The content of your cron job in a bash code chunk.

  • The content of your is_in_stock.py script.