STAT 29000: Project 4 — Spring 2021
Motivation: In this project we will continue to hone your web scraping skills, introduce you to some "gotchas", and give you a little bit of exposure to a powerful tool called cron.
Context: We are in the second to last project focused on web scraping. This project will introduce some supplementary tools that work well with web scraping: cron, sending emails from Python, etc.
Scope: python, web scraping, selenium, cron
Questions
Question 1
Check out the following website: project4.tdm.wiki
Use selenium
to scrape and print the 6 colors of pants offered.
You may have to interact with the webpage for certain elements to render. |
-
Python code used to solve the problem.
-
Output from running your code.
Question 2
Websites are updated frequently. You can imagine a scenario where a change in a website is a sign that there is more data available, or that something of note has happened. This is a fake website designed to help students emulate real changes to a website. Specifically, there is one part of the website that has two possible states (let’s say, state A
and state B
). Upon refreshing the website, or scraping the website again, there is an x% chance that the website will be in state A
and a 1-x% chance the website will be in state B
.
Describe the two states (the thing (element or set of elements) that changes as you refresh the page), and scrape the website enough to estimate x.
You will need to interact with the website to "see" the change. |
Since we are just asking about a state, and not any specific element, you could use the |
Your estimate of x does not need to be perfect. |
-
Python code used to solve the problem.
-
Output from running your code.
-
What state
A
andB
represent. -
An estimate for
x
.
Question 3
Dig into the changing "thing" from question (2). What specifically is changing? Use selenium and xpath expressions to scrape and print the content. What are the two possible values for the content?
Due to the changes that occur when a button is clicked, I’d highly advice you to use the |
|
-
Python code used to solve the problem.
-
Output from running your code.
Question 4
The following code allows you to send an email using Python from your Purdue email account. Replace the username and password with your own information and send a test email to yourself to ensure that it works.
Do NOT include your password in your homework submission. Any time you need to type your password in you final submission just put something like "SUPERSECRETPASSWORD" or "MYPASSWORD". |
To include an image (or screenshot) in RMarkdown, try |
The spacing and tabs near the |
Questions 4 and 5 were inspired by examples and borrowed from the code found at the Real Python website. |
def send_purdue_email(my_purdue_email, my_password, to, my_subject, my_message):
import smtplib, ssl
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
message = MIMEMultipart("alternative")
message["Subject"] = my_subject
message["From"] = my_purdue_email
message["To"] = to
# Create the plain-text and HTML version of your message
text = f'''\
Subject: {my_subject}
To: {to}
From: {my_purdue_email}
{my_message}'''
html = f'''\
<html>
<body>
{my_message}
</body>
</html>
'''
# Turn these into plain/html MIMEText objects
part1 = MIMEText(text, "plain")
part2 = MIMEText(html, "html")
# Add HTML/plain-text parts to MIMEMultipart message
# The email client will try to render the last part first
message.attach(part1)
message.attach(part2)
context = ssl.create_default_context()
with smtplib.SMTP("smtp.purdue.edu", 587) as server:
server.ehlo() # Can be omitted
server.starttls(context=context)
server.ehlo() # Can be omitted
server.login(my_purdue_email, my_password)
server.sendmail(my_purdue_email, to, message.as_string())
# this sends an email from [email protected] to [email protected]
# replace supersecretpassword with your own password
# do NOT include your password in your homework submission.
send_purdue_email("[email protected]", "supersecretpassword", "[email protected]", "put subject here", "put message body here")
-
Python code used to solve the problem.
-
Output from running your code.
-
Screenshot showing your received the email.
Question 5
The following is the content of a new Python script called is_in_stock.py
:
def send_purdue_email(my_purdue_email, my_password, to, my_subject, my_message):
import smtplib, ssl
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
message = MIMEMultipart("alternative")
message["Subject"] = my_subject
message["From"] = my_purdue_email
message["To"] = to
# Create the plain-text and HTML version of your message
text = f'''\
Subject: {my_subject}
To: {to}
From: {my_purdue_email}
{my_message}'''
html = f'''\
<html>
<body>
{my_message}
</body>
</html>
'''
# Turn these into plain/html MIMEText objects
part1 = MIMEText(text, "plain")
part2 = MIMEText(html, "html")
# Add HTML/plain-text parts to MIMEMultipart message
# The email client will try to render the last part first
message.attach(part1)
message.attach(part2)
context = ssl.create_default_context()
with smtplib.SMTP("smtp.purdue.edu", 587) as server:
server.ehlo() # Can be omitted
server.starttls(context=context)
server.ehlo() # Can be omitted
server.login(my_purdue_email, my_password)
server.sendmail(my_purdue_email, to, message.as_string())
def main():
# scrape element from question 3
# does the text indicate it is in stock?
# if yes, send email to yourself telling you it is in stock.
# otherwise, gracefully end script using the "pass" Python keyword
if __name__ == "__main__":
main()
First, make a copy of the script in your $HOME
directory:
cp /class/datamine/data/scraping/is_in_stock.py $HOME/is_in_stock.py
```
If you now look in the "Files" tab in the lower right hand corner of RStudio, and click the refresh button, you should see the file is_in_stock.py
. You can open and modify this file directly in RStudio. Before you do so, however, change the permissions of the $HOME/is_in_stock.py
script so only YOU can read, write, and execute it:
chmod 700 $HOME/is_in_stock.py
The script should now appear in RStudio, in your home directory, with the correct permissions. Open the script (in RStudio) and fill in the main
function as indicated by the comments. We want the script to scrape to see whether the pants from question 3 are in stock or not.
A cron job is a task that runs at a certain interval. Create a cron job that runs your script, /class/datamine/apps/python/f2020-s2021/env/bin/python $HOME/is_in_stock.py
every 5 minutes. Wait 10-15 minutes and verify that it is working properly. The long path, /class/datamine/apps/python/f2020-s2021/env/bin/python
simply makes sure that our script is run with access to all of the packages in our course environment. $HOME/is_in_stock.py
is the path to your script ($HOME
expands or transforms to /home/<my_purdue_alias>
).
If you struggle to use the text editor used with the |
Don’t forget to copy your import statements from question (3) as well. |
Once you are finished with the project, if you no longer wish to receive emails every so often, follow the instructions here to remove the cron job. |
-
Python code used to solve the problem.
-
Output from running your code.
-
The content of your cron job in a bash code chunk.
-
The content of your
is_in_stock.py
script.