STAT 29000: Project 8 — Spring 2021

Motivation: Python is an interpreted language (as opposed to a compiled language). In a compiled language, you are (mostly) unable to run and evaluate a single instruction at a time. In Python (and R — also an interpreted language), we can run and evaluate a line of code easily using a repl. In fact, this is the way you’ve been using Python to date — selecting and running pieces of Python code. Other ways to use Python include creating a package (like numpy, pandas, and pytorch), and creating scripts. You can create powerful CLI’s (command line interface) tools using Python. In this project, we will explore this in detail and learn how to create scripts that accept options and input and perform tasks.

Context: This is the first (of two) projects where we will learn about creating and using Python scripts.

Scope: python

Learning objectives
  • Write a python script that accepts user inputs and returns something useful.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Questions

Question 1

Often times the deliverable part of a project isn’t custom built packages or modules, but a script. A script is a .py file with python code written inside to perform action(s). Python scripts are incredibly easy to run, for example, if you had a python script called question01.py, you could run it by opening a terminal and typing:

python3 /path/to/question01.py

The python interpreter then looks for the scripts entrypoint, and starts executing. You should read this article about the main function and python scripts. In addition, read this section, paying special attention to the shebang.

Create a Python script called question01.py in your $HOME directory. Use the second shebang from the article: #!/usr/bin/env python3. When run, question01.py should use the sys package to print the location of the interpreter being used to run the script. For example, if we started a Python interpreter in RStudio using the following code:

datamine_py()
reticulate::repl_python()

Then, we could print the interpreter by running the following Python code one line at a time:

import sys
print(sys.executable)

Since we are using our Python environment, you should see this result: /class/datamine/apps/python/f2020-s2021/env/bin/python3. This is the fully qualified path of the Python interpreter we’ve been using for this course.

Restart your R session by clicking Session > Restart R, navigate to the "Terminal" tab in RStudio, and run the following lines in the terminal. What is the output?

# this command gives execute permissions to your script -- this only needs to be run once
chmod +x $HOME/question01.py
# execute your script
$HOME/question01.py

You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options.

Items to submit
  • The entire question01.py script’s contents in a Python code chunk with chunk option "eval=F".

  • Output from running your code copy and pasted as text.

Question 2

Was your output in question (1) expected? Why or why not?

When we restarted the R session, our datamine_py’s effects were reversed, and the default Python interpreter is no longer our default when running `python3. It is very common to have a multitude of Python environments available to use. But, when we are running a Python script it is not convenient to have to run various commands (in our case, the single datamine_py command) in order to get our script to run the way we want it to run. In addition, if our script used a set of packages that were not installed outside of our course environment, the script would fail.

In this project, since our focus is more on how to write scripts and make them work as expected, we will have some fun and experiment with some pre-trained state of the art machine learning models.

The following function accepts a string called sentence as an input and returns the sentiment of the sentence, "POSITIVE" or "NEGATIVE".

from transformers import pipeline
def get_sentiment(model, sentence: str) -> str:
    result = model(sentence)

    return result[0].get('label')
model = pipeline('sentiment-analysis')
print(get_sentiment(model, 'This is really great!'))
print(get_sentiment(model, 'Oh no! Thats horrible!'))

Include get_sentiment (including the import statement) in a new script, question02.py script. Note that you do not have to use get_sentiment anywhere, just include it for now. Go to the terminal in RStudio and execute your script. What happens?

Remember, since our current shebang is #!/usr/bin/env python3, if our script uses one or more packages that are not installed in the current environment environment, the script will fail. This is what is happening. The transformers package that we use is not installed in the current environment. We do, however, have an environment that does have it installed, and it is located on Scholar at: /class/datamine/apps/python/pytorch2021/env/bin/python. Update the script’s shebang and try to run it again. Does it work now?

Depending on the state of your current environment, the original shebang, #!/usr/bin/env python3 will use the same Python interpreter and environment that is currently set to python3 (run which python3 to see). If you haven’t run datamine_py, this will be something like: /apps/spack/scholar/fall20/apps/anaconda/2020.11-py38-gcc-4.8.5-djkvkvk/bin/python or /usr/bin/python, if you have run datamine_py, this will be: /class/datamine/apps/python/f2020-s2021/env/bin/python. Both environments lack the transformers package. Our other environment whose interpreter lives here: /class/datamine/apps/python/pytorch2021/env/bin/python does have this package. The shebang is then critically important for any scripts that want to utilize packages from a specific environment.

You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options.

Items to submit
  • Sentence explaining why or why not the output from question (1) was expected.

  • Sentence explaining what happens when you include get_sentiment in your script and try to execute it.

  • The entirety of the updated (working) script’s content in a Python code chunk with chunk option "eval=F".

Question 3

Okay, great. We now understand that if we want to use packages from a specific environment, we need to modify our shebang accordingly. As it currently stands, our script is pretty useless. Modify the script, in a new script called question03.py to accept a single argument. This argument should be a sentence. Your script should then print the sentence, and whether or not the sentence is "POSITIVE" or "NEGATIVE". Use sys.argv to accomplish this. Make sure the script functions in the following way:

$HOME/question03.py This is a happy sentence, yay!
Too many arguments.
$HOME/question03.py 'This is a happy sentence, yay!'
Our sentence is: This is a happy sentence, yay!
POSITIVE
$HOME/question03.py
./question03.py requires at least 1 argument, "sentence".

One really useful way to exit the script and print a message is like this:

import sys
sys.exit(f"{__file__} requires at least 1 argument, 'sentence'")

You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options.

Items to submit
  • The entirety of the updated (working) script’s content in a Python code chunk with chunk option "eval=F".

  • Output from running your script with the given examples.

Question 4

If you look at the man pages for a command line tool like awk or grep (you can get these by running man awk or man grep in the terminal), you will see that typically CLI’s have a variety of options. Options usually follow the following format:

grep -i 'ok' some_file.txt

However, often times you have 2 ways you can use an option — either with the short form (for example -i), or long form (for example -i is the same as --ignore-case). Sometimes options can get values. If options don’t have values, you can assume that the presence of the flag means TRUE and the lack means FALSE. When using short form, the value for the option is separated by a space (for example grep -f my_file.txt). When using long form, the value for the option is separated by an equals sign (for example grep --file=my_file.txt).

Modify your script (as a new question04.py) to include an option called score. When active (question04.py --score or question04.py -s), the script should return both the sentiment, "POSITIVE" or "NEGATIVE" and the probability of being accurate. Make sure that you modify your checks from question 3 to continue to work whenever we use --score or -s. Some examples below:

$HOME/question04.py 'This is a happy sentence, yay!'
Our sentence is: This is a happy sentence, yay!
POSITIVE
$HOME/question04.py --score 'This is a happy sentence, yay!'
Our sentence is: This is a happy sentence, yay!
POSITIVE: 0.999848484992981
$HOME/question04.py -s 'This is a happy sentence, yay!'
Our sentence is: This is a happy sentence, yay!
POSITIVE: 0.999848484992981
$HOME/question04.py 'This is a happy sentence, yay!' -s
Our sentence is: This is a happy sentence, yay!
POSITIVE: 0.999848484992981
$HOME/question04.py 'This is a happy sentence, yay!' --score
Our sentence is: This is a happy sentence, yay!
POSITIVE: 0.999848484992981
$HOME/question04.py 'This is a happy sentence, yay!' --value
Unknown option(s): ['--value']
$HOME/question04.py 'This is a happy sentence, yay!' --value --score
Too many arguments.
$HOME/question04.py
question04.py requires at least 1 argument, "sentence"
$HOME/question04.py --score
./question04.py requires at least 1 argument, "sentence". No sentence provided.
$HOME/question04.py 'This is one sentence' 'This is another'
./question04.py requires only 1 sentence, but 2 were provided.

You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options.

Experiment with the provided function. You will find the probability of being accurate is already returned by the model.

Items to submit
  • The entirety of the updated (working) script’s content in a Python code chunk with chunk option "eval=F".

  • Output from running your script with the given examples.

Question 5

Wow, that is an extensive amount of logic for for a single option. Luckily, Python has the argparse package to help you build CLI’s and handle situations like this. You can find the documentation for argparse here and a nice little tutorial here. Update your script (as a new question05.py) using argparse instead of custom logic. Specifically, add 1 positional argument called "sentence", and 1 optional argument "--score" or "-s". You should handle the following scenarios:

$HOME/question05.py 'This is a happy sentence, yay!'
Our sentence is: This is a happy sentence, yay!
POSITIVE
$HOME/question05.py --score 'This is a happy sentence, yay!'
Our sentence is: This is a happy sentence, yay!
POSITIVE: 0.999848484992981
$HOME/question05.py -s 'This is a happy sentence, yay!'
Our sentence is: This is a happy sentence, yay!
POSITIVE: 0.999848484992981
$HOME/question05.py 'This is a happy sentence, yay!' -s
Our sentence is: This is a happy sentence, yay!
POSITIVE: 0.999848484992981
$HOME/question05.py 'This is a happy sentence, yay!' --score
Our sentence is: This is a happy sentence, yay!
POSITIVE: 0.999848484992981
$HOME/question05.py 'This is a happy sentence, yay!' --value
usage: question05.py [-h] [-s] sentence
question05.py: error: unrecognized arguments: --value
$HOME/question05.py 'This is a happy sentence, yay!' --value --score
usage: question05.py [-h] [-s] sentence
question05.py: error: unrecognized arguments: --value
$HOME/question05.py
usage: question05.py [-h] [-s] sentence
positional arguments:
  sentence
optional arguments:
  -h, --help   show this help message and exit
  -s, --score  display the probability of accuracy
$HOME/question05.py --score
usage: question05.py [-h] [-s] sentence
question05.py: error: too few arguments
$HOME/question05.py 'This is one sentence' 'This is another'
usage: question05.py [-h] [-s] sentence
question05.py: error: unrecognized arguments: This is another

A good way to print the help information if no arguments are provided is:

if len(sys.argv) == 1:
    parser.print_help()
    parser.exit()

Include the bash code chunk option error=T to enable RMarkdown to knit and output errors.

You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options.

Items to submit
  • Python code used to solve the problem.

  • Output from running your code.