STAT 39000: Project 6 — Fall 2021
Sharing Python code: Virtual environments & git part I
Motivation: Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are typically making a tradeoff between developer speed (the time in which it takes to write a functioning program or scripts) and program speed (how fast your code runs). This is often the best choice depending on your staff and how much your software developers or data scientists earn. However, Python code does not have the advantage of being able to be compiled to machine code for a certain architecture (x86_64, ARM, Darwin, etc.), and easily shared. In Python you need to learn how to use virtual environments (and git) to share your code.
Context: This is the first in a series of 3 projects that explores how to setup and use virtual environments, as well as some git
basics. This series is not intended to teach you everything you need to know, but rather to give you some exposure so the terminology and general ideas are not foreign to you.
Scope: Python, virtual environments, git
Questions
While this project may look like it is a lot of work, it is probably one of the easiest projects you will get this semester. The question text is long, but it is mostly just instructional content and directions. If you just carefully read through it, it will probably take you well under 1 hour to complete! |
Question 1
Sign up for a free GitHub account at https://github.com. If you already have a GitHub account, perfect!
Once complete, type your GitHub username into a markdown cell.
-
Your GitHub username in a markdown cell.
Question 2
We’ve created a repository for this project at github.com/TheDataMine/f2021-stat39000-project6. You’ll quickly see that the code will be ultra familiar to you. The goal of this question, is to clone the repository to your $HOME
directory. Some of you may already be rushing off to your Jupyter Notebook to run the following.
%%bash
git clone https://github.com/TheDataMine/f2021-stat39000-project6
Don’t! Instead, we are going to take the time to setup authentication with GitHub using SSH keys. Don’t worry, it’s way easier than it sounds!
P.S. As usual, you should have a notebook called |
The first step is to create a new SSH key pair on Brown, in your $HOME
directory. To do that, simply run the following in a bash cell.
If you know what an SSH key pair is, and already have one setup on Brown, you can skip this step. |
%%bash
ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519 -C "lastname_brown_key"
When prompted for a passphrase, just press enter twice without entering a passphrase. If it doesn’t prompt you, it probably already generated your keys! Congratulations! You have your new key pair!
So, what is a key pair, and what does it look like? A key pair is two files on your computer (or in this case, Brown). These files live inside the following directory ~/.ssh
. Take a look by running the following in a bash cell.
ls -la ~/.ssh
... id_ed25519 id_ed25519.pub ...
The first file, id_ed25519
is your private key. It is critical that you do not share this key with anybody, ever. Anybody in possession of this key can login to any system with an associated public key, as you. As such, on a shared system (with lots of users, like Brown), it is critical to assign the correct permissions to this file. Run the following in a bash cell.
chmod 600 ~/.ssh/id_ed25519
This will ensure that you, as the owner of the file, have the ability to both read and write to this file. At the same time, this prevents any other user from being able to read, write, or execute this file (with the exception of a superuser). It is also important get the permissions of files within ~/.ssh
correct, as openssh
will not work properly otherwise (for safety).
Great! The other file, id_ed25519.pub
is your public key. This is the key that is shareable, and that allows a third party to verify that "the user trying to access resource X has the associated private key." First, lets set the correct permissions by running the following in a bash cell.
chmod 644 ~/.ssh/id_ed25519.pub
This will ensure that you, as the owner of the file, have the ability to both read and write to this file. At the same time, everybody else on the system will have read and execute permissions.
Last, but not least run the following to correctly set the permission of the ~/.ssh
directory.
%%bash
chmod 700 ~/.ssh
Now, take a look at the contents of your public key by running the following in a bash cell.
%%bash
cat ~/.ssh/id_ed25519.pub
Not a whole lot to it, right? Great. Copy this file to your clipboard. Now, navigate and login to github.com if you haven’t already. Click on your profile in the upper-right-hand corner of the screen, and then click Settings.
If you haven’t already, this is a fine time to explore the various GitHub settings, set a profile picture, add a bio, etc. |
In the left-hand menu, click on SSH and GPG keys.
In the next screen, click on the green button that says New SSH key. Fill in the "Title" field with anything memorable. I like to put a description that tells me where I generated the key (on what computer), for example, "brown.rcac.purdue.edu". That way, I can know if I can delete that key down the road when cleaning things out. In the "Key" field, paste your public key (the output from running the cat
command in the previous code block). Finally, click the button that says Add SSH key.
Congratulations! You should now be able to easily authenticate with GitHub from Brown, how cool! To test the connection, run the following in a cell.
!ssh -o "StrictHostKeyChecking no" -T [email protected]
If you use the following — you will get an error, but as long as it says "Hi username! …" at the top, you are good to go!
|
If you were successful, it should reply with something like:
Hi username! You've successfully authenticated, but GitHub does not provide shell access.
If it asks you something like "Are you sure you want to continue connecting (yes/no)?", type "yes" and press enter. |
Okay, FINALLY, let’s get to the actual task! Clone the repository to your $HOME
directory, using SSH rather than HTTPS.
If you navigate to the repository in the browser, click on the green "<> Code" button, you will get a dropdown menu that allows you to select "SSH", which will then present you with the string you can use in combination with the |
Upon success, you should see a new folder in your $HOME
directory, f2021-stat39000-project6
.
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Take a peek into your freshly cloned repository. You’ll notice a couple of files that you may not recognize. Focus on the pyproject.toml
file, and cat
it to see the contents.
The pyproject.toml
file contains the build system requirements of a given Python project. It can be used with pip
or some other package installer to download the exact versions of the exact packages (like pandas
, for example) required in order to build and/or run the project!
Typically, when you are working on a project, and you’ve cloned the project, you want to build the exact environment that the developer had set up when developing the project. This way you ensure that you are using the exact same versions of the same packages, so you can expect things to function the same way. This is critical, as the last thing you want to have to deal with is figuring out why your code is not working but the developers or project maintainers is.
There are a variety of popular tools that can be used for dependency management and/or virtual environment management in Python. The most popular are: conda, pipenv, and poetry.
What is a "virtual environment"? In a nutshell, a virtual environment is a Python installation such that the interpreter, libraries, and scripts that are available in the virtual environment are distinct and separate from those in other virtual environments or the system Python installation. We will dig into this more. |
There are pros and cons to each of these tools, and you are free to explore and use what you like. Having used each of these tools exclusively for at least 1 year or more, I have had the fewest issues with poetry.
When I say "issues" here, I mean unresolved bugs with open tickets on the project’s GitHub page. For that reason, we will be using poetry for this project. |
Poetry was used to create the pyproject.toml
file you see in the repository. Poetry is already installed in Brown. See where by running the following in a bash cell.
which poetry
By default, when creating a virtual environment using poetry, each virtual environment will be saved to $HOME/.cache/pypoetry
, while this is not particularly bad, there is a configuration option we can set that will instead store the virtual environment in a projects own directory. This is a nice feature if you are working on a shared compute space as it is explicitly clear where the environment is located, and theoretically, you will have access (as it is a shared space). Let’s set this up. Run the following command.
%%bash
poetry config virtualenvs.in-project true
poetry config cache-dir "$HOME/.cache/pypoetry"
poetry config --list
This will create a config.toml
file in $HOME/.config/pypoetry/config.toml
that is where your settings are saved.
Finally, let’s setup your own virtual environment to use with your cloned f2021-stat39000-project6
repository. Run the following commands.
module unload python/f2021-s2022-py3.9.6
cd $HOME/f2021-stat39000-project6
poetry install
This may take a minute or two to run. |
Normally, you’d be able to skip the |
This should install all of the dependencies and the virtual environment in $HOME/f2021-stat39000-project6/.venv
. To check run the following.
ls -la $HOME/f2021-stat39000-project6/
To actually use this virtual environment (rather than our kernel’s Python environment, or the system Python installation), preface python
commands with poetry run
. For example, let’s say we want to run a script in the package. Instead of running python script.py
, we can run poetry run python script.py
. Test it out!
For each bash cell when running poetry commands — it is critical the cells begin as follows:
Otherwise, poetry will not use the correct Python environment. This is a side effect of the way we have our installation, normally, poetry will know to use the correct Python environment for the project. |
We have a file called runme.py
in the scripts
directory ($HOME/f2021-stat39000-project6/scripts/runme.py
). This script just quickly uses our package and prints some info — nothing special. Run the script using the virtual environment.
You may need to provide execute permissions to the runme files.
|
%%bash
module unload python/f2021-s2022-py3.9.6
chmod 700 $HOME/f2021-stat39000-project6/scripts/runme.py
chmod 700 $HOME/f2021-stat39000-project6/scripts/runme2.py
cd $HOME/f2021-stat39000-project6
poetry run python scripts/runme.py
The script will print the location of the |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Now, try to run the following script using our virtual environment: $HOME/f2021-stat39000-project6/scripts/runme2.py
. What happens?
Make sure to run the script from the project folder and not from the
But do run:
|
It looks like a package wasn’t found, and should be added to our environment (and therefore our pyproject.toml
file). Run the following command to install the package to your virtual environment.
module unload python/f2021-s2022-py3.9.6
cd $HOME/f2021-stat39000-project6
poetry add packagename # where packagename is the name of the package/module you want to install (that was found to be missing)
Does the pyproject.toml
reflect this change? Now try and run the script again — voila!
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Read about at least 1 of the 2 git workflows listed here (if you have to choose 1, I prefer the "GitHub flow" style). Describe in words the process you would use to add a function or method to our repo, step by step, in as much detail as you can. I will start for you, with the "GitHub flow" style.
-
Add the function or method to the
watch_data.py
module in$HOME/f2021-stat39000-project6/
. -
…
-
Deploy the the branch (this could be a website, or package being used somewhere) for final testing, before merging into the
main
branch where code should be pristine and able to be immediately deployed at any time and function as intended. -
…
The goal of this question is to try as hard as you can to understand at a high level what a work flow like this enables, the steps involved, and think about it from a perspective of working with 100 other data scientists and/or software engineers. Any details, logic, or explanation you want to provide in the steps would be excellent! |
You do not need to specify actual |
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |