Using Data Mine Containers

What’s The Advantage of Using Containers?

This section applies to code examples in the form of notebooks across the Starter Guides.

Don’t worry about the packages. Each container comes prefabricated with the right packages already, so you don’t have to install any packages. No more fixing Pip dependencies, figuring out which packages work together, what packages the notebook was originally made with, etc.
Data sources are integrated into the container, so you won’t have to go hunting for the dataset.
You can tweak the code inside the container, and then download the resulting notebook to your local machine- start with something that you know works, and build from there.
Somewhat future proof: if, a few years from now, you are assigned an NLP project, and you have an NLP container from the Data Mine, you could fire up that container, tweak the code a little to suit your need, without trying to figure out what package versions you need or installing years old datasets whose URL’s no longer work, etc.
Only a few lines of terminal commands gets you this code.
Containers are immutable, meaning that you can’t accidentally change the container. Edit to your hearts content after running the container, and the original code will still pop up next time you run the container.
Because its running a Jupyter Lab, you can easily download any code edits you make, the same as if you were to download a notebook off of Jupyter Lab on Anvil.

What’s In The Container?

The containers are composed of:

The notebook for that particular project in a fully working state.
All the data needed to run the notebook.
A requirements.txt file, optionally included to record which libraries were installed in the container (this a direct copy of the requirements.txt used to make the container). You do not have to install anything; this is merely for informational purposes.
An installed command to immediately open Jupyter Lab when the container is run.

How To Use The Containers

Installing Docker

To run the containers, you will need Docker installed. If you don’t already have Docker installed, you can download it here.

Finding Container Image Names

On The Starter Guide Page

You can find the specific image name on the associated Starter Guide page in the Code Examples. The images all follow the same format:

ghcr.io/thedatamine/starter-guides:<specific-project-tag-name>

All of the projects will have the same command to pull, except the final portion after the : will be that given project. For instance, the command for the neural networks introduction container is

ghcr.io/thedatamine/starter-guides:neural-nets-intro

On The Container Tag Library

You can find a full list of all the containers we have available (at least for the Starter Guide notebooks) at the Container Tag Library page.

Downloading The Container

Download the container locally, not to Anvil.

Say we want to download the web scraping introduction container. You can find the code for how to download this on the web scraping page (down in the code examples section), but the command you need at a terminal to pull the specific container image to you is:

docker pull ghcr.io/thedatamine/starter-guides:<specific-project-tag-name>

This will pull the container towards your local machine.

Running The Container

When you are ready to run the container, use this code at your terminal, replacing the <specific-project-tag-name> with the specific name of the image.

docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:<specific-project-tag-name>

If there are other applications using port 8888, you might have trouble connecting correctly. Make sure port 8888 is open, or edit this code to map to a port that is open.

The -p config switch you see is letting Docker know which port to map to. In theory, you can map whichever ports you want; 8888 is the most common for Jupyter Lab. Read more about the -p switch here. You will notice that, upon running this container, Docker will present a few http links that you can ctrl-click on to open them in your favorite web browser; pick whichever typically works on your machine. From here, it will look just the same as if you were opening Jupyter Lab on Anvil- but the difference is that this is a locally hosted Jupyter Lab inside the container, that you are accessing locally through a browser.

To get your desired container, the shell commands will be given along with the correctly named Docker image to pull, on that given page. So if we want the one for Time Series, go to Time Series → Code Examples.

Once The Container is Running

After ctrl-clicking the link given in the terminal, you will be looking at Jupyter Lab. From here you can double click on the only notebook in the Lab. You can start it just like you would any notebook to verify that it works, then edit as needed.

When done, save and download the notebook like usual.

When you are finished working in the container, you can ctrl-C in your terminal to quit the container.

When you save and download, it will download your modified notebook. However, if you quit the container, and run the container again, you will find the original notebook unmodified. This is because containers are immutable. Be sure to download your edits outside of Jupyter Lab and to your local machine before you close the container!

Finding All Packages Used In The Container

Open up a terminal in Jupyter Lab in the container. Type

pip freeze

To get a list of all the packages installed for that particular code example.

You will notice that there is already a requirements.txt file in the container itself, which contains a list of all the packages used at the time the notebook/container was made.

You can use

pip install -r /path/to/requirements.txt

To install packages from the requirements.txt file to your currently activated Python environment. For instance, if you download the requirements.txt to your local computer, then provide the path to that requirements.txt, this would download all the packages that the notebook in the container was using.

If the requirements.txt file has differing versions of packages than that which is currently installed, it might install the requested version, which might make that package unavailable for other Python scripts that use it. The solution here is to create multiple Python environments, that way you can install multiple versions of packages depending on their use case. You can learn more about creating multiple Python environments here. If you are using Conda, you can learn about managing environments using Conda here.