Dask

Overview

Dask is a package focused on interacting with data in Python. Dask has a lot of similar functionality to a common Python package named Pandas. The advantage of Dask is that it is a naturally parallel package. This means that the people who wrote Dask utilized parallel processing in order to help it read and work with data faster than other packages, like Pandas, that use a single processer.

Dask Dashboard

In addition to the built-in parallel processing Dask has a dashboard client that can be used to monitor the processes as they run. Because the Data Mine code is run on Purdue’s high performance computing cluster we have a few extra steps to allow us to view the dashboard.

Steps to Launch a Dask Dashboard

  1. Navigate to the Purdue HPC website and click the Remote Desktop option.

    • Note: this will require the dual-authentication for Purdue’s systems.

  2. Launch a terminal in the Remote Desktop session. This can be done by either:

    • Selecting the terminal icon that appears in the bottom center of the screen.

    • Right-clicking on the desktop and selecting the Open Terminal Here option.

  3. Once we have a terminal open we need to load the Jupyter Lab environment so that we can launch a new session. The commands to load this environment and launch a jupyter lab session are included below:

    • In the terminal type module use /scratch/brown/kamstut/tdm/opt/modulefiles

    • Once we have run the first command type module load python/jupyterlab-py3.9.6

    • These two commands are specific to the Data Mine environment. The first command makes specific modules that the staff created available to the user. The second loads the Jupyter Lab environment that we will use for the Dask dashboard.

  4. After the module has been loaded run the python -m Jupyter Lab command in the terminal. This will launch a new Jupyter Lab session and should open the browser automatically.

  5. In the Jupyter Lab session you can now launch the Dask dashboard. The code for the dashboard is included below:

from dask.distributed import Client

client = Client()
client
  1. Once the code cell above is exectued Dask should provide a link to the dashboard. Clicking the link will open the dashboard in a new browser tab.