TDM 40200: Project 02 - Basic Python, Environment & Dependency Management, Documentation

Project Objectives

This project is an introduction to Python dependency management and environments. We will explore how to create and use different environments, and how to manage them afterwards. This is very closely related with dependency mangement and making sure we can make it easy for others to take on the project, while solve problems when it arises; thus, we will take a look at some common tools and conflict resolution techniques. We integrate this learning with analyzing data driven problems. The dataset deals with global C02 and greenhouse gas emission. Since this is one of the introduction projects, this part will also serve as a refresher on basic Python implementation, and relevant documentation methods.

Learning Objectives
  • Understand the purpose and importance of dependency management and python environments.

  • Learn how to integrate dependency management and how to use python environments.

  • Creating visualizations and writing functions in Python.

  • Python documentation.

Dataset

The dataset is at /anvil/projects/tdm/data/co2/owid-co2-data.csv from github.com/owid/co2-data.

Questions

Question 1 (2 points)

In programming, dependency means there are external libraries and packages that are used for project features to work. There are several reasons why we need to manage dependencies. First is reproducibility. This means regardless of the way the code is ran, we obtain the same results every time. This also means if there is a problem and multiple people are looking at the project, the same issue will occur. The importance of reproducibility shows in all stages of development: you will make it easier for yourself if you can recreate the setting for all code to work, and for others if there are multiple people working on it.

The common issue that arises from no dependency management is version conflicts: two or more libraries used requires different versions than the one you installed because all of them are globally installed locally.

Thus, we can use virtual environments, to create a separate space for all of one project’s dependencies so that regardless of what we have in our system’s global installation or what other projects use, the one we are working on runs smoothly.

We will try creating a virtual environment using the Python 3 venv module. Open your terminal or command prompt and navigate to the directory where you want to set up your project. Then run the following command (replace env_proj with your preferred environment name if desired):

python3 -m venv env_proj

Above creates a new environment named env_proj. You can name it anything, but it should be easily recognizable and relevant. Use below code to activate the environment.

source env_proj/bin/activate

A folder that will have all the needed packages is created inside the project directory. We can install whatever is needed inside the environment using pip.

Pip is a standard package management tool in Python for libraries or modules that are not in the Python standard library; it searches through Python Package Index (PyPI), which is a public repository of open source packages that can be installed. Developers create setup containing metadata of the pacakge and build distribution files, then upload to PyPI. When install command is ran by the user, pip locates the package in PyPI, resolve dependecies, and downlaod wheels.

  • Install (with version specification if needed)

pip install name
pip install name==version
  • Uninstall

pip uninstall name
  • Show information about installed package

pip show name
  • List installed packages

pip list [options]

There are multiple options available for most pip commands. We recommend you to check the documentation for full listings. (for example: pip.pypa.io/en/stable/cli/pip_install/ for pip install)

Deactivating the virtual environment is very simple with:

deactivate

Now, let’s load the dataset in the environment we created.

import pandas as pd

url = "https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv"
df = pd.read_csv(url)

print("df shape:", df.shape)
df.head()

Example of ModuleNotFoundError

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 import pandas as pd
      3 url = "https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv"
      4 df = pd.read_csv(url)

ModuleNotFoundError: No module named 'pandas'

This would have occurred if for example, the pandas library could not be found in the current environment - Python can not import what doesn’t exist.

We can see the above ourselves.

pipreqs .

Which would result in:

/bin/bash: line 1: pipreqs: command not found

Can be confirmed with:

pip show pipreqs
WARNING: Package(s) not found: pipreqs
Note: you may need to restart the kernel to use updated packages.

We’ll install pipreqs and use it in the next question.

Example of installing with pip (these should already exist in Anvil):

pip install pandas matplotlib numpy seaborn

venv provides a very easy way to manage dependencies; however, if there are packages you need that is not done through pip install or some non Python libraries, venv will not do it for you and you might need to find another tool.

Deliverables
  • 1a. In your own words, write a few sentences explaining when and why virtual environments are used. Do the same for dependency management.

  • 1b. Create an environment for this project and install everything shown above.

  • 1c. Load the dataset into a pandas data frame and print the shape and head of the dataset. Write a few sentences on your observation and initial thoughts about the dataset.

  • 1d. Print the number of unique countries and their names.

  • 1e. Find and show the number of missing values and duplicates, and where we have them. Drop the missing values and remove duplicate rows if exists. Show the output.

  • 1f. Filter the dataframe to get only the rows with "United States" as their 'country' column value, and assign into a new dataframe.

Question 2 (2 points)

'pip list' and 'pip freeze' both outputs installed Python packages. Try running both:

  • pip list

Package                 Version
----------------------- -----------
matplotlib              3.10.3
  • pip freeze

matplotlib==3.10.3

You’ll notice that the formatting are different. pip list will return a table format with separate columns for packages and current version. pip freeze returns a list in a requirement format (package-name==version), compatible with:

pip install -r requirements.txt

One can install all dependencies needed for a project using requirements.txt. On our end, we can create the file with:

pip freeze > requirements.txt

Another way we can create a requirements file is through pipreqs.

pip install pipreqs
pipreqs .

'.' is to indicate current directory. pipreqs will scan all .py files and get all import and from statements, find PyPI package name, and your installed version. Alphabetical ordering is used to write package==version into requirements.txt.

Depending on the situation, there are flags we can use with pipreqs, including --force (overwrite existing requirements.txt), --print (outputs detected dependencies instead of writing to file), --ignore <dir> (specific directory not taken into considering when scanning), --clean <file> (remove modules that are not imported anymore), etc. Again, we recommend you to check documentations.

Deliverables
  • 2a. What is a requirement file and what is it used for? Also, explain in your own words two methods of creating requirements.txt mentioned above, and the difference between them.

  • 2b. Try creating requirements.txt using pip freeze and pipreqs. Show results for both.

Question 3 (2 points)

Conda is another widely used environment management system, with the main difference from venv being that it works across multiple languages and is not limited to Python packages. It is also fast since we are installing precompiled binary packages on Conda; users have less worries when it comes to compiling extension modules of libraries written in different languages. Another difference is that while most environments are created and used at the scope of the project, Conda environments can be used across various projects and be activated from any location.

conda create --name <env>

Replace <env> with your environment name. You can also specify Python version by adding 'python=version' at the end.

Activate with:

conda activate <env>

Deactivate with:

conda deactivate

To check what is installed in the conda environment, use the conda list command.

conda list

Note that you can check package list in a different Conda environment with

conda list -n <other environment name>

Remove existing environment with:

conda remove --name <env> --all

It will remove the environment, dependencies, and packages.

You can check list of other commands at: docs.conda.io/projects/conda/en/stable/commands/index.html

We can install with:

conda install pandas matplotlib numpy seaborn

Again, we want our environments to be reproducible. Conda uses .yml file to do so, which is a text file using YAML formatting (very widely used for creating software project configuration files). Below command that outputs a file called environment.yml containing environment information regarding name, installed pacakges, versions, and other configuration details.

conda env export > environment.yml

The file structure will look something like:

name: env_proj
channels:
  - defaults
dependencies:
 - pandas=2.3.1
 - and other dependencies you've installed
  • name: The name of your conda environment

  • channels: Section listing which conda channels the packages should be obtained from. Default (default by Anaconda) and conda-forge (open source contributors - we can get packages from here) are two very common channels.

  • dependencies: Section listing all packages that should be installed in the environment. Packages are named following "-" and the version number can be specified with "=" after the name. If there are any pip installed packages, it will be listed under a subsection "pip:"

In the environment we created, let’s create some visualizations from our dataset (this should also be a refresher on basic python visualization and methods). Here is an example:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,4))
sns.lineplot(data=usa_data, x='year', y='co2')
plt.title("USA Annual co2 Emissions (million tonnes)")

Above is a graphical representation of the annual USA carbon dioxide emission.

Read and understand what each columns represent here: github.com/owid/co2-data/blob/master/owid-co2-codebook.csv

Deliverables
  • 3a. Try creating, activating, and installing packages using Conda.

  • 3b. Create an environment.yml file. Explain what this file contains and what it is used for.

  • 3c. What are some advantages of using conda environment? Do you think there are any disadvantages? How does conda differ from some other environment management systems?

  • 3d. Graph the annual USA carbon dioxide emission

  • 3e. Create two more graphs representing relationships of your choice. Explain what we can infer from those graphs.

Question 4 (2 points)

We will be writing a function that outputs a graphical representation of CO2 emission by sector for a given year. If you look at the columns of the dataset, it provides information on the annual CO2 emission for each sources (coal, cement, gas, oil, and other industry causes).

def plot_emission(df, year):
    sources = {'coal_co2': 'Coal', 'cement_co2': 'Cement', 'gas_co2': 'Gas',
               'oil_co2': 'Oil', 'other_industry_co2': 'Other Industry'}

Additionally, once you have your code written, you will write a docstring for this. Docstring is a way to document your code by describing usage and functionality using string literal. They appear first in a module or a function, and we declare them with three single or double quotes.

Documentation is important for several reasons. - Documentation explains not only about the code, but also gives insight into why we have it, functionality, and various specifics of the function or a class. This is helpful to both yourself as you code, and to others who might see your code, or work on a project with you to understand implementation details well. Efficient collaboration is one of the big part. - It supports debugging or problem solving by providing references and information about the program. It will help reduce time or minimize certain errors.

We will document using Google Style Docstrings. This formatting includes description, function arguments, returns, and raises. Read section 3.8 of in google.github.io/styleguide/pyguide.html.

Here is an example:

def plot_emission(df, year):
    """
    For a specific year, creates a bar plot of CO2 emission by sector.
    Sector includes Coal, Cement, Gas, Oil, and Other Industry.
    The plot displays emission values and
    how much each source contributes to total emissions by percentage.

    Args:
        df (pandas.dataframe): owid dataset containing global CO2 and Greenhouse gas emission information.
        year (int): The year for which emission information will be taken and plotted.

    Returns:
        tuple
        A tuple containing Matplotlib figure and axes object
    """

    sources = {'coal_co2': 'Coal', 'cement_co2': 'Cement', 'gas_co2': 'Gas',
               'oil_co2': 'Oil', 'other_industry_co2': 'Other Industry'}

... and rest of the function
When we write string literal for docstrings, there is a __doc__ attribute holding docstring of the object. So, we can access docstring with ".__doc__" (ex. print(function_name.__doc__))
Deliverables
  • 4a. Write a function take takes in a dataframe and a year and outputs a bar plot of emission of each sources for that given year. Calculate the percentage for each sources and show that as well.

  • 4b. Include Google style docstring for the function.

  • 4c. There are multiple docstring formatting that can be used. Research another one of your choice and write one.

Question 5 (2 points)

In this question, we are going to try out pdoc, a tool for automatically generating API documentation.

First, we’ll briefly take a look at python scripts since we will be using that for pdoc. Python script is a file with .py extension and it can be run in any way as long as Python is installed. Some ways they are ran include integrated development environment (IDE), text editors, command lines and terminal, interactive mode, or even from another script or web browsers. The given script is read by Python line by line.

We can run scripts in terminal by:

python filename.py

If using a text editor or an IDE, Python should be available (otherwise you can get an extension if text editor) and then can be ran.

Notebook content can be exported as a .py module with below command (nbconvert can convert notebook documents to other formats too):

jupyter nbconvert --to script 'project.ipynb'

From the py module, pdoc scans our python functions and classes and extract the docstring documentation. HTML page is created from that.

To use pdoc, run:

pdoc ./file_name

Creating an HTML file, we can run:

pdoc ./file_name.py -o ./docs

You should see a .html file created in under docs directory.

Here is a basic example. pdoc should output something like below for you:

pdoc output

There are customizations we can add to pdoc. For example, adding logo like below:

logo output
pdoc ./demo.py --logo "link to logo"
pdoc official documentation: https://pdoc.dev/docs/pdoc.html
Deliverables
  • 5a. Submit the .py file alongside your .ipynb

  • 5b. Try using pdoc and submit a screenshot of pdoc result.

  • 5c. Submit a screenshot of a pdoc version including a logo.

Question 6 (2 points)

Sometimes, we need to make changes to our environment, for example adding more packages or needing a newer version of an existing package. For us, we just added a new pdoc install.

Here is how to make changes in a conda environment. Run below command after editing environment.yml as needed. It will update the existing environment by comparing packages and version in the file with the installed ones; if there are any packages that do not exist in the environment but listed in environment.yml, it will install them, and likewise if there are packages not used anymore that are present, it will remove them (specified with --prune).

conda env update --file environment.yml  --prune
Deliverables
  • 6a. Edit and update environment.yml. Write down what you added to the environment.yml

  • 6b. Research another dependency management system/tool of your choice. How is it different from the ones we explored? Is there a specific case where we might want to consider a specific management tool?

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit
  • firstname_lastname_project1.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.