Posted:August 12, 2020

CWPK #13: Managing Python Packages and Environments

Keeping Multiple Interacting Parts Current

Early in this series (CWPK #9) of Cooking with Python and KBpedia, I noted the importance of Anaconda as a package and configuration manager for Python. Akin to the design of the Unix and Linux operating systems, Python applications are an ecosystem of scripts and libraries that are shared and invoked across multiple applications and uses. The largest Python repository, PyPi, itself contains more than 230,000 projects. The basic installation of Anaconda contains about 500 packages on its own in its standard configuration.

Since the overwhelming majority of these projects exists independently of the other ones and each progresses on its own schedule of improvements and new version releases, it is not hyperbole to envision the relative stability of a package installer such as Anaconda as masking a bubbling cauldron of constant package changes under the surface. To illustrate this process I will focus on one of the Anaconda packages called Pandoc that figures prominently in the next installment of this CWPK series. Pandoc has been around for about 15 years and is the undisputed king of applications for converting one major text format type into another. Pandoc processes external formats using what it calls ‘readers’, converts that external form into an internal representation, and then uses ‘writers’ to output that internal representation into another form useful to external applications. Generally, a given format has both a reader and a writer, though there are a few strays. In the current version of Pandoc (2.9.2.x) there are 33 readers and 55 writers.

A Python environment is a dedicated directory where specific dependencies can be stored and maintained. Environments have unique names and can be activated when you need them, allowing you to have ultimate control over the libraries that are installed at any given time. You can create as many environments as you want. Because each one is independent, they will not interact or ‘mess up’ the other. Thus, it is common for programmers to create new environments for each project that they work on. Often times, information about your environment can assist you in debugging certain errors. Starting with a clean environment for each project can help you control the number of variables to consider when looking for bugs. When it comes to creating environments, you have two choices:
  1. you can create a virtual environment (venv) using pip to install packages or
  2. create a conda environment with conda installing packages for you. [1]

In my own work I tend to author documents either in HTML or LibreOffice, corresponding to the *.html and *.odt formats, respectively. However, the Jupyter Notebook that we will be using for our interactive electronic notebooks represents standard formatted text in the Markdown format (*.md) that it combines with the interactive portions that use embedded JavaScript Object Notation (JSON). The combination of these narrative and interactive portions is represented by the *.ipynb format. Markdown is a plain text superset of HTML that uses character conventions rather than bracketed tags (for example, ‘-‘ for marking bullets or ‘#‘ for marking headings). We’ll have many occasions to look at Markdown markup throughout this series. Since I was anticipating switching between writing narratives and interacting with code, I wanted to use my standard writing tools for longer explanations as well as to publish interactive notebook pages on static Web sites. I was investigating Pandoc as a means of ‘round-tripping‘ between HTML and *.ipynb and to leverage the strengths of each.

A quick look at the Pandoc site showed that, indeed, both formats were supported. Further, the Pandoc documentation also suggested there were ‘switches’ for the readers and writers of these two formats that would likely give me the control I needed to round-trip between formats with few or no errors. So, I downloaded the latest version of Pandoc (updating an earlier version already on my machine), and proceeded to do my set-up work in preparation for the upcoming CWPK installment #14. However, every time I ran the Pandoc command to do the conversion, I repeatedly got the error message of “Unknown output format.”

As I tried to debug this problem I made some discoveries. First, Pandoc was already a package included in Anaconda. Further, while I previously had Pandoc in my environment path, the new path entered when I installed Anaconda was put higher on the list, meaning the Anaconda Pandoc was invoked before the instance I had installed directly. As I investigated the Anaconda packages, I found that it was using Pandoc version 2.2.3.2, which dated from August 7, 2018. In investigating Pandoc releases, I noted that *.ipynb support was not introduced into Pandoc until version 2.6 on January 30, 2019. So, despite what the Web site stated and my own installation of the version from March 23, 2020, the actual Pandoc that was being used in my environment did not support the notebook format!

To sate my curiosity I took a random sample of a dozen packages hosted by Anaconda and compared them to later updates that might be found elsewhere on the Web or directly from the developers. I found Anaconda was up to date in about 10 of these 12 instances. However, in the instance of Pandoc this gap was material. This raises two important points. First, when first installing or when returning to use after a hiatus, it is important to update your existing distribution. For Anaconda, first begin to update that repository:

conda update --all

Invoking this option causes a flurry of activity as multiple packages are checked for currency, dependencies, and then proper load orders. These are the kinds of activities that formerly were painful and subject to many inadvertent conflicts as one package updated a dependency that broke another. This kind of update activity is shown by Figure 1.

Updating the Anaconda Environment
Figure 1: Updating the Anaconda Environment

Second, we then need additional ways to find and install Python packages. The most common package installer in Python is pip, a leading method to accessing PyPi, though clearly Anaconda chose to use an alternate approach in conda. The philosophy of conda is to better manage dependencies and interactions between packages than pip historically provided. There are other repositories that have embraced that same philosophy, and one with even greater dependency testing than conda is conda-forge, also a popular repository for data science packages. In all random cases I checked, conda-forge had as recent or more recent packages than conda. conda-forge also had the most recent version of Pandoc. Further, conda-forge can be integrated into the Anaconda package installation environment.

Installing Pandoc from the conda-forge channel can be achieved by adding to your channels (in this case, Anaconda) [2]:

conda config --add channels conda-forge

Once the conda-forge channel has been enabled, Pandoc can be installed with:

conda install pandoc

It is possible, obviously, to add specific packages from conda-forge to your channel using this exact command format. It is also possible to list all of the versions of Pandoc available on your platform with:

conda search pandoc --channel conda-forge

This same approach may be used for any specific package maintained by conda-forge, while keeping dependencies and Anaconda current.

Here are some resources if you wish to explore Python package management further:

We saw from Jupyter Notebook in CWPK #10 that it is not able to access all areas of your computer unless you place it at the root. That is never a good idea for security reasons. It is always best to keep your Python working environment sequestered to some extent. Further, as the sources above indicate, if you are to get serious with Python and engage in multiple projects, it is a good idea to use virtual environments as well as dedicated directories. I do not address the topic of virtual environments further in this series since many just learning for the first time may not need this complexity.

Another truth of such large installations as Anaconda is that it is very tricky — indeed, nearly impossible on a Windows machine — to change the directory in which it was first installed. The safest way to do so is to uninstall Anaconda then re-install it in the new directory. That can be disruptive itself, so is not a step to undertake lightly. It is therefore deserving of some attention to how you organize your directory structures. You are thus best to play a bit with your Python environment, see what is working and what is not in terms of your workflows and file locations, and then make changes if need be before committing to any true work-dependent tasks. I first introduced the question of directory structure in CWPK #9. We will continue this topic in earnest in our next installment.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

Endnotes:

[1] Norris, Will, Jenny Palomino, and Leah Wasser. 2019. “Use Conda Environments to Manage Python Dependencies: Everything That You Need to Know.” Earth Data Science – Earth Lab. https://www.earthdatascience.org/courses/intro-to-earth-data-science/python-code-fundamentals/use-python-packages/introduction-to-python-conda-environments/ (April 10, 2020).
[2] Conda-Forge/Pandoc-Feedstock. 2020. conda-forge. Shell. https://github.com/conda-forge/pandoc-feedstock (April 10, 2020)

Schema.org Markup

headline:
CWPK #13: Managing Python Packages and Environments

alternativeHeadline:
Keeping Multiple Interacting Parts Current

author:

image:
https://www.mkbergman.com/wp-content/uploads/2020/07/cooking-with-kbpedia-785.png

description:
We discuss the importance of keeping applications and packages up-to-date using a package manager, using the Pandoc text conversion app as our case study.

articleBody:
see above

datePublished:

Leave a Reply

Your email address will not be published. Required fields are marked *