Posted:August 18, 2020

owlready2 Appears to be a Capable Option

In the CWPK #2 and #4 installments to this Cooking with Python and KBpedia series, we noted that we would reach a decision point when we needed to determine how we will manipulate our knowledge graphs (ontologies) using the Python language. We have now reached that point. Our basic Python environment is set (at least in an initial specification) and we need to begin inputting and accessing KBpedia to develop and test our needed management and build functions.

In our own efforts over the past five years or more, we have used the Java OWL API initially developed by the University of Manchester. The OWL API is an integral part of the Protégé IDE (see CWPK #5) and supports OWL2. The API is actively maintained. We have been very pleased with the API’s performance and stability in our earlier KBpedia (and other ontology) efforts. In our own Clojure-based work we have used a wrapper around the OWL API. A wrapper using Python is certainly a viable (perhaps even best) approach to our current project.

We still may return to this approach for reasons of performance or capabilities, but I decided to first explore a more direct approach using a Python language option. This decision is in keeping with this series’ Python education objectives. I prefer for these lessons to use a consistent Python style and naming conventions, rather than those in Java. I was also curious to evaluate and test what presently exists in the marketplace. We may gain some advantages from a more direct approach; we may also discover some gotchas or deadends that initial due diligence missed. We can always return to Plan B with a wrapper around the existing OWL API.

If we do need to revert and take the wrapper approach, the leading candidate for the wrapper is py4j. Initial research suggests other Python bridges to Java such as Jython or JPype are less efficient and less popular than py4j. pyJNIus had a similar objective to py4j but has seen no development activity for 4-6 years. The ROBOT tool for biomedical ontologies points the way to how Python can link through py4j. Even if our Python-based approach herein works great, we still may want to embrace py4j as we move forward given the wealth of ontology-related applications written in Java. But I digress.

There is no acclaimed direct competitor to the OWL API in Python, though there are pieces that may approximate its capabilities. Frankly, I was surprised after beginning my due diligence with the relative dearth of Python tools for working with OWL. Many of the Python projects that do or did exist harken back years. There was a bulge of tool-making in the mid-2000s using Python that has since cooled substantially, with two notable exceptions I discuss below.

One of those exceptions is RDFLib, a Python library for working with RDF. RDFLib provides a useful set of parsers and serializers and a plug-in architecture, but directly lacks OWL 2 support. FuXi was an OWL reasoner based on RDFLib that used a subset of OWL, but is now abandoned. SuRF is an object-RDF mapper based on RDFlib that enables manipulations of RDF triples, but is somewhat dated. rdftools had a similar objective to RDFLib, but has been abandoned from about 5-7 yrs ago. owlib is a 5-yr old API to OWL built using RDFLib to simplify working with OWL constructs; it has not been updated and is inactive. More currently, infixowl is a RDFLib Python binding for the OWL abstract syntax, which makes it more like the wrapper alternative. Though not immediately applicable to our OWL needs, we may later embrace RDFLib for parsers and serializers or as a useful library for the typologies in KBpedia.

Then there are a number of tools independent of RDFLib. SETH was an attempt at a Python OWL API that still required the JVM from about a dozen years back, and is now largely abandoned (though available via CVS repository). funowl is a pythonic API that follows the OWL functional model for constructing OWL and it provides a py4j or equivalent wrapper to the standard Java OWL libraries. It appears to be active and is worth keeping an eye on. The ontobio Python module is a library for working with ontologies and associations to outside entities, though it is not an ontology manager.

Fortunately, the second exception is owlready2, a module for ontology-oriented programming in Python 3, including an optimized RDF quadstore. A number of things impressed me about owlready2 in my due diligence. First, its functionality fit the bill for what I wanted to see in an ontology manager dealing with all CRUD (create-read-update-delete) aspects of an ontology and its components. Second, I liked the intent and philosophy behind the system as expressed in its original academic paper and home Web site (see Additional Documentation below). Third, the project is being actively maintained with many releases over the past two years. Fourth, the documentation level was comparatively high for an open-source project and clearly written and understandable. And, last, there is an existing extension to owlready2 that adds support for RDFLib, should we also decide to add that route.

One concern arising from my diligence is the lack of direct Notation3 (N3) file support in owlready2, since all of KBpedia’s current ontology files are in N3. According to owlready2’s developer, Jean-Baptiste Lamy, N-Triples, which are a subset of N3, are presently supported by owlready2. We can test and see if our N3 constructs load or not. If they do not, we can save out our ontology files in RDF/XML, which owlready2 does support. (Indeed, use of the RDF/XML format has proven to be the better approach.) Alternatively, we can do file conversions with RDFLib or the Java OWL API. File format conversions and compatibility will be a constant theme in our work, and this potential hurdle is not unlike others we may face.

Thus, while the pickings were surprisingly thin for off-the-shelf OWL tools in Python, owlready2 appears to have the requisite functionality and currentness and to be a reasonable initial choice. Should this choice prove frustrating, we will likely fall back onto the py4j wrapper to the OWL API or funowl.

So, now with the choice made, it is time to set up our directory structure and install owlready2.

Here is our standard main directory structure with the owlready2 additions noted:

|-- PythonProject                                             
|-- Python
|-- [Anaconda3 distribution]
|-- Notebooks
|-- CWPKNotebook
|-- owlready2 # place it at top level of project
|-- kg # for knowledge graphs (kgs) and ontologies
|-- scripts # for related Python scripts
|-- TBA

After making these changes on disk, it is time to install owlready2, which is easy:

    conda install -c conda-forge owlready2

You will see the reports to the screen terminal as we noted before, and you will need to agree to proceed. Assuming no errors are encountered, you will be returned to the command window prompt. You can then invoke ‘Jupyter Notebook‘ again.

Finding and Opening Files

Let’s begin working with owlready2 by loading and reading an ontology/knowledge graph file. Let’s start with the smallest of our KBpedia ontology files, kko.owl (per the instructions above this is the kko.n3 file converted to RDF/XML in Protégé). (You may download this converted file from here.) I will also assume you stored this file under the owlready2/kg directory noted above.

Important Note: You may be working with these interactive notebooks either online with MyBinder or from your own local file system. In the first case, the files you will be using will be downloaded from GitHub; in the second case, you will be reading directly from your local directory structure. In the instructions below, and in ALL cases where external files are used, we will show you the different Python commands associated with each of these options.

As you begin to work with files in Python on Windows, here are some initial considerations:

  • In Windows, a full file directory path starts with a drive letter (C:, D:. etc.). In Linux and OS-X, it starts with “/
  • Python lets you use OS-X/Linux style slashes “/” in Windows. Recommended is to use a format such as ‘C:/Main/FirstDirectory/second-directory/my-file.txt
  • Relative addressing is allowed, with the current directory understood to be the one where you started your interpreter (Jupyter Notebook in our case). However, that is generally not best practice. Python embraces the concept of Current Working Directory. CWD is the folder your Python is operating from, which might vary by application, such as Jupyter Notebook. The CWD is the 'root‘ for your current session. What this means is that relative file addresses can be tricky to use. You are best off using the absolute reference to all of your files.

When you work with online file documents, you will need to use different Python commands and conventions, as the examples below show. We will offer more explanation on this specific option when the code below is presented.

Here are some general references that can explain files and paths further:

To find what your CWD is for your current session:

import os
dir(os)

Note there are a couple of things going on in this snippet. First, we have imported the Python built-in module called ‘os‘. Not all commands are brought into memory when you first invoke Python. In this case, we are invoking (or ‘importing’) the os module.

Second, we have invoked the dir command to get a listing of the various functions within the os module. So, go ahead and shift+enter this cell or Run it from the Jupyter Notebook menu to see what os contains.

We can invoke other functions with a similar syntax. Another option besides dir is to get help on the current module:

help(os)

Note these same dir and help commands can be applied to any (module) active in the system.

This next example shows another function in os called ‘walk‘. We invoke this function by calling the combined module and function notation using the dot (.) syntax (‘os.walk‘). We will add a couple more statements to get our directory listing to display (‘print()‘) the directory file names to screen:

for dirpath, dirnames, files in os.walk('.'):
    print(f'Found directory: {dirpath}')
    for file_name in files:
        print(file_name)

One of the first things you will learn about Python is that there are often multiple modules, and modules within external libraries, that may be invoked for a given task. It takes time to discover and learn these options, but that is also one of the fun parts of the language.

Our next example shows just this, using a new package, pathlib, useful for local files, that has some great path management functions. (This library will be one of our stalwarts moving forward.)

Remember we can import functions from add-ons beyond the Python built-ins. We do so via modules again using the import statement, but we now need to identify the library (or ‘package’) where that module resides. We do so via the ‘from‘ statement. Remember, external libraries need to be downloaded and registered via Anaconda (conda or conda-forge) prior to use if they are not already installed on your system. (Recall that our installed packages are at C:\1-PythonProjects\Python\pkgs based on my own configuration.

In this next example we are using the home command within the Path module in the pathlib package. The home command tells us where the ‘root‘ is for our current notebook:

from pathlib import Path
home = Path.home()
print(home)
C:\Users\Michael

Windows is a tricky environment for handling file names, since the native operating system (OS) requires back-clashes (‘\‘) rather than forward-slashes (‘/‘) and also requires the drive designation for absolute paths. We also have the issue of relative paths, which because of CWD (common working directory) can get confused in Python (or rather, in our use of Python).

One habit is to adopt the convention of declaring your file targets as a variable (say, ‘path‘), make sure the reference is good, and then refer to the ‘path‘ object in the rest of the code to prevent confusion. One code approach to this, including a print of the referenced file is:

path = r'C:\1-PythonProjects\owlready2\kg\kko.owl'         # see (A)
# path = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'      # see (A)
with open(path) as fobj:                                   # see (B)
    for line in fobj:
        print (line, end='')

Note, this example may not work unless you are using local files.

We get the absolute file name (A) on Windows by going to its location within Windows Explorer, highlighting our desired file in the right panel, and then right-clicking on the path listing shown above the pane and choosing ‘Copy address as text’; that is the information placed between the quotes on (A). Note also the ‘r‘ switch on this line (A) (no space after ‘r‘!), which means ‘raw’ and enables the Windows backslashes to be interpreted properly. Go ahead and shift+enter this file and see the listing (which is also useful to surface any encoding issues, which will appear at the end of the file listing should they exist).

Now, the example above is for local files. If you are using the system via MyBinder, we need to load and view our files from online. Here is a different format for accessing such information:

import urllib.request 

path = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'
for line in urllib.request.urlopen(path):
    print(line.decode('utf-8'), end='')

A couple of items for this format deserve comment. First, we need to import a new package, urllib, that carries with it the functions and commands necessary for accessing URLs. There are multiple options available in Python for doing so. This particular one presents, IMO, one of the better formats for viewing text files. Second, we declare the UTF-8 encoding, a constant requirement and theme through the rest of this CWPK series. And, third, we add the attribute option of end='' in our print statement to eliminate the extra lines in the printout that occur without it. Python functions often have many similar options or switches available.

In any case, the above gives us the basis to load the upper ontology of KBpedia called KKO. We now turn to how we begin to manage our knowledge graphs.

Import an Ontology

So, let’s load our first ontology into owlready2 applying some of these concepts:

from owlready2 import *
# the local file option
# onto = get_ontology(path).load()

# the remote file (URL) option
onto = get_ontology(path).load()

Inspect Ontology Contents

We do not get a confirmation that the file loaded OK, the object name of which is onto, except no error messages appeared (which is good!). Just to test if everything proceeded OK, let’s ask the system to return (print to screen) a known class from our kko.owl ontology called ‘Generals‘:.

print(onto.Generals)
        

Can apply to all of the ontology components (in this case the class, ‘Generals’).

We can also list all of the classes in the ontology:

list(onto.classes())
list(onto.disjoint_classes())

Armed with these basics we can begin to manipulate the components in our knowleldge graph, the topic for our next installment.

Additional Documentation

Here is additional documentation on owlready2:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on August 18, 2020 at 9:40 am in CWPK, KBpedia, Semantic Web Tools | Comments (4)
The URI link reference to this post is: https://www.mkbergman.com/2347/cwpk-17-choosing-and-installing-an-owl-api/
The URI to trackback this post is: https://www.mkbergman.com/2347/cwpk-17-choosing-and-installing-an-owl-api/trackback/
Posted:August 17, 2020

Most of the Effort in Coding is in the Planning

With the environment in place, it is now time to plan the project underlying this Cooking with Python and KBpedia series. This installment formally begins Part II in our CWPK installments.

Recall from the outset that our major objectives of this initiative, besides learning Python and gaining scripts, were to manage and exploit the KBpedia knowledge graph, to expose its build and test procedures so that extensions or modifications to the baseline KBpedia may be possible by others, and to apply KBpedia to contemporary challenges in machine learning, artificial intelligence, and data interoperability. These broad objectives help to provide the organizational backbone to our plan.

We can thus see three main parts to our project. The first part deals with managing, querying, and using KBpedia as distributed. The second part emphasizes the logical build and testing regimes for the graph and how those may be applied to extensions or modifications. The last part covers a variety of advanced applications of KBpedia or its progeny. As we define the tasks in these parts of the plan, we will also identify possible gaps in our current environment that we will need to rectify for progress to continue. Some of these gaps we can identify now and so filling them will be some of our most immediate tasks. Other gaps may only arise as we work through subsequent steps. In those instances we will need to fill the gaps as encountered. Lastly, in terms of scope, while our last part deals with advanced applications that we can term ‘complete’ at some arbitrary number of applications, the truth is that applications are open-ended. We may continue to add to the roster of advanced applications as time and need allows.

Important Series Note: As first noted in CWPK #14, this current installment marks the first that every new CWPK article is now available as an interactive Jupyter Notebook page. The first interactive installment was actually CWPK #14, and we have reached back and made those earlier pages available as well.

Each of these new CWPK installments is available both as an online interactive file or as a direct download to use locally. For the online interactive option, pick one of the *.ipynb files. The MyBinder service we are using for the online interactive version maintains a Docker image for each project. Depending on how long it has been since someone last requested a CWPK interactive page, sometimes access may be rapid since the image is in cache, or it may take a bit of time to generate another image anew. We discuss this service more in CWPK #57.

Part I: Using and Managing KBpedia

Two immediate implications of the project plan arise as we begin to think it through. First, because of our learning and tech transfer objectives for the series, we have the opportunity to rely on the electronic notebook aspects of Jupyter to deliver on these objectives. We thus need to better understand how to mix narrative, working code, and interactivity in our Jupyter Notebook pages. Second, since we need to bridge between Python programs and a knowledge graph written in OWL, we will need some form of application programming interface (API) or bridge between these programmatic and semantic worlds. It, too, is a piece that needs to be put in place at the outset.

This additional foundation then enables us to tackle key use and management aspects for the KBpedia knowledge graph. First among these tasks are the so-called CRUD (create-read-update-delete) activities for the structural components of a knowledge graph:

  • Add/delete/modify classes (concepts)
  • Add/delete/modify individuals (instances)
  • Add/delete/modify object properties
  • Add/delete/modify data properties and values
  • Add/delete/modify annotations.

We also need to expand upon these basic management functions in areas such as:

  • Advanced class specifications
  • Advanced property specifications
  • Multi-lingual annotations
  • Load/save of ontologies (knowledge graphs)
  • Copy/rename ontologies.

We also need to put in place means for querying KBpedia and using the SPARQL query language. We can enhance these basics with a rules language, SWRL. Because our use of the knowledge graph involves feeding inputs to third-party machine learners and natural language processors, we need to add scripts for writing outputs to file in various formats. We want to add to this listing some best practices and how we can package our scripts into reusable files and libraries.

Part II: Building, Testing, and Extending the Knowledge Graph

Though KBpedia is certainly usable ‘as is’ for many tasks, importantly including as a common reference nexus for interoperating disparate data, maximum advantage arises when the knowledge graph encompasses the domain problem at hand. KBpedia is an excellent starting point for building such domain ontologies. By definition, the scope, breadth, and depth of a domain knowledge graph will differ from what is already in KBpedia. Some existing areas of KBpedia are likely not needed, others are missing, and connections and entity coverage will differ as well. This part of the project deals with building and logically testing the domain knowledge graph that morphs from the KBpedia starting point.

For years now we have built KBpedia from scratch based on a suite of canonically formatted CSV input files. These input files are written in a common UTF-8 encoding and duplicate the kind of tuples found in an N3 (Notation3) RDF/OWL file. As a build progresses through its steps, various consistency and logical tests are applied to ensure the coherence of the built graph. Builds that fail these tests are error flagged, which requires fixes to the input files, before the build can resume and progress to completion. The knowledge graph that passes these logical tests might be used or altered by third-party tools, prominently including Protégé, during the use of and interaction with the graph. We thus also need methods for extracting out the build files from an existing knowledge graph in order to feed the build process anew. These various workflows between graph and build scripts and tools is shown by Figure 1:

General Workflow of the KBpedia Project
Figure 1: General Workflow of the KBpedia Project

This part of the plan will address all steps in this workflow. The use of CSV flat files as the canonical transfer form between the applications also means we need to have syntax and encoding checks in the process. Many of the instructions in this part deal with good practices for debugging and fixing inconsistent or unsatisfied graphs. At least as we have managed KBpedia to date, every new coherent release requires multiple build iterations until the errors are found and corrected. (This area has potential for more automation.)

We will also spend time on the modular design of the KBpedia knowledge graph and the role of (potentially disjoint) typologies to organize and manage the entities represented by the graph. Here, too, we may want to modify individual typologies or add or delete entire ones in transitioning the baseline KBpedia to a responsive domain graph. We thus provide additional installments focused solely on typology construction, modification, and extension. Use and mapping of external sources is essential in this process, but is never cookie-cutter in nature. Having some general scripts available plus knowledge of creating new relevant Python scripts is most helpful to accommodate the diversity found in the wild. Fortunately, we have existing Clojure code for most of these components so that our planning efforts amount more to a refactoring of an existing code base into another language. Hopefully, we will also be able to improve a bit on these existing scripts.

Part III: Advanced Applications

Having full control of the knowledge graph, plus a working toolchest of applications and scripts, is a firm basis to use the now-tailored knowledge graph for machine learning and other advanced applications. The plan here is less clear than the prior two parts, though we have documented existing use cases with code to draw upon. Major installments in this part are likely in creating machine learning training sets, in creating corpora for unsupervised training, generating various types (word, statement, graph) of embedding models, selecting and generating sub-graphs, mapping external vocabularies, categorization, and natural language processing.

Lastly, we reserve a task in this plan for setting up the knowledge graph on a remote server and creating access endpoints. This task is likely to occur at the transition between Parts II and III, though it may prove opportune to do it at other steps along the way.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on August 17, 2020 at 10:01 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2345/cwpk-16-planning-the-project/
The URI to trackback this post is: https://www.mkbergman.com/2345/cwpk-16-planning-the-project/trackback/
Posted:August 14, 2020

Recipes for Jupyter Notebooks Going Forward

In the last installment of the Cooking with Python and KBpedia series, we began to learn about weaving code and narrative in a Jupyter Notebook page. We also saw that we can generate narratives to accompany our code with the Markdown mark-up language, though it is not designed (in my view) for efficient document creation. Short explanations between code snippets are fine in Jupyter Notebook, but longer narratives or ones where formatting or decorating are required are fairly difficult. (For an update, see the NB box at the conclusion of this installment.) Further, we also want to publish Web pages independent of our environment. What I describe in this CWPK installment is how I combine standard Web page editing and publishing with Jupyter, as well as the starting parts to my standard workflow.

Having a repeatable and fairly efficient workflow for formulating a lesson or question, then scoping it out framed with introduction and working parts, and then skeletonizing it such that good working templates can be put in place is important when one contemplates progressing through all of the stages of discovering, addressing, and documenting a project. In the case of this CWPK series, this is not a trifling consideration. I am anticipating literally dozens of installments in this series; heck, we are already at installment #15 and we haven’t begun yet to code anything in Python! We could stitch together more direct methods of doing a given task, but that will not necessarily arm us to do a broader set of tasks.

Not everyone prefers my style of trying to get systems and game plan in place before tackling a big task, in which case I suggest you skip to the end where we conclude with a discussion of directory organization. For this initial part, however, I will assume that you want to sometimes rely on an interacting coding environment and other times want to generate narratives efficiently. In this use case, the ability to ‘round-trip‘ between HTML editing and Jupyter is an important consideration. Efficiency and document size are relevant considerations, too.

Recall in our last installment that we pointed to two ways to get HTML pages from a Jupyter Notebook: 1) either from a download, or 2) from invoking the nbconvert service from a command window. We could not invoke nbconvert from within a notebook page because it is a Jupyter service. This next frame shows the file created from the article herein using the download method. You invoke the cell by entering shift+enter to call up the file, and then, once inspected, use Cell → All Output → Clear to clear and collapse the view area:

with open('files/cwpk-15-using-notebooks-download.txt', 'r') as f:
    print(f.read())

I should mention that both the nbconvert and download methods produce similarly bloated files. Go ahead, scroll through it. While the generated file renders very well, it is about 10x larger than the original HTML file that captures its narrative (13,644 v 397 lines; 298 K v 30 K). This bloat in file size is due to the fact that all of the style information (*.css) contained in the original document gets re-expressed in this version, along with much other styling information not directly related to this page. Thus, while the generation of the page is super easy, and renders beautifully, it is an overweight pig. We could spend some time whittling down this monster to size with some of the built-in functionality of nbconvert, but why not deal that problem using Pandoc directly upon which nbconvert is based?

So, in testing the cycle from HTML to notebooks and back again, we find that certain aspects of generating project documentation present challenges. In working through the documentation for this series I have found these types of problem areas for round-tripping:

  • Use of a standard, formatted header (with logo)
  • Use of standard footers (notification boxes in our case)
  • Centering images
  • Centering text
  • Tables, and
  • Loss of the interactive functionality in the notebook in the HTML.

Only the last consideration is essential to create useful project and code documentation. However, if one likes professional, well-formatted pages with loads of images and other pretty aspects, it is worth some time to work out productive ways to handle these aspects. In broad terms, for me, that means to be able to move between Web page authoring and interactive code development, testing, and documentation. I also decided to devote some time to these questions as a way to better understand the flexibilities and power of the tools we have chosen. We will always encounter gaps in knowledge when working new problems. I’d like to find the practical balance between the de minimus path to get something done with learning enough to be able to travel similar paths in the future, perhaps even in a production mode.

Since Markdown is a superset of HTML it is not possible to round-trip using Markdown alone within Jupyter Notebook. Fortunately, many Markdown interpreters, including Jupyter, accept some limited HTML in documents. There are two ways that may happen. The first is to use one of the so-called ‘magic’ terms in iPython, the command shell underneath Jupyter Notebook. By placing the magic term %%html at the start of a notebook Markdown cell, we instruct the system to render that entire cell as HTML. Since it is easy to stop a cell and add a new one below it, we can ‘fence’ such aspects easily in our notebook code bases. I encourage you to study other ‘magic’ terms from the prior link that are shortcuts to some desired notebook capabilities.

A second way to use HTML in notebooks is to embed HTML tags. This way is trickier since the various Markdown evaluation engines — due to Markdown’s diversity of implementations — may recognize different tags or, when recognized, treat them differently. One of the reasons to embrace Pandoc, introduced in the last installment, is to accept its standard way of handling languages, markups, formats, and functions.

Boiled down to its essence, then, we have two functional challenges in round-tripping:

  1. Loss of HTML tags and styling with Markdown
  2. Loss of notebook functionality in HTML.

One of Pandoc’s attractions is that both <div> and <span> can be flagged to be skipped in the conversions, which means we can isolate our HTML changes to these tag types, with divs giving us block ‘fencing’ capabilities and spans inline ‘fencing’ capabilities. (There are also Lua filter capabilities with Pandoc to provide essentially unlimited control over conversions, but we will leave that complexity outside of our scope.) Another observation we make is that many of the difficult tags that do not round-trip well deal with styling or HTML tags that can be captured via CSS styling.

Another challenge that must be deciphered are the many flavors of Markdown that appear in the wild. Pandoc handles many flavors of Markdown, including the specified readers of markdown, markdown_strict, markdown_mmd, markdown_phpextra, and gfm (GitHub-flavored Markdown). One can actually ingest any of these flavors in Pandoc and express any of the others. As I noted in the last installment, Pandoc presently has 33 different format readers from Word docs to rich text and can write out nearly twice that many in different formats. For our purposes, however, it is best to choose Pandoc’s canonical internal form of markdown. However, besides translation purposes, the gfm option likely has the broadest external applicability.

OK, so it appears that Pandoc’s own flavor of Markdown is the best conversion target and that we will try to move problem transfer areas to div and span. As for the loss of notebook functionality in HTML, there is no direct answer. However, because an interactive notebook page is organized in a sequence of cells, we can segregate activity areas from interactive areas in our documents. That does not give us complete document convertibility, but we can do it in sections if need be after initial drafting. With this basic approach decided, we begin to work through the issues.

After testing inline styling, we see that we can find recipes that move the CSS and related HTML (such as <center> or <i> or <italics>) between our HTML and notebook environments without loss. Once we go beyond proofs of concept, however, we want to be able to capture our CSS information in styling class and ID designations so that we not need to duplicate lengthy styling code. However, handling and referencing CSS stylesheets is not straightforward with the complexity of applications and configuration files of an Anaconda Python distribution and environment. For a very useful discussion of CSS in this context see Jack Northrup’s notebook on customizing CSS.

Now, the tools at both ends of this process, Jupyter Notebook and Pandoc, both recognize users will want their own custom.css files to guide these matters. But, of course, each tool has a different location and mechanism for specifying this. There is surprisingly little documentation or guidance on the Web for how to handle these things. Most of the references I encountered on these matters were incorrect. We have two fundamental challenges in this quest: 1) how do we define and where do we locate our custom.css file on disk?; and 2) what are our command-line instructions to best guide the two round-trip conversion steps? We will use Pandoc and proper stylesheet locations to guide both questions.

Let’s take the first trip of moving from HTML draft into an operating shell for Jupyter Notebook. First, as we draft material with an HTML editor, we are going to want to store our custom.css information that we need to segregate into some pre-defined, understood location. One way to do that is through a relative location in relation to where our authored HTML document resides. An easy choice is to create a sub-directory of files that is placed immediately below where our HTML document resides. If we follow this location, we may always find the stylesheet (CSS) in the relative location of ‘files/custom.css‘. (You may name the subdirectory something different, but it is probably best to retain the name of ‘custom.css‘ since that is expected by Jupyter.) However, those same CSS specifications need to be available to Jupyter Notebook, which follows different file look-up conventions. One way to discover where Jupyter Notebook expects to find its supplementary CSS files is to open the relevant notebook page and save it as HTML (download or nbconvert or Pandoc methods). When you open the HTML file with an editor, look for the reference to ‘custom.css‘. That will give you the file location in relation to your document’s root. In my case, this location is C:\1-PythonProjects\Python\Lib\site-packages\notebook\static\custom. For your own circumstance, it may also be under the \user\user\ profile area depending on whether you first installed Anaconda for an individual user. At any rate, look for the \Lib\ directory and then follow the directory cascade under your main Python location.

NB: Unfortunately, should you later update your Jupyter Notebook package, you may find that your custom.css is overwritten with the standard, blank placeholder. You may again need to copy your active version into this location.

Once you have your desired CSS both where Pandoc will look for it (again, relative is best) and where Jupyter Notebook will separately look for it, we can concentrate on getting our additional styles into custom.css, which you may click on to see its contents for this current page. Once we have populated custom.css, it is now time to figure out the conversion parameters. There is much flexibility in Pandoc for all aspects of instructing the application at the command line. I present one hard-earned configuration below, but for your own purposes, I strongly recommend you inspect the PDF version of the Pandoc User Guide should you want to pursue your own modifications. At any rate, here is the basic HTML → Notebook initial conversion, using this current page as the example:

$ pandoc -f html -t ipynb+native_divs cwpk-15-using-notebooks.html -o cwpk-15-using-notebooks.ipynb

Here is what these command-line options and switches mean:

  • -f html-f (also --from) is the source or from switch, with html indicating the source format type. Multiple types are possible, but only one may be specified at a time
  • -t ipynb-t (also --to) is the target of the conversion, with ipynb in this case indicating a notebook document. Multiple types are possible, but only one may be specificed here
  • +native_divs – this is a conversion switch that tells Pandoc to retain the content within a native HTML div in the source
  • cwpk-15-using-notebooks.html – this is the source file specification. There are defaults within Pandoc that allow -f html to not be specified, for example and for other formats, once this input file type is specified
  • -o cwpk-15-using-notebooks.ipynb – this is the output (-o) file name; if left unspecified, the default is to write the original file name with the new .ipynb extension (or whatever target format was specified).

These commands and switches require the Windows command window or PowerShell to be opened in the same directory as the *.html document you are converting when you instruct at the command line. Upon entering this command, the appearance of the prompt tells you the conversion proceeded to completion.

This command will now establish a new notebook file (*.ipynb) in the same directory. Please make sure this directory location is under the root you established when you installed Jupyter Notebook (see CWPK #10 if you need to refresh that or change locations).

When you invoke Jupyter Notebook and call up the new *.ipynb file, it will open as a single Markdown cell. If you need to split that input into multiple parts in order to interweave interactive parts, double-click to edit, cut the sections you need to move, Run the cell, add a cell below, and paste the split section into the new cell. In this way, you can skeletonize your active portions with existing narrative.

Upon completing your activities and additions and code tests within Notebook, you may now save out your results to HTML for publishing elsewhere. Again, you could Download or use nbconvert, but to keep our file sizes manageable and to give ourselves the requisite control we will again do this conversion with Pandoc. After saving your work and exiting to the command window, and while still in the current working directory where the *.ipynb resides, go ahead and issue this command at the command window prompt:

$ pandoc -s -f ipynb -t html -c files/custom.css --highlight-style=kate cwpk-15-using-notebooks.ipynb -o cwpk-15-using-notebooks-test.html

There we have it! We now have our recipes to move from HTML to *.ipynb and the reverse!

Here is what the new command-line options and switches mean:

We have now reversed the -f and -t switches since we are now exporting as HTML; again, multiple format options may be substituted here (though specific options may change depending on format)

  • -s means to process the export as standalone, which will bring in the HTML information outside of the <body> tags
  • -c (or --css=) tells the writer where to find the supplementary, external CSS file. This example is the files subdirectory under the current notebook; the file could be whatever.css, but we keep the custom.css name to be consistent with the required name for Jupyter (even though in a different location)
  • --highlight-style=kate is one of the language syntax highlighting options available in Pandoc; there are many others available and you may also create your own
  • -o cwpk-15-using-notebooks-test.html – is an optional output only if we want to change the base name from the input name; during drafting it is recommended to use another name (-test) to prevent inadvertent overwriting of good files.

Upon executing this command, you will get a successful export, but a message indicating you did not provide a title for the project and it will default to the file name as shown by Figure 1:

Message at HTML Export
Figure 1: Message at HTML Export

There are metadata options you may assign at the command line, plus, of course, many other configuration options. Again, the best consolidated source for learning about these options is in the PDF Pandoc Users Guide. This document is kept current with the many revisions that occur frequently for Pandoc.

The next panel shows the HTML generated by this export. Note this document is much smaller (10x) than the version that comes from the download or nbconvert methods:

with open('files/cwpk-15-using-notebooks-html.txt', 'r') as f:
    print(f.read())

This HTML export is good for publication purposes, but lacks the interactivity of its interactive parent. You should thus refrain from such exports until development is largely complete. In any case, we still see static sections for the interactive portions of the notebook. These were styled according to custom.css.

NB: The Web pages that appear on my AI3 blog are the HTML conversions of these interactive notebook pages. The information box at the bottom of each installment page instructs as to where you may obtain the fully interactive versions.

Using similar commands you can also produce outputs in other formats, such as this one for the GitHub flavor of Markdown using this command line instruction:

$ pandoc -s -f ipynb -t gfm -c files/custom.css --highlight-style=kate cwpk-15-using-notebooks.ipynb

Note we have changed the -t to option to gfm and have removed the -o output option because we will use the same notebook file name. Here is the output from that conversion:

with open('files/cwpk-15-using-notebooks-md.txt', 'r') as f:
    print(f.read())

You can see that headers are more cleanly shown by # symbols, and that gfm is a generally clean design. It is becoming the de facto standard for shared Markdown.

Of course, no export is necessary for the actual notebook file since they are plain text. As noted in earlier installments, Jupyter Notebook files are natively expressed in JavaScript Object Notation (JSON). This is the only file representation that contains transferrable instructions for the interactive code cells in the notebook page. This JSON file is for the same content expressed in the files above:

with open('files/cwpk-15-using-notebooks-ipynb.txt', 'r') as f:
    print(f.read())

Mastery of these tools is a career in itself. I’m sure there are better ways to write these commands or even how to approach the workflow. As the examples presently stand there are a few minor glitches, for example, that keep this round-tripping from being completely automatic. Relative file locations get re-written with an ‘attachment:‘ prefix during the round-trip, which must be removed from the HTML code to get images to display. For some strange reason, images also need to have a width entry (e.g., width="800") in order not to be converted to Markdown format. Also, in some instances HTML code within a div gets converted to Markdown syntax, which then can not be recognized when later writing to HTML. The Pandoc system is full-featured and difficult to master without, I am sure, much use.

In working with these tools, here is what I have discovered to be a good starting workflow:

  1. Author initial skeleton in HTML. Write intro, get references and links, set footer. Try to use familiar editing tools to apply desired formatting and styles
  2. Add blocks and major steps, including some thought for actual interactive pieces; name this working file with an -edit name extension to help prevent overwriting it
  3. Convert to Notebook format and transfer to notebook
  4. Work on interaction steps within Jupyter Notebook, one by one. Add narrative lead in and following commentary to each given step. If the narrative is too long or too involved to readily handle in the notebook with Markdown, save, revert to the HTML version for the interstitial narrative Markdown cell
  5. Generate Markdown for the connecting cell, copy back into the working notebook page
  6. Repeat as necessary to work through the interaction steps
  7. Save, and generate the new notebook
  8. Export to multiple publication platforms.

Directory and File Structure

These conversion efforts have also helped refine some of the directory refinements useful to this workflow. I first began laying out a broad directory structure for this project in CWPK #9. We can now add a major branch under our project for notebooks, with a major sub-branch being this one for the CWPK series. I personally use CamelCase for naming my upper directory levels, those I know will likely last for a year or more. Lower levels I have tended to lower case and hyphen separate more akin to a consistent treatment on Linux boxes.

Here is how I am setting up my directory structure:

|-- PythonProject                             # directory first introduced in CWPK #9 
|-- Python
|-- [Anaconda3 distribution]
|-- Notebooks # see next directory expansion
|-- CWPKNotebook
|-- TBA # We'll add to this directory structure as we move on
|-- TBA

Individual notebooks should live in their own directory along side any ancillary files related to them. For example:

Notebooks/
|-- CWPKNotebook
| |-- cwpk-1-installment # one folder/notebook per installment
| |-- .ipyn_checkpoints # created automatically as notebooks are saved and checkpointed
| +-- cwpk-1-installment.ipynb # backup file from checkpoint, same name as current
| +-- cwpk-1-installment.ipynb # current active notebook file
| +-- cwpk-1-installment-edit.html # initial drafting file named differently to prevent overwriting
| +-- cwpk-1-installment.html
| |-- files # a sub-directory for all supporting files for that installment
| +-- custom.css # same across installments, + one in the Jupyter Notebook settings
| +-- image-1.png
| +-- image-2.jpg
| +-- attachment.txt
| |-- cwpk-2-installment
| +-- etc.
| |-- cwpk-3-etc.

Note that Save and Checkpoint within Jupyter Notebook automatically creates a .ipynb_checkpoints subdirectory and populates it with the current version of the *.ipynb file. (So, don’t mix up the current file in the parent directory with this backup one.) Further, it is perhaps better to create a more streamlined version of this directory structure that would place all notebook files (*.ipynb) in a single directory with a single location for the custom.css. That approach requires more logic in the application and is harder to include in a lesson. One advantage of the somewhat duplicative structure herein is that we are able to treat each notebook installment as a standalone unit.

NB: From the perspective of CWPK #60 looking back, I find that my assumptions of how I would use Jupyter Notebook did not prove to be exactly accurate. In fact, I have found using Notebooks to be tremendously helpful and productive for all drafting activities. I like the way that the flow of cells, either code or Markdown, can lead to productive drafting. My actual experience going forward is that I completely ceased using HTML or Web pages for any drafting. The interactive notebook environment has proven to be a real favorite with me. True, I now do more drafting directly using Markdown, but even that has proven to be quick and productive.

This installment completes our first major section on set-up and configuration of our working environment. In our next installment we switch gears to working with Python and lay out our game plan for doing so.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on August 14, 2020 at 2:10 pm in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2343/cwpk-15-using-notebooks-for-cwpk-documentation/
The URI to trackback this post is: https://www.mkbergman.com/2343/cwpk-15-using-notebooks-for-cwpk-documentation/trackback/
Posted:August 13, 2020

Eventually, You May Need to Know How to Dissect a Notebook Page

We discussed in the CWPK #10 installment of this Cooking with Python and KBpedia series the role of Jupyter Notebook pages to document this entire plan. The reason we are using electronic notebooks is because, from this point forward, we will be following the discipline of literate programming. Literate programming is a style of coding introduced by Donald Knuth to combine coding statements with language narratives about what the code is doing and how it works. The paradigm, and thus electronic notebooks, is popular with data scientists because activities like machine learning also require data processing or cleaning and multiple tests with varying parameters in order to dial-in resulting models. The interactive notebook paradigm, combined with the idea of the scientist’s lab notebook, is a powerful way to instruct programming and data science.

In this installment we will dissect a Jupyter Notebook page and how we write the narrative portions in a lightweight mark-up language known as Markdown. Actually, Markdown is more of a loose affiliation of related formats, with lack of standardization posing some challenges to its use. In the next installment we will provide recipes for keeping your Markdown clean and for integrating notebook pages into your workflows and directory structures.

We first showed a Jupyter Notebook page in Figure 5 of CWPK #10. Review that installment now, make sure you have a CWPK notebook page (*.ipynb) somewhere on your machine, go to the directory where it is stored (remember that needs to be beneath the root directory you set in CWPK #10), and then bring up a command window. We’ll start up Jupyter Notebook first:

$ jupyter notebook

Assuming you are using this current notebook page as your example, your screen should look like this one. To confirm our notebook is active, type in our earlier ‘Hello KBpedia!‘ statement:

print ("Hello KBpedia!")

Now, scroll up to the top of this page and double-click anywhere in the area where the intro narrative is. You should get a screen like the one below, which I have annotated to point out some aspects of the interactive notebook page:

Example Markdown Cell in Edit Mode
Figure 1: Example Markdown Cell in Edit Mode

We can see that the active area on the page, what is known as a “cell” contains plain text (1). Also note that the dropdown menu in the header (1) tells us the cell is of the ‘Markdown’ type. There are multiple types of cells, but throughout this series we will be concentrating on the two main ones: Markdown for formatting narratives, and Code for entering and testing our scripts. Recall that Markdown uses plain text rather than embedded tags (as in HTML, for example) (2). We have conventions for designating headings (2) or links with URLs and link text (2). Most common page or text formatting such as bullets or italics or emphasized text or images have a plain text convention associated with them. In this instance, we are using the Pandoc flavor of Markdown. But, also notice, that we can mix many HTML elements (3) into our Markdown text to accomplish more nuanced markup. In this case, we as using the HTML <div> tag to convey style and placement information for our header with its logo.

As we open or close cells, new cells appear for entry at the bottom of our page. We can also manage these cells by inserting above or below or deleting them via two of the menu options (4). To edit, we either double-click in a Markdown cell or enter directly into a Code cell. When have finished our changes, we can see the effect via the Run button (5) or Cell option (4), including to run all cells (the complete page) or selected cells. But be careful! While we can do much entry and modifications with Markdown cells, this application is not like a standard text editor. We can get instant feedback on our modifications, but it is different to Save files as checkpoints (6) and changing file names is not possible from within the notebook, where we must use the file system. We can also have multiple cells unevaluated at a given time (7). We may also choose among multiple kernals (different languages or versions, including R and others). Many of these features we will not use in this series; the resources at the end of this article provide additional links to learn more about notebooks.

To learn more about Markdown, let me recommend two terrific resources. The first is directly relevant to Jupyter Notebook, the second is for a very useful Markdown format:

When you are done working on your notebook, you can save the notebook using Widgets → Save Notebook Widgets State OR File → Save and Checkpoint and then File → Close and Halt. (You may also Logout (8), but make sure you have saved in advance.) Depending on your sequence, you may exit to the command window. If so, and the system is still running in the background, pick Ctrl+c to quit the application and return to the command window prompt.

Should you want to convert your notebook to a Web page (*.html), you may use nbconvert at the command prompt when you are out of Jupyter Notebook. For the notebook file we have been using for this example, the command is (assuming you are in the same directory as the notebook file):

  $ jupyter nbconvert --to html cwpk-14-markdown-notebook-file.ipynb

This command will write out a large HTML page (large because it embeds all style information). This version pretty faithfully captures the exact look of the application on screen. See the nbconvert documentation for further details. Alternatively, you may export the notebook directly by picking File → Download as → HTML (.html). Then, save to your standard download location.

We will learn more about these saving options and ways to improve file size and faithful rendering in the next installment.

Important note: as of the forthcoming CWPK #16 installment, we will begin to distribute Jupyter Notebook files with the publication of each installment. Further, even though early installments in this series had no interactivity, we will also re-published them as notebook files. From this point forward all new installments will include a Notebook file. Check out CWPK #16 when it is published for more details.

More Resources

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on August 13, 2020 at 9:51 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2342/cwpk-14-markdown-and-anatomy-of-a-notebook-file/
The URI to trackback this post is: https://www.mkbergman.com/2342/cwpk-14-markdown-and-anatomy-of-a-notebook-file/trackback/
Posted:August 12, 2020

Keeping Multiple Interacting Parts Current

Early in this series (CWPK #9) of Cooking with Python and KBpedia, I noted the importance of Anaconda as a package and configuration manager for Python. Akin to the design of the Unix and Linux operating systems, Python applications are an ecosystem of scripts and libraries that are shared and invoked across multiple applications and uses. The largest Python repository, PyPi, itself contains more than 230,000 projects. The basic installation of Anaconda contains about 500 packages on its own in its standard configuration.

Since the overwhelming majority of these projects exists independently of the other ones and each progresses on its own schedule of improvements and new version releases, it is not hyperbole to envision the relative stability of a package installer such as Anaconda as masking a bubbling cauldron of constant package changes under the surface. To illustrate this process I will focus on one of the Anaconda packages called Pandoc that figures prominently in the next installment of this CWPK series. Pandoc has been around for about 15 years and is the undisputed king of applications for converting one major text format type into another. Pandoc processes external formats using what it calls ‘readers’, converts that external form into an internal representation, and then uses ‘writers’ to output that internal representation into another form useful to external applications. Generally, a given format has both a reader and a writer, though there are a few strays. In the current version of Pandoc (2.9.2.x) there are 33 readers and 55 writers.

A Python environment is a dedicated directory where specific dependencies can be stored and maintained. Environments have unique names and can be activated when you need them, allowing you to have ultimate control over the libraries that are installed at any given time. You can create as many environments as you want. Because each one is independent, they will not interact or ‘mess up’ the other. Thus, it is common for programmers to create new environments for each project that they work on. Often times, information about your environment can assist you in debugging certain errors. Starting with a clean environment for each project can help you control the number of variables to consider when looking for bugs. When it comes to creating environments, you have two choices:
  1. you can create a virtual environment (venv) using pip to install packages or
  2. create a conda environment with conda installing packages for you. [1]

In my own work I tend to author documents either in HTML or LibreOffice, corresponding to the *.html and *.odt formats, respectively. However, the Jupyter Notebook that we will be using for our interactive electronic notebooks represents standard formatted text in the Markdown format (*.md) that it combines with the interactive portions that use embedded JavaScript Object Notation (JSON). The combination of these narrative and interactive portions is represented by the *.ipynb format. Markdown is a plain text superset of HTML that uses character conventions rather than bracketed tags (for example, ‘-‘ for marking bullets or ‘#‘ for marking headings). We’ll have many occasions to look at Markdown markup throughout this series. Since I was anticipating switching between writing narratives and interacting with code, I wanted to use my standard writing tools for longer explanations as well as to publish interactive notebook pages on static Web sites. I was investigating Pandoc as a means of ‘round-tripping‘ between HTML and *.ipynb and to leverage the strengths of each.

A quick look at the Pandoc site showed that, indeed, both formats were supported. Further, the Pandoc documentation also suggested there were ‘switches’ for the readers and writers of these two formats that would likely give me the control I needed to round-trip between formats with few or no errors. So, I downloaded the latest version of Pandoc (updating an earlier version already on my machine), and proceeded to do my set-up work in preparation for the upcoming CWPK installment #14. However, every time I ran the Pandoc command to do the conversion, I repeatedly got the error message of “Unknown output format.”

As I tried to debug this problem I made some discoveries. First, Pandoc was already a package included in Anaconda. Further, while I previously had Pandoc in my environment path, the new path entered when I installed Anaconda was put higher on the list, meaning the Anaconda Pandoc was invoked before the instance I had installed directly. As I investigated the Anaconda packages, I found that it was using Pandoc version 2.2.3.2, which dated from August 7, 2018. In investigating Pandoc releases, I noted that *.ipynb support was not introduced into Pandoc until version 2.6 on January 30, 2019. So, despite what the Web site stated and my own installation of the version from March 23, 2020, the actual Pandoc that was being used in my environment did not support the notebook format!

To sate my curiosity I took a random sample of a dozen packages hosted by Anaconda and compared them to later updates that might be found elsewhere on the Web or directly from the developers. I found Anaconda was up to date in about 10 of these 12 instances. However, in the instance of Pandoc this gap was material. This raises two important points. First, when first installing or when returning to use after a hiatus, it is important to update your existing distribution. For Anaconda, first begin to update that repository:

conda update --all

Invoking this option causes a flurry of activity as multiple packages are checked for currency, dependencies, and then proper load orders. These are the kinds of activities that formerly were painful and subject to many inadvertent conflicts as one package updated a dependency that broke another. This kind of update activity is shown by Figure 1.

Updating the Anaconda Environment
Figure 1: Updating the Anaconda Environment

Second, we then need additional ways to find and install Python packages. The most common package installer in Python is pip, a leading method to accessing PyPi, though clearly Anaconda chose to use an alternate approach in conda. The philosophy of conda is to better manage dependencies and interactions between packages than pip historically provided. There are other repositories that have embraced that same philosophy, and one with even greater dependency testing than conda is conda-forge, also a popular repository for data science packages. In all random cases I checked, conda-forge had as recent or more recent packages than conda. conda-forge also had the most recent version of Pandoc. Further, conda-forge can be integrated into the Anaconda package installation environment.

Installing Pandoc from the conda-forge channel can be achieved by adding to your channels (in this case, Anaconda) [2]:

conda config --add channels conda-forge

Once the conda-forge channel has been enabled, Pandoc can be installed with:

conda install pandoc

It is possible, obviously, to add specific packages from conda-forge to your channel using this exact command format. It is also possible to list all of the versions of Pandoc available on your platform with:

conda search pandoc --channel conda-forge

This same approach may be used for any specific package maintained by conda-forge, while keeping dependencies and Anaconda current.

Here are some resources if you wish to explore Python package management further:

We saw from Jupyter Notebook in CWPK #10 that it is not able to access all areas of your computer unless you place it at the root. That is never a good idea for security reasons. It is always best to keep your Python working environment sequestered to some extent. Further, as the sources above indicate, if you are to get serious with Python and engage in multiple projects, it is a good idea to use virtual environments as well as dedicated directories. I do not address the topic of virtual environments further in this series since many just learning for the first time may not need this complexity.

Another truth of such large installations as Anaconda is that it is very tricky — indeed, nearly impossible on a Windows machine — to change the directory in which it was first installed. The safest way to do so is to uninstall Anaconda then re-install it in the new directory. That can be disruptive itself, so is not a step to undertake lightly. It is therefore deserving of some attention to how you organize your directory structures. You are thus best to play a bit with your Python environment, see what is working and what is not in terms of your workflows and file locations, and then make changes if need be before committing to any true work-dependent tasks. I first introduced the question of directory structure in CWPK #9. We will continue this topic in earnest in our next installment.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

Endnotes:

[1] Norris, Will, Jenny Palomino, and Leah Wasser. 2019. “Use Conda Environments to Manage Python Dependencies: Everything That You Need to Know.” Earth Data Science – Earth Lab. https://www.earthdatascience.org/courses/intro-to-earth-data-science/python-code-fundamentals/use-python-packages/introduction-to-python-conda-environments/ (April 10, 2020).
[2] Conda-Forge/Pandoc-Feedstock. 2020. conda-forge. Shell. https://github.com/conda-forge/pandoc-feedstock (April 10, 2020)

Posted by AI3's author, Mike Bergman Posted on August 12, 2020 at 9:42 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2341/cwpk-13-managing-python-packages-and-environments/
The URI to trackback this post is: https://www.mkbergman.com/2341/cwpk-13-managing-python-packages-and-environments/trackback/