Posted:October 28, 2020

What Should be Simple Proves Frustratingly Complex

Sometimes the installments in this Cooking with Python and KBpedia series come together fairly quickly, sometimes not. This installment has proven to be particularly difficult. Research has spread over days, and progress has been frustratingly slow. As a result, I spread the content of developing a remote SPARQL service across two parts.

At the outset I thought it would progress rapidly: After all, is not SPARQL a proven query language with central importance to knowledge graphs? But, possibly because our focus in the series is Python, or perhaps for other reasons, I have found a dearth of examples to follow regarding setting up a Python endpoint.

You will recall we first introduced SPARQL in CWPK #25 in conjunction with the RDFLib package. We showed the flexibility and robustness of this query language to retrieve and filter any and all structural aspects of a knowledge graph. Then, in installment CWPK #50 we expanded on this basis to describe how SPARQL can be an essential component for querying and retrieving data from external sources, principally Wikidata and DBpedia.

Most all public SPARQL endpoints that presently exist (see this representative list, which is disappointingly small) are based on triple stores that come bundled with SPARQL endpoints. A few are also based on endpoint wrappers based on Java such as RDF4j or Jena and a few languages such as C (Redland) or JavaScript. These options obviously do not meet our Python objectives.

As we saw in CWPK #25, RDFLib provides SPARQL query support and also has the related SPARQLwrapper package that enables one to pose queries to external SPARQL endpoints. (easysparql provides similar functionality.) However, the objective we have to turn a local or remote instance into a SPARQL-enabled endpoint accessible to outside parties is not so easily supported. A number of years back there were the well-regarded rdflib-web apps that ran within Flask; unfortunately, this code is out of date and does not run on Python 3. There was also the adhs package that saw limited development and has not been updated in five years. In my initial diligence for this series I also found the pyLDAPI package that looked promising. However, I have not been able to find a working version of this system, and the I find the approach it takes to content negotiation for linked data to be cumbersome and tedious (see next installment).

So, based on the fragments indicated and found from these researches, I decided to tackle setting up a SPARQL endpoint largely on my own. Having established a toe-hold in our remote Linux server in the last installation, I decided to proceed by baby steps reflecting what I had already learned with our local instance to expose an endpoint on our remote server.

Step-wise Approach

We begin our process by setting up our environment, loading needed packages and KBpedia, testing them, and then proceeding to write some code to enable SPARQL queries and then to manage the application. Not knowing if all of these steps will work, I decide to approach these questions in a step-by-step manner.

1. Create a ‘sparql’ conda and Flask address

Note: I have always found the Linux vi editor to be difficult and hard to navigate, since I only use it on occasion. I now use nano as my editor replacement, since it presents key commands at the bottom of the screen useful to my occasional use, and is also part of the standard distro.

We follow the same steps that we worked out in CWPK #58 for setting up a conda virtual environment, that we will name ‘sparql’:

conda create -n sparql python=3

We get the echo to screen as the basic conda environment is created. Remember, this environment is found in the /usr/bin/python-projects/miniconda3/envs/sparql directory location. We then activate the environment:

conda activate sparql

We install some basic packages and then create our new sparql directory and the two standard stub files there:

conda install flask
conda install pip

then the two files, beginning with test_sparql.py:

from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "Hello SPARQL!"

and then wsgi.py:

import sys
sys.path.insert(0, "/var/www/html/sparql/")
from test_sparql import app as application

We then proceed to set up the Apache2 configurations, placed directly below our prior similar specification in the /etc/apache2/sites-enabled directory in the 000-default.conf file:

        WSGIDaemonProcess sparql python-path=/usr/bin/python-projects/miniconda3/envs/sparql/lib/python3.8/site-packages
WSGIScriptAlias /sparql /var/www/html/sparql/wsgi.py
<Directory /var/www/html/sparql>
WSGIProcessGroup sparql
WSGIApplicationGroup %{GLOBAL}
Order deny,allow
Allow from all
</Directory>

then you can check whether the configuration is OK and re-start the server. Then, when we enter:

http://54.227.249.140/sparql

We see that the right message appears and our configuration is OK.

2. Install all needed Python packages

If you recall from the last installment, we used the minimal miniconda3 package installer for our remote Linux (Ubuntu) instance. This minimal footprint largely only installs conda and Python. That means we must install all of the needed additional packages for our current application.

We noted the pip installer before, but we are best off using one of the conda-related channels since they better check configuration dependencies. To expand our package availability from what is standard in the conda channel, we may need to add some additional channels to our base package. One of the most useful of these is conda-forge. To install it:

conda config --add channels conda-forge

It is best to install packages in bulk, since dependencies are checked at install time. One does this by listing the packages in the same command line. When doing so, you may encounter messages that one or more of the packages was not found. In these cases, you should go to the search box at https://anaconda.com, search for the package, and then note the channel in which the package is found. If that channel is not already part of your configuration, add it.

Many of the needed packages for our SPARQL implementation are found under the conda-forge channel. Here is how a bulk install may look:

conda install networkx owlready2 rdflib sparqlwrapper pandas --channel conda-forge

We also then need to install cowpoke using pip by using this command while in the sparql virtual environment:

pip install cowpoke

Every time we invoke the sparql virtual environment these packages will be available, which you can inspect using:

conda list

Also, if you want to share with others the package configuration of your conda environments, you may create the standard configuration file using this command:

conda env export > environment.yaml

The file will be written to the directory in which you invoke this command.

3. Install KBpedia ‘sandbox’ KGs

Clearly, besides the Python code, we also need the various knowledge graphs used by KBpedia. These graphs are the same *.owl (rdf/xml) files that we first discussed in CWPK #18 . We will use the same ‘sandbox’ files from that installment.

Our first need is to decide where we want to store our KBpedia knowledge graphs. For the same reasons noted above, we choose to create the directory structure of /var/data/kbpedia. Once we create these directories, we need to set up the ownership and access properties for the files we will place there. So, we navigate to the parent directory data of our target kbpedia directory and issue two statements to set the ownership and access rights to this location:

sudo chown -R user-owner:user-group kbpedia
sudo chmod -R 775 kbpedia

The -R switch means that our settings get applied recursively to all files and directories in the target directory. The permissions level (775) means that user owners or groups may write to these files (general users may not).

These permission changes now allow us to transfer our local ‘sandbox’ files to this new directory. The two files that we need to transfer using our SSH or file transfer clients are:

kbpedia_reference_concepts.owl
kko.owl

Recall these are the RDF/XML conversions of the original *.n3 files. We now have the data available on the remote instance for our SPARQL purposes.

4. Verify access and use of KBpedia and owlready2

OK, so to see that some of this is working, I pick up on the file viewing code in CWPK #18 to see if we can load and view this stuff. I enter this code into a temp.py file and run python (python temp.py) under the /var/www/html/sparql/ directory:

main = '/var/data/kbpedia/kko.owl'  

with open(main) as fobj:
for line in fobj:
print (line)

Good; we see the kko.owl file scroll by.

So, the next test is to see if owlready2 is loaded properly and we can inspect the KBpedia knowledge graph.

Picking up from some of the first tests in CWPK #20, I create a script file locally and enter these instructions (note where the kko.owl file is now located):

main = '/var/data/kbpedia/kko.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core'

from owlready2 import *
kko = get_ontology(main).load()

skos = get_ontology(skos_file).load()
kko.imported_ontologies.append(skos)

list(kko.classes())

When in the sparql directory under /var/www/html/sparql, I call up Python (remember to have the sparql virtual environment active!), which gives me this command line feedback:

(sparql) root@ip-xxx-xx-x-xx:/var/www/html/sparql# python
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

and I paste the code block above at the cursor (>>>). I then hit Enter at the end of the code block, and we then see our kko classes get listed out.

Good, it appears we have the proper packages and directory locations. We can Ctrl-d (since we are on Linux) to exit the Python interactive session.

5. Create a ‘remote_access.py’ to verify a SPARQL query against the local version of the remote instance

So far, so good. We are now ready to test support for SPARQL. We again look to one of our prior installments, CWPK #25, to test whether SPARQL is working for us with all of the constituent KBpedia knowledge graphs. As we did with the prior question, we formulate a code block and invoke it interactively on the remote server with our python command. Here is the code (note that we have switched the definition of main to the full KBpedia reference concepts graph):

main = '/var/data/kbpedia/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core'
kko_file = '/var/data/kbpedia/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(main).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)

import rdflib

graph = world.as_rdflib_graph()

form_1 = list(graph.query_owlready("""
PREFIX rc: <http://kbpedia.org/kko/rc/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?x ?label
WHERE
{
?x rdfs:subClassOf rc:Mammal.
?x skos:prefLabel ?label.
}
"""))

print(form_1)

Fantastic! This works, too, even to the level of giving us the owlready2 circular reference warnings we received when we first invoked CWPK #25!

Now, let’s also test if we can query using SPARQL to another remote endpoint from our remote instance using again more code from the CWPK #25 installment and also after importing the sparqlwrapper package:

main = '/var/data/kbpedia/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core'
kko_file = '/var/data/kbpedia/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(main).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)

from SPARQLWrapper import SPARQLWrapper, JSON
from rdflib import Graph

sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

sparql.setQuery("""
PREFIX schema: <http://schema.org/>
SELECT ?item ?itemLabel ?wikilink ?itemDescription ?subClass ?subClassLabel WHERE {
VALUES ?item { wd:Q25297630
wd:Q537127
wd:Q16831714
wd:Q24398318
wd:Q11755880
wd:Q681337
}
?item wdt:P910 ?subClass.

SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
print(results)

Most excellent! We have also confirmed we can use our remote server for remote endpoint queries.

6. Create a Flask-based SPARQL input form for the local version This progress is rewarding, but the task now becomes substantially harder. We need to set up interfaces that will allow these queries to be run from external sources to our remote instance. There are two ways we can tackle this requirement.

The first way, the subject of this particular question, is to set up a Web page form that any outside user may access from the Web to issue a SPARQL query via an editable input form. The second way, the subject of question #9, is to enable a remote query issued via sparqlwrapper and Python that goes directly to the endpoint and bypasses the need for a form.

Since we already have installed Flask and validated it in the last installment, our task under this present question is to set up the Web form (in the form of a template as used by Flask) in which we enter our SPARQL queries. Flask maps Web (HTTP) requests to Python functions, which we showed in the last installment where the /sparql URI fragment maps to the /var/www/html/sparql path and its test_sparql.py function. Flask runs this code and then displays results to the browser using HTTP protocols, with the GET method being the most common, but all HTTP methods may be supported. The Python code invoked may call up templates (based on Jinja) that can then invoke HTML pages forms and various response functions.

I noted earlier two SPARQL-related efforts, pyLDAPI and adhs. While neither appears to have a working example, both contain aspects that can inform this task and subsequent ones. A (non-working) implementation of pyLDAPI called GNAF, in particular, has a SPARQL Web page that looked to be useful as a starting template.

If you recall, Flask uses HTML-based templates as its ‘view’-related approach to the model-view-controller (MVC) design. Besides embedding standard HTML, these templates may also contain set Flask statements that relate the Web page to various model or controller commands. These templates should be placed into a set directory under the Flask directory structure. The templates can be nested within one another, useful, for example, when one wants a header and footer repeated across multiple pages, but for our instance I chose a single-page template.

In essence, I took the two main text areas from the starting GNAF template and embedded them in duplicate presentations of the header and footer from the KBpedia current Web page design. (You should know that the server hosting the subject SPARQL page is different from the physical server hosting the standard KBpedia Web site.) I took this approach because I was considering making a SPARQL query form a standard part of the main KBpedia site, which I implement at the conclusion of the next installment. Here is how the resulting Web page form looks:

KBpedia SPARQL Form
Figure 1: KBpedia SPARQL Form

Though located on a remote server different than the standard KBpedia Web site, we have designed the KBpedia SPARQL form to mimic the look of that standard site (1) with the same menu options, and both interact seamlessly. Sample SPARQL queries are provided both for the internal KBpedia knowledge graph and for external sites (2), including links (2) to additional query examples. These queries, whether samples or ones of your own crafting, can be pasted into the query entry box (3). Once pasted, you have the option to enter an external SPARQL query URL (4), pick whether your query should be directed internally to KBpedia or externally (4) (if the query is external), and to select amongst about 8 output formats (4), including standard RDF/XML, JSON, CSV, HTML, etc. Then, when you submit the query (4), the results appear in the final text box (5). If the results are helpful, you may copy them and paste them into a local file.

You can inspect this resulting SPARQL Web page at the following address (View Page Source to see the HTML):

http://sparql.kbpedia.org/

You will note that besides logo and menu items similar to the standard KBpedia site, that this form has two text areas, one for entering the SPARQL query and one for viewing subsequent results. There are also some switches regarding input and output forms. It is these switches and the two text areas that relate most directly to the next question.

Tying this form to (which, of course was actually developed in conjunction with) its accompanying code was the most difficult coding effort I have undertaken with this CWPK series to date. I cover this coding development, along with the remaining questions and related topics, in our next installment.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 28, 2020 at 9:11 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2409/cwpk-59-adding-a-sparql-endpoint-part-i/
The URI to trackback this post is: https://www.mkbergman.com/2409/cwpk-59-adding-a-sparql-endpoint-part-i/trackback/
Posted:October 26, 2020

The Beginning of Moving Our Environment into the Cloud

Today’s installment in our Cooking with Python and KBpedia series begins a three-part mini-series of moving our environment into the clouds. This first part begins with setting up a remote instance and Web page server using the mini-framework Flask. The next installment expands on this basis to begin adding a SPARQL endpoint to the remote instance. And, then, in the concluding third part to this mini-series, we complete the endpoint and make the SPARQL service active.

We undertake this mini-series because we want to make KBpedia more open to SPARQL queries from anywhere on the Web and we will be migrating our main KBpedia Web site content to Python. The three installments in this mini-series pave the way for us to achieve these aims.

We begin today’s installment by looking at the approach and alternatives for achieving these aims, and then we proceed to outline the steps needed to set up a Python instance on a remote (cloud) instance and to begin serving Web pages.

Approach and Alternatives

We saw earlier in CWPK #25 and CWPK #50 the usefulness of the SPARQL query language to interact with knowledge graphs. Our first objective, therefore, was to establish a similar facility for KBpedia. Though we have an alternate choice to set up some form of RESTful API to KBpedia (see further Additional Documentation below), and perhaps that may make sense at some point in time, our preference is to use SPARQL given its robustness and many query examples as earlier installments document.

Further, we can foresee that our work with Python in KBpedia may warrant our moving our standard KBpedia Web site to that away from Clojure and its current Bootstrap-based Web page format. Though Python is not generally known for its Web-page serving capabilities, some exploration of that area may indicate whether we may go in that direction or not. Lastly, given our intent to make querying of KBpedia more broadly available, we also wanted to adhere to linked data standards. This latter objective is perhaps the most problematic of our aims as we will discuss in a couple of installments.

Typically, and certainly the easiest path, when one wants to set up a SPARQL endpoint with linked data ‘conneg‘ (content negotiation) is to use an existing triple store that has these capabilities built in. We already use Virtuoso as a triple store, and there are a couple of Python installation guides already available for Virtuoso. Most of the other widely available triple stores have similar capabilities.

Were we not interested in general Web page serving and were outside of the bounds of the objectives of this CWPK series, use of a triple store is the path we would recommend. However, since our aims are different, we needed to turn our attention to Python-based solutions.

From the standpoint of Web-page serving, perhaps the best known and most widely installed Python option is Django, a fully featured, open-source Web framework. Django has a similar scope and suite of capabilities to its PHP counterpart, Drupal. These are comprehensive and complicated frameworks, well suited to massive database-backed sites or ones with e-commerce or other specialty applications. We have had much experience with Drupal, and frankly did not want to tackle the complexity of a full framework.

I was much more attracted to a simpler microframework. The two leading Python options are Flask and Bottle (though there is also Falcon, which does not seem as developed). I was frankly more impressed with the documentation and growth and acceptance shown by Flask, and there appeared to be more analogous installations. Flask is based on the Jinja template engine and the Werkzeug WSGI (Web-server Gateway Interface) utility library. It is fully based on Unicode.

Another factor that needs to be considered is support for RDFlib the key package (and related) that we will be using for the SPARQL efforts. I first discussed this package in CWPK #24, though it is featured in many installments.

Basic Installation

We will be setting up these new endpoints on our cloud server, which is a large EC2 instance on Amazon Web Services running Ubuntu 18.04 LTS. Of course, this is only one of many cloud services. As a result, I will not discuss all of the preliminary steps to first securing an instance, or setting up an SSH client to access it, nor any of the initial other start-up issues. For EC2 on AWS, there are many such tutorials. Two that I have encountered in doing the research for this installment include this one and this other one. There are multiple others, and ones applicable to other providers than AWS as well.

So, we begin from the point that an instance exists and you know how to log in and do basic steps with it. Even with this simplification, I began my considerations with these questions in mind:

  1. Do I need a package manager, and if so, which one?
  2. Where should I place my Python projects within the remote instance directory structure?
  3. How do I also include the pip package environment?
  4. Should I use virtual environments?

With regard to the first question, I was sure I wanted to maintain the conda package manager, but I was not convinced I needed the full GUI and set of packages with Anaconda. I wanted to keep consistency with the set-up environment we put in place for our local instance (see CWPK #9). However, since I had gained much experience with Anaconda and conda, I felt comfortable not installing the full 4 GB full Anaconda distribution. This led me to adopt the minimal miniconda package, which has a much smaller footprint. True, I will need to install all specific Python packages and work from the command line (terminal), but that is easy given the experience I have gained.

Second, in reviewing best practices information for directory structures on Ubuntu, I decided to create an analogous ‘python-projects’ master directory, similar to what I established for the local instance, under the standard user application directory. I thus decided and created this master directory: usr/bin/python-projects.

So, having decided to use miniconda and knowing the master directory I wanted to use, I proceeded to begin to set up my remote installation. I followed a useful source for this process, though importantly updated all references to the current versions. The first step was to navigate to my master directory and then to download the 64-bit Linux installer onto my remove instance, followed by executing the installation script:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  * make sure you use updated version

sh Miniconda3-latest-Linux-x86_64.sh    

The installation script first requires you to page through the license agreement (many lines!!) and then accept it. Upon acceptance, the Miniconda code is installed with some skeletal stubs in a new directory under your master directory, which in my case is /usr/bin/python-projects/miniconda3. The last step in the installation process asks where you want Miniconda installed. The default is to use root/miniconda3. Since I wanted to keep all of my Python project stuff isolated, I overrode this suggestion and used my standard location of /usr/bin/python-projects/miniconda3.

After all of the appropriate files are copied, I agreed to initialize Miniconda. This completes the basic Miniconda installation.

Important Note: You will need to sign out and then sign back into your SSH shell at this point before the changes become active!

After signing back in, I again navigated to my Python master directory and then installed the pip package manager since not all of the Python packages we use in this CWPK series and cowpoke are available through conda.

Virtual Environments and a Detour

In the standard use of a Linux installation, one uses the distro’s own installation and package management procedures. In the case of Ubuntu and related distros such as Debian, apt stands for ‘advanced package tool‘ and through commands such as apt-get is one means to install new capabilities to your remote instance. Other Linux distros may use yum or rpm or similar to install new packages.

Then, of course, within Python, one has the pip package installer and the conda one we are using. Further, within a Linux installation, how one may install packages and where they may be applicable depends on the user rights of the installer. That is one reason why it is important to have sudo (super user) rights as an administrator of the instance when one wants new packages to be available to all users.

These package managers may conflict in rights and, if not used properly, may act to work at cross purposes. For example, in a standard AWS EC2 instance using Ubuntu, it comes packaged with its own Python version. This default is useful for the occasional app that needs Python, but does not conform to the segregation of packages that Python often requires using its ‘virtual environments‘.

On the face of it, the use of virtual environments seems to make sense for Python: we can keep projects separate with different dependencies and even different versions of Python. I took this viewpoint at face value when I began this transition to installing Python on a remote instance.

Given this, it is important to never run sudo pip install. We need to know where and to what our Python packages apply, and generic (Linux-type) install routines confuse Linux conventions with Python ones. We are best in being explicit. There are conditions, then, where the idea of a Python virtual environment makes sense. These circumstances include, among others, a Python shop that has many projects; multiple developers with different versions and different applications; a mix of Python applications between Python 2 and 3 or where specific version dependencies create unworking conditions; and so forth.

However, what I found in migrating to a remote instance is that virtual environments, combined with a remote instance and different operating system, added complexity that made finding and correcting problems difficult. After struggling for hours trying to get systems to work, not really knowing where the problem was occurring nor where to look for diagnostics, I learned some important things the hard way. I describe below some of these lessons.

An Unfortunate Bad Fork

So, I was convinced that a virtual environment made sense and set about following some online sources that documented this path. In general, these approaches used virtualenv (or venv), a pip-based environment manager, to set up this environment. Further, since I was using Ubuntu, AWS and Apache2, these aspects added to the constraints I needed to accommodate to pursue this path.

Important Note: For my configuration, this path is a dead end! If you have a similar configuration, do NOT follow this path. The section after this one presents the correct approach!

In implementing this path, I first installed pip:

sudo apt install -y python3-pip

I was now ready to tackle the fourth and last of my installation questions, namely to provide a virtual environment for my KBpedia-related Python projects. To do so, we first need to install the virtual environment package:

sudo apt install -y python3-venv

Then, we make sure we are in the base directory where we want this virtual environment to reside. (In our case, /usr/bin/python-projects/. We also will name our virtual environment kbpedia. We first establish the virtual environment:

python3.6 -m venv kbpedia

which pre-populates a directory with some skeletal capabilities. Then we activate it:

source kbpedia/bin/activate

The virtual environment is now active, and you can work with it as if you were are the standard command prompt, though that prompt does change in form to something like (kbpedia) :/usr/bin/python-projects#. You work with this directory as you normally would, adding test code next in our case. When done working with this environment, type deactivate to return from the virtual environment.

The problem is, none of this worked for my circumstance, and likely never would. What I had neglected in taking this fork is that conda is both a package manager and a virtual environment manager. With the steps I had just taken, I had inadvertently set up multiple virtual environments, which were definitely in conflict.

The Proper Installation

Once I realized that my choice of conda meant I had already committed to a virtual environment manager, I needed to back off all of the installs I had undertaken with the bad fork. That meant I needed to uninstall all packages installed under that environment, remove the venv environment manager, remove all symbolic links, and so forth. I also needed to remove all Apache2 updates I had installed for working with wsgi. I had no confidence whatever I had installed had registered properly.

The bad fork was a costly mistake, and it took me quite of bit of time to research and find the proper commands to remove the steps I had undertaken (which I do not document here). My intent was to get back to ‘bare iron’ for my remote installation so that I could pursue a clean install based on conda.

Installing the mod-wsgi Apache Module

After reclaiming the instance, my first step was to install the appropriate Apache2 modules to work with Python and wsgi. I began by installing the WSGI module to Apache2:

sudo apt-get install libapache2-mod-wsgi-py3

Which we then test to see if it was indeed installed properly:

apt-cache show libapache2-mod-wsgi-py3

Which displays:

Package: libapache2-mod-wsgi-py3
Architecture: amd64
Version: 4.5.17-1ubuntu1
Priority: optional
Section: universe/httpd
Source: mod-wsgi
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com> Original-Maintainer: Debian Python Modules Team <python-modules-team@lists.alioth.debian.org> Bugs: https://bugs.launchpad.net/ubuntu/+filebug Installed-Size: 271 Provides: httpd-wsgi Depends: apache2-api-20120211, apache2-bin (>= 2.4.16), python3 (>= 3.6), python3 (<< 3.7), libc6 (>= 2.14), libpython3.6 (>= 3.6.5) Conflicts: libapache2-mod-wsgi Filename: pool/universe/m/mod-wsgi/libapache2-mod-wsgi-py3_4.5.17-1ubuntu1_amd64.deb Size: 88268 MD5sum: 540669f9c5cc6d7a9980255656dd1273 SHA1: 4130c072593fc7da07b2ff41a6eb7d8722afd9df SHA256: 6e443114d228c17f307736ee9145d6e6fcef74ff8f9ec872c645b951028f898b Homepage: http://www.modwsgi.org/ Description-en: Python 3 WSGI adapter module for Apache The mod_wsgi adapter is an Apache module that provides a WSGI (Web Server Gateway Interface, a standard interface between web server software and web applications written in Python) compliant interface for hosting Python based web applications within Apache. The adapter provides significantly better performance than using existing WSGI adapters for mod_python or CGI. . This package provides module for Python 3.X. Description-md5: 9804c7965adca269cbc58c4a8eb236d8 </python-modules-team@lists.alioth.debian.org></ubuntu-devel-discuss@lists.ubuntu.com>

And next, we check to see if the module is properly registered:

Our first test is:

sudo apachectl -t

and the second is:

sudo apachectl -M | grep wsgi

which gives us the correct answer:

wsgi_module (shared)

Setting Up the Conda Virtual Environment

We then proceed to create the ‘kbpedia’ virtual environment under conda:

conda create -n kbpedia python=3

Which we test with the Ubuntu path inquiry:

echo $PATH

which gives us the correct path registration (the first of the five listed):

/usr/bin/python-projects/miniconda3/envs/kbpedia/bin:/usr/bin/python-projects/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

and then we activate the ‘kbpedia’ environment:

conda activate kbpedia

Installing Needed Packages

Now that we are active in our virtual environment, we need to install our missing packages:

conda install flask
conda install pip

Installing Needed Files

For Flask to operate, we need two files. The first file is the basic calling application, test_kbpedia.py, that we place under the Web sites directory location, /var/www/html/kbpedia/ (we create the kbpedia directory). We call up the editor for this new file:

vi test_kbpedia.py

and enter the following simple Python program:

 
from flask import Flask
app = Flask(__name__)@app.route("/")
def hello():
return "Hello KBpedia!"

It is important to know that Flask comes with its own minimal Web server, useful for only demo or test purposes, and one (because it can not be secured) should NOT be used in a production environment. Nonetheless, we can test this initial Flask specification by entering either of these two commands:

 wget http://127.0.0.1/ -O- 

curl http://127.0.0.1:5000

Hmmm. These commands seem not to work. Something must not be correct in our Python file format. To get better feedback, we invoke Python and then:

flask run

This gives us the standard traceback listing that we have seen previously with Python programs. We get an error message that the name ‘app’ is not recognized. As we inspect the code more closely, we can see that one line in the example that we were following was inadvertenly truncated (denoted by the decorator ‘@’ sign). We again edit test_kbpedia.py to now appear as follows:

 
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "Hello KBpedia!"

Great! That seems to fix the problem.

We next need to enter a second Python program that tells WSGI where to find this app. We call up the editor again in the same directory location:

vi wsgi.py

(We also may use the *.wsgi file extension if we so choose; some examples prefer this second option.)

and enter this simple program into our editor:

    
import sys
sys.path.insert(0, "/var/www/html/kbpedia/")
from test_kbpedia import app as application

This program tells WSGI where to find the application, which we register via the sys package as the code states.

Lastly, to complete our needed chain of references, Apache2 needs instructions for where a ‘kbpedia’ reference in an incoming URI needs to point in order to find its corresponding resources. So, we go to the /etc/apache2/sites-enabled directory and then edit this file:

vi 000-default.conf

Under DocumentRoot in this file, enter these instructions:

        WSGIDaemonProcess kbpedia python-path=/usr/bin/python-projects/miniconda3/envs/kbpedia/lib/python3.8/site-packages
WSGIScriptAlias /kbpedia /var/www/html/kbpedia/wsgi.py
<Directory /var/www/html/kbpedia>
WSGIProcessGroup kbpedia
WSGIApplicationGroup %{GLOBAL}
Order deny,allow
Allow from all
</Directory>

This provides the path for where to find the WSGI file and Python.

We check to make sure all of these files are written correctly by entering this command:

sudo apachectl -t

When we first run this, while we get a ‘Syntax OK’ message, we also get a warning message that we need to “Set the ‘ServerName’ directive globally to suppress this message”. While there is no problem to continue in this mode, in order to understand the suite of supporting files, we navigate to where the ServerName is set in the directory /etc/apache2​ by editing this file:

vi apache2.conf

and enter this name:

ServerName localhost

Note anytime we make a change to an Apache2 configuration file, that we need to re-start the server to make sure the new configuration is now active. Here is the basic command:

sudo service apache2 restart

If there is no problem, the prompt is returned after entering the command.

We are now complete with our initial inputs. To test whether we have properly set up Flask and our Web page serving, we use the IP for our instance and enter this command in a browser:

http://54.227.249.140/kbpedia

And, the browser returns a Web page stating:

Hello KBpedia!

Fantastic! We now are serving Web pages from our remote instance.

Of course, this is only the simplest of examples, and we will need to invoke templates in order to use actual HTML and style sheets. These are topics we will undertake in the next installment. But, we have achieved a useful milestone in this three-part mini-series.

Some More conda Commands

If you want to install libraries into an environment, or you want to use that environment, you have to activate the environment first by:

conda activate kbpedia

After invoking this command, we can install desired packages into the target environment, for example:

conda install package-name

But sometimes we need to use a different channel, in which case we first need to install that channel:

conda install --channel asmeurer

then, invoke it (say):

conda install -c conda-forge package-name

To see what packages are presently available to your conda environment, type:

conda list

And, to see what environments you have set up within conda, enter:

conda env list

which, in our current circumstance, give us this result:

base                     /usr/bin/python-projects/miniconda3
kbpedia * /usr/bin/python-projects/miniconda3/envs/kbpedia

The environment shown with the asterisk (*) is the currently active one.

Another useful command to know is to get full information on your currently active conda environment. To do so, type:

conda info

in our case, that produces the following output:

     active environment : kbpedia
    active env location : /usr/bin/python-projects/miniconda3/envs/kbpedia
            shell level : 1
       user config file : /root/.condarc
 populated config files :
          conda version : 4.8.4
    conda-build version : not installed
         python version : 3.8.3.final.0
       virtual packages : __glibc=2.27
       base environment : /usr/bin/python-projects/miniconda3  (writable)
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /usr/bin/python-projects/miniconda3/pkgs
                          /root/.conda/pkgs
       envs directories : /usr/bin/python-projects/miniconda3/envs
                          /root/.conda/envs
               platform : linux-64
             user-agent : conda/4.8.4 requests/2.24.0 CPython/3.8.3 Linux/4.15.0-115-generic ubuntu/16.04.1 glibc/2.27
                UID:GID : 0:0
             netrc file : None
           offline mode : False

Additional Documentation

Here are some additional resources touched upon in the prior discussions:

Getting Oriented

Flask on AWS

Each of these cover some of the first steps needed to get set up on AWS, which we skip over herein:

RESTful APIs

General Flask Resources

RDFlib

Here is a nice overview of RDFlib.

General Remote Instance Set-up

A video on setting up an EC2 instance and Putty; also deals with updating Python, Filezilla and crontab.

Other Web Page Resources

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 26, 2020 at 10:28 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2407/cwpk-58-setting-up-a-remote-instance-and-web-page-server/
The URI to trackback this post is: https://www.mkbergman.com/2407/cwpk-58-setting-up-a-remote-instance-and-web-page-server/trackback/
Posted:October 22, 2020

Notebooks are Only Interactive if You Share

We first began publishing interactive electronic notebooks for this Cooking with Python and KBpedia series in CWPK #16, though the first installment using a notebook started with CWPK #14. I had actually drafted all of the installments up to this one before that date was reached. Since one major purpose of this series was to provide hands-on training, I did not want to force those who wanted to experience some degree of interactivity to have to go through all of the steps to set up their own interactive environment. My hope was that a taste of direct involvement with the code and interactivity would itself encourage users to get more deeply involved to establish their own interactive environments.

I had encountered for myself fully interactive notebooks prior to this point, ones where all I needed to do was to click to operate, so I knew there must be a way to make my own notebooks similarly available. In order to achieve my objective, as is true with so much of this series, I was forced to do the research and discover how I could set up such a thing.

A Survey of the Options

In researching the options it was clear that a spectrum of choices existed. We have already discussed how we can create non-interactive mockups of an interactive notebook using the nbconvert option or converting a drafted notebook using Pandoc (CWPK #15). My research surfaced some additional options to render a notebook page for general Web (HTML) display:

Static Options

  1. nbconvert, but lose interactivity
  2. The Pandoc option
  3. Publish in other formats (PDF)
  4. View a non-interactive page via nbviewer by simply providing a URL, which works like nbconvert.

These options are helpful, of course, but lack the full interactivity desired.

Fully Interactive

Systems that allow code cells to be run interactively are obviously more complex than nice rendering tools. My investigations turned up a number of online services, plus ways to set up own or private servers. From the standpoint of online services, here are the leading options:

There is a Python option that does not provide complete interactivity, but simple interactions of certain aspects of certain notebook cells:

  1. nbinteract is a Python package that provides a command-line tool to generate interactive web pages from Jupyter notebooks

Then, there are a series of online services:

  1. The MyBinder option, which uses a JupyterHub server directly from a Git repository
  2. Google’s Colaboratory, which provides a Google flavor on this approach
  3. Microsoft’s Azure Notebooks, which provides a MS flavor on this approach
  4. There are other sites such as Kaggle Kernels, CoCalc, nanoHUB, or Datalore that also provide such services, some for a fee.

The other interactive approach is to not use an established service, but to set up your own server.

  1. For private repositories, one can build on BinderHub, the same technology used by MyBinder, and which runs on JupyterHub running on Kubernetes for most of its functionality, or
  2. One can run a public notebook server based on Jupyter, though it is limited to a single access user at a time, or
  3. Set up one’s own JupyterHub, similar to the BinderHub option but not limited to a Git repository.

Frankly, most of the own-server options looked to be too much work simply to support my educational objectives for the CWPK series.

The Chosen MyBinder Option

I was very much committed to have an online service that would run my full stack. I chose to implement the MyBinder option because I could see it worked and was popular, it had close ties to Jupyter and rendered notebooks the same as when using locally, was free, and seemed to have strong backing and documentation. On the other hand, MyBinder has some weaknesses and poses some challenges. Some of the key ones I knew going in or identified as I began working with the system were:

  • As a hosted service that runs its applications in containers, it could take some minutes to get the online service active when used after a hiatus, necessary to reconfigure the container specific to the application and Python modules being used
  • It reportedly has memory limitation of 1-2 GB. Memory can be an issue with CWPK locally even at the 8 GB level
  • The service needed to run off of a Git repo. I had plans to better expose all aspects of the CWPK series and its supporting software on our existing public GitHub repository; the Git requirement caused me to accelerate my exposure of this service
  • Though free now, each MyBinder application is a rather large consumer of resources. I have some concerns regarding the longer-term availability of the service
  • CWPK would have more than 60 interactive notebook pages, though I did see reference that performance issues may only arise due to multiple, concurrent use. Going in, I had no clue as to what the use factor might be for the service and whether this would pose a problem or not.

Some of these issues deserve their own commentary below.

Setting Up the Environment

Setting up a new instance at the MyBinder service is relatively straightforward. Here is the basic set-up screen on the main page:

Setting Up a New MyBinder Project
Figure 1: Setting Up a New MyBinder Project

One must first have a Git repository available to start the service. One also needs to have completed an environment configuration file (environment.yml in our case with Python) and a project README.md at the root of the master branch on the repo. In our case, it is the CWPK repository on GitHub; we also indicate we are dealing with the master branch (1). (I had some initial difficulty when I over-specified a link to an individual notebook page; removing this cleared things up.) These simple specifications create the URL that is the link to your formal online project (2). Upon launch (3), the build process is shown to screen (4), which may take some minutes to build. The set of working input specs also provide the basis for generating a link badge you can use on your own Web sites (5).

Upon completion of a successful build, one is shown the standard Jupyter Notebook entry page.

Here are two additional resources useful to setting up a MyBinder application for the first time:

Implementation Challenges

Though set up is straightforward, there are some challenges in implementing MyBinder to accommodate specific CWPK needs. Here are some of the major areas I encountered, and some steps to address them.

Importing Local Code

As has become obvious in our series to date, Python is a highly configurable environment, with literally tens of thousands of packages to choose from to invoke needed functionality. The standard environment settings appear to do a good job of allowing new packages to be specified and imported into the MyBinder system. I had confidence these could be handled appropriately.

My major concern related to CWPK‘s own cowpoke package. At the time of starting this effort, this package was not commercial grade and was not registered on major distribution networks like PyPi or conda-forge. When used locally, including cowpoke is not a big issue: we only need to include it in the local listing of site packages. But, once we rely on a cloud instance, how can we get that code into our online MyBinder system?

The answer, it turns out, is to package our code as one would normally do for a commercial package, and then to include a setup.py configuration file in our local specification. That enables us to invoke the package through the standard MyBinder environment configuration. See especially this key reference and this stub for setup.py.

Local Data Sharing and Organization

Until this point, I had been developing and refining a local file directory structure in which to put different versions of KBpedia, other input files, and example outputs. This system was being developed logically from the perspective of a local file system.

However, these files were local and not exposed for access to an online system like MyBinder. My first thought was to simply copy this structure to the GitHub repo. But the manual copying of files to a version control system is NOT efficient, and the directory structure itself did not appear suitable for a repo presence. Further, manually copying files presents an ongoing issue of keeping local and remote versions in sync. Moreover, as I began adding new daily installments to the GitHub repo, I could see in general that manual additions were not going to be sustainable.

These realizations forced two decisions. First, I would need to re-think and re-organize my directory structures to accommodate both local and repo needs. The directory structure we have developed to date now reflects this re-organization. Second, as described in the next section, I needed to cease my manual use of GitHub and fully embrace it as a version control system.

Fully Embracing GitHub

I have to admit: Every time I try to work with version control with Git, I have been confused and frustrated with how to actually get anything done. I have, hopefully, progressed a bit beyond this point, but I would caution some of you looking to move into this area that you may have to overcome poor documentation and obfuscated instructions and commands.

So, I will not divert this series to deal with how to properly set up a Git-based version control system. In brief, one needs to establish a Git repository, and on Windows set up a Git client and then (for me), set up TortoiseGit if you want to work directly in Windows and the File Explorer rather than from the command line. In the process of doing all of this, you will also need to set up a key-based access control system with PuTTy (puttygen) so that you can communicate securely between your remote instance and your local file system. These steps are more effectively described in the TortoiseGit manual or various tutorials of one aspect or another. Installation, too, can be difficult with regard to general aspects or the PuTTy keys.

The reason, of course, for accepting this set-up complexity is being able to make changes either on a local version of code or data or a remote version of the same. This is the essence of the definition of keeping systems in ‘sync’. After having worked with these systems now for some weeks, daily, I think I can offer some simple tips for how best to work with these version control systems, points which are not obvious from most written presentations:

  1. First, make sure both the remote and local sources are in sync (this is actually not such an easy point, and is often the point of failure and frustration. However, until this status of being in sync is met, none of the other points below are possible.) When working locally, it is good practice to ‘pull’ any changes first from the remote repository before you attempt to ‘push’ local changes back to it. If you run into problems at this initial point, you need to research and find a fix before moving on
  2. Remember that your version control can really only occur from the local side, where your TortoiseGit is installed. So, while changes may occur either in the remote or local repository, the control to keep things in sync will occur from the local side (TortoiseGit)
  3. Whether at the remote or local repository, make all needed changes there, including deleting files, adding files, or modifying files. Then, commit those changes to the repository at hand (local or remote). (On TortoiseGit locally, this is done via the ‘Add’ Explorer menu option for new files; use the ‘Check for modifications’ option for changed files.) Commitment to the repository at hand is needed before the version control system knows what has been formally modified
  4. Again, be cognizant of where the modifications have occurred, which in any case you will control from the local TortoiseGit. If the changes have been made locally, then ‘push’ those changes to the remote repository; if the changes have been made remotely, then ‘pull’ those changes changes back to the local.

Always make sure that as any changes are made, at either side, they are synced to the system. In this way, you can be assured that your version control system is in a stable state, and you are free to make changes on either the local or remote side. Also know you can use GitHub for keeping multiple local instances (a desktop and a laptop in my case) in sync with the remote repository. Simply follow the above guidelines for each instance.

Handling Styles (CSS)

If you recall the discussion in CWPK #15, there is a difference in where custom styles can be set when viewing notebook pages locally versus whether they are called up directly from Python. Now, as we move to an online expression of these notebooks, we again raise the question of where online custom styles can be invoked.

Perhaps in some expressions, where style overrrides can be invoked is a matter of little consequence. But this CWPK series has some specific styles in such things as pointing to warnings, pointing to online resources, etc. Having a consistent way to refer to these styles (presentation) means better efficiency.

My hope had been that with MyBinder we had some identifiable means for providing such custom.css overrides as well. Though I can see links to such in page views, and there are hints online for how to actually modify styles, I was unable to find any means for effectively doing so. My suspicion is that online interactivity such as MyBinder is still in its infancy, and the degree of control that we expect in either local or remote environments is not yet mature.

Thus, since I could find no way after many frustrating hours to provide my own specific styles, I had to make the reluctant decision to embed all such style changes in each individual notebook page. (What this means, effectively, is the specific statement of style attributes needs to be repeated each time as used [MyBinder does not support referring to a style name in a separate external file, which is the more efficient alternative].) This embedded approach is not efficient, but, like prior discussions about the use of relative addresses, sometimes being specific is the best way to ensure consistent treatment across environments.

Using MyBinder

As time has gone on, I now have learned a general workflow that reflects these realities. In general, thus, with each new CWPK installment I try to:

  1. Draft all material in the Jupyter Notebook; make sure that version embodies all desired content and style changes and updates
  2. Inspect all link references and style definitions to make sure they are absolute, and not referenced
  3. Make sure all external files and images are moved and stored on the repository systems
  4. Post the updated file to the its current repository, and then commit it
  5. Push the updated file to the remote (or local) repository
  6. Convert the *.ipynb to HTML and post on my local blog
  7. Mix and stir again.

Though it is not designed directly as such, it is also possible to analyze use of MyBinder and gain statistics of use.

Additional Documentation

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 22, 2020 at 11:17 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2406/cwpk-57-publishing-interactive-notebooks/
The URI to trackback this post is: https://www.mkbergman.com/2406/cwpk-57-publishing-interactive-notebooks/trackback/
Posted:October 21, 2020

Working Examples are Not so Easy to Come By

Today’s installment in our Cooking with Python and KBpedia series is a great example of how impressive uses of Python can be matched with frustrations over how we get there and whether our hoped-for desires can be met. The case study we tackle in this installment is visualization of the large-scale KBpedia graph. With nearly 60,000 nodes and about a quarter of a million edge configurations, our KBpedia graph is hardly a toy example. Though there are certainly larger graphs out there, once we pass about 10,000 nodes we enter difficult territory for Python as a performant language. We’ll cover this topic and more in this installment.

Normally, what one might encounter online regarding graph visualization with Python begins with a simple example, which then goes on to discuss how to stage the input data and then make modifications to the visualization outputs. This approach does not work so well, however, when our use case scales up to the size of KBpedia. Initial toy examples do not provide good insight how the various Python visualization packages may operate at larger scales. So, while we can look at example visualizations and example code showing how to expose options, in the end whether we can get the package to perform requires us to install and test it. Since our review time is limited, and we have to in the end produce working code, we need a pretty efficient process of identifying, screening, and then selecting options.

Our desires for a visualization package thus begin with the ability to handle large graphs, including graph analytic components in addition to visualization, compatibility with our Jupyter Notebook interactive environment, ease-of-learning and -implementation, and hopefully attractive rendering of the final graph. From a graph visualization standpoint, some of our desires include:

  • Attractive outputs
  • Ability to handle a large graph with acceptable rendering speed
  • Color coding of nodes by SuperType
  • Varying node sizes depending on the importance (in-degree) of the node
  • Control over the graphical elements of the display (edge and node styles)
  • Perhaps some interactivity such as panning and zooming and tooltips when hovering over nodes, and
  • A choice of a variety of graph layout options to gauge which best displays the graph.

Preferably, whatever packages work best for these criteria also have robust supporting capabilities within the Python data science ecosystem. To test these criteria, of course, it is first necessary to stage our graph in an input form that can be rendered by the visualization package. This staging of the graph data is thus where we obviously begin.

Data Preparation

Given the visualization criteria above, we know that we want to produce an input file for a directed graph that consists of individual rows for each ‘edge’ (connection between two nodes) consisting of a source node (subclass), a target node to which it points (the parent node), a SuperType for the parent node (and possibly its matching rendering color), and a size for the parent node as measured by its number of direct subclasses. This should give us a tabular graph definition file with rows corresponding to the individual edges (subclasses of each parent) by sometime like these columns:

  RC1(source)     RC2(target)     No Subclasses(weight)     SuperType     ST color  

Different visualization packages may want this information in slightly different order, but that may be readily accomplished by shifting the order of written output.

Another thing I wanted to do was to order the SuperTypes according to the order of the universal categories as shown by kko-demo.n3. This will tend to keep the color ordering more akin to the ordering of the universal categories (see further CWPK #8 for a description of these universal categories).

It is pretty straightforward to generate a listing of hex color values from an existing large-scale bokeh color palette, as we used in the last CWPK installment. First, we count the number of categories in our use case (72 for the STs). Second, we pick one of the large (256) bokeh palettes. We then generate a listing of 72 hex colors from the palette, which we can then relate to the ST categories:

from bokeh.palettes import Plasma256, linear_palette

linear_palette(Plasma256,72)

We reverse the order to go from lighter to darker, and then correlate the hex values to the SuperTypes listed in universal category order. Our resulting custom color dictionary then becomes:

cmap = {'Constituents'           : '#EFF821',
'NaturalPhenomena' : '#F2F126',
'TimeTypes' : '#F4EA26',
'Times' : '#F6E525',
'EventTypes' : '#F8DF24',
'SpaceTypes' : '#F9D924',
'Shapes' : '#FBD324',
'Places' : '#FCCC25',
'AreaRegion' : '#FCC726',
'LocationPlace' : '#FDC128',
'Forms' : '#FDBC2A',
'Predications' : '#FDB62D',
'AttributeTypes' : '#FDB030',
'IntrinsicAttributes' : '#FCAC32',
'AdjunctualAttributes' : '#FCA635',
'ContextualAttributes' : '#FBA238',
'RelationTypes' : '#FA9C3B',
'DirectRelations' : '#F8963F',
'CopulativeRelations' : '#F79241',
'ActionTypes' : '#F58D45',
'MediativeRelations' : '#F48947',
'SituationTypes' : '#F2844B',
'RepresentationTypes' : '#EF7E4E',
'Denotatives' : '#ED7B51',
'Indexes' : '#EB7654',
'Associatives' : '#E97257',
'Manifestations' : '#E66D5A',
'NaturalMatter' : '#E46A5D',
'AtomsElements' : '#E16560',
'NaturalSubstances' : '#DE6064',
'Chemistry' : '#DC5D66',
'OrganicMatter' : '#D8586A',
'OrganicChemistry' : '#D6556D',
'BiologicalProcesses' : '#D25070',
'LivingThings' : '#CF4B74',
'Prokaryotes' : '#CC4876',
'Eukaryotes' : '#C8447A',
'ProtistsFungus' : '#C5407D',
'Plants' : '#C13C80',
'Animals' : '#BD3784',
'Diseases' : '#BA3487',
'Agents' : '#B62F8B',
'Persons' : '#B22C8E',
'Organizations' : '#AE2791',
'Geopolitical' : '#A92395',
'Symbolic' : '#A51F97',
'Information' : '#A01B9B',
'AVInfo' : '#9D189D',
'AudioInfo' : '#9713A0',
'VisualInfo' : '#9310A1',
'WrittenInfo' : '#8E0CA4',
'StructuredInfo' : '#8807A5',
'Artifacts' : '#8405A6',
'FoodDrink' : '#7E03A7',
'Drugs' : '#7901A8',
'Products' : '#7300A8',
'PrimarySectorProduct' : '#6D00A8',
'SecondarySectorProduct' : '#6800A7',
'TertiarySectorService' : '#6200A6',
'Facilities' : '#5E00A5',
'Systems' : '#5701A4',
'ConceptualSystems' : '#5101A2',
'Concepts' : '#4C02A1',
'TopicsCategories' : '#45039E',
'LearningProcesses' : '#40039C',
'SocialSystems' : '#3A049A',
'Society' : '#330497',
'EconomicSystems' : '#2D0494',
'Methodeutic' : '#250591',
'InquiryMethods' : '#1F058E',
'KnowledgeDomains' : '#15068A',
'EmergentKnowledge' : '#0C0786',
}

We now have all of the input pieces to complete our graph dataset. Fortunately, we had already developed a routine in CWPK #49 for generating an output listing from our owlready2 representation of KBpedia. We begin by loading up our necessary packages for working with this information:

from cowpoke.__main__ import *
from cowpoke.config import *
from owlready2 import *

And we follow the same configuration setup approach that we have developed for prior extractions:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###             
# 'kb_src'        : 'standard'                                        # Set in master_deck
# 'loop_list'     : kko_order_dict.values(),                          # Note 1   
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/mappings/',              
# 'ext'           : '.csv',                                         
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv',

def graph_extractor(**extract_deck):
    print('Beginning graph structure extraction . . .')
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    class_loop = extract_deck.get('class_loop')
    base = extract_deck.get('base')
    ext = extract_deck.get('ext')
    
    # Note 2
    parent_set = ['kko.SocialSystems','kko.Products','kko.Methodeutic','kko.Eukaryotes',
              'kko.ConceptualSystems','kko.AVInfo','kko.Systems','kko.Places',
              'kko.OrganicChemistry','kko.MediativeRelations','kko.LivingThings',
              'kko.Information','kko.CopulativeRelations','kko.Artifacts','kko.Agents',
              'kko.TimeTypes','kko.Symbolic','kko.SpaceTypes','kko.RepresentationTypes',
              'kko.RelationTypes','kko.OrganicMatter','kko.NaturalMatter',
              'kko.AttributeTypes','kko.Predications','kko.Manifestations',
              'kko.Constituents']

    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    header = ['target', 'source', 'weight', 'SuperType']
    out_file = extract_deck.get('out_file')
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:                                           
        csv_out = csv.writer(output)
        csv_out.writerow(header)    
        for value in loop_list:
            print('   . . . processing', value)
            s_set = []
            root = eval(value)
            s_set = root.descendants()
            frag = value.replace('kko.','')
            for s_item in s_set:
                child_set = list(s_item.subclasses())
                count = len(list(child_set))
                
# Note 3                
                if value not in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        cur_list.append(new_pair)
                        s_rc = s_rc.replace('rc.','')
                        child = child.replace('rc.','')
                        row_out = (s_rc,child,count,frag)
                        csv_out.writerow(row_out)
                elif value in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        if new_pair not in cur_list:
                            cur_list.append(new_pair)
                            s_rc = s_rc.replace('rc.','')
                            child = child.replace('rc.','')
                            row_out = (s_rc,child,count,frag)
                            csv_out.writerow(row_out)
                        elif new_pair in cur_list:
                            continue
        output.close()         
        print('Processing is complete . . .')
graph_extractor(**extract_deck)

This routine is pretty consistent with the prior version except for a few changes. First, the order of the STs in the input dictionary has changed (1), consistent with the order of the universal categories and with lower categories processed first. Since source-target pairs are only processed once, this ordering means duplicate assignments are always placed at their lowest point in the KBpedia hierarchy. Second, to help enforce this ordering, parental STs are separately noted (2) and then processed to skip source-target pairs that had been previously processed (3).

To see the output from this routine (without hex colors yet being assigned by ST), run:

import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')

df

Evaluation of Large-scale Graph Visualization Options

In my prior work with large-scale graph visualizations, I have used Cytoscape and written about it (2008), as well as Gephi and written about it (2011). Though my most recent efforts have preferred Gephi, neither is written in Python and both are rather cumbersome to set up for a given visualization.

My interest here is either a pure Python option or one that has a ready Python wrapper. The Python visualization project, PyViz provides a great listing of the options available. Since some are less capable than others, I found Timothy Lin’s benchmark comparisons of network packages to be particularly valuable, and I have limited my evaluation to the packages he lists.

The first package is NetworkX, which is written solely in Python and is the granddaddy of network analysis packages in the language. We will use it as our starting baseline.

Lin also compares SNAP, NetworkKit, igraph, graph-tool, and Lightgraphs. I looked in detail at all of these packages except for Lightgraphs, which is written in Julia and has no Python wrapper.

Lin’s comparisons showed NetworkX to be, by far, the slowest and least performant of all of the packages tested. However, NetworkX has a rich ecosystem around it and much use and documentation. As such, it appears to be a proper baseline for the testing.

All of the remaining candidates implement their core algorithms in C or C++ for performance reasons, though Python wrappers are provided. Based on Lin’s benchmark results and visualization examples online, my initial preference was for graph-tool, followed possibly by NetworkKit. SNAP had only recently been updated by Stanford, and igraph initially appeared as more oriented to R than Python.

So, my plan was to first test NetworkX, and then try to implement one or more of the others if not satisfied.

First NetworkX Visualizations

With our data structure now in place for the entire KBpedia, it was time to attempt some visualizations using NetworkX. Though primarily an analysis package, NetworkX does support some graph visualizations, principally through graphViz or matplotlib. In this instance, we use the matplotlib option, using the spring layout.

Note in the routine below, which is fairly straightforward in nature, I inserted a print statement to separate out the initial graph construction step from graph rendering. The graph construction takes mere seconds, while rendering the graph took multiple hours.

WARNING!: The cell below takes tens of minutes to hours to run. Please do not execute unless you are able to let this run in the background.
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, edge_attr='weight', create_using=Graphtype)
print('Graph construction complete.')
pos = nx.spring_layout(G,scale=1)

nx.draw(G,pos, with_labels=True)
plt.show()
Baseline KBpedia Visualization with Labels
Figure 1: Baseline KBpedia Visualization with Labels
NOTE: The figures in this article are static captures of the interactive electronic notebook. See note at bottom for how to access these.

Since the labels render this uninterpretable, we tried the same approach without labels.

WARNING!: The cell below takes tens of minutes to hours to run. Please do not execute unless you are able to let this run in the background.
plt.figure(figsize=(8,6))
nx.draw(G,pos, with_labels=False)
plt.show() 
Baseline KBpedia Visualization without Labels
Figure 2: Baseline KBpedia Visualization without Labels

This view is hardly any better.

Given the lengthy times it took to generate these visualizations, I decided to return to our candidate list and try other packages.

More Diligence on NetworkX Alternatives

If you recall, my first preferred option was graph-tool because of its reported speed and its wide variety of graph algorithms and layouts. The problem with graph-tool, as with the other alternatives, is that a C++ compiler is required, along with other dependencies. After extensive research online, I was unable to find an example of a Windows instance that was able to install graph-tool and its dependencies successfully.

I turned next to NetworKit. Though visualization choices are limited in comparison to the other C++ alternatives, this package has clearly been designed for network analysis and has a strong basis in data science. This package does offer a Windows 10 installation path, but one that suggests adding a virtual Linux subsystem layer to Windows. Again, I deemed this to be more complexity than a single visualization component warranted.

With igraph, I went through the steps of attempting an install, but clearly was also missing dependencies and using it kept killing the kernel in Jupyter Notebook. Again, perhaps with more research and time, I could have gotten this package to work, but it seemed to impose too much effort for a Windows environment for the possible reward.

Lastly, given these difficulties, and the fact that SNAP had been under less active development in recent years, I chose not to pursue this option further.

In the end, I think with some work I could have figured out how to get igraph to install, and perhaps NetworKit as well. However, as a demo candidate being chosen for Python newbies, it struck me that no reader of this series would want to spend the time jumping through such complicated hoops in order to get a C++ option running. Perhaps in a production environment these configuration efforts may be warranted. However, for our teaching purposes, I judged trying to get a C++ installation on Windows as not worth the effort. I do believe this leaves an opening for one or more developers of these packages to figure out a better installation process for Windows. But that is a matter for the developers, not for a newbie Python user such as me.

Faster Testing of NetworkX with the Upper KBpedia

So these realizations left me with the NetworkX alternative as the prime option. Given the time it took to render the full KBpedia, I decided to use the smaller upper structure of KBpedia to work out the display and rendering options before applying it to the full KBpedia.

I thus created offline a smaller graph dataset that consisted of the 72 SuperTypes and all of their direct resource concept (RC) children. You can inspect this dataset (df_kko) in a similar matter to the snippet noted above for the full KBpedia (df).

Also, to overcome some of the display limitations of the standard NetworkX renderers, I recalled that the HoloViews package used in the last installment also had an optional component, hvPlot, designed specifically to work with NetworkX graph layouts and datasets. The advantage of this approach is that we would gain interactivity and some of the tooltips when hovering over nodes on the graph.

I literally spent days trying to get all of these components to work together in terms of my desired visualizations where SuperType nodes (and their RCs) would be colored differently and the size of the nodes would be dependent on the number of subclasses. Alas, I was unable to get these desired options to work. In part, I think this is because of the immaturity of the complete ecosystem. In part, it is also due to my lack of Python skills and the fact that the entire chain of NetworkX → bokeh → HoloViews → hvPlot each provides its own syntax and optional functions for making visualization tweaks. It is hard to know what governs what and how to get all of the parts to work together nicely.

Fortunately, with the smaller input graph set, it is nearly instantaneous to make and see changes in real time. Despite the number of tests applied, the resulting code is fairly small and straightforward:

import pandas as pd
import holoviews as hv
import networkx as nx
import hvplot.networkx as hvnx
from holoviews import opts
from bokeh.models import HoverTool

hv.extension('bokeh')

# Load the data
# on MyBinder: https://github.com/Cognonto/CWPK/blob/master/sandbox/extracts/data/kko_graph_specs.csv
df_kko = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/kko_graph_specs.csv')

# Define the graph
G_kko = nx.from_pandas_edgelist(df_kko, 'source', 'target', ['Subs', 'SuperType', 'Color'], create_using=nx.DiGraph())

pos = nx.spring_layout(G_kko, k=0.4, iterations=70)

hvnx.draw(G_kko, pos, node_color='#D6556D', alpha=0.65).opts(node_size=10, width=950, height=950, 
                           edge_line_width=0.2, tools=['hover'], inspection_policy='edges')
Smaller Scale KKO (KBpedia) Graph
Figure 3: Smaller Scale KKO (KBpedia) Graph

Final Large-scale Visualization with NetworkX

With these tests of the smaller graph complete, we are now ready to produce the final visualization of the full KBpedia graph. Though the modified code is presented below, and does run, we actually use a captured figure below the code listing to keep this page size manageable.

WARNING!: The cell below takes more than three hours to run on our standard laptop and creates a page file of 36 MB. Please do not execute unless you are able to let this run in the background.
import pandas as pd
import holoviews as hv
import networkx as nx
from holoviews import opts
import hvplot.networkx as hvnx
#from bokeh.models import HoverTool

hv.extension('bokeh')

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, edge_attr='weight', create_using=Graphtype)
print('Graph construction complete.')

pos = nx.spring_layout(G, k=0.4, iterations=70)

hvnx.draw(G, pos, node_color='#FCCC25', alpha=0.65).opts(node_size=5, width=950, height=950, 
                           edge_line_width=0.1)

#, tools=['hover'], inspection_policy='edges'
Full-sized KBpedia Graph
Figure 4: Full-sized KBpedia Graph

Other Graphing Options

I admit I am disappointed and frustrated with available Python options to capture the full scale of KBpedia. The pure Python options are unacceptably slow. Options that promise better performance and a wider choice of layouts and visualizations are difficult, if not impossible, to install on Windows. Of all of the options directly tested, none allowed me (at least with my limited Python skill level) to vary node colors or node sizes by in-degrees.

On the other hand, we began to learn some of the robust NetworkX package and will have occasion to investigate it further in relation to network analysis (CWPK #61). Further, as a venerable package, NetworkX offers a wide spectrum of graph data formats that it can read and write. We can export our graph specifications to a number of forms that perhaps will provide better visualization choices. As examples, here are ways to specify two of NetworkX’s formats, both of which may be used as inputs to the Gephi package. (For more about Gephi and Cytoscape as options here, see the initial links at the beginning of this installment.)

import pandas as pd
import networkx as nx

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, edge_attr='weight', create_using=Graphtype)

nx.write_gexf(G, 'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.gexf')

print('Gephi file complete.')

nx.write_gml(G, 'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.gml')

print('GML file complete.')

Additional Documentation

This section presents the significant amount of material reviewed in order to make the choices for use in this present CWPK installment.

First, it is possible to get online help for most options to be tested. For example:

hv.help(hvnx.draw)

And, here are some links related to options investigated for this installment, some tested, some not:

NetworkX

These same references above are also provided for the ‘latest’ version.

graph-tool

NetworKit

deepgraph

nxviz

Netwulf

igraph

SNAP

pygraphistry

ipycytoscape

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 21, 2020 at 11:04 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2404/cwpk-56-graph-visualization-and-extraction/
The URI to trackback this post is: https://www.mkbergman.com/2404/cwpk-56-graph-visualization-and-extraction/trackback/
Posted:October 19, 2020

It’s Time for Some Pretty Figures and Charts

It is time in our Cooking with Python and KBpedia series to investigate some charting options for presenting data or output. One of the remarkable things about Python is the wealth of add-on packages that one may employ, and visualization is no exception.

What we will first do in this installment is to investigate some of the leading charting options in Python sufficient for us to make an initial selection. We want nice looking output that is easily configured and fed with selected data. We also want multiple visualization types perhaps to work from the same framework, so that we need not make single choices, but multiple ones for multiple circumstances as our visualization needs unfold.

We will next tailor up some datasets for charting. We’d like to see a distribution histogram of our typlogies. We’d like to see the distribution of major components in the system, notably classes, properties, and mappings. We’d like to see a distribution of our notable links (mappings) to external sources. And, we’d like to see the interactive effect of our disjointedness assignments between typologies. The first desires can be met with bar and pie charts. the last with some kind of interaction matrix. (We investigate the actual knowledge graph in the next CWPK installment.)

We also want to learn how to take the data as it comes to us to process into a form suitable for visualization. Naturally, since we are generating many of these datasets ourselves, we could alter the initial generating routines in order to more closely match the needs for visualization inputs. However, for now, we will take our existing outputs as is, since that is also a good use case for wrangling wild data.

Review of Visualization Options

For quite a period, my investigation of Python visualization options had been focused on individual packages. I liked the charting output of options like Seaborn and Bokeh, and knew that Matplotlib and Plotly had close ties with Jupyter Notebook. I had previously worked with JavaScript visualization toolkits, and liked their responsiveness and often interactivity. On independent grounds, I was quite impressed with the D3.js library, though I was still investigating the suitability of that to Python. Because CWPK is a series that focuses on Python, though, I had some initial prejudice to avoid JS-dominated options. I also had spent quite a bit of time looking at graph visualization (see next installment), and had some concerns that I was not yet finding a package that met my desired checklist.

As I researched further, it was clear there were going to be trade-offs when picking a single, say, charting and then graphing package. It was about this time I came across the PyViz ecosystem. (Overall helpful tools listing: https://pyviz.org/tools.html.) PyViz is nominally the visualization complement to the broader PyData community.

Jake VanderPlas pulled together a nice overview of the Python visualization landscape and how it evolved for a presentation to PyCon in 2017. Here is the summary diagram from his talk:

Python Visualization Landscape
Figure 1: Python Visualization Landscape

Source: Jake VanderPlas, “Python’s Visualization Landscape,” PyCon 2017, https://speakerdeck.com/jakevdp/pythons-visualization-landscape-pycon-2017

The trend in visualization for quite a few years has been the development of wrappers over more primitive drawing programs that abstract and make the definition of graphs and charts much easier. As these higher-level libraries have evolved they have also come to embrace multiple lower-level packages under their umbrellas. The trade-off in easier definitions of visualization objects is some lack of direct control over the output.

Because of the central role of Jupyter Notebooks in this CWPK series, and not having a more informed basis for making an alternative choice, I chose to begin our visualization efforts with Holoviews, which is an umbrella packaging over the applications as shown in the figure above. Bokeh provides a nice suite of interactive plotting and figure types. NetworkX (which is used in the next installment) has good network analysis tools and links to network graph drawing routines. And Matplotlib is another central hub for various plot types, many other Python visualization projects, color palettes, and NumPy.

Getting Started

Like most Python packages, installation of Holoviews is quite straightforward. Since I also know we will be using the bokeh plot library, we include it as well when installing the system:

   conda install -c pyviz holoviews bokeh

Generating the First Chart

The first chart we want to tackle is the distribution of major components in KBpedia, which we will visualize with a pie chart. Statistics from our prior efforts (see the prior CWPK #54) and what is generated in the Protégé interface provide our basic counts. Since the input data set is so small, we will simply enter it directly into the code. (Later examples will show how we load CSV data using pandas .)

For the pie chart we will be using, we pick the bokeh plotting package. In reviewing code samples across the Web, we pick one example and modify it for our needs. I will explain key aspects of this routine after the code listing and chart output:

import panel as pn
pn.extension()
from math import pi
import pandas as pd                                                                   # Note 1

from bokeh.palettes import Accent
from bokeh.plotting import figure
from bokeh.transform import cumsum

a = {                                                                                 # Note 2
    'Annotation': 759398,
    'Logical': 85333,
    'Declaration': 63229,
    'Other': 8274
}

data = pd.Series(a).reset_index(name='value').rename(columns={'index':'axiom'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = Accent[len(a)]

p = figure(plot_height=350, title='Axioms in KBpedia', toolbar_location=None,         # Note 3
           tools='hover', tooltips='@axiom: @value', x_range=(-0.5, 1.0))

r = p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),    # Note 4
        line_color='white', fill_color='color', legend_field='axiom', source=data)

p.axis.axis_label=None                                                                # Note 5
p.axis.visible=False
p.grid.grid_line_color = None

bokeh_pane = pn.pane.Bokeh(p)
bokeh_pane                                                                            # Note 6
Pie Chart of KBpedia Axioms
NOTE: The figures in this article are static captures of the interactive electronic notebook. See note at bottom for how to access these.

As with our other special routines, we begin by importing the new packages that are required for the pie chart (1). One of the imports, pandas, gives us very nice ways to relate an input CSV file or entered data to pick up item labels (rows) and attributes (col). Another notable import is to pick the color palette we want to use for our figure.

As noted, because our dataset is so small, we just enter it directly into the routine (2). Note how data entry conforms to the Python dictionary format of key:value pairs. Our data section also provides how we will convert the actual numbers of our data into segment slices in the pie chart, as well as defines for us the labels to be used based on pandas’ capabilities. We also indicate how many discrete colors we wish to use from the Accents palette. (Palettes may be chosen based on a set of discrete colors over a given spectrum, or, for larger data sets, picked as an increment over a continuous color spectrum. See further Additional Documentation below.)

The next two parts dictate how we format the chart itself. The first part sets the inputs for the overall figure, such as size, aspect, title, background color and so forth ((3)). We can also invoke some tools at this point, including the useful ‘hover’ that enables us to see actual values or related when mousing over items in the final figure. The second part of this specification guides the actual chart type display, ‘wedge’ in this case because of our choice of a pie chart (4). To see the various attributes available to us, we can invoke the standard dir() Python function:

dir(p)

We continue to add the final specifications to our figure (5) and then invoke our function to render the chart (6).

We can take this same pattern and apply new data on the distribution of properties within KBpedia according to our three major types, which produces this second pie chart, again following the earlier approach:

prop = {
    'Object': 1316,
    'Data': 802,
    'Annotation': 2919
}

data = pd.Series(prop).reset_index(name='value').rename(columns={'index':'property'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = Accent[len(prop)]

p = figure(plot_height=350, title="Properties in KBpedia", toolbar_location=None,
           tools="hover", tooltips="@property: @value", x_range=(-0.5, 1.0))

r = p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='property', source=data)

p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None

bokeh_pane = pn.pane.Bokeh(p)
bokeh_pane
Pie Chart of KBpedia Properties

More Complicated Datasets

The two remaining figures in this charting installment use a considerably more complicated dataset: an interaction matrix of the SuperTypes (STs) in KBpedia. There are more than 70 STs under the Generals branch in KBpedia, but a few of them are very-high level (Manifestations, Symbolic, Systems, ConceptualSystems, Concepts, Methodeutic, KnowledgeDomains), leaving a total of about 64 that have potentially meaningful interactions. If we assume that interactions are transitive, that gives us a total of 2016 possible pairwise combinations among these STs ((N * N-1)/2).

From a substantive standpoint, some interactions are nearly global such as for Predications (including AttributeTypes, DirectRelations, and RepresentationTypes, specifically incorporating AdjunctualAttributes, ContextualAttributes, IntrinsicAttributes, CopulativeRelations, MediativeRelations, Associatives, Denotatives, and Indexes), and about 70 pair interactions are with direct parents. When we further remove these potential interactions, we are left with about 50 remaining STs, representing a final set of 1204 ST pairwise interactions.

Of this final set, 50% (596) are completely disjoint, 646 are disjoint to max 0.5%, and only 355 (30%) have overlaps exceeding 10%.

There are two charts we want to produce from this larger dataset. The first is a histogram of the distribution of STs as measured by number of reference concepts (RCs) each contains, and the second is a heatmap of the ST interactions that meaningfully participate in disjoint assertions.

In getting the basic input data into shape, it would have been possible to rely on many standard Python packages geared to data wrangling, but the fact is that a dataset of even this size can perhaps be more effectively and quickly manipulated in a spreadsheet, which is how I approached these sets. The trick to large-scale sorts and manipulations of such data in a spreadsheet is to create temporary columns or rows in which unique sequence numbers are designed (with the numbers being calculated from a formula such a new cell ID = prior cell ID + 1), copy the formulas as values, and then include these temporary rows or columns in the global (named) block that contains all of the data. One can then do many manipulations of the data matrix and still return to desired organization and order by sorting again on these temporary sequence numbers.

Histogram Distribution of STs by RCs

Let’s first begin, then, with the routine for displaying our SuperTypes (STs) according to their count of reference concepts (RCs). We import our needed Python packages, including a variety of color palettes, and reference our source input file in CSV format. Note we are reading this input file into pandas, which we invoke in order to see the input data (ST by RC count):

import pandas as pd
from bokeh.plotting import figure, output_notebook, show, ColumnDataSource
from bokeh.models.tools import HoverTool
from bokeh.transform import factor_cmap
from bokeh.palettes import viridis, magma, Turbo256, linear_palette

output_notebook()

src = r'C:\1-PythonProjects\kbpedia\v300\extractions\data\supertypes_counts.csv'
# on MyBinder, find at: CWPK/sandbox/extracts/data/

df = pd.read_csv(src)

df

Again using pandas, we are able to relate our column data to what will be displayed in the final figure:

supertypes = df['SuperTypes']
rcs = df['RCs']

supertypes

As with our previous figures, we have to input our settings for both the overall figure and the plot type (horizontal bar, in this case):

p = figure(y_range=supertypes,
           title = 'Counts by Disjoint KBpedia SuperTypes',
           x_axis_label = 'RC Counts',
           plot_width = 800,
           plot_height = 600,
           tools = 'pan,box_select,zoom_in,zoom_out,save,reset'
           )

p.hbar(y = supertypes,
       right = rcs,
       left = 0,
       height = 0.4,
       color = 'orange',
       fill_alpha = 0.5
       )

show(p)
Bar Chart of KBpedia RCs by SuperType (single color)

This shows the ease of working directly with pandas dataframes. But, there is a built-in function called ColumnDataSource that gives us some additional flexibility:

source = ColumnDataSource(df)

st_list = source.data['SuperTypes'].tolist()

p2 = figure(y_range = st_list,                              # Note the change of source here
            title = 'Counts by Disjoint KBpedia SuperTypes',
            x_axis_label = 'RC Counts',
            plot_width = 800,
            plot_height = 600,
            tools = 'pan,box_select,zoom_in,zoom_out,save,reset'
           )

p2.hbar(y = 'SuperTypes',                                   
        right = 'RCs',                                      
        left = 0,
        height = 0.4,
        color = 'orange',
        fill_alpha = 0.5,
        source=source                                      # Note the additional source
       )

hover = HoverTool()

hover.tooltips = """
    <div>
        <div><strong>SuperType: </strong>@SuperTypes</div>
        <div><strong>RCs: </strong>@RCs</div>         
    </div>
"""
p2.add_tools(hover)

show(p2)

Next, we want to add a palette. After trying the variations first loaded, we choose Turbo256 and tell the system the number of discrete colors desired:

mypalette = linear_palette(Turbo256,50)

p2.hbar(y = 'SuperTypes',
        right = 'RCs',
        left = 0,
        height = 0.4,
        fill_color = factor_cmap(
               'SuperTypes',
               palette = mypalette,
               factors=st_list
               ),
        fill_alpha=0.9,
        source=source
)

hover = HoverTool()

hover.tooltips = """
    <div>
        <div><strong>SuperType: </strong>@SuperTypes</div>
        <div><strong>RCs: </strong>@RCs</div>         
    </div>
"""
p2.add_tools(hover)

show(p2)
Bar Chart of KBpedia RCs by SuperType (multi-color)

This now achieves the look we desire, with the bars sorted in order and a nice spectrum of colors across the bars. We also have hover tips that provide the actual data for each bar. The latter is made possible by the ColumnDataSource where we replace the standard ‘dict’ format into x, y.

Since we continue to gain a bit more tailoring and experience with each chart, we decide it is time to tackle the heatmap.

Heatmap Display

A heatmap is an interaction matrix. In our case, what we want to display are the SuperTypes that have some degree of disjointedness plotted against one another, with the number of RCs in x displayed against the RCs within y. Since, as the previous horizontal bar chart shows, we have a wide range of RC counts by SuperType, to normalize these interactions we decide to express the overlap as a percentage.

We again set up our imports and figure as before. If you want to see the actual data input file and format, invoke df_h as we did before:

import holoviews as hv
from holoviews import opts
hv.extension('bokeh', 'matplotlib')
import pandas as pd
import matplotlib

src = r'C:\1-PythonProjects\kbpedia\v300\extractions\data\st_heatmap.csv'
# on MyBinder, find at: CWPK/sandbox/extracts/data/

df_h = pd.read_csv(src)

heatmap = hv.HeatMap(df_h, kdims=['ST 1(x)', 'ST 2(y)'], vdims=['Rank', 'Overlap', 'Overlap/ST 1', 
                    'ST 1 RCs', 'ST 2 RCs'])

color_list = ['#555555', '#CFCFCF', '#C53D4D', '#D14643', '#DC5039', '#E55B30',
           '#EB6527', '#F0701E', '#F47A16', '#F8870D', '#FA9306', '#FB9E07',
           '#FBAC10', '#FBB91E', '#F9C52C', '#F6D33F', '#F3E056', '#F1EB6C',
           '#F1EE74', '#F2F381', '#F3F689', '#F5F891', '#F6F99F', '#F7FAAC',
           '#F9FBB9', '#FAFCC6', '#FCFDD3', '#FEFFE5']

# for color_list, see https://stackoverflow.com/questions/21094288/convert-list-of-rgb-codes-to-matplotlib-colormap

my_cmap = matplotlib.colors.ListedColormap(color_list, name='interact')

heatmap.opts(opts.HeatMap(tools=['hover'], cmap=my_cmap, colorbar=True, width=960, 
                          xrotation=90, height=960, toolbar='above', clim=(0, 26)))

heatmap
Overlap Heatmap of Shared RCs Between SuperTypes

All of the available palettes did not have a color spectrum we liked, plus we needed to introduce the dark gray color (where an ST is being mapped to itself and therefore needs to be excluded). Another exclusion (light gray) is to remove ST interactions with anything in its parental lineage.

As for useful interactions, we wanted a close to smooth distribution of overlap intensities across the entire spectrum of 0% overlap (no color, white) to more than 95% (dark red). We achieve this distribution by not working directly from the percentage overlap figures, but by the mapping of thse percentage overlaps to a more-or-less smooth ranking assignment from roughly 0 to 30. It is the rank value that determines the color of the interaction cell.

There are clearly many specifics that may set and tweaked for your own figures. The call below is one example of how to get explanation of these settings.

hv.help(hv.HeatMap)

Additional Documentation

Colors and Palettes

Charting

What to chart?

Heatmaps

Other

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 19, 2020 at 11:44 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2402/cwpk-55-charting/
The URI to trackback this post is: https://www.mkbergman.com/2402/cwpk-55-charting/trackback/