Posted:October 26, 2020

The Beginning of Moving Our Environment into the Cloud

Today’s installment in our Cooking with Python and KBpedia series begins a three-part mini-series of moving our environment into the clouds. This first part begins with setting up a remote instance and Web page server using the mini-framework Flask. The next installment expands on this basis to begin adding a SPARQL endpoint to the remote instance. And, then, in the concluding third part to this mini-series, we complete the endpoint and make the SPARQL service active.

We undertake this mini-series because we want to make KBpedia more open to SPARQL queries from anywhere on the Web and we will be migrating our main KBpedia Web site content to Python. The three installments in this mini-series pave the way for us to achieve these aims.

We begin today’s installment by looking at the approach and alternatives for achieving these aims, and then we proceed to outline the steps needed to set up a Python instance on a remote (cloud) instance and to begin serving Web pages.

Approach and Alternatives

We saw earlier in CWPK #25 and CWPK #50 the usefulness of the SPARQL query language to interact with knowledge graphs. Our first objective, therefore, was to establish a similar facility for KBpedia. Though we have an alternate choice to set up some form of RESTful API to KBpedia (see further Additional Documentation below), and perhaps that may make sense at some point in time, our preference is to use SPARQL given its robustness and many query examples as earlier installments document.

Further, we can foresee that our work with Python in KBpedia may warrant our moving our standard KBpedia Web site to that away from Clojure and its current Bootstrap-based Web page format. Though Python is not generally known for its Web-page serving capabilities, some exploration of that area may indicate whether we may go in that direction or not. Lastly, given our intent to make querying of KBpedia more broadly available, we also wanted to adhere to linked data standards. This latter objective is perhaps the most problematic of our aims as we will discuss in a couple of installments.

Typically, and certainly the easiest path, when one wants to set up a SPARQL endpoint with linked data ‘conneg‘ (content negotiation) is to use an existing triple store that has these capabilities built in. We already use Virtuoso as a triple store, and there are a couple of Python installation guides already available for Virtuoso. Most of the other widely available triple stores have similar capabilities.

Were we not interested in general Web page serving and were outside of the bounds of the objectives of this CWPK series, use of a triple store is the path we would recommend. However, since our aims are different, we needed to turn our attention to Python-based solutions.

From the standpoint of Web-page serving, perhaps the best known and most widely installed Python option is Django, a fully featured, open-source Web framework. Django has a similar scope and suite of capabilities to its PHP counterpart, Drupal. These are comprehensive and complicated frameworks, well suited to massive database-backed sites or ones with e-commerce or other specialty applications. We have had much experience with Drupal, and frankly did not want to tackle the complexity of a full framework.

I was much more attracted to a simpler microframework. The two leading Python options are Flask and Bottle (though there is also Falcon, which does not seem as developed). I was frankly more impressed with the documentation and growth and acceptance shown by Flask, and there appeared to be more analogous installations. Flask is based on the Jinja template engine and the Werkzeug WSGI (Web-server Gateway Interface) utility library. It is fully based on Unicode.

Another factor that needs to be considered is support for RDFlib the key package (and related) that we will be using for the SPARQL efforts. I first discussed this package in CWPK #24, though it is featured in many installments.

Basic Installation

We will be setting up these new endpoints on our cloud server, which is a large EC2 instance on Amazon Web Services running Ubuntu 18.04 LTS. Of course, this is only one of many cloud services. As a result, I will not discuss all of the preliminary steps to first securing an instance, or setting up an SSH client to access it, nor any of the initial other start-up issues. For EC2 on AWS, there are many such tutorials. Two that I have encountered in doing the research for this installment include this one and this other one. There are multiple others, and ones applicable to other providers than AWS as well.

So, we begin from the point that an instance exists and you know how to log in and do basic steps with it. Even with this simplification, I began my considerations with these questions in mind:

  1. Do I need a package manager, and if so, which one?
  2. Where should I place my Python projects within the remote instance directory structure?
  3. How do I also include the pip package environment?
  4. Should I use virtual environments?

With regard to the first question, I was sure I wanted to maintain the conda package manager, but I was not convinced I needed the full GUI and set of packages with Anaconda. I wanted to keep consistency with the set-up environment we put in place for our local instance (see CWPK #9). However, since I had gained much experience with Anaconda and conda, I felt comfortable not installing the full 4 GB full Anaconda distribution. This led me to adopt the minimal miniconda package, which has a much smaller footprint. True, I will need to install all specific Python packages and work from the command line (terminal), but that is easy given the experience I have gained.

Second, in reviewing best practices information for directory structures on Ubuntu, I decided to create an analogous ‘python-projects’ master directory, similar to what I established for the local instance, under the standard user application directory. I thus decided and created this master directory: usr/bin/python-projects.

So, having decided to use miniconda and knowing the master directory I wanted to use, I proceeded to begin to set up my remote installation. I followed a useful source for this process, though importantly updated all references to the current versions. The first step was to navigate to my master directory and then to download the 64-bit Linux installer onto my remove instance, followed by executing the installation script:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  * make sure you use updated version

sh Miniconda3-latest-Linux-x86_64.sh    

The installation script first requires you to page through the license agreement (many lines!!) and then accept it. Upon acceptance, the Miniconda code is installed with some skeletal stubs in a new directory under your master directory, which in my case is /usr/bin/python-projects/miniconda3. The last step in the installation process asks where you want Miniconda installed. The default is to use root/miniconda3. Since I wanted to keep all of my Python project stuff isolated, I overrode this suggestion and used my standard location of /usr/bin/python-projects/miniconda3.

After all of the appropriate files are copied, I agreed to initialize Miniconda. This completes the basic Miniconda installation.

Important Note: You will need to sign out and then sign back into your SSH shell at this point before the changes become active!

After signing back in, I again navigated to my Python master directory and then installed the pip package manager since not all of the Python packages we use in this CWPK series and cowpoke are available through conda.

Virtual Environments and a Detour

In the standard use of a Linux installation, one uses the distro’s own installation and package management procedures. In the case of Ubuntu and related distros such as Debian, apt stands for ‘advanced package tool‘ and through commands such as apt-get is one means to install new capabilities to your remote instance. Other Linux distros may use yum or rpm or similar to install new packages.

Then, of course, within Python, one has the pip package installer and the conda one we are using. Further, within a Linux installation, how one may install packages and where they may be applicable depends on the user rights of the installer. That is one reason why it is important to have sudo (super user) rights as an administrator of the instance when one wants new packages to be available to all users.

These package managers may conflict in rights and, if not used properly, may act to work at cross purposes. For example, in a standard AWS EC2 instance using Ubuntu, it comes packaged with its own Python version. This default is useful for the occasional app that needs Python, but does not conform to the segregation of packages that Python often requires using its ‘virtual environments‘.

On the face of it, the use of virtual environments seems to make sense for Python: we can keep projects separate with different dependencies and even different versions of Python. I took this viewpoint at face value when I began this transition to installing Python on a remote instance.

Given this, it is important to never run sudo pip install. We need to know where and to what our Python packages apply, and generic (Linux-type) install routines confuse Linux conventions with Python ones. We are best in being explicit. There are conditions, then, where the idea of a Python virtual environment makes sense. These circumstances include, among others, a Python shop that has many projects; multiple developers with different versions and different applications; a mix of Python applications between Python 2 and 3 or where specific version dependencies create unworking conditions; and so forth.

However, what I found in migrating to a remote instance is that virtual environments, combined with a remote instance and different operating system, added complexity that made finding and correcting problems difficult. After struggling for hours trying to get systems to work, not really knowing where the problem was occurring nor where to look for diagnostics, I learned some important things the hard way. I describe below some of these lessons.

An Unfortunate Bad Fork

So, I was convinced that a virtual environment made sense and set about following some online sources that documented this path. In general, these approaches used virtualenv (or venv), a pip-based environment manager, to set up this environment. Further, since I was using Ubuntu, AWS and Apache2, these aspects added to the constraints I needed to accommodate to pursue this path.

Important Note: For my configuration, this path is a dead end! If you have a similar configuration, do NOT follow this path. The section after this one presents the correct approach!

In implementing this path, I first installed pip:

sudo apt install -y python3-pip

I was now ready to tackle the fourth and last of my installation questions, namely to provide a virtual environment for my KBpedia-related Python projects. To do so, we first need to install the virtual environment package:

sudo apt install -y python3-venv

Then, we make sure we are in the base directory where we want this virtual environment to reside. (In our case, /usr/bin/python-projects/. We also will name our virtual environment kbpedia. We first establish the virtual environment:

python3.6 -m venv kbpedia

which pre-populates a directory with some skeletal capabilities. Then we activate it:

source kbpedia/bin/activate

The virtual environment is now active, and you can work with it as if you were are the standard command prompt, though that prompt does change in form to something like (kbpedia) :/usr/bin/python-projects#. You work with this directory as you normally would, adding test code next in our case. When done working with this environment, type deactivate to return from the virtual environment.

The problem is, none of this worked for my circumstance, and likely never would. What I had neglected in taking this fork is that conda is both a package manager and a virtual environment manager. With the steps I had just taken, I had inadvertently set up multiple virtual environments, which were definitely in conflict.

The Proper Installation

Once I realized that my choice of conda meant I had already committed to a virtual environment manager, I needed to back off all of the installs I had undertaken with the bad fork. That meant I needed to uninstall all packages installed under that environment, remove the venv environment manager, remove all symbolic links, and so forth. I also needed to remove all Apache2 updates I had installed for working with wsgi. I had no confidence whatever I had installed had registered properly.

The bad fork was a costly mistake, and it took me quite of bit of time to research and find the proper commands to remove the steps I had undertaken (which I do not document here). My intent was to get back to ‘bare iron’ for my remote installation so that I could pursue a clean install based on conda.

Installing the mod-wsgi Apache Module

After reclaiming the instance, my first step was to install the appropriate Apache2 modules to work with Python and wsgi. I began by installing the WSGI module to Apache2:

sudo apt-get install libapache2-mod-wsgi-py3

Which we then test to see if it was indeed installed properly:

apt-cache show libapache2-mod-wsgi-py3

Which displays:

Package: libapache2-mod-wsgi-py3
Architecture: amd64
Version: 4.5.17-1ubuntu1
Priority: optional
Section: universe/httpd
Source: mod-wsgi
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com> Original-Maintainer: Debian Python Modules Team <python-modules-team@lists.alioth.debian.org> Bugs: https://bugs.launchpad.net/ubuntu/+filebug Installed-Size: 271 Provides: httpd-wsgi Depends: apache2-api-20120211, apache2-bin (>= 2.4.16), python3 (>= 3.6), python3 (<< 3.7), libc6 (>= 2.14), libpython3.6 (>= 3.6.5) Conflicts: libapache2-mod-wsgi Filename: pool/universe/m/mod-wsgi/libapache2-mod-wsgi-py3_4.5.17-1ubuntu1_amd64.deb Size: 88268 MD5sum: 540669f9c5cc6d7a9980255656dd1273 SHA1: 4130c072593fc7da07b2ff41a6eb7d8722afd9df SHA256: 6e443114d228c17f307736ee9145d6e6fcef74ff8f9ec872c645b951028f898b Homepage: http://www.modwsgi.org/ Description-en: Python 3 WSGI adapter module for Apache The mod_wsgi adapter is an Apache module that provides a WSGI (Web Server Gateway Interface, a standard interface between web server software and web applications written in Python) compliant interface for hosting Python based web applications within Apache. The adapter provides significantly better performance than using existing WSGI adapters for mod_python or CGI. . This package provides module for Python 3.X. Description-md5: 9804c7965adca269cbc58c4a8eb236d8 </python-modules-team@lists.alioth.debian.org></ubuntu-devel-discuss@lists.ubuntu.com>

And next, we check to see if the module is properly registered:

Our first test is:

sudo apachectl -t

and the second is:

sudo apachectl -M | grep wsgi

which gives us the correct answer:

wsgi_module (shared)

Setting Up the Conda Virtual Environment

We then proceed to create the ‘kbpedia’ virtual environment under conda:

conda create -n kbpedia python=3

Which we test with the Ubuntu path inquiry:

echo $PATH

which gives us the correct path registration (the first of the five listed):

/usr/bin/python-projects/miniconda3/envs/kbpedia/bin:/usr/bin/python-projects/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

and then we activate the ‘kbpedia’ environment:

conda activate kbpedia

Installing Needed Packages

Now that we are active in our virtual environment, we need to install our missing packages:

conda install flask
conda install pip

Installing Needed Files

For Flask to operate, we need two files. The first file is the basic calling application, test_kbpedia.py, that we place under the Web sites directory location, /var/www/html/kbpedia/ (we create the kbpedia directory). We call up the editor for this new file:

vi test_kbpedia.py

and enter the following simple Python program:

 
from flask import Flask
app = Flask(__name__)@app.route("/")
def hello():
return "Hello KBpedia!"

It is important to know that Flask comes with its own minimal Web server, useful for only demo or test purposes, and one (because it can not be secured) should NOT be used in a production environment. Nonetheless, we can test this initial Flask specification by entering either of these two commands:

 wget http://127.0.0.1/ -O- 

curl http://127.0.0.1:5000

Hmmm. These commands seem not to work. Something must not be correct in our Python file format. To get better feedback, we invoke Python and then:

flask run

This gives us the standard traceback listing that we have seen previously with Python programs. We get an error message that the name ‘app’ is not recognized. As we inspect the code more closely, we can see that one line in the example that we were following was inadvertenly truncated (denoted by the decorator ‘@’ sign). We again edit test_kbpedia.py to now appear as follows:

 
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "Hello KBpedia!"

Great! That seems to fix the problem.

We next need to enter a second Python program that tells WSGI where to find this app. We call up the editor again in the same directory location:

vi wsgi.py

(We also may use the *.wsgi file extension if we so choose; some examples prefer this second option.)

and enter this simple program into our editor:

    
import sys
sys.path.insert(0, "/var/www/html/kbpedia/")
from test_kbpedia import app as application

This program tells WSGI where to find the application, which we register via the sys package as the code states.

Lastly, to complete our needed chain of references, Apache2 needs instructions for where a ‘kbpedia’ reference in an incoming URI needs to point in order to find its corresponding resources. So, we go to the /etc/apache2/sites-enabled directory and then edit this file:

vi 000-default.conf

Under DocumentRoot in this file, enter these instructions:

        WSGIDaemonProcess kbpedia python-path=/usr/bin/python-projects/miniconda3/envs/kbpedia/lib/python3.8/site-packages
WSGIScriptAlias /kbpedia /var/www/html/kbpedia/wsgi.py
<Directory /var/www/html/kbpedia>
WSGIProcessGroup kbpedia
WSGIApplicationGroup %{GLOBAL}
Order deny,allow
Allow from all
</Directory>

This provides the path for where to find the WSGI file and Python.

We check to make sure all of these files are written correctly by entering this command:

sudo apachectl -t

When we first run this, while we get a ‘Syntax OK’ message, we also get a warning message that we need to “Set the ‘ServerName’ directive globally to suppress this message”. While there is no problem to continue in this mode, in order to understand the suite of supporting files, we navigate to where the ServerName is set in the directory /etc/apache2​ by editing this file:

vi apache2.conf

and enter this name:

ServerName localhost

Note anytime we make a change to an Apache2 configuration file, that we need to re-start the server to make sure the new configuration is now active. Here is the basic command:

sudo service apache2 restart

If there is no problem, the prompt is returned after entering the command.

We are now complete with our initial inputs. To test whether we have properly set up Flask and our Web page serving, we use the IP for our instance and enter this command in a browser:

http://54.227.249.140/kbpedia

And, the browser returns a Web page stating:

Hello KBpedia!

Fantastic! We now are serving Web pages from our remote instance.

Of course, this is only the simplest of examples, and we will need to invoke templates in order to use actual HTML and style sheets. These are topics we will undertake in the next installment. But, we have achieved a useful milestone in this three-part mini-series.

Some More conda Commands

If you want to install libraries into an environment, or you want to use that environment, you have to activate the environment first by:

conda activate kbpedia

After invoking this command, we can install desired packages into the target environment, for example:

conda install package-name

But sometimes we need to use a different channel, in which case we first need to install that channel:

conda install --channel asmeurer

then, invoke it (say):

conda install -c conda-forge package-name

To see what packages are presently available to your conda environment, type:

conda list

And, to see what environments you have set up within conda, enter:

conda env list

which, in our current circumstance, give us this result:

base                     /usr/bin/python-projects/miniconda3
kbpedia * /usr/bin/python-projects/miniconda3/envs/kbpedia

The environment shown with the asterisk (*) is the currently active one.

Another useful command to know is to get full information on your currently active conda environment. To do so, type:

conda info

in our case, that produces the following output:

     active environment : kbpedia
    active env location : /usr/bin/python-projects/miniconda3/envs/kbpedia
            shell level : 1
       user config file : /root/.condarc
 populated config files :
          conda version : 4.8.4
    conda-build version : not installed
         python version : 3.8.3.final.0
       virtual packages : __glibc=2.27
       base environment : /usr/bin/python-projects/miniconda3  (writable)
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /usr/bin/python-projects/miniconda3/pkgs
                          /root/.conda/pkgs
       envs directories : /usr/bin/python-projects/miniconda3/envs
                          /root/.conda/envs
               platform : linux-64
             user-agent : conda/4.8.4 requests/2.24.0 CPython/3.8.3 Linux/4.15.0-115-generic ubuntu/16.04.1 glibc/2.27
                UID:GID : 0:0
             netrc file : None
           offline mode : False

Additional Documentation

Here are some additional resources touched upon in the prior discussions:

Getting Oriented

Flask on AWS

Each of these cover some of the first steps needed to get set up on AWS, which we skip over herein:

RESTful APIs

General Flask Resources

RDFlib

Here is a nice overview of RDFlib.

General Remote Instance Set-up

A video on setting up an EC2 instance and Putty; also deals with updating Python, Filezilla and crontab.

Other Web Page Resources

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 26, 2020 at 10:28 am in CWPK, KBpedia, Semantic Web Tools | Comments (1)
The URI link reference to this post is: https://www.mkbergman.com/2407/cwpk-58-setting-up-a-remote-instance-and-web-page-server/
The URI to trackback this post is: https://www.mkbergman.com/2407/cwpk-58-setting-up-a-remote-instance-and-web-page-server/trackback/
Posted:October 22, 2020

Notebooks are Only Interactive if You Share

We first began publishing interactive electronic notebooks for this Cooking with Python and KBpedia series in CWPK #16, though the first installment using a notebook started with CWPK #14. I had actually drafted all of the installments up to this one before that date was reached. Since one major purpose of this series was to provide hands-on training, I did not want to force those who wanted to experience some degree of interactivity to have to go through all of the steps to set up their own interactive environment. My hope was that a taste of direct involvement with the code and interactivity would itself encourage users to get more deeply involved to establish their own interactive environments.

I had encountered for myself fully interactive notebooks prior to this point, ones where all I needed to do was to click to operate, so I knew there must be a way to make my own notebooks similarly available. In order to achieve my objective, as is true with so much of this series, I was forced to do the research and discover how I could set up such a thing.

A Survey of the Options

In researching the options it was clear that a spectrum of choices existed. We have already discussed how we can create non-interactive mockups of an interactive notebook using the nbconvert option or converting a drafted notebook using Pandoc (CWPK #15). My research surfaced some additional options to render a notebook page for general Web (HTML) display:

Static Options

  1. nbconvert, but lose interactivity
  2. The Pandoc option
  3. Publish in other formats (PDF)
  4. View a non-interactive page via nbviewer by simply providing a URL, which works like nbconvert.

These options are helpful, of course, but lack the full interactivity desired.

Fully Interactive

Systems that allow code cells to be run interactively are obviously more complex than nice rendering tools. My investigations turned up a number of online services, plus ways to set up own or private servers. From the standpoint of online services, here are the leading options:

There is a Python option that does not provide complete interactivity, but simple interactions of certain aspects of certain notebook cells:

  1. nbinteract is a Python package that provides a command-line tool to generate interactive web pages from Jupyter notebooks

Then, there are a series of online services:

  1. The MyBinder option, which uses a JupyterHub server directly from a Git repository
  2. Google’s Colaboratory, which provides a Google flavor on this approach
  3. Microsoft’s Azure Notebooks, which provides a MS flavor on this approach
  4. There are other sites such as Kaggle Kernels, CoCalc, nanoHUB, or Datalore that also provide such services, some for a fee.

The other interactive approach is to not use an established service, but to set up your own server.

  1. For private repositories, one can build on BinderHub, the same technology used by MyBinder, and which runs on JupyterHub running on Kubernetes for most of its functionality, or
  2. One can run a public notebook server based on Jupyter, though it is limited to a single access user at a time, or
  3. Set up one’s own JupyterHub, similar to the BinderHub option but not limited to a Git repository.

Frankly, most of the own-server options looked to be too much work simply to support my educational objectives for the CWPK series.

The Chosen MyBinder Option

I was very much committed to have an online service that would run my full stack. I chose to implement the MyBinder option because I could see it worked and was popular, it had close ties to Jupyter and rendered notebooks the same as when using locally, was free, and seemed to have strong backing and documentation. On the other hand, MyBinder has some weaknesses and poses some challenges. Some of the key ones I knew going in or identified as I began working with the system were:

  • As a hosted service that runs its applications in containers, it could take some minutes to get the online service active when used after a hiatus, necessary to reconfigure the container specific to the application and Python modules being used
  • It reportedly has memory limitation of 1-2 GB. Memory can be an issue with CWPK locally even at the 8 GB level
  • The service needed to run off of a Git repo. I had plans to better expose all aspects of the CWPK series and its supporting software on our existing public GitHub repository; the Git requirement caused me to accelerate my exposure of this service
  • Though free now, each MyBinder application is a rather large consumer of resources. I have some concerns regarding the longer-term availability of the service
  • CWPK would have more than 60 interactive notebook pages, though I did see reference that performance issues may only arise due to multiple, concurrent use. Going in, I had no clue as to what the use factor might be for the service and whether this would pose a problem or not.

Some of these issues deserve their own commentary below.

Setting Up the Environment

Setting up a new instance at the MyBinder service is relatively straightforward. Here is the basic set-up screen on the main page:

Setting Up a New MyBinder Project
Figure 1: Setting Up a New MyBinder Project

One must first have a Git repository available to start the service. One also needs to have completed an environment configuration file (environment.yml in our case with Python) and a project README.md at the root of the master branch on the repo. In our case, it is the CWPK repository on GitHub; we also indicate we are dealing with the master branch (1). (I had some initial difficulty when I over-specified a link to an individual notebook page; removing this cleared things up.) These simple specifications create the URL that is the link to your formal online project (2). Upon launch (3), the build process is shown to screen (4), which may take some minutes to build. The set of working input specs also provide the basis for generating a link badge you can use on your own Web sites (5).

Upon completion of a successful build, one is shown the standard Jupyter Notebook entry page.

Here are two additional resources useful to setting up a MyBinder application for the first time:

Implementation Challenges

Though set up is straightforward, there are some challenges in implementing MyBinder to accommodate specific CWPK needs. Here are some of the major areas I encountered, and some steps to address them.

Importing Local Code

As has become obvious in our series to date, Python is a highly configurable environment, with literally tens of thousands of packages to choose from to invoke needed functionality. The standard environment settings appear to do a good job of allowing new packages to be specified and imported into the MyBinder system. I had confidence these could be handled appropriately.

My major concern related to CWPK‘s own cowpoke package. At the time of starting this effort, this package was not commercial grade and was not registered on major distribution networks like PyPi or conda-forge. When used locally, including cowpoke is not a big issue: we only need to include it in the local listing of site packages. But, once we rely on a cloud instance, how can we get that code into our online MyBinder system?

The answer, it turns out, is to package our code as one would normally do for a commercial package, and then to include a setup.py configuration file in our local specification. That enables us to invoke the package through the standard MyBinder environment configuration. See especially this key reference and this stub for setup.py.

Local Data Sharing and Organization

Until this point, I had been developing and refining a local file directory structure in which to put different versions of KBpedia, other input files, and example outputs. This system was being developed logically from the perspective of a local file system.

However, these files were local and not exposed for access to an online system like MyBinder. My first thought was to simply copy this structure to the GitHub repo. But the manual copying of files to a version control system is NOT efficient, and the directory structure itself did not appear suitable for a repo presence. Further, manually copying files presents an ongoing issue of keeping local and remote versions in sync. Moreover, as I began adding new daily installments to the GitHub repo, I could see in general that manual additions were not going to be sustainable.

These realizations forced two decisions. First, I would need to re-think and re-organize my directory structures to accommodate both local and repo needs. The directory structure we have developed to date now reflects this re-organization. Second, as described in the next section, I needed to cease my manual use of GitHub and fully embrace it as a version control system.

Fully Embracing GitHub

I have to admit: Every time I try to work with version control with Git, I have been confused and frustrated with how to actually get anything done. I have, hopefully, progressed a bit beyond this point, but I would caution some of you looking to move into this area that you may have to overcome poor documentation and obfuscated instructions and commands.

So, I will not divert this series to deal with how to properly set up a Git-based version control system. In brief, one needs to establish a Git repository, and on Windows set up a Git client and then (for me), set up TortoiseGit if you want to work directly in Windows and the File Explorer rather than from the command line. In the process of doing all of this, you will also need to set up a key-based access control system with PuTTy (puttygen) so that you can communicate securely between your remote instance and your local file system. These steps are more effectively described in the TortoiseGit manual or various tutorials of one aspect or another. Installation, too, can be difficult with regard to general aspects or the PuTTy keys.

The reason, of course, for accepting this set-up complexity is being able to make changes either on a local version of code or data or a remote version of the same. This is the essence of the definition of keeping systems in ‘sync’. After having worked with these systems now for some weeks, daily, I think I can offer some simple tips for how best to work with these version control systems, points which are not obvious from most written presentations:

  1. First, make sure both the remote and local sources are in sync (this is actually not such an easy point, and is often the point of failure and frustration. However, until this status of being in sync is met, none of the other points below are possible.) When working locally, it is good practice to ‘pull’ any changes first from the remote repository before you attempt to ‘push’ local changes back to it. If you run into problems at this initial point, you need to research and find a fix before moving on
  2. Remember that your version control can really only occur from the local side, where your TortoiseGit is installed. So, while changes may occur either in the remote or local repository, the control to keep things in sync will occur from the local side (TortoiseGit)
  3. Whether at the remote or local repository, make all needed changes there, including deleting files, adding files, or modifying files. Then, commit those changes to the repository at hand (local or remote). (On TortoiseGit locally, this is done via the ‘Add’ Explorer menu option for new files; use the ‘Check for modifications’ option for changed files.) Commitment to the repository at hand is needed before the version control system knows what has been formally modified
  4. Again, be cognizant of where the modifications have occurred, which in any case you will control from the local TortoiseGit. If the changes have been made locally, then ‘push’ those changes to the remote repository; if the changes have been made remotely, then ‘pull’ those changes changes back to the local.

Always make sure that as any changes are made, at either side, they are synced to the system. In this way, you can be assured that your version control system is in a stable state, and you are free to make changes on either the local or remote side. Also know you can use GitHub for keeping multiple local instances (a desktop and a laptop in my case) in sync with the remote repository. Simply follow the above guidelines for each instance.

Handling Styles (CSS)

If you recall the discussion in CWPK #15, there is a difference in where custom styles can be set when viewing notebook pages locally versus whether they are called up directly from Python. Now, as we move to an online expression of these notebooks, we again raise the question of where online custom styles can be invoked.

Perhaps in some expressions, where style overrrides can be invoked is a matter of little consequence. But this CWPK series has some specific styles in such things as pointing to warnings, pointing to online resources, etc. Having a consistent way to refer to these styles (presentation) means better efficiency.

My hope had been that with MyBinder we had some identifiable means for providing such custom.css overrides as well. Though I can see links to such in page views, and there are hints online for how to actually modify styles, I was unable to find any means for effectively doing so. My suspicion is that online interactivity such as MyBinder is still in its infancy, and the degree of control that we expect in either local or remote environments is not yet mature.

Thus, since I could find no way after many frustrating hours to provide my own specific styles, I had to make the reluctant decision to embed all such style changes in each individual notebook page. (What this means, effectively, is the specific statement of style attributes needs to be repeated each time as used [MyBinder does not support referring to a style name in a separate external file, which is the more efficient alternative].) This embedded approach is not efficient, but, like prior discussions about the use of relative addresses, sometimes being specific is the best way to ensure consistent treatment across environments.

Using MyBinder

As time has gone on, I now have learned a general workflow that reflects these realities. In general, thus, with each new CWPK installment I try to:

  1. Draft all material in the Jupyter Notebook; make sure that version embodies all desired content and style changes and updates
  2. Inspect all link references and style definitions to make sure they are absolute, and not referenced
  3. Make sure all external files and images are moved and stored on the repository systems
  4. Post the updated file to the its current repository, and then commit it
  5. Push the updated file to the remote (or local) repository
  6. Convert the *.ipynb to HTML and post on my local blog
  7. Mix and stir again.

Though it is not designed directly as such, it is also possible to analyze use of MyBinder and gain statistics of use.

Additional Documentation

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 22, 2020 at 11:17 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2406/cwpk-57-publishing-interactive-notebooks/
The URI to trackback this post is: https://www.mkbergman.com/2406/cwpk-57-publishing-interactive-notebooks/trackback/
Posted:October 21, 2020

Working Examples are Not so Easy to Come By

Today’s installment in our Cooking with Python and KBpedia series is a great example of how impressive uses of Python can be matched with frustrations over how we get there and whether our hoped-for desires can be met. The case study we tackle in this installment is visualization of the large-scale KBpedia graph. With nearly 60,000 nodes and about a quarter of a million edge configurations, our KBpedia graph is hardly a toy example. Though there are certainly larger graphs out there, once we pass about 10,000 nodes we enter difficult territory for Python as a performant language. We’ll cover this topic and more in this installment.

Normally, what one might encounter online regarding graph visualization with Python begins with a simple example, which then goes on to discuss how to stage the input data and then make modifications to the visualization outputs. This approach does not work so well, however, when our use case scales up to the size of KBpedia. Initial toy examples do not provide good insight how the various Python visualization packages may operate at larger scales. So, while we can look at example visualizations and example code showing how to expose options, in the end whether we can get the package to perform requires us to install and test it. Since our review time is limited, and we have to in the end produce working code, we need a pretty efficient process of identifying, screening, and then selecting options.

Our desires for a visualization package thus begin with the ability to handle large graphs, including graph analytic components in addition to visualization, compatibility with our Jupyter Notebook interactive environment, ease-of-learning and -implementation, and hopefully attractive rendering of the final graph. From a graph visualization standpoint, some of our desires include:

  • Attractive outputs
  • Ability to handle a large graph with acceptable rendering speed
  • Color coding of nodes by SuperType
  • Varying node sizes depending on the importance (in-degree) of the node
  • Control over the graphical elements of the display (edge and node styles)
  • Perhaps some interactivity such as panning and zooming and tooltips when hovering over nodes, and
  • A choice of a variety of graph layout options to gauge which best displays the graph.

Preferably, whatever packages work best for these criteria also have robust supporting capabilities within the Python data science ecosystem. To test these criteria, of course, it is first necessary to stage our graph in an input form that can be rendered by the visualization package. This staging of the graph data is thus where we obviously begin.

Data Preparation

Given the visualization criteria above, we know that we want to produce an input file for a directed graph that consists of individual rows for each ‘edge’ (connection between two nodes) consisting of a source node (subclass), a target node to which it points (the parent node), a SuperType for the parent node (and possibly its matching rendering color), and a size for the parent node as measured by its number of direct subclasses. This should give us a tabular graph definition file with rows corresponding to the individual edges (subclasses of each parent) by sometime like these columns:

  RC1(source)     RC2(target)     No Subclasses(weight)     SuperType     ST color  

Different visualization packages may want this information in slightly different order, but that may be readily accomplished by shifting the order of written output.

Another thing I wanted to do was to order the SuperTypes according to the order of the universal categories as shown by kko-demo.n3. This will tend to keep the color ordering more akin to the ordering of the universal categories (see further CWPK #8 for a description of these universal categories).

It is pretty straightforward to generate a listing of hex color values from an existing large-scale bokeh color palette, as we used in the last CWPK installment. First, we count the number of categories in our use case (72 for the STs). Second, we pick one of the large (256) bokeh palettes. We then generate a listing of 72 hex colors from the palette, which we can then relate to the ST categories:

from bokeh.palettes import Plasma256, linear_palette

linear_palette(Plasma256,72)

We reverse the order to go from lighter to darker, and then correlate the hex values to the SuperTypes listed in universal category order. Our resulting custom color dictionary then becomes:

cmap = {'Constituents'           : '#EFF821',
'NaturalPhenomena' : '#F2F126',
'TimeTypes' : '#F4EA26',
'Times' : '#F6E525',
'EventTypes' : '#F8DF24',
'SpaceTypes' : '#F9D924',
'Shapes' : '#FBD324',
'Places' : '#FCCC25',
'AreaRegion' : '#FCC726',
'LocationPlace' : '#FDC128',
'Forms' : '#FDBC2A',
'Predications' : '#FDB62D',
'AttributeTypes' : '#FDB030',
'IntrinsicAttributes' : '#FCAC32',
'AdjunctualAttributes' : '#FCA635',
'ContextualAttributes' : '#FBA238',
'RelationTypes' : '#FA9C3B',
'DirectRelations' : '#F8963F',
'CopulativeRelations' : '#F79241',
'ActionTypes' : '#F58D45',
'MediativeRelations' : '#F48947',
'SituationTypes' : '#F2844B',
'RepresentationTypes' : '#EF7E4E',
'Denotatives' : '#ED7B51',
'Indexes' : '#EB7654',
'Associatives' : '#E97257',
'Manifestations' : '#E66D5A',
'NaturalMatter' : '#E46A5D',
'AtomsElements' : '#E16560',
'NaturalSubstances' : '#DE6064',
'Chemistry' : '#DC5D66',
'OrganicMatter' : '#D8586A',
'OrganicChemistry' : '#D6556D',
'BiologicalProcesses' : '#D25070',
'LivingThings' : '#CF4B74',
'Prokaryotes' : '#CC4876',
'Eukaryotes' : '#C8447A',
'ProtistsFungus' : '#C5407D',
'Plants' : '#C13C80',
'Animals' : '#BD3784',
'Diseases' : '#BA3487',
'Agents' : '#B62F8B',
'Persons' : '#B22C8E',
'Organizations' : '#AE2791',
'Geopolitical' : '#A92395',
'Symbolic' : '#A51F97',
'Information' : '#A01B9B',
'AVInfo' : '#9D189D',
'AudioInfo' : '#9713A0',
'VisualInfo' : '#9310A1',
'WrittenInfo' : '#8E0CA4',
'StructuredInfo' : '#8807A5',
'Artifacts' : '#8405A6',
'FoodDrink' : '#7E03A7',
'Drugs' : '#7901A8',
'Products' : '#7300A8',
'PrimarySectorProduct' : '#6D00A8',
'SecondarySectorProduct' : '#6800A7',
'TertiarySectorService' : '#6200A6',
'Facilities' : '#5E00A5',
'Systems' : '#5701A4',
'ConceptualSystems' : '#5101A2',
'Concepts' : '#4C02A1',
'TopicsCategories' : '#45039E',
'LearningProcesses' : '#40039C',
'SocialSystems' : '#3A049A',
'Society' : '#330497',
'EconomicSystems' : '#2D0494',
'Methodeutic' : '#250591',
'InquiryMethods' : '#1F058E',
'KnowledgeDomains' : '#15068A',
'EmergentKnowledge' : '#0C0786',
}

We now have all of the input pieces to complete our graph dataset. Fortunately, we had already developed a routine in CWPK #49 for generating an output listing from our owlready2 representation of KBpedia. We begin by loading up our necessary packages for working with this information:

from cowpoke.__main__ import *
from cowpoke.config import *
from owlready2 import *

And we follow the same configuration setup approach that we have developed for prior extractions:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###             
# 'kb_src'        : 'standard'                                        # Set in master_deck
# 'loop_list'     : kko_order_dict.values(),                          # Note 1   
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/mappings/',              
# 'ext'           : '.csv',                                         
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv',

def graph_extractor(**extract_deck):
    print('Beginning graph structure extraction . . .')
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    class_loop = extract_deck.get('class_loop')
    base = extract_deck.get('base')
    ext = extract_deck.get('ext')
    
    # Note 2
    parent_set = ['kko.SocialSystems','kko.Products','kko.Methodeutic','kko.Eukaryotes',
              'kko.ConceptualSystems','kko.AVInfo','kko.Systems','kko.Places',
              'kko.OrganicChemistry','kko.MediativeRelations','kko.LivingThings',
              'kko.Information','kko.CopulativeRelations','kko.Artifacts','kko.Agents',
              'kko.TimeTypes','kko.Symbolic','kko.SpaceTypes','kko.RepresentationTypes',
              'kko.RelationTypes','kko.OrganicMatter','kko.NaturalMatter',
              'kko.AttributeTypes','kko.Predications','kko.Manifestations',
              'kko.Constituents']

    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    header = ['target', 'source', 'weight', 'SuperType']
    out_file = extract_deck.get('out_file')
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:                                           
        csv_out = csv.writer(output)
        csv_out.writerow(header)    
        for value in loop_list:
            print('   . . . processing', value)
            s_set = []
            root = eval(value)
            s_set = root.descendants()
            frag = value.replace('kko.','')
            for s_item in s_set:
                child_set = list(s_item.subclasses())
                count = len(list(child_set))
                
# Note 3                
                if value not in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        cur_list.append(new_pair)
                        s_rc = s_rc.replace('rc.','')
                        child = child.replace('rc.','')
                        row_out = (s_rc,child,count,frag)
                        csv_out.writerow(row_out)
                elif value in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        if new_pair not in cur_list:
                            cur_list.append(new_pair)
                            s_rc = s_rc.replace('rc.','')
                            child = child.replace('rc.','')
                            row_out = (s_rc,child,count,frag)
                            csv_out.writerow(row_out)
                        elif new_pair in cur_list:
                            continue
        output.close()         
        print('Processing is complete . . .')
graph_extractor(**extract_deck)

This routine is pretty consistent with the prior version except for a few changes. First, the order of the STs in the input dictionary has changed (1), consistent with the order of the universal categories and with lower categories processed first. Since source-target pairs are only processed once, this ordering means duplicate assignments are always placed at their lowest point in the KBpedia hierarchy. Second, to help enforce this ordering, parental STs are separately noted (2) and then processed to skip source-target pairs that had been previously processed (3).

To see the output from this routine (without hex colors yet being assigned by ST), run:

import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')

df

Evaluation of Large-scale Graph Visualization Options

In my prior work with large-scale graph visualizations, I have used Cytoscape and written about it (2008), as well as Gephi and written about it (2011). Though my most recent efforts have preferred Gephi, neither is written in Python and both are rather cumbersome to set up for a given visualization.

My interest here is either a pure Python option or one that has a ready Python wrapper. The Python visualization project, PyViz provides a great listing of the options available. Since some are less capable than others, I found Timothy Lin’s benchmark comparisons of network packages to be particularly valuable, and I have limited my evaluation to the packages he lists.

The first package is NetworkX, which is written solely in Python and is the granddaddy of network analysis packages in the language. We will use it as our starting baseline.

Lin also compares SNAP, NetworkKit, igraph, graph-tool, and Lightgraphs. I looked in detail at all of these packages except for Lightgraphs, which is written in Julia and has no Python wrapper.

Lin’s comparisons showed NetworkX to be, by far, the slowest and least performant of all of the packages tested. However, NetworkX has a rich ecosystem around it and much use and documentation. As such, it appears to be a proper baseline for the testing.

All of the remaining candidates implement their core algorithms in C or C++ for performance reasons, though Python wrappers are provided. Based on Lin’s benchmark results and visualization examples online, my initial preference was for graph-tool, followed possibly by NetworkKit. SNAP had only recently been updated by Stanford, and igraph initially appeared as more oriented to R than Python.

So, my plan was to first test NetworkX, and then try to implement one or more of the others if not satisfied.

First NetworkX Visualizations

With our data structure now in place for the entire KBpedia, it was time to attempt some visualizations using NetworkX. Though primarily an analysis package, NetworkX does support some graph visualizations, principally through graphViz or matplotlib. In this instance, we use the matplotlib option, using the spring layout.

Note in the routine below, which is fairly straightforward in nature, I inserted a print statement to separate out the initial graph construction step from graph rendering. The graph construction takes mere seconds, while rendering the graph took multiple hours.

WARNING!: The cell below takes tens of minutes to hours to run. Please do not execute unless you are able to let this run in the background.
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, edge_attr='weight', create_using=Graphtype)
print('Graph construction complete.')
pos = nx.spring_layout(G,scale=1)

nx.draw(G,pos, with_labels=True)
plt.show()
Baseline KBpedia Visualization with Labels
Figure 1: Baseline KBpedia Visualization with Labels
NOTE: The figures in this article are static captures of the interactive electronic notebook. See note at bottom for how to access these.

Since the labels render this uninterpretable, we tried the same approach without labels.

WARNING!: The cell below takes tens of minutes to hours to run. Please do not execute unless you are able to let this run in the background.
plt.figure(figsize=(8,6))
nx.draw(G,pos, with_labels=False)
plt.show() 
Baseline KBpedia Visualization without Labels
Figure 2: Baseline KBpedia Visualization without Labels

This view is hardly any better.

Given the lengthy times it took to generate these visualizations, I decided to return to our candidate list and try other packages.

More Diligence on NetworkX Alternatives

If you recall, my first preferred option was graph-tool because of its reported speed and its wide variety of graph algorithms and layouts. The problem with graph-tool, as with the other alternatives, is that a C++ compiler is required, along with other dependencies. After extensive research online, I was unable to find an example of a Windows instance that was able to install graph-tool and its dependencies successfully.

I turned next to NetworKit. Though visualization choices are limited in comparison to the other C++ alternatives, this package has clearly been designed for network analysis and has a strong basis in data science. This package does offer a Windows 10 installation path, but one that suggests adding a virtual Linux subsystem layer to Windows. Again, I deemed this to be more complexity than a single visualization component warranted.

With igraph, I went through the steps of attempting an install, but clearly was also missing dependencies and using it kept killing the kernel in Jupyter Notebook. Again, perhaps with more research and time, I could have gotten this package to work, but it seemed to impose too much effort for a Windows environment for the possible reward.

Lastly, given these difficulties, and the fact that SNAP had been under less active development in recent years, I chose not to pursue this option further.

In the end, I think with some work I could have figured out how to get igraph to install, and perhaps NetworKit as well. However, as a demo candidate being chosen for Python newbies, it struck me that no reader of this series would want to spend the time jumping through such complicated hoops in order to get a C++ option running. Perhaps in a production environment these configuration efforts may be warranted. However, for our teaching purposes, I judged trying to get a C++ installation on Windows as not worth the effort. I do believe this leaves an opening for one or more developers of these packages to figure out a better installation process for Windows. But that is a matter for the developers, not for a newbie Python user such as me.

Faster Testing of NetworkX with the Upper KBpedia

So these realizations left me with the NetworkX alternative as the prime option. Given the time it took to render the full KBpedia, I decided to use the smaller upper structure of KBpedia to work out the display and rendering options before applying it to the full KBpedia.

I thus created offline a smaller graph dataset that consisted of the 72 SuperTypes and all of their direct resource concept (RC) children. You can inspect this dataset (df_kko) in a similar matter to the snippet noted above for the full KBpedia (df).

Also, to overcome some of the display limitations of the standard NetworkX renderers, I recalled that the HoloViews package used in the last installment also had an optional component, hvPlot, designed specifically to work with NetworkX graph layouts and datasets. The advantage of this approach is that we would gain interactivity and some of the tooltips when hovering over nodes on the graph.

I literally spent days trying to get all of these components to work together in terms of my desired visualizations where SuperType nodes (and their RCs) would be colored differently and the size of the nodes would be dependent on the number of subclasses. Alas, I was unable to get these desired options to work. In part, I think this is because of the immaturity of the complete ecosystem. In part, it is also due to my lack of Python skills and the fact that the entire chain of NetworkX → bokeh → HoloViews → hvPlot each provides its own syntax and optional functions for making visualization tweaks. It is hard to know what governs what and how to get all of the parts to work together nicely.

Fortunately, with the smaller input graph set, it is nearly instantaneous to make and see changes in real time. Despite the number of tests applied, the resulting code is fairly small and straightforward:

import pandas as pd
import holoviews as hv
import networkx as nx
import hvplot.networkx as hvnx
from holoviews import opts
from bokeh.models import HoverTool

hv.extension('bokeh')

# Load the data
# on MyBinder: https://github.com/Cognonto/CWPK/blob/master/sandbox/extracts/data/kko_graph_specs.csv
df_kko = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/kko_graph_specs.csv')

# Define the graph
G_kko = nx.from_pandas_edgelist(df_kko, 'source', 'target', ['Subs', 'SuperType', 'Color'], create_using=nx.DiGraph())

pos = nx.spring_layout(G_kko, k=0.4, iterations=70)

hvnx.draw(G_kko, pos, node_color='#D6556D', alpha=0.65).opts(node_size=10, width=950, height=950, 
                           edge_line_width=0.2, tools=['hover'], inspection_policy='edges')
Smaller Scale KKO (KBpedia) Graph
Figure 3: Smaller Scale KKO (KBpedia) Graph

Final Large-scale Visualization with NetworkX

With these tests of the smaller graph complete, we are now ready to produce the final visualization of the full KBpedia graph. Though the modified code is presented below, and does run, we actually use a captured figure below the code listing to keep this page size manageable.

WARNING!: The cell below takes more than three hours to run on our standard laptop and creates a page file of 36 MB. Please do not execute unless you are able to let this run in the background.
import pandas as pd
import holoviews as hv
import networkx as nx
from holoviews import opts
import hvplot.networkx as hvnx
#from bokeh.models import HoverTool

hv.extension('bokeh')

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, edge_attr='weight', create_using=Graphtype)
print('Graph construction complete.')

pos = nx.spring_layout(G, k=0.4, iterations=70)

hvnx.draw(G, pos, node_color='#FCCC25', alpha=0.65).opts(node_size=5, width=950, height=950, 
                           edge_line_width=0.1)

#, tools=['hover'], inspection_policy='edges'
Full-sized KBpedia Graph
Figure 4: Full-sized KBpedia Graph

Other Graphing Options

I admit I am disappointed and frustrated with available Python options to capture the full scale of KBpedia. The pure Python options are unacceptably slow. Options that promise better performance and a wider choice of layouts and visualizations are difficult, if not impossible, to install on Windows. Of all of the options directly tested, none allowed me (at least with my limited Python skill level) to vary node colors or node sizes by in-degrees.

On the other hand, we began to learn some of the robust NetworkX package and will have occasion to investigate it further in relation to network analysis (CWPK #61). Further, as a venerable package, NetworkX offers a wide spectrum of graph data formats that it can read and write. We can export our graph specifications to a number of forms that perhaps will provide better visualization choices. As examples, here are ways to specify two of NetworkX’s formats, both of which may be used as inputs to the Gephi package. (For more about Gephi and Cytoscape as options here, see the initial links at the beginning of this installment.)

import pandas as pd
import networkx as nx

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, edge_attr='weight', create_using=Graphtype)

nx.write_gexf(G, 'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.gexf')

print('Gephi file complete.')

nx.write_gml(G, 'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.gml')

print('GML file complete.')

Additional Documentation

This section presents the significant amount of material reviewed in order to make the choices for use in this present CWPK installment.

First, it is possible to get online help for most options to be tested. For example:

hv.help(hvnx.draw)

And, here are some links related to options investigated for this installment, some tested, some not:

NetworkX

These same references above are also provided for the ‘latest’ version.

graph-tool

NetworKit

deepgraph

nxviz

Netwulf

igraph

SNAP

pygraphistry

ipycytoscape

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 21, 2020 at 11:04 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2404/cwpk-56-graph-visualization-and-extraction/
The URI to trackback this post is: https://www.mkbergman.com/2404/cwpk-56-graph-visualization-and-extraction/trackback/
Posted:October 19, 2020

It’s Time for Some Pretty Figures and Charts

It is time in our Cooking with Python and KBpedia series to investigate some charting options for presenting data or output. One of the remarkable things about Python is the wealth of add-on packages that one may employ, and visualization is no exception.

What we will first do in this installment is to investigate some of the leading charting options in Python sufficient for us to make an initial selection. We want nice looking output that is easily configured and fed with selected data. We also want multiple visualization types perhaps to work from the same framework, so that we need not make single choices, but multiple ones for multiple circumstances as our visualization needs unfold.

We will next tailor up some datasets for charting. We’d like to see a distribution histogram of our typlogies. We’d like to see the distribution of major components in the system, notably classes, properties, and mappings. We’d like to see a distribution of our notable links (mappings) to external sources. And, we’d like to see the interactive effect of our disjointedness assignments between typologies. The first desires can be met with bar and pie charts. the last with some kind of interaction matrix. (We investigate the actual knowledge graph in the next CWPK installment.)

We also want to learn how to take the data as it comes to us to process into a form suitable for visualization. Naturally, since we are generating many of these datasets ourselves, we could alter the initial generating routines in order to more closely match the needs for visualization inputs. However, for now, we will take our existing outputs as is, since that is also a good use case for wrangling wild data.

Review of Visualization Options

For quite a period, my investigation of Python visualization options had been focused on individual packages. I liked the charting output of options like Seaborn and Bokeh, and knew that Matplotlib and Plotly had close ties with Jupyter Notebook. I had previously worked with JavaScript visualization toolkits, and liked their responsiveness and often interactivity. On independent grounds, I was quite impressed with the D3.js library, though I was still investigating the suitability of that to Python. Because CWPK is a series that focuses on Python, though, I had some initial prejudice to avoid JS-dominated options. I also had spent quite a bit of time looking at graph visualization (see next installment), and had some concerns that I was not yet finding a package that met my desired checklist.

As I researched further, it was clear there were going to be trade-offs when picking a single, say, charting and then graphing package. It was about this time I came across the PyViz ecosystem. (Overall helpful tools listing: https://pyviz.org/tools.html.) PyViz is nominally the visualization complement to the broader PyData community.

Jake VanderPlas pulled together a nice overview of the Python visualization landscape and how it evolved for a presentation to PyCon in 2017. Here is the summary diagram from his talk:

Python Visualization Landscape
Figure 1: Python Visualization Landscape

Source: Jake VanderPlas, “Python’s Visualization Landscape,” PyCon 2017, https://speakerdeck.com/jakevdp/pythons-visualization-landscape-pycon-2017

The trend in visualization for quite a few years has been the development of wrappers over more primitive drawing programs that abstract and make the definition of graphs and charts much easier. As these higher-level libraries have evolved they have also come to embrace multiple lower-level packages under their umbrellas. The trade-off in easier definitions of visualization objects is some lack of direct control over the output.

Because of the central role of Jupyter Notebooks in this CWPK series, and not having a more informed basis for making an alternative choice, I chose to begin our visualization efforts with Holoviews, which is an umbrella packaging over the applications as shown in the figure above. Bokeh provides a nice suite of interactive plotting and figure types. NetworkX (which is used in the next installment) has good network analysis tools and links to network graph drawing routines. And Matplotlib is another central hub for various plot types, many other Python visualization projects, color palettes, and NumPy.

Getting Started

Like most Python packages, installation of Holoviews is quite straightforward. Since I also know we will be using the bokeh plot library, we include it as well when installing the system:

   conda install -c pyviz holoviews bokeh

Generating the First Chart

The first chart we want to tackle is the distribution of major components in KBpedia, which we will visualize with a pie chart. Statistics from our prior efforts (see the prior CWPK #54) and what is generated in the Protégé interface provide our basic counts. Since the input data set is so small, we will simply enter it directly into the code. (Later examples will show how we load CSV data using pandas .)

For the pie chart we will be using, we pick the bokeh plotting package. In reviewing code samples across the Web, we pick one example and modify it for our needs. I will explain key aspects of this routine after the code listing and chart output:

import panel as pn
pn.extension()
from math import pi
import pandas as pd                                                                   # Note 1

from bokeh.palettes import Accent
from bokeh.plotting import figure
from bokeh.transform import cumsum

a = {                                                                                 # Note 2
    'Annotation': 759398,
    'Logical': 85333,
    'Declaration': 63229,
    'Other': 8274
}

data = pd.Series(a).reset_index(name='value').rename(columns={'index':'axiom'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = Accent[len(a)]

p = figure(plot_height=350, title='Axioms in KBpedia', toolbar_location=None,         # Note 3
           tools='hover', tooltips='@axiom: @value', x_range=(-0.5, 1.0))

r = p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),    # Note 4
        line_color='white', fill_color='color', legend_field='axiom', source=data)

p.axis.axis_label=None                                                                # Note 5
p.axis.visible=False
p.grid.grid_line_color = None

bokeh_pane = pn.pane.Bokeh(p)
bokeh_pane                                                                            # Note 6
Pie Chart of KBpedia Axioms
NOTE: The figures in this article are static captures of the interactive electronic notebook. See note at bottom for how to access these.

As with our other special routines, we begin by importing the new packages that are required for the pie chart (1). One of the imports, pandas, gives us very nice ways to relate an input CSV file or entered data to pick up item labels (rows) and attributes (col). Another notable import is to pick the color palette we want to use for our figure.

As noted, because our dataset is so small, we just enter it directly into the routine (2). Note how data entry conforms to the Python dictionary format of key:value pairs. Our data section also provides how we will convert the actual numbers of our data into segment slices in the pie chart, as well as defines for us the labels to be used based on pandas’ capabilities. We also indicate how many discrete colors we wish to use from the Accents palette. (Palettes may be chosen based on a set of discrete colors over a given spectrum, or, for larger data sets, picked as an increment over a continuous color spectrum. See further Additional Documentation below.)

The next two parts dictate how we format the chart itself. The first part sets the inputs for the overall figure, such as size, aspect, title, background color and so forth ((3)). We can also invoke some tools at this point, including the useful ‘hover’ that enables us to see actual values or related when mousing over items in the final figure. The second part of this specification guides the actual chart type display, ‘wedge’ in this case because of our choice of a pie chart (4). To see the various attributes available to us, we can invoke the standard dir() Python function:

dir(p)

We continue to add the final specifications to our figure (5) and then invoke our function to render the chart (6).

We can take this same pattern and apply new data on the distribution of properties within KBpedia according to our three major types, which produces this second pie chart, again following the earlier approach:

prop = {
    'Object': 1316,
    'Data': 802,
    'Annotation': 2919
}

data = pd.Series(prop).reset_index(name='value').rename(columns={'index':'property'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = Accent[len(prop)]

p = figure(plot_height=350, title="Properties in KBpedia", toolbar_location=None,
           tools="hover", tooltips="@property: @value", x_range=(-0.5, 1.0))

r = p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='property', source=data)

p.axis.axis_label=None
p.axis.visible=False
p.grid.grid_line_color = None

bokeh_pane = pn.pane.Bokeh(p)
bokeh_pane
Pie Chart of KBpedia Properties

More Complicated Datasets

The two remaining figures in this charting installment use a considerably more complicated dataset: an interaction matrix of the SuperTypes (STs) in KBpedia. There are more than 70 STs under the Generals branch in KBpedia, but a few of them are very-high level (Manifestations, Symbolic, Systems, ConceptualSystems, Concepts, Methodeutic, KnowledgeDomains), leaving a total of about 64 that have potentially meaningful interactions. If we assume that interactions are transitive, that gives us a total of 2016 possible pairwise combinations among these STs ((N * N-1)/2).

From a substantive standpoint, some interactions are nearly global such as for Predications (including AttributeTypes, DirectRelations, and RepresentationTypes, specifically incorporating AdjunctualAttributes, ContextualAttributes, IntrinsicAttributes, CopulativeRelations, MediativeRelations, Associatives, Denotatives, and Indexes), and about 70 pair interactions are with direct parents. When we further remove these potential interactions, we are left with about 50 remaining STs, representing a final set of 1204 ST pairwise interactions.

Of this final set, 50% (596) are completely disjoint, 646 are disjoint to max 0.5%, and only 355 (30%) have overlaps exceeding 10%.

There are two charts we want to produce from this larger dataset. The first is a histogram of the distribution of STs as measured by number of reference concepts (RCs) each contains, and the second is a heatmap of the ST interactions that meaningfully participate in disjoint assertions.

In getting the basic input data into shape, it would have been possible to rely on many standard Python packages geared to data wrangling, but the fact is that a dataset of even this size can perhaps be more effectively and quickly manipulated in a spreadsheet, which is how I approached these sets. The trick to large-scale sorts and manipulations of such data in a spreadsheet is to create temporary columns or rows in which unique sequence numbers are designed (with the numbers being calculated from a formula such a new cell ID = prior cell ID + 1), copy the formulas as values, and then include these temporary rows or columns in the global (named) block that contains all of the data. One can then do many manipulations of the data matrix and still return to desired organization and order by sorting again on these temporary sequence numbers.

Histogram Distribution of STs by RCs

Let’s first begin, then, with the routine for displaying our SuperTypes (STs) according to their count of reference concepts (RCs). We import our needed Python packages, including a variety of color palettes, and reference our source input file in CSV format. Note we are reading this input file into pandas, which we invoke in order to see the input data (ST by RC count):

import pandas as pd
from bokeh.plotting import figure, output_notebook, show, ColumnDataSource
from bokeh.models.tools import HoverTool
from bokeh.transform import factor_cmap
from bokeh.palettes import viridis, magma, Turbo256, linear_palette

output_notebook()

src = r'C:\1-PythonProjects\kbpedia\v300\extractions\data\supertypes_counts.csv'
# on MyBinder, find at: CWPK/sandbox/extracts/data/

df = pd.read_csv(src)

df

Again using pandas, we are able to relate our column data to what will be displayed in the final figure:

supertypes = df['SuperTypes']
rcs = df['RCs']

supertypes

As with our previous figures, we have to input our settings for both the overall figure and the plot type (horizontal bar, in this case):

p = figure(y_range=supertypes,
           title = 'Counts by Disjoint KBpedia SuperTypes',
           x_axis_label = 'RC Counts',
           plot_width = 800,
           plot_height = 600,
           tools = 'pan,box_select,zoom_in,zoom_out,save,reset'
           )

p.hbar(y = supertypes,
       right = rcs,
       left = 0,
       height = 0.4,
       color = 'orange',
       fill_alpha = 0.5
       )

show(p)
Bar Chart of KBpedia RCs by SuperType (single color)

This shows the ease of working directly with pandas dataframes. But, there is a built-in function called ColumnDataSource that gives us some additional flexibility:

source = ColumnDataSource(df)

st_list = source.data['SuperTypes'].tolist()

p2 = figure(y_range = st_list,                              # Note the change of source here
            title = 'Counts by Disjoint KBpedia SuperTypes',
            x_axis_label = 'RC Counts',
            plot_width = 800,
            plot_height = 600,
            tools = 'pan,box_select,zoom_in,zoom_out,save,reset'
           )

p2.hbar(y = 'SuperTypes',                                   
        right = 'RCs',                                      
        left = 0,
        height = 0.4,
        color = 'orange',
        fill_alpha = 0.5,
        source=source                                      # Note the additional source
       )

hover = HoverTool()

hover.tooltips = """
    <div>
        <div><strong>SuperType: </strong>@SuperTypes</div>
        <div><strong>RCs: </strong>@RCs</div>         
    </div>
"""
p2.add_tools(hover)

show(p2)

Next, we want to add a palette. After trying the variations first loaded, we choose Turbo256 and tell the system the number of discrete colors desired:

mypalette = linear_palette(Turbo256,50)

p2.hbar(y = 'SuperTypes',
        right = 'RCs',
        left = 0,
        height = 0.4,
        fill_color = factor_cmap(
               'SuperTypes',
               palette = mypalette,
               factors=st_list
               ),
        fill_alpha=0.9,
        source=source
)

hover = HoverTool()

hover.tooltips = """
    <div>
        <div><strong>SuperType: </strong>@SuperTypes</div>
        <div><strong>RCs: </strong>@RCs</div>         
    </div>
"""
p2.add_tools(hover)

show(p2)
Bar Chart of KBpedia RCs by SuperType (multi-color)

This now achieves the look we desire, with the bars sorted in order and a nice spectrum of colors across the bars. We also have hover tips that provide the actual data for each bar. The latter is made possible by the ColumnDataSource where we replace the standard ‘dict’ format into x, y.

Since we continue to gain a bit more tailoring and experience with each chart, we decide it is time to tackle the heatmap.

Heatmap Display

A heatmap is an interaction matrix. In our case, what we want to display are the SuperTypes that have some degree of disjointedness plotted against one another, with the number of RCs in x displayed against the RCs within y. Since, as the previous horizontal bar chart shows, we have a wide range of RC counts by SuperType, to normalize these interactions we decide to express the overlap as a percentage.

We again set up our imports and figure as before. If you want to see the actual data input file and format, invoke df_h as we did before:

import holoviews as hv
from holoviews import opts
hv.extension('bokeh', 'matplotlib')
import pandas as pd
import matplotlib

src = r'C:\1-PythonProjects\kbpedia\v300\extractions\data\st_heatmap.csv'
# on MyBinder, find at: CWPK/sandbox/extracts/data/

df_h = pd.read_csv(src)

heatmap = hv.HeatMap(df_h, kdims=['ST 1(x)', 'ST 2(y)'], vdims=['Rank', 'Overlap', 'Overlap/ST 1', 
                    'ST 1 RCs', 'ST 2 RCs'])

color_list = ['#555555', '#CFCFCF', '#C53D4D', '#D14643', '#DC5039', '#E55B30',
           '#EB6527', '#F0701E', '#F47A16', '#F8870D', '#FA9306', '#FB9E07',
           '#FBAC10', '#FBB91E', '#F9C52C', '#F6D33F', '#F3E056', '#F1EB6C',
           '#F1EE74', '#F2F381', '#F3F689', '#F5F891', '#F6F99F', '#F7FAAC',
           '#F9FBB9', '#FAFCC6', '#FCFDD3', '#FEFFE5']

# for color_list, see https://stackoverflow.com/questions/21094288/convert-list-of-rgb-codes-to-matplotlib-colormap

my_cmap = matplotlib.colors.ListedColormap(color_list, name='interact')

heatmap.opts(opts.HeatMap(tools=['hover'], cmap=my_cmap, colorbar=True, width=960, 
                          xrotation=90, height=960, toolbar='above', clim=(0, 26)))

heatmap
Overlap Heatmap of Shared RCs Between SuperTypes

All of the available palettes did not have a color spectrum we liked, plus we needed to introduce the dark gray color (where an ST is being mapped to itself and therefore needs to be excluded). Another exclusion (light gray) is to remove ST interactions with anything in its parental lineage.

As for useful interactions, we wanted a close to smooth distribution of overlap intensities across the entire spectrum of 0% overlap (no color, white) to more than 95% (dark red). We achieve this distribution by not working directly from the percentage overlap figures, but by the mapping of thse percentage overlaps to a more-or-less smooth ranking assignment from roughly 0 to 30. It is the rank value that determines the color of the interaction cell.

There are clearly many specifics that may set and tweaked for your own figures. The call below is one example of how to get explanation of these settings.

hv.help(hv.HeatMap)

Additional Documentation

Colors and Palettes

Charting

What to chart?

Heatmaps

Other

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 19, 2020 at 11:44 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2402/cwpk-55-charting/
The URI to trackback this post is: https://www.mkbergman.com/2402/cwpk-55-charting/trackback/
Posted:October 15, 2020

This installment in our Cooking with Python and KBpedia series covers two useful (essential?) utilities for any substantial project: stats and logging. stats refers to internal program or knowledge graph metrics, not a generalized statistical analysis package. logging is a longstanding Python module that provides persistence and superior control over using simple print statements for program tracing and debugging.

On the stats side, we will emphasize capturing metrics not already available when using Protégé, which provides its own set of useful baseline statistics. (See Figure 1.) These metrics are mostly simple counts, with some sums and averages. The results of these metrics are some of the numerical data points that we will use in the next installment on charting.

On the logging front, we will edit all of our existing routines to log to file, as well as print to screen. We can embed these routines in existing functions so that we may better track our efforts.

An Internal Stats Module

In our earlier extract-and-build routines we have already put in place the basic file and set processing steps necessary to capture additional metrics. We will add to these here, in the process creating an internal stats module in our cowpoke package.

First, there is no need to duplicate the information that already comes to us when using Protégé. Here are the standard stats provided on the main start-up screen:

Protégé Internal Stats
Figure 1: Protégé Internal Stats

We are loading up here (1) our KBpedia v 300 in-progress. We can see that Protégé gives us counts (2) of classes (58200), object properties (1316), data properties (802), annotation properties (2919), and a few other metrics.

We will take these values as givens, and will enter them as part of the initialization for our own internal procedures (for checking totals and calculating percentages).

Pyflakes is a simple Python code checker that you may want to consider. If you want to add in stylistic checks, you want flake8, which combines Pyflakes with style checks against PEP 8 or pycodestyle. Pylint is another static code style checker.

from cowpoke.__main__ import *
from cowpoke.config import *
### KEY CONFIG SETTINGS (see build_deck in config.py) ###                
# 'kb_src'        : 'standard'
# count           : 14                                                    # Note 1
# out_file        : 'C:/1-PythonProjects/kbpedia/v300/targets/stats/kko_typol_stats.csv'

from itertools import combinations                                       # Note 2

def typol_stats(**build_deck):
    kko_list = typol_dict.values()
    count = build_deck.get('count')
    out_file = build_deck.get('out_file')
    with open(out_file, 'w', encoding='utf8') as output:
        print('count,size_1,kko_1,size_2,kko_2,intersect RCs', file=output)
        for i in combinations(kko_list,2):                              
            kko_1 = i[0]                                              
            kko_2 = i[1]                                              
            kko_1_frag = kko_1.replace('kko.', '')
            kko_1 = getattr(kko, kko_1_frag)
            print(kko_1_frag)
            kko_2_frag = kko_2.replace('kko.', '')
            kko_2 = getattr(kko, kko_2_frag)     
            descent_1 = kko_1.descendants(include_self = False)       
            descent_1 = set(descent_1)
            size_1 = len(descent_1)
            descent_2 = kko_2.descendants(include_self = False)
            descent_2 = set(descent_2)
            size_2 = len(descent_2)
            intersect = descent_1.intersection(descent_2)              
            num = len(intersect)
            if num <= count:                                           
                print(num, size_1, kko_1, size_2, kko_2, intersect, sep=',', file=output)
            else: 
                print(num, size_1, kko_1, size_2, kko_2, sep=',', file=output)
    print('KKO typology intersection analysis is done.')
typol_stats(**build_deck)

The procedure above takes a few minutes to run. You can inspect what the routine produces at C:/1-PythonProjects/kbpedia/v300/targets/stats/kko_typol_stats.csv.

We can also get summary statistics from the knowledge graph using the rdflib package. Here is a modification of one of the library’s routine to obtain some VoID statistics:

import collections

from rdflib import URIRef, Graph, Literal
from rdflib.namespace import VOID, RDF

graph = world.as_rdflib_graph()
g = graph

def generate2VoID(g, dataset=None, res=None, distinctForPartitions=True):
    """
    Returns a VoID description of the passed dataset

    For more info on Vocabulary of Interlinked Datasets (VoID), see:
    http://vocab.deri.ie/void

    This only makes two passes through the triples (once to detect the types
    of things)

    The tradeoff is that lots of temporary structures are built up in memory
    meaning lots of memory may be consumed :)
    
    distinctSubjects/objects are tracked for each class/propertyPartition
    this requires more memory again

    """

    typeMap = collections.defaultdict(set)
    classes = collections.defaultdict(set)
    for e, c in g.subject_objects(RDF.type):
        classes[c].add(e)
        typeMap[e].add(c)

    triples = 0
    subjects = set()
    objects = set()
    properties = set()
    classCount = collections.defaultdict(int)
    propCount = collections.defaultdict(int)

    classProps = collections.defaultdict(set)
    classObjects = collections.defaultdict(set)
    propSubjects = collections.defaultdict(set)
    propObjects = collections.defaultdict(set)
    num_classObjects = 0
    num_propSubjects = 0
    num_propObjects = 0
    
    for s, p, o in g:

        triples += 1
        subjects.add(s)
        properties.add(p)
        objects.add(o)

        # class partitions
        if s in typeMap:
            for c in typeMap[s]:
                classCount[c] += 1
                if distinctForPartitions:
                    classObjects[c].add(o)
                    classProps[c].add(p)

        # property partitions
        propCount[p] += 1
        if distinctForPartitions:
            propObjects[p].add(o)
            propSubjects[p].add(s)

    if not dataset:
        dataset = URIRef('http://kbpedia.org/kko/rc/')

    if not res:
        res = Graph()

    res.add((dataset, RDF.type, VOID.Dataset))

    # basic stats
    res.add((dataset, VOID.triples, Literal(triples)))
    res.add((dataset, VOID.classes, Literal(len(classes))))

    res.add((dataset, VOID.distinctObjects, Literal(len(objects))))
    res.add((dataset, VOID.distinctSubjects, Literal(len(subjects))))
    res.add((dataset, VOID.properties, Literal(len(properties))))

    for i, c in enumerate(classes):
        part = URIRef(dataset + "_class%d" % i)
        res.add((dataset, VOID.classPartition, part))
        res.add((part, RDF.type, VOID.Dataset))

        res.add((part, VOID.triples, Literal(classCount[c])))
        res.add((part, VOID.classes, Literal(1)))

        res.add((part, VOID["class"], c))

        res.add((part, VOID.entities, Literal(len(classes[c]))))
        res.add((part, VOID.distinctSubjects, Literal(len(classes[c]))))

        if distinctForPartitions:
            res.add(
                (part, VOID.properties, Literal(len(classProps[c]))))
            res.add((part, VOID.distinctObjects,
                     Literal(len(classObjects[c]))))
            num_classObjects = num_classObjects + len(classObjects[c])           
            

    for i, p in enumerate(properties):
        part = URIRef(dataset + "_property%d" % i)
        res.add((dataset, VOID.propertyPartition, part))
        res.add((part, RDF.type, VOID.Dataset))

        res.add((part, VOID.triples, Literal(propCount[p])))
        res.add((part, VOID.properties, Literal(1)))

        res.add((part, VOID.property, p))

        if distinctForPartitions:

            entities = 0
            propClasses = set()
            for s in propSubjects[p]:
                if s in typeMap:
                    entities += 1
                for c in typeMap[s]:
                    propClasses.add(c)

            res.add((part, VOID.entities, Literal(entities)))
            res.add((part, VOID.classes, Literal(len(propClasses))))

            res.add((part, VOID.distinctSubjects,
                     Literal(len(propSubjects[p]))))
            res.add((part, VOID.distinctObjects,
                     Literal(len(propObjects[p]))))
            num_propSubjects = num_propSubjects + len(propSubjects[p])
            num_propObjects = num_propObjects + len(propObjects[p]) 
    print('triples:', triples)
    print('subjects:', len(subjects))
    print('objects:', len(objects))
    print('classObjects:', num_classObjects)
    print('propObjects:', num_propObjects)      
    print('propSubjects:', num_propSubjects)
     

    return res, dataset
generate2VoID(g, dataset=None, res=None, distinctForPartitions=True)
triples: 1662129
subjects: 213395
objects: 698372
classObjects: 850446
propObjects: 858445
propSubjects: 1268005
(<Graph identifier=Na47c69e2f7b84d9b911c46e2cdf0fe11 (<class 'rdflib.graph.Graph'>)>,
rdflib.term.URIRef('http://kbpedia.org/kko/rc/'))

These metrics can go into the pot with the summary statistics we also gain from Protégé. We’ll see some graphic reports on these numbers in the next installment.

Logging

I think an honest appraisal may straddle the fence about whether logging makes sense for the cowpoke package. On the one hand, we have begun to assemble a fair degree of code within the package, that perhaps would normally trigger the advisability of logging. On the other hand, we run the various scripts only sporadically, and in pieces when we do. There is not a continuous production function under what we have done, so far.

If we were to introduce this code into a production setting or get multiple developers involved, I would definitely argue for the need for logging. Consider what we have in the current cowpoke code base as the transition condition for looking at this question. However, since logging is good practice, and we are close, let’s go ahead and invoke the capability nonetheless.

One chooses logging over the initial print statements because we gain these benefits:

  1. The ability to time stamp our logging messages
  2. The ability to keep our logging messages persistent
  3. We can generate messages constantly in the background for later inspection, and
  4. We can better organize our logging messages.

The logging module that comes with Python is quite mature and has further advantages:

  1. We can control the warning level of the messages and what warning levels trigger logging
  2. We can format the messages as we wish, and
  3. We can send our messages to file, screen, or socket.

By default, the Python logging module has five pre-set warning levels:

  • debug – detailed information, typically of interest only when diagnosing problems
  • info – confirmation that things are working as expected
  • warning – an indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected
  • error – due to a more serious problem, the software has not been able to perform some function, or
  • critical – a serious error, indicating that the program itself may be unable to continue running.

We’ll see in the following steps how we can configure the logger and set it up for working with our existing functions.

Configuration

Logging is organized as a tree, with the root being the system level. For a single package, it is best to set up a separate main logging branch under the root so that warnings and loggings can be treated consistently throughout the package. This design, for example, allows warning messages and logging levels to be set with a single call across the entire package (sub-branches may have their own conditions). This is what is called adding a ‘custom’ logger to your system.

Configurations may be set in Python code (the method we will use, because it is the simplest) or via a separate .ini file. Configuration settings include most of the specified items below.

Handlers

You can set up logging messages to go to console (screen) or file. In our examples below, we will do both.

Formatters

You can set up how your messages are formatted. We can also format console v file messages differently, as our examples below show.

Default Messages

Whenever we insert a logging message, beside setting severity level, we may also assign a message unique to that part of the code. However, if we choose not to assign a new, specific message, the message invoked will be the default one defined in our configuration.

Example Code

Since our set up is straightforward, we will put our configuration settings into our existing config.py file and write our logging messages to the log subdirectory. Here is how our set up looks (with some in-line commentary):

import logging

# Create a custom logger
logger = logging.getLogger(__name__)                # Will invoke name of current module

# Create handlers
log_file = 'C:/1-PythonProjects/kbpedia/v300/targets/logs/kbpedia_logging.log'
#logging.basicConfig(filename=log_file,level=logging.DEBUG)
c_handler = logging.StreamHandler()                 # Separate c_ and f_ handlers
f_handler = logging.FileHandler(log_file)
c_handler.setLevel(logging.WARNING)
f_handler.setLevel(logging.DEBUG)

# Create formatters and add it to handlers          # File logs include time stamp, console does not
c_format = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
f_format = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
c_handler.setFormatter(c_format)
f_handler.setFormatter(f_format)

# Add handlers to the logger
logger.addHandler(c_handler)
logger.addHandler(f_handler)

logging.debug('This is a debug message.')
logging.info('This is an informational message.')
logging.warning('Warning! Something does not look right.')
logging.error('You have encountered an error.')
logging.critical('You have experienced a critical problem.')
WARNING:root:Warning! Something does not look right.
ERROR:root:You have encountered an error.
CRITICAL:root:You have experienced a critical problem.

Make sure you have this statement at the top of all of your cowpoke files:

  import logging

Then, as you write or update your routines, use the logging.severity() statement where you previously were using print. This will cause you to get messages to both console and file, at the severity threshold level set. It is that easy!

Additional Documentation

Here is some supporting documentation for today’s installment:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 15, 2020 at 10:52 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2401/cwpk-54-statistics-and-logging/
The URI to trackback this post is: https://www.mkbergman.com/2401/cwpk-54-statistics-and-logging/trackback/