Posted:November 9, 2020

CWPK #63: Staging Data Sci Resources and Preprocessing

Clean Corpora and Datasets are a Major Part of the Effort

With our discussions of network analysis and knowledge extractions from our knowledge graph now behind us, we are ready to tackle the questions of analytic applications and machine learning in earnest for our Cooking with Python and KBpedia series. We will be devoting our next nine installments to this area. We devote two installments to data sources and input preparations, largely based on NLP (natural language processing) applications. Then we devote two installments to ‘standard’ machine learning (largely) using the scikit-learn packages. We next devote four installments to deep learning, split equally between the Deep Learning Graph (DGL) and PyTorch Geometric (PyG) frameworks. We conclude this Part VI with a summary and comparison of results across these installments based on the task of node classification.

In this particular installment we flesh out the plan for completing these installments and discuss data sources and completing data prep needed for the plan. We provide particular attention to the architecture and data flows within the PyTorch framework. We describe the additional Python packages we need for this work, and install and configure the first ones. We discuss general sources of data and corpora useful for machine learning purposes. Our coding efforts in this installment will obtain and clean the Wikipedia pages that supplement the two structural and annotation sources based on KBpedia that were covered in the prior installment. These three sources of structure, annotations and pages are the input basis to creating our own embeddings to be used in many of the machine learning tests.

Plan for Completion of Part VI

The broad ecosystem of Python packages I was considering looked, generally, to be good choices to work together, as first outlined in CWPK #61. I had done an adequate initial diligence. But, how all of this was to unfold, what my plan of attack should be, became driving factors I had to solve to shorten my development and coding efforts. So, with an understanding of how we could extract general information from KBpedia useful to analysis and machine learning, I needed to project out over the entire anticipated scope to see if, indeed, these initial sources looked to be the right ones for our purposes. And, if so, how shall the efforts be sequenced and what is the flow of data?

Much reading and research went into this effort. It is true, for example, that we had already prepared a pretty robust series of analytic and machine learning case studies in Clojure, available from the KBpedia Web site. I revisited each of these use cases and got some ideas of what made sense for us to attempt with Python. But I needed to understand the capabilities now available to us with Python, so I also studied each of the candidate keystone packages in some detail.

I will weave the results of this research as the next installments unfold, providing background discussion in context and as appropriate. But, in total, I formulated about 30 tasks going forward that appeared necessary to cover the defined scope. The listing below summarizes these steps, and keys the transition point (as indicated by CWPK installment number) for proceeding to each next new installment:

Formulate Part VI plan
Extract two source files from KBpedia
- structure
- annotations
Set environment up (not doing virtual)
Obtain Wikipedia articles for matching RCs
Set up gensim
Clean Wikipedia articles, all KB annotations
Set up spaCy
ID, extract phrases
Finish embeddings prep #64
- remove stoplist
- create numeric??
Create embedding models:
- word2vec and doc2vec
Text summarization for short articles (gensim)
Named entity recognition
Set up scikit-learn #65
Create master pandas file
Do event/action extraction
Do scikit-learn classifier #66
- SVM
- k-nearest neighbors
- random forests
Introduce the sklearn.metrics module and confusion matrix, etc. The standard for reporting
Discuss basic test parameters/’gold standars’
Knowledge graph embeddings #67
Create embedding models -2
- KB-struct
- KB-annot
- KB-annot-full: what is above + below
- KB-annot-page
Set up PyTorch/DLG-KE #68
Set up PyTorch/PyG
Formulate learning pathway/code
Do standard DL classifiers: #69
- TransE
- TransR
- RESCAL
- DistMult
- ComplEx
- RotatE
Do research DL classifiers: #70
- VAE
- GGSNN
- MPNN
- ChebyNet
- GCN
- SAGE
- GAT
Choose a model evaluator: #71
- scikit-learn
- pyTorch
- other?
Collate prior results
Evaluate prior results
Present comparative results

Some of these steps also needed some preliminary research before proceeding. For example, knowing I wanted to compare results across algorithms meant I needed to have a good understanding of testing and analysis requirements before starting any of the tests.

PyTorch Architecture

A critical question in contemplating this plan was how exactly data needed to be produced, staged, and then fed into the analysis portions. From the earlier investigations I had identified the three categories of knowledge grounded in KBpedia that could act as bases or features to machine learning; namely, structure, annotations and pages. I also had identified PyTorch as a shared abstraction layer for deep and machine learning.

I was particularly focused on the question of data formats and representations such that information could be readily passed from one step to the next in the analysis pipeline. Figure 1 is the resulting data flow chart and architecture that arose from these investigations.

First, the green block labeled ‘owlready2’ represents that Python package, but also the location where the intact knowledge graph of KBpedia is stored and accessed. As early installments covered, we can use either owlready2 or Protégé to manage this knowledge graph, though owlready2 is the point at which the KBpedia information is exported or extracted for downstream uses, importantly machine learning. As our owlready2 discussions also indicated, there is a close relationship between it and RDFLib (which is also the SPARQL access point). RDFLib can provide direct input into NetworkX, but that is limited to structure only.

The clearest common denominator format for entry into the machine learning pipeline is pandas via CSV files. This centrality is fortunate given that all of our prior KBpedia extract-and-build routines have been designed around this format. This format is also one of the direct feeds possible into the PyTorch datasets format, as the figure shows:

Data Flows in Machine Learning and KG Analysis

Figure 1: Data Flows in Machine Learning and Knowledge Graph Analysis

An important block on the figure is for ’embeddings’. If you recall, all text needs to first be encoded to a numeric form to be understood by the computer. This process can also undertake dimensionality reduction, important for a sparse matrix data form like language. This same ability can be applied to graph structure and interactions. Thus, the ’embedding’ block is a pivotal point at which we can represent words, sentences, paragraphs, documents, nodes, or entire graphs. We will focus much on embeddings throughout this Part VI.

For training purposes we can also feed pre-trained corpora or embeddings into the system. We address this topic in the next main section.

Figure 1 is not meant to be a comprehensive view of PyTorch, but it is one useful to understand data flows with respect to our use of the KBpedia knowledge graph. Over the course of this research, I also encountered many PyTorch-related extensions that, when warranted, I include in the discussion.

Possible Extensions

There are some extensions to the PyTorch ecosystem that we will not be using or testing in this CWPK series. Here are some of the ones that seem closest in capabilities to what we are doing with KBpedia:

PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, Microsoft LightGBM, spaCy, and many more
PiePline is a neural networks training pipeline based on PyTorch. Designed to standardize training process and accelerate experiments
Catalyst helps to write full-featured deep learning pipelines in a few lines of code
Poutyne is a Keras-like framework for PyTorch and handles much of the boilerplating code needed to train neural networks
torchtext has some capabilities in language modeling, sentiment analysis, text classification, question classification, entailment, machine translation, sequence tagging, question answering, and unsupervised learning
Spotlight uses PyTorch to build both deep and shallow recommender models.

Corpora and Datasets

There are many off-the-shelf resources that can be of use when doing machine learning involving text and language. (There are as well for images, but that is out of scope to our current interests.) These resources fall into three main areas:

corpora – are language resources of either a general or domain nature, with vetted relationships or annotations between terms and concepts or other pre-processing useful to computational linguistics
pre-trained models – are pre-calculated language models, often expressing probability distributions over words or text. Some embeddings can act in this manner. Transformers use deep learning to train their representations, with BERT being a notable example
embeddings – are vector representations of chunks of text, ranging from individual words up to entire documents or languages. The numeric representation either represents a pooled statistical representation across all tokens (the so-called CBOW approach) or context and adjacency using the skip-gram or similar method. GloVe, word2vec and fastText are example methodologies for producing word embeddings.

Example corpora include Wikipedia (in multiple languages), news articles, Web crawls, and many others. Such corpora can be used as the language input basis for training various models, or may be a reference vocabulary for scoring and ranking input text. Various pre-trained language models are available, and embedding methods are available in a number of Python packages, including scikit-learn, gensim and spaCy used in cowpoke.

Pre-trained Resources

There are a number of free or open-source resources for these corpora or datasets. Some include:

Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages
HuggingFace datasets
English Wikipedia dump
Wikipedia2Vec pre-trained embeddings
gensim datasets contain links to 8 options
word2vec pre-trained models lists 16 or so datasets
A comprehensive list of available gensim datasets and models
11 pre-trained word embedding models in various embedding formats
Some older GloVe embeddings
Word vectors for 157 languages
DBpedia entity typing + word embeddings.

Setting Up the Environment

In doing this research, I also assembled the list of Python packages needed to add these capabilities to cowpoke. Had I not just updated the conda packages, I would do so now:

conda update --all

Next, the general recommendation when installing multiple new packages in Python is to do them in one batch, which allows the package manager (conda in our circumstance) to check on version conflicts and compatibility during the install process. However, with some of the packages involved in the current expansion, there are other settings necessary that obviates this standard ‘batch’ install recommendation.

Another note is important here. In an enterprise environment with many Python projects, it is also best to install these machine learning extensions into their own virtual environment. (I covered this topic a bit in CWPK #58.) However, since we are keeping this entire series in its own environment, we will skip that step here. You may prefer the virtual option.

So, we will begin with those Python packages and frameworks that pose their own unique set-up and install challenges. We begin with PyTorch. We need to first appreciate that the rationale for PyTorch was to abstract machine learning constructs while taking advantage of graphics processing units (GPUs) (specifically, Nvidia via the CUDA interface). The CUDA architecture provides one or two orders of magnitude speed up on a local machine. Unfortunately, my local Windows machine does not have the separate Nvidia GPU, so I want to install the no CUDA option. For the PyTorch install options, visit https://pytorch.org/get-started/locally/. This figure shows my selections prior to download (yours may vary):

Figure 2: PyTorch Download Screen

In my circumstance, my local machine does not have a separate graphics processor, so I set the CUDA requirement to ‘None’ (1). I also removed the ‘torchvision’ command line specification (2) since that is an image-related package. (We may later need some libraries from this package, in which case we will then install it.) The PyTorch package is rather large, so install takes a few minutes. Here is the actual install command:

conda install pytorch cpuonly -c pytorch

Since we were not able to batch all new packages, I decide to continue with some of the other major additions in a sequential matter, with spaCy and its installation next:

conda install -c conda-forge spacy

and then gensim and its installation:

conda install -c conda-forge gensim

and then DLG, which has an installation screen similar to PyTorch in Figure 2 with the same picked options:

conda install -c dglteam dgl

The DLG-KE extension needs to be built from source for Windows, so we will hold off on that now until we need it. We next install PyTorch Geometric, which needs to be installed from a series of binaries, with CPU or GPU individually specified:

pip install torch-scatter==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-sparse==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-cluster==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-spline-conv==latest+cpu -f https://pytorch-geometric.com/whl/torch-1.6.0.html
pip install torch-geometric

These new packages join these that are already a part of my local conda packages, and which will arise in the coming installments:

scikit-learn and tqdm.

Getting Wikipedia Pages

With these preliminaries complete, we are now ready to resume our data preparation tasks for our embedding and machine learning experiments. In the prior installment, we discussed two of the three source files we had identified for these efforts, the KBpedia structure (kbpedia/v300/extractions/data/graph_specs.csv) and the KBpedia annotations (kbpedia/v300/extractions/classes/Generals_annot_out.csv) files. In this specific section we obtain the third source file of pages from Wikipedia.

Of the 58,000 reference concepts presently contained in KBpedia, about 45,000 have a directly corresponding Wikipedia article or listing of category articles. These provide a potentially rich source of content for language models and embeddings. The challenge is how to obtain this content in a way that can be readily processed for our purposes.

We have been working with Wikipedia since its inception, so we knew that there are data sources for downloads or dumps. For example, the periodic language dumps such as https://dumps.wikimedia.org/enwiki/20200920/ may be accessed to obtain full-text versions of articles. Such dumps have been used scores of times to produce Wikipedia corpora in many different languages and for many different purposes. But, our own mappings are a mere subset, about 1% of the nearly 6 million articles in the English Wikipedia alone. So, even if we grabbed the current dump or one of the corpora so derived, we would need to process much content to obtain the subset of interest.

Unfortunately, Wikipedia does not have a direct query or SPARQL form as exists for Wikidata (which also does not have full-text articles). We could obtain the so-called ‘long abstracts’ of Wikipedia pages from DBpedia (see, for example, https://wiki.dbpedia.org/downloads-2016-10), but this source is dated and each abstract is limited to about 220 words; further, a full download of the specific file in English is about 15 GB!

The basic approach, then, appeared that I would need to download the full Wikipedia article file, figure out how to split it into parts, and then match identifiers between KBpedia mappings and the full dataset to obtain the articles of interest. This approach is not technically difficult, but it is a real pain in the ass.

So, shortly before I committed to this work effort, I challenged myself to find another way that was perhaps less onerous. Fortunately, I found the online Wikipedia service, https://en.wikipedia.org/wiki/Special:Export, that allows one to submit article names to a text box and then get the full page article back in XML format. I tested this online service with a few articles, then 100, and then ramped up to a listing of 5 K at a time. (Similar services often have governors that limit the frequency or amounts of individual requests.) This approach worked!, and within 30 min I had full articles in nine separate batches for all 45 K items in KBpedia.

Clean All Input Text

This file is a single article from the Wikipedia English dump for 1-(2-Nitrophenoxy)octane:

<page>
    <title>1-(2-Nitrophenoxy)octane</title>
    <ns>0</ns>
    <id>11793192</id>
    <revision>
      <id>891140188</id>
      <parentid>802024542</parentid>
      <timestamp>2019-04-05T23:04:47Z</timestamp>
      <contributor>
        <username>Koavf</username>
        <id>205121</id>
      </contributor>
      <minor/>
      <comment>/* top */Replace HTML with MediaWiki markup or templates, replaced: &lt;sub&gt; → {{sub| (3), &lt;/sub&gt; → }} (3)</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="2029" xml:space="preserve">{{chembox
| Watchedfields = changed
| verifiedrevid = 477206849
| ImageFile =Nitrophenoxyoctane.png
| ImageSize =240px
| ImageFile1 = 1-(2-Nitrophenoxy)octane-3D-spacefill.png
| ImageSize1 = 220
| ImageAlt1 = NPOE molecule
| PIN = 1-Nitro-2-(octyloxy)benzene
| OtherNames = 1-(2-Nitrophenoxy)octane&lt;br /&gt;2-Nitrophenyl octyl ether&lt;br /&gt;1-Nitro-2-octoxy-benzene&lt;br /&gt;2-(Octyloxy)nitrobenzene&lt;br /&gt;Octyl o-nitrophenyl ether
|Section1={{Chembox Identifiers
| Abbreviations =NPOE
| ChemSpiderID_Ref = {{chemspidercite|correct|chemspider}}
| ChemSpiderID = 148623
| InChI = 1/C14H21NO3/c1-2-3-4-5-6-9-12-18-14-11-8-7-10-13(14)15(16)17/h7-8,10-11H,2-6,9,12H2,1H3
| InChIKey = CXVOIIMJZFREMM-UHFFFAOYAD
| StdInChI_Ref = {{stdinchicite|correct|chemspider}}
| StdInChI = 1S/C14H21NO3/c1-2-3-4-5-6-9-12-18-14-11-8-7-10-13(14)15(16)17/h7-8,10-11H,2-6,9,12H2,1H3
| StdInChIKey_Ref = {{stdinchicite|correct|chemspider}}
| StdInChIKey = CXVOIIMJZFREMM-UHFFFAOYSA-N
| CASNo_Ref = {{cascite|correct|CAS}}
| CASNo =37682-29-4
| PubChem =169952
| SMILES = [O-][N+](=O)c1ccccc1OCCCCCCCC
}}
|Section2={{Chembox Properties
| Formula =C{{sub|14}}H{{sub|21}}NO{{sub|3}}
| MolarMass =251.321
| Appearance =
| Density =1.04 g/mL
| MeltingPt =
| BoilingPtC = 197 to 198
| BoilingPt_notes = (11 mm Hg)
| Solubility =
  }}
|Section3={{Chembox Hazards
| MainHazards =
| FlashPt =
| AutoignitionPt = 
 }}
}}

'''1-(2-Nitrophenoxy)octane''', also known as '''nitrophenyl octyl ether''' and abbreviated '''NPOE''', is a 
[[chemical compound]] that is used as a matrix in [[fast atom bombardment]] [[mass spectrometry]], liquid 
[[secondary ion mass spectrometry]], and as a highly [[lipophilic]] [[plasticizer]] in [[polymer]] 
[[Polymeric membrane|membranes]] used in [[ion selective electrode]]s.

== See also ==

* [[Glycerol]]
* [[3-Mercaptopropane-1,2-diol]]
* [[3-Nitrobenzyl alcohol]]
* [[18-Crown-6]]
* [[Sulfolane]]
* [[Diethanolamine]]
* [[Triethanolamine]]

{{DEFAULTSORT:Nitrophenoxy)octane, 1-(2-}}
[[Category:Nitrobenzenes]]
[[Category:Phenol ethers]]</text>
      <sha1>0n15t2w0sp7a50fjptoytuyus0vsrww</sha1>
    </revision>
  </page>

We want to extract out the specific article text (denoted by the <text> field), perhaps capture some other specific fields, remove internal tags, and then create a clean text representation that we can further process. This additional processing includes removing stoplist words, finding and identifying phrases (multiple token chunks), and then tokenizing the text suitable for processing as computer input.

There are multiple methods available for this kind of processing. One approach, for example, uses XML parsing and specific code geared to the Wikipedia dump. Another approach uses a dedicated Wikipedia extractor. There are actually a few variants of dedicated extractors.

However, one particular Python package, gensim, provides multiple utilities and Wikipedia services. Since I had already identified gensim to provide services like sentiment analysis and some other NLP capabilities, I chose to focus on using this package for the needed Wikipedia cleaning tasks.

Gensim has a gensim.corpora.wikicorpus.WikiCorpus class designed specifically for processing the Wikipedia article dump file. Fortunately, I was able to find some example code on KDnuggets that showed the way in how to process this file

https://stackoverflow.com/questions/56715394/how-do-i-use-the-wikipedia-dump-as-a-gensim-model (doc2vec example) and
https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html as another.

However, prior to using gensim, I needed to combine the batch outputs from my Wikipedia page retrievals into a single xml file, which I could then bzip for direct ingest by gensim. (Most gensim models and capabilities can read either bzip or text files.)

Each 5 K xml page retrieval from Wikipedia comes with its own header and closing tags. These need to be manually snipped out of the group retrieval files before combining. We prepared these into nine blocks that corresponded to each of the batch Wikipedia retrievals, and retained the header and closing tags in the first and last files respectively:

NOTE: Due to GitHub’s file size limits (of 100 MB max), the nine text files listed in the next routine have been zipped and uploaded to kbpedia.org/cwpk-text/Wikipedia-pages-1.zip. To use these files, you will need to download to your local system and unzip. You will need to increment the zip files up to #9. Then, all following routines below must be repeated locally in order to progress through the various cleaning and preparation steps.

out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-pages-full.xml'
filenames = [r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-1.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-2.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-3.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-4.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-5.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-6.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-7.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-8.txt', 
             r'C:\1-PythonProjects\kbpedia\v300\models\inputs\Wikipedia-pages-9.txt']
with open(out_f, 'w', encoding='utf-8') as outfile:
    for fname in filenames:
        with open(fname, encoding='utf-8', errors='ignore') as infile:
            i = 0
            for line in infile:
                i = i + 1
                try:
                    outfile.write(line)
                except Exception as e:
                    print('Error at line:' + i + str(e))
            print('Now combined:' + fname)
    outfile.close 
    print('Now all files combined!')

The output of this routine is then bzipped offline, and then used as the submission to the gensim WikiCorpus function that processes the standard xml output:

"""
Creates a corpus from Wikipedia dump file.
Inspired by:
https://github.com/panyang/Wikipedia_Word2vec/blob/master/v1/process_wiki.py
"""

import sys
from gensim.corpora import WikiCorpus

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-pages-full.xml.bz2'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full.txt'

def make_corpus(in_f, out_f):

    """Convert Wikipedia xml dump file to text corpus"""

    output = open(out_f, 'w', encoding='utf-8')            # made change
    wiki = WikiCorpus(in_f)
    i = 0
    for text in wiki.get_texts():
        try:
            output.write(' '.join(map(lambda x:x.decode('utf-8'), text)) + '\n')
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
        if (i % 10000 == 0):
            print('Processed ' + str(i) + ' articles')
    output.close()
    print('Processed ' + str(i) + ' articles;')
    print('Processing complete!')

make_corpus(in_f, out_f)

We further make a smaller input file, enwiki-test-corpus.xml.bz2, with only a few records from the Wikipedia XML dump in order to speed testing of the above code.

Initial Results

Here is what the sample program produced for the entry for 1-(2-Nitrophenoxy)octane listed above:

nitrophenoxy octane also known as nitrophenyl octyl ether and abbreviated npoe is chemical compound that is used as matrix in fast atom bombardment mass spectrometry liquid secondary ion mass spectrometry and as highly lipophilic plasticizer in polymer membranes used in ion selective electrodes see also glycerol mercaptopropane diol nitrobenzyl alcohol crown sulfolane diethanolamine triethanolamine

We see a couple of things that are perhaps not in keeping with the extracted information we desire:

No title
No sentence boundaries
No internal category links
No infobox specifications

On the other hand, we do get the content from the ‘See Also’ section.

We want sentence boundaries for cleaner training purposes for word embedding models like word2vec. We want the other items so as to improve the lexical richness and context for the given concept. Further, we want two versions: one with titles as a separate field and one for learning purposes that includes the title in the lexicon (titles, after all, are preferred labels and deserve an additional frequency boost).

OK, so how does one make these modifications? My first hope was that arguments to these functions (args) might provide the specification latitude to deal with these changes. Unfortunately, none of the specified items fell into this category, though there is much latitude to modify underlying procedures. The second option was to find some third-party modification or override. Indeed, I did find one, that I found quite intriguing as a way to at least deal with sentence boundaries and possibly other areas. I spent nearly a full day trying to adapt this script, never succeeding. One fix would lead to another need for a fix, research on that problem, and then a fix and more problems. I’m sure most all of this is due to my amateur programming skills.

Still, the effort was frustrating. The good thing, however, is that in trying to work out a third-party fix, I was learning the underlying module. Eventually, it became clear if I was to address all desired areas it was smartest to modify the source directly. The three key functions that emerged as needing attention were tokenize, process_article and the class WikiCorpus(TextCorpus) code. In fact, it was the text processing heart of the last class that was the focus for changes, but the other two functions got involved because of their supporting roles. As I attempted to sub-class this basis with my own parallel approach (class KBWikiCorpus(WikiCorpus), I kept finding the need to bring into the picture more supporting functions. Some of this may have been due to nuances in how to specify imported functions and modules, which I am still learning about (see concluding installments). But it is also difficult to sub-set or modify any code.

The real impact of these investigations was to help me understand the underlying module. What at first blush looked too intimidating, now was becoming understandable. I could also see other portions of the underlying module that addressed ALL aspects of my earlier desires. Third-party modifications choose their own scope; direct modification of the underlying module provides more aspects to tweak. So, I switched emphasis from modifying a third-party overlay to directly changing the core underlying module.

Modifying WikiCorpus

We already knew the key functions needing focus. All changes to be made occur in the wikicorpus.py file that resides in your gensim package directory under Python packages. So, I make a copy of the original and name it such, then proceed to modify the base file. Though we will substitute this modified wikicorpus_kb.py file, I will also keep a backup of it as well such that we have copies of the original and modified file.

Here is the resulting modified code, with notes about key changes following the listing:

with open('files/wikicorpus_kb.py', 'r') as f:
    print(f.read())

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2010 Radim Rehurek <radimrehurek@seznam.cz>
# Copyright (C) 2012 Lars Buitinck <larsmans@gmail.com>
# Copyright (C) 2018 Emmanouil Stergiadis <em.stergiadis@gmail.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

Uses multiprocessing internally to parallelize the work and process the dump more quickly.

Notes
-----
If you have the `pattern <https://github.com/clips/pattern>`_ package installed,
this module will use a fancy lemmatization to get a lemma of each token (instead of plain alphabetic tokenizer).

See :mod:`gensim.scripts.make_wiki` for a canned (example) command-line script based on this module.

"""

import bz2
import logging
import multiprocessing
import re
import signal
from pickle import PicklingError
# LXML isn't faster, so let's go with the built-in solution
try:
    from xml.etree.cElementTree import iterparse
except ImportError:
    from xml.etree.ElementTree import iterparse


from gensim import utils
# cannot import whole gensim.corpora, because that imports wikicorpus...
from gensim.corpora.dictionary import Dictionary
from gensim.corpora.textcorpus import TextCorpus

from six import raise_from


logger = logging.getLogger(__name__)

ARTICLE_MIN_WORDS = 50
"""Ignore shorter articles (after full preprocessing)."""

# default thresholds for lengths of individual tokens
TOKEN_MIN_LEN = 2
TOKEN_MAX_LEN = 15

RE_P0 = re.compile(r'<!--.*?-->', re.DOTALL | re.UNICODE)
"""Comments."""
RE_P1 = re.compile(r'<ref([> ].*?)(</ref>|/>)', re.DOTALL | re.UNICODE)
"""Footnotes."""
RE_P2 = re.compile(r'(\n\[\[[a-z][a-z][\w-]*:[^:\]]+\]\])+$', re.UNICODE)
"""Links to languages."""
RE_P3 = re.compile(r'{{([^}{]*)}}', re.DOTALL | re.UNICODE)
"""Template."""
RE_P4 = re.compile(r'{{([^}]*)}}', re.DOTALL | re.UNICODE)
"""Template."""
RE_P5 = re.compile(r'\[(\w+):\/\/(.*?)(( (.*?))|())\]', re.UNICODE)
"""Remove URL, keep description."""
RE_P6 = re.compile(r'\[([^][]*)\|([^][]*)\]', re.DOTALL | re.UNICODE)
"""Simplify links, keep description."""
RE_P7 = re.compile(r'\n\[\[[iI]mage(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)
"""Keep description of images."""
RE_P8 = re.compile(r'\n\[\[[fF]ile(.*?)(\|.*?)*\|(.*?)\]\]', re.UNICODE)
"""Keep description of files."""
RE_P9 = re.compile(r'<nowiki([> ].*?)(</nowiki>|/>)', re.DOTALL | re.UNICODE)
"""External links."""
RE_P10 = re.compile(r'<math([> ].*?)(</math>|/>)', re.DOTALL | re.UNICODE)
"""Math content."""
RE_P11 = re.compile(r'<(.*?)>', re.DOTALL | re.UNICODE)
"""All other tags."""
RE_P12 = re.compile(r'(({\|)|(\|-(?!\d))|(\|}))(.*?)(?=\n)', re.UNICODE)
"""Table formatting."""
RE_P13 = re.compile(r'(?<=(\n[ ])|(\n\n)|([ ]{2})|(.\n)|(.\t))(\||\!)([^[\]\n]*?\|)*', re.UNICODE)
"""Table cell formatting."""
RE_P14 = re.compile(r'\[\[Category:[^][]*\]\]', re.UNICODE)
"""Categories."""
RE_P15 = re.compile(r'\[\[([fF]ile:|[iI]mage)[^]]*(\]\])', re.UNICODE)
"""Remove File and Image templates."""
RE_P16 = re.compile(r'\[{2}(.*?)\]{2}', re.UNICODE)
"""Capture interlinks text and article linked"""
RE_P17 = re.compile(
    r'(\n.{0,4}((bgcolor)|(\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=)|(scope=))(.*))|'
    r'(^.{0,2}((bgcolor)|(\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=))(.*))',
    re.UNICODE
)
"""Table markup"""
IGNORED_NAMESPACES = [
    'Wikipedia', 'Category', 'File', 'Portal', 'Template',
    'MediaWiki', 'User', 'Help', 'Book', 'Draft', 'WikiProject',
    'Special', 'Talk'
]
"""MediaWiki namespaces that ought to be ignored."""


def filter_example(elem, text, *args, **kwargs):
    """Example function for filtering arbitrary documents from wikipedia dump.


    The custom filter function is called _before_ tokenisation and should work on
    the raw text and/or XML element information.

    The filter function gets the entire context of the XML element passed into it,
    but you can of course choose not the use some or all parts of the context. Please
    refer to :func:`gensim.corpora.wikicorpus.extract_pages` for the exact details
    of the page context.

    Parameters
    ----------
    elem : etree.Element
        XML etree element
    text : str
        The text of the XML node
    namespace : str
        XML namespace of the XML element
    title : str
       Page title
    page_tag : str
        XPath expression for page.
    text_path : str
        XPath expression for text.
    title_path : str
        XPath expression for title.
    ns_path : str
        XPath expression for namespace.
    pageid_path : str
        XPath expression for page id.

    Example
    -------
    .. sourcecode:: pycon

        >>> import gensim.corpora
        >>> filter_func = gensim.corpora.wikicorpus.filter_example
        >>> dewiki = gensim.corpora.WikiCorpus(
        ...     './dewiki-20180520-pages-articles-multistream.xml.bz2',
        ...     filter_articles=filter_func)

    """
    # Filter German wikipedia dump for articles that are marked either as
    # Lesenswert (featured) or Exzellent (excellent) by wikipedia editors.
    # *********************
    # regex is in the function call so that we do not pollute the wikicorpus
    # namespace do not do this in production as this function is called for
    # every element in the wiki dump
    _regex_de_excellent = re.compile(r'.*\{\{(Exzellent.*?)\}\}[\s]*', flags=re.DOTALL)
    _regex_de_featured = re.compile(r'.*\{\{(Lesenswert.*?)\}\}[\s]*', flags=re.DOTALL)

    if text is None:
        return False
    if _regex_de_excellent.match(text) or _regex_de_featured.match(text):
        return True
    else:
        return False


def find_interlinks(raw):
    """Find all interlinks to other articles in the dump.

    Parameters
    ----------
    raw : str
        Unicode or utf-8 encoded string.

    Returns
    -------
    list
        List of tuples in format [(linked article, the actual text found), ...].

    """
    filtered = filter_wiki(raw, promote_remaining=False, simplify_links=False)
    interlinks_raw = re.findall(RE_P16, filtered)

    interlinks = []
    for parts in [i.split('|') for i in interlinks_raw]:
        actual_title = parts[0]
        try:
            interlink_text = parts[1]
        except IndexError:
            interlink_text = actual_title
        interlink_tuple = (actual_title, interlink_text)
        interlinks.append(interlink_tuple)

    legit_interlinks = [(i, j) for i, j in interlinks if '[' not in i and ']' not in i]
    return legit_interlinks


def filter_wiki(raw, promote_remaining=True, simplify_links=True):
    """Filter out wiki markup from `raw`, leaving only text.

    Parameters
    ----------
    raw : str
        Unicode or utf-8 encoded string.
    promote_remaining : bool
        Whether uncaught markup should be promoted to plain text.
    simplify_links : bool
        Whether links should be simplified keeping only their description text.

    Returns
    -------
    str
        `raw` without markup.

    """
    # parsing of the wiki markup is not perfect, but sufficient for our purposes
    # contributions to improving this code are welcome :)
    text = utils.to_unicode(raw, 'utf8', errors='ignore')
    text = utils.decode_htmlentities(text)  # '&amp;nbsp;' --> '\xa0'
    return remove_markup(text, promote_remaining, simplify_links)


def remove_markup(text, promote_remaining=True, simplify_links=True):
    """Filter out wiki markup from `text`, leaving only text.

    Parameters
    ----------
    text : str
        String containing markup.
    promote_remaining : bool
        Whether uncaught markup should be promoted to plain text.
    simplify_links : bool
        Whether links should be simplified keeping only their description text.

    Returns
    -------
    str
        `text` without markup.

    """
    text = re.sub(RE_P2, '', text)  # remove the last list (=languages)
    # the wiki markup is recursive (markup inside markup etc)
    # instead of writing a recursive grammar, here we deal with that by removing
    # markup in a loop, starting with inner-most expressions and working outwards,
    # for as long as something changes.
#    text = remove_template(text)                                # Note
    text = remove_file(text)
    iters = 0
    while True:
        old, iters = text, iters + 1
        text = re.sub(RE_P0, '', text)  # remove comments
        text = re.sub(RE_P1, '', text)  # remove footnotes
        text = re.sub(RE_P9, '', text)  # remove outside links
        text = re.sub(RE_P10, '', text)  # remove math content
        text = re.sub(RE_P11, '', text)  # remove all remaining tags
#        text = re.sub(RE_P14, '', text)  # remove categories                           # Note
        text = re.sub(RE_P5, '\\3', text)  # remove urls, keep description

        if simplify_links:
            text = re.sub(RE_P6, '\\2', text)  # simplify links, keep description only
        # remove table markup
        text = text.replace("!!", "\n|")  # each table head cell on a separate line
        text = text.replace("|-||", "\n|")  # for cases where a cell is filled with '-'
        text = re.sub(RE_P12, '\n', text)  # remove formatting lines
        text = text.replace('|||', '|\n|')  # each table cell on a separate line(where |{{a|b}}||cell-content)
        text = text.replace('||', '\n|')  # each table cell on a separate line
        text = re.sub(RE_P13, '\n', text)  # leave only cell content
        text = re.sub(RE_P17, '\n', text)  # remove formatting lines

        # remove empty mark-up
        text = text.replace('[]', '')
        # stop if nothing changed between two iterations or after a fixed number of iterations
        if old == text or iters > 2:
            break

    if promote_remaining:
        text = text.replace('[', '').replace(']', '')  # promote all remaining markup to plain text

    return text


def remove_template(s):
    """Remove template wikimedia markup.

    Parameters
    ----------
    s : str
        String containing markup template.

    Returns
    -------
    str
        Ð¡opy of `s` with all the `wikimedia markup template <http://meta.wikimedia.org/wiki/Help:Template>`_ removed.

    Notes
    -----
    Since template can be nested, it is difficult remove them using regular expressions.

    """
    # Find the start and end position of each template by finding the opening
    # '{{' and closing '}}'
    n_open, n_close = 0, 0
    starts, ends = [], [-1]
    in_template = False
    prev_c = None
    for i, c in enumerate(s):
        if not in_template:
            if c == '{' and c == prev_c:
                starts.append(i - 1)
                in_template = True
                n_open = 1
        if in_template:
            if c == '{':
                n_open += 1
            elif c == '}':
                n_close += 1
            if n_open == n_close:
                ends.append(i)
                in_template = False
                n_open, n_close = 0, 0
        prev_c = c

    # Remove all the templates
    starts.append(None)
    return ''.join(s[end + 1:start] for end, start in zip(ends, starts))


def remove_file(s):
    """Remove the 'File:' and 'Image:' markup, keeping the file caption.

    Parameters
    ----------
    s : str
        String containing 'File:' and 'Image:' markup.

    Returns
    -------
    str
        Ð¡opy of `s` with all the 'File:' and 'Image:' markup replaced by their `corresponding captions
        <http://www.mediawiki.org/wiki/Help:Images>`_.

    """
    # The regex RE_P15 match a File: or Image: markup
    for match in re.finditer(RE_P15, s):
        m = match.group(0)
        caption = m[:-2].split('|')[-1]
        s = s.replace(m, caption, 1)
    return s

def tokenize(content):
# ORIGINAL VERSION
#def tokenize(content, token_min_len=TOKEN_MIN_LEN, token_max_len=TOKEN_MAX_LEN, lower=True):
    """Tokenize a piece of text from Wikipedia.

    Set `token_min_len`, `token_max_len` as character length (not bytes!) thresholds for individual tokens.

    Parameters
    ----------
    content : str
        String without markup (see :func:`~gensim.corpora.wikicorpus.filter_wiki`).
    token_min_len : int
        Minimal token length.
    token_max_len : int
        Maximal token length.
    lower : bool
         Convert `content` to lower case?

    Returns
    -------
    list of str
        List of tokens from `content`.

    """
    # ORIGINAL VERSION
    # TODO maybe ignore tokens with non-latin characters? (no chinese, arabic, russian etc.)
#    return [
#        utils.to_unicode(token) for token in utils.tokenize(content, lower=lower, errors='ignore')
#        if token_min_len <= len(token) <= token_max_len and not token.startswith('_')
#    ]
# NEW VERSION
    return [token.encode('utf8') for token in utils.tokenize(content, lower=True, errors='ignore')
            if len(token) <= 15 and not token.startswith('_')]
# TO RESTORE MOST PUNCTUATION
#    return [token.encode('utf8') for token in content.split() 
#           if len(token) <= 15 and not token.startswith('_')]

def get_namespace(tag):
    """Get the namespace of tag.

    Parameters
    ----------
    tag : str
        Namespace or tag.

    Returns
    -------
    str
        Matched namespace or tag.

    """
    m = re.match("^{(.*?)}", tag)
    namespace = m.group(1) if m else ""
    if not namespace.startswith("http://www.mediawiki.org/xml/export-"):
        raise ValueError("%s not recognized as MediaWiki dump namespace" % namespace)
    return namespace


_get_namespace = get_namespace


def extract_pages(f, filter_namespaces=False, filter_articles=None):
    """Extract pages from a MediaWiki database dump.

    Parameters
    ----------
    f : file
        File-like object.
    filter_namespaces : list of str or bool
         Namespaces that will be extracted.

    Yields
    ------
    tuple of (str or None, str, str)
        Title, text and page id.

    """
    elems = (elem for _, elem in iterparse(f, events=("end",)))

    # We can't rely on the namespace for database dumps, since it's changed
    # it every time a small modification to the format is made. So, determine
    # those from the first element we find, which will be part of the metadata,
    # and construct element paths.
    elem = next(elems)
    namespace = get_namespace(elem.tag)
    ns_mapping = {"ns": namespace}
    page_tag = "{%(ns)s}page" % ns_mapping
    text_path = "./{%(ns)s}revision/{%(ns)s}text" % ns_mapping
    title_path = "./{%(ns)s}title" % ns_mapping
    ns_path = "./{%(ns)s}ns" % ns_mapping
    pageid_path = "./{%(ns)s}id" % ns_mapping

    for elem in elems:
        if elem.tag == page_tag:
            title = elem.find(title_path).text
            text = elem.find(text_path).text

            if filter_namespaces:
                ns = elem.find(ns_path).text
                if ns not in filter_namespaces:
                    text = None

            if filter_articles is not None:
                if not filter_articles(
                        elem, namespace=namespace, title=title,
                        text=text, page_tag=page_tag,
                        text_path=text_path, title_path=title_path,
                        ns_path=ns_path, pageid_path=pageid_path):
                    text = None

            pageid = elem.find(pageid_path).text
            yield title, text or "", pageid  # empty page will yield None

            # Prune the element tree, as per
            # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
            # except that we don't need to prune backlinks from the parent
            # because we don't use LXML.
            # We do this only for <page>s, since we need to inspect the
            # ./revision/text element. The pages comprise the bulk of the
            # file, so in practice we prune away enough.
            elem.clear()

_extract_pages = extract_pages  # for backward compatibility


def process_article(args):
# ORIGINAL VERSION
#def process_article(args, tokenizer_func=tokenize, token_min_len=TOKEN_MIN_LEN,
#                    token_max_len=TOKEN_MAX_LEN, lower=True):
    """Parse a Wikipedia article, extract all tokens.

    Notes
    -----
    Set `tokenizer_func` (defaults is :func:`~gensim.corpora.wikicorpus.tokenize`) parameter for languages
    like Japanese or Thai to perform better tokenization.
    The `tokenizer_func` needs to take 4 parameters: (text: str, token_min_len: int, token_max_len: int, lower: bool).

    Parameters
    ----------
    args : (str, bool, str, int)
        Article text, lemmatize flag (if True, :func:`~gensim.utils.lemmatize` will be used), article title,
        page identificator.
    tokenizer_func : function
        Function for tokenization (defaults is :func:`~gensim.corpora.wikicorpus.tokenize`).
        Needs to have interface:
        tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str.
    token_min_len : int
        Minimal token length.
    token_max_len : int
        Maximal token length.
    lower : bool
         Convert article text to lower case?

    Returns
    -------
    (list of str, str, int)
        List of tokens from article, title and page id.

    """

    text, lemmatize, title, pageid = args
    text = filter_wiki(text)
    if lemmatize:
        result = utils.lemmatize(text)
    else:
# ORIGINAL VERSION
#        result = tokenizer_func(text, token_min_len, token_max_len, lower)
# NEW VERSION
        result = tokenize(text)
#        result = title + text
    return result, title, pageid


def init_to_ignore_interrupt():
    """Enables interruption ignoring.

    Warnings
    --------
    Should only be used when master is prepared to handle termination of
    child processes.

    """
    signal.signal(signal.SIGINT, signal.SIG_IGN)


def _process_article(args):
    """Same as :func:`~gensim.corpora.wikicorpus.process_article`, but with args in list format.

    Parameters
    ----------
    args : [(str, bool, str, int), (function, int, int, bool)]
        First element - same as `args` from :func:`~gensim.corpora.wikicorpus.process_article`,
        second element is tokenizer function, token minimal length, token maximal length, lowercase flag.

    Returns
    -------
    (list of str, str, int)
        List of tokens from article, title and page id.

    Warnings
    --------
    Should not be called explicitly. Use :func:`~gensim.corpora.wikicorpus.process_article` instead.

    """
    tokenizer_func, token_min_len, token_max_len, lower = args[-1]
    args = args[:-1]

    return process_article(
        args, tokenizer_func=tokenizer_func, token_min_len=token_min_len,
        token_max_len=token_max_len, lower=lower
    )


class WikiCorpus(TextCorpus):
    """Treat a Wikipedia articles dump as a read-only, streamed, memory-efficient corpus.

    Supported dump formats:

    * <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2
    * <LANG>wiki-latest-pages-articles.xml.bz2

    The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.

    Notes
    -----
    Dumps for the English Wikipedia can be founded at https://dumps.wikimedia.org/enwiki/.

    Attributes
    ----------
    metadata : bool
        Whether to write articles titles to serialized corpus.

    Warnings
    --------
    "Multistream" archives are *not* supported in Python 2 due to `limitations in the core bz2 library
    <https://docs.python.org/2/library/bz2.html#de-compression-of-files>`_.

    Examples
    --------
    .. sourcecode:: pycon

        >>> from gensim.test.utils import datapath, get_tmpfile
        >>> from gensim.corpora import WikiCorpus, MmCorpus
        >>>
        >>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
        >>> corpus_path = get_tmpfile("wiki-corpus.mm")
        >>>
        >>> wiki = WikiCorpus(path_to_wiki_dump)  # create word->word_id mapping, ~8h on full wiki
        >>> MmCorpus.serialize(corpus_path, wiki)  # another 8h, creates a file in MatrixMarket format and mapping

    """
    def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), dictionary=None,
                 filter_namespaces=('0',)):
# ORIGINAL VERSION
#                 filter_namespaces=('0',), tokenizer_func=tokenize, article_min_tokens=ARTICLE_MIN_WORDS,
#                 token_min_len=TOKEN_MIN_LEN, token_max_len=TOKEN_MAX_LEN, lower=True, filter_articles=None):
        """Initialize the corpus.

        Unless a dictionary is provided, this scans the corpus once,
        to determine its vocabulary.

        Parameters
        ----------
        fname : str
            Path to the Wikipedia dump file.
        processes : int, optional
            Number of processes to run, defaults to `max(1, number of cpu - 1)`.
        lemmatize : bool
            Use lemmatization instead of simple regexp tokenization.
            Defaults to `True` if you have the `pattern <https://github.com/clips/pattern>`_ package installed.
        dictionary : :class:`~gensim.corpora.dictionary.Dictionary`, optional
            Dictionary, if not provided,  this scans the corpus once, to determine its vocabulary
            **IMPORTANT: this needs a really long time**.
        filter_namespaces : tuple of str, optional
            Namespaces to consider.
        tokenizer_func : function, optional
            Function that will be used for tokenization. By default, use :func:`~gensim.corpora.wikicorpus.tokenize`.
            If you inject your own tokenizer, it must conform to this interface:
            `tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str`
        article_min_tokens : int, optional
            Minimum tokens in article. Article will be ignored if number of tokens is less.
        token_min_len : int, optional
            Minimal token length.
        token_max_len : int, optional
            Maximal token length.
        lower : bool, optional
             If True - convert all text to lower case.
        filter_articles: callable or None, optional
            If set, each XML article element will be passed to this callable before being processed. Only articles
            where the callable returns an XML element are processed, returning None allows filtering out
            some articles based on customised rules.

        Warnings
        --------
        Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.

        """
        self.fname = fname
        self.filter_namespaces = filter_namespaces
#        self.filter_articles = filter_articles
        self.metadata = True
        if processes is None:
            processes = max(1, multiprocessing.cpu_count() - 1)
        self.processes = processes
        self.lemmatize = lemmatize
#        self.tokenizer_func = tokenizer_func
#        self.article_min_tokens = article_min_tokens
#        self.token_min_len = token_min_len
#        self.token_max_len = token_max_len
#        self.lower = lower
#        get_title = cur_title

        if dictionary is None:
            self.dictionary = Dictionary(self.get_texts())
        else:
            self.dictionary = dictionary

    def get_texts(self):
        """Iterate over the dump, yielding a list of tokens for each article that passed
        the length and namespace filtering.

        Uses multiprocessing internally to parallelize the work and process the dump more quickly.

        Notes
        -----
        This iterates over the **texts**. If you want vectors, just use the standard corpus interface
        instead of this method:

        Examples
        --------
        .. sourcecode:: pycon

            >>> from gensim.test.utils import datapath
            >>> from gensim.corpora import WikiCorpus
            >>>
            >>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2")
            >>>
            >>> for vec in WikiCorpus(path_to_wiki_dump):
            ...     pass

        Yields
        ------
        list of str
            If `metadata` is False, yield only list of token extracted from the article.
        (list of str, (int, str))
            List of tokens (extracted from the article), page id and article title otherwise.

        """
        articles, articles_all = 0, 0
        positions, positions_all = 0, 0
# ORIGINAL VERSION
#        tokenization_params = (self.tokenizer_func, self.token_min_len, self.token_max_len, self.lower)
#        texts = \
#            ((text, self.lemmatize, title, pageid, tokenization_params)
#             for title, text, pageid
#             in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces, self.filter_articles))
#        pool = multiprocessing.Pool(self.processes, init_to_ignore_interrupt)
# NEW VERSION
        texts = ((text, self.lemmatize, title, pageid) for title, text, pageid 
                in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
        pool = multiprocessing.Pool(self.processes)
        try:
            # process the corpus in smaller chunks of docs, because multiprocessing.Pool
            # is dumb and would load the entire input into RAM at once...

# ORIGINAL VERSION            
#            for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
# NEW VERSION
            for group in utils.chunkize_serial(texts, chunksize=10 * self.processes):
# ORIGINAL VERSION
#                for tokens, title, pageid in pool.imap(_process_article, group):
# NEW VERSION
                for tokens, title, pageid in pool.imap(process_article, group):  # chunksize=10):
                    articles_all += 1
                    positions_all += len(tokens)
                    # article redirects and short stubs are pruned here
# ORIGINAL VERSION
#                    if len(tokens) < self.article_min_tokens or \
#                            any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
# NEW VERSION FOR ENTIRE BLOCK
                    if len(tokens) < ARTICLE_MIN_WORDS or \
                          any(title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES):
                        continue
                    articles += 1
                    positions += len(tokens)
                    try:
                        if self.metadata:
                            title = title.replace(' ', '_')
                            title = (title + ',')
                            title = bytes(title, 'utf-8')
                            tokens.insert(0,title)
                            yield tokens
                        else:
                            yield tokens
                    except Exception as e:
                        print('Wikicorpus exception error: ' + str(e))
        except KeyboardInterrupt:
            logger.warn(
                "user terminated iteration over Wikipedia corpus after %i documents with %i positions "
                "(total %i articles, %i positions before pruning articles shorter than %i words)",
# ORIGINAL VERSION
#                articles, positions, articles_all, positions_all, self.article_min_tokens
# NEW VERSION
                articles, positions, articles_all, positions_all
            )
        except PicklingError as exc:
            raise_from(PicklingError('Can not send filtering function {} to multiprocessing, '
                'make sure the function can be pickled.'.format(self.filter_articles)), exc)
        else:
            logger.info(
                "finished iterating over Wikipedia corpus of %i documents with %i positions "
                "(total %i articles, %i positions before pruning articles shorter than %i words)",
# ORIGINAL VERSION
#                articles, positions, articles_all, positions_all, self.article_min_tokens
# NEW VERSION
                articles, positions, articles_all, positions_all
            )
            self.length = articles  # cache corpus length
        finally:
            pool.terminate()

Gensim provides well documented code that is written in an understandable way.

Most of the modifications I made occurred at the bottom of the code listing. However, the text routine at the top of the file allows us to tailor what page ‘sections’ are kept or not in each Wikipedia article. Because of their substantive lexical content, I add the page templates and category names to be retained with the text body.

Assuming I will want to retain these modifications and understand them at a later date, I block off all modified sections with ORIGINAL VERSION and NEW VERSION tags. One change was to remove punctuation. Another was to grab and capture the article title.

This file, then, becomes a replacement to the original wikicorpus.py code. I am cognizant that changing underlying source code for local purposes is generally considered to be a BAD idea. It very well may be so in this case. However, with the backups, and being attentive to version updates and keeping working code in sync, I guess I do not see where keeping track of a modification is any less sustainable than needing to update existing code to a modification. Both require inspection and effort. If I diff on the changed underlying module, I suspect it is of equivalent effort or lesser effort to change a third-party interface modification.

The net result is that I am now capturing the substantive content of these articles in a form I want to process.

Remove Stoplist

In my initial workflow, I had the step of stoplist removal later in the process because I thought it might be helpful to have all text prior to phrase identification. A stoplist (also known as ‘stop words‘), by the way, is a listing of very common words (mostly conjuctions, common verb tenses, articles and propositions) that can be removed from a block of text without adversely affecting its meaning or readability.

Since it proved superior to not retain these stop words when forming n-grams (see next section), I moved the routine up to be next processing of the Wikipedia pages. Here is the relevant code (NOTE: Neither the wikipedia-output-full.txt or wikipedia-output-full-stopped.txt files below are posted to GitHub due to their large (nearly GB) size; they can be reproduced using the GenSim standard stoplist as enhanced with the more_stops below):

import sys
from gensim.parsing.preprocessing import remove_stopwords  # Key line for stoplist
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full-stopped.txt'

more_stops = ['b', 'c', 'category', 'com', 'd', 'f', 'formatnum', 'g', 'gave', 'gov', 'h', 
              'htm', 'html', 'http', 'https', 'id', 'isbn', 'j', 'k', 'l', 'loc', 'm', 'n', 
              'need', 'needed', 'org', 'p', 'properties', 'q', 'r', 's', 'took', 'url', 'use', 
              'v', 'w', 'www', 'y', 'z']  
documents = smart_open(in_f, 'r', encoding='utf-8')
content = [doc.split(' ') for doc in documents]
with open(out_f, 'w', encoding='utf-8') as output:
    i = 0
    for line in content:
        try:
            line = ', '.join(line)
            line = line.replace(', ', ' ')
            line = remove_stopwords(line)  
            querywords = line.split()
            resultwords = [word for word in querywords if word.lower() not in more_stops]
            line = ' '.join(resultwords)
            line = line + '\n'
            output.write(line)
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
        if (i % 10000 == 0):
            print('Stopwords applied to ' + str(i) + ' articles')
    output.close()
    print('Stopwords applied to ' + str(i) + ' articles;')
    print('Processing complete!')

Stopwords applied to 10000 articles
Stopwords applied to 20000 articles
Stopwords applied to 30000 articles
Stopwords applied to 31157 articles;
Processing complete!

Gensim comes with its own stoplist, to which I added a few of my own, including removal of the category keyword that arose from adding that grouping. The output of this routine is the next file in the pipeline, wikipedia-output-full-stopped.txt.

Phrase Identification and Extraction

Phrases are n-grams, generally composed of two or three paired words, which are known as ‘bigrams’ and ‘trigrams’, respectively. Phrases are one of the most powerful ways to capture domain or technical language, since these compounded terms arise through the use and consensus of their users. Some phrases help disambiguate specific entities or places, as when for example ‘river’, ‘state’, ‘university’ or ‘buckeyes’ does when combined with the term ‘ohio’.

Generally, most embeddings or corpora do not include n-grams in their initial preparation. But, for the reasons above, and experience of the usefulness of n-grams to text retrieval, we decided to include phrase identification and extraction as part of our preprocessing.

Again, gensim comes with a pre-trained phrase identifier (like all gensim models, you can re-train and tune these models as you gain experience and want them to perform differently). The main work of this routine is the ngram call, wherein term adjacency is used to construct paired term indentifications. Here is the code and settings for our first pass with this function to create our initial bigrams from the stopped input text:

import sys
from gensim.models.phrases import Phraser, Phrases
from gensim.parsing.preprocessing import remove_stopwords  # Key line for stoplist
from smart_open import smart_open

in_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-output-full-stopped.txt'
out_f = r'C:\1-PythonProjects\kbpedia\v300\models\inputs\wikipedia-bigram.txt'

documents = smart_open(in_f, 'r', encoding='utf-8')
sentence_stream = [doc.split(' ') for doc in documents]
common_terms = ['aka']
ngram = Phrases(sentence_stream, min_count=3,threshold=10, max_vocab_size=80000000, 
                delimiter=b'_', common_terms=common_terms)
ngram = Phraser(ngram)
content = list(ngram[sentence_stream])
with open(out_f, 'w', encoding='utf-8') as output:
    i = 0
    for line in content:
        try:
            line = ', '.join(line)
            line = line.replace(', ', ' ')
            line = line.replace(' s ', '')
            output.write(line)
        except Exception as e:
            print ('Exception error: ' + str(e))
        i = i + 1
        if (i % 10000 == 0):
            print('ngrams calculated for ' + str(i) + ' articles')
    output.close()
    print('Calculated ngrams for ' + str(i) + ' articles;')
    print('Processing complete!')

ngrams calculated for 10000 articles
ngrams calculated for 20000 articles
ngrams calculated for 30000 articles
Calculated ngrams for 31157 articles;
Processing complete!

This routine takes about 14 minutes to run on my laptop, with the settings as shown. Note in the routine where we set the delimiter to be the underscore character; this is how we know the bigram.

Once this routine finishes, we can take its output and re-use it as input to a subsequent run. Now, we will be producing trigrams where we can match to existing bigrams. Generally, we set our thresholds and minimum counts higher. In our case, the new settings are min_count=8, threshold=50 The trigram analysis takes 19 min to run.

We have now completed our preprocessing steps for the embedding models we introduce in the next installment.

Additional Documentation

Here are many supplementary resources useful to the environment and natural language processing capabilities introduced in this installment.

PyTorch and pandas

Convert Pandas Dataframe to PyTorch Tensor – pandas → numpy → torch
PyTorch Dataset: Reading Data Using Pandas vs. NumPy
See CWPK #56 for pandas/networkx reads and imports.

PyTorch Resources and Tutorials

The Most Complete Guide to PyTorch for Data Scientists provides the basics of tensors and tensor operations and then provides a high-level overview of the PyTorch capabilities
Awesome-Pytorch-list provides multiple resource categories, with 230 in related packages alone
The official PyTorch documentation
Building Efficient Custom Datasets in PyTorch
Getting Started with PyTorch: A Deep Learning Tutorial
Incredible PyTorch is a curated list of tutorials, papers, projects, communities and more relating to PyTorch
Learning PyTorch with Examples.

spaCy and gensim

Natural Language in Python using spaCy: An Introduction
Gensim Tutorial – A Complete Beginners Guide is excellent and comprehensive.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.

NOTE: This CWPK installment is available both as an online interactive file

or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.

I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted:November 5, 2020

CWPK #62: Network and Graph Analysis

Knowledge Graphs Deserve Attention in Their Own Right

We first introduced NetworkX in installment CWPK #56 of our Cooking with Python and KBpedia series. The purpose of NetworkX in that installment was to stage data for graph visualizations. In today’s installment, we look at the other side of the NetworkX coin; that is, as a graph analytics capability. We will also discuss NetworkX in relation to staging data for machine learning.

The idea of graphs or networks is at the center of the concept of knowledge graphs. Graphs are unique information artifacts that can be analyzed in their own right as well as being foundations for many unique analytical techniques, including for machine learning and its deep learning subset. Still, graphs as conceptual and mathematical structures are of relatively recent vintage. For example, the field known as graph theory is less than 300 years old. I outlined much of the intellectual history of graphs and their role in analysis in a 2012 article, The Age of the Graph.

Graph or network analysis has three principal aspects. The first aspect is to analyze the nature of the graph itself, with its connections, topologies and paths. The second is to use the structural aspects of the graph representation in order to conduct unique analyses. Some of these analyses relate to community or influence or relatedness. The third aspect is to use various or all aspects of the graph representation of the domain to provide, through dimensionality reduction, tractable and feature-rich methods for analyzing or conducting data science work useful to the domain. We’ll briefly cover the first two aspects in this installment. The remaining installments in this Part VI relate more to the third aspect of graph and deep graph representation.

Initial Setup

We will pick up with our NetworkX work from CWPK #56 to start this installment. (See the concluding sections below if you have not already generated the graph_specs.csv file.)

Since I have been away from the code for a bit, I first decide to make sure my Python packages are up-to-date by running this standard command:

>>>conda update --all

Then, we invoke our standard start-up routine:

from cowpoke.__main__ import *
from cowpoke.config import *
from owlready2 import *

We then want to bring NetworkX into our workspace, along with pandas for data management. The routine we are going to write will read our earlier graph_specs.csv file using pandas. We will use this specification to create a networkx representation of the KBpedia structure, and then begin reporting on some basic graph stats (which will take a few seconds to run):

import networkx as nx
import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, edge_attr='weight', create_using=Graphtype)
print('Graph construction complete.')

# Print the number of nodes in the graph
print('Number of Nodes:', len(G.nodes()))
#

print('Edges:', G.edges('Mammal'))
#
# Get the subgraph of all nodes around node
sub = [ n[1] for n in G.edges('Mammal') ]
# Actually, need to add the 'marge' node too
sub.append('Mammal')
#
# Now create a new graph, which is the subgraph
sg = nx.Graph(G.subgraph(sub))
#
# Print the nodes of the new subgraph and edges
print('Subgraph nodes:', sg.nodes())
print('Subgraph edges:', sg.edges())
#
#
# Print basic graph info
#info=nx.info(G)
print('Basic graph info:', nx.info(G))

We have picked ‘mammal’ to generate some subgraphs and we also call up basic graph info based on networkx. As a directed graph, KBpedia can be characterized by both ‘in degree’ and ‘out degree’. ‘in degree’ is the number of edges pointing to a given node (or vertex); ‘out degree’ is the opposite. The average across all nodes in KBpedia exceeds 1.3. Both measures are the same because our only edge type in this structure is subClassOf, which is transitive.

Network Metrics and Operations

So we see that our KBpedia graph has loaded properly, and now we are ready to do some basic network analysis. Most of the analysis deals with the relations structure of the graph. NetworkX has a very clean interface to common measures and metrics of graphs, as our examples below demonstrate.

‘Density‘ is the ratio of actual edges in the network to all possible edges in the network, and ranges from 0 to 1. A ‘dense’ graph is one where the number of edges is close to the maximal number of edges; a ‘sparse’ graph is the opposite. The maximal number of edges is calculated as the potential connections, or nodes X (nodes -1). This potential is multiplied by two for a directed graph, since A → B and B → A are both possible. The density is thus the actual number of connections divided by the potential number. The density of KBpedia is quite sparse.

print('Density:', nx.density(G))

‘Degree‘ is a measure to find the most important nodes in graph, since a node’s degree is the sum of its edges. You can find the degree for an individual node, or the max ones as these two algorithms indicate:

print('Degree:', nx.degree(G,'Mammal'))

‘Average clustering‘ is the sum of all node clusterings. A node is clustered if it has a relatively high number of actual links to neighbors in relation to potential links to neighbors. A small-world network is one where the distance between random nodes grows in proportion to the natural log of the number of nodes in the graph. Low average clustering is an indicator of a small-world network.

print('Average clustering:', nx.average_clustering(G))

G_node = 'Mammal'
print('Clustering for node', G_node, ':', nx.clustering(G, G_node))

‘Path length‘ is calculated as the number of hop jumps traversing two end nodes is a network. An ‘average path length‘ measures shortest paths over a graph and then averages them. A small number indicates a shorter, more easily navigated graph on average, but there can be much variance.

print('Average shortest path length:', nx.average_shortest_path_length(G))

The next three measures throw an error, since KBpedia ‘is not strongly connected.’ ‘Eccentricity‘ is the maximum length between a node and its connecting nodes in a graph, with the ‘diameter‘ being the maximum eccentricity across all nodes and the ‘radius‘ being the minimum.

print('Eccentricity:', nx.eccentricity(G))

print('Diameter:', nx.diameter(G))

print('Radius:', nx.radius(G))

The algorithms that follow take longer to calculate or produce long listings. The first such measure we see is ‘centrality‘, which in NetworkX is the number of connections to a given node, with higher connectivity a proxy for importance. Centrality can be measured in many different ways; there are multiple options in NetworkX.

# Calculate different centrality measures
print('Centrality:', nx.degree_centrality(G))
print('Centrality (eigenvector):', nx.eigenvector_centrality(G))

print('In-degree centrality:', nx.in_degree_centrality(G))

print('Out-degree centrality:', nx.out_degree_centrality(G))

Here are some longer analysis routines (unfortunately, betweenness takes hours to calculate):

# Calculate different centrality measures
print('Betweenness:', nx.betweenness_centrality(G))

As a directed graph, some NetworkX measures are not applicable. Here are some of them:

nx.is_connected(G)
nx.connected_components(G).

Subgraphs

We earlier showed code for extracting a subgraph. Here is a generalized version of that function. Replace the ‘Bird’ reference concept with any other valid RC from KBpedia:

# Provide label for current KBpedia reference concept
rc = 'Bird'
# Get the subgraph of all nodes around node
sub = [ n[1] for n in G.edges(rc) ]
# Actually, need to add the 'rc' node too
sub.append(rc)
#
# Now create a new graph, which is the subgraph
sg = nx.Graph(G.subgraph(sub))
#
# Print the nodes of the new subgraph and edges
print('Subgraph nodes:', sg.nodes())
print('Subgraph edges:', sg.edges())

DeepGraphs

There is a notable utility package called DeepGraphs (and its documentation) that appears to offer some nice partitioning and quick visualization options. I have not installed or tested it.

Full Network Exchange

So far, we have seen the use of networks in driving visualizations (CWPK #56) and, per above, as knowledge artifacts with their own unique characteristics and metrics. The next role we need to highlight for networks is as information providers and graph-based representations of structure and features to analytical applications and machine learners.

NetworkX can convert to and from other data formats:

NumPy arrays
SciPy sparse matrices, or
pandas DataFrames.

All of these are attractive because PyTorch has direct routines for them.

NetworkX can also read and write graphs in multiple formats, some of which include:

adjacency lists
multiline adjacency lists
edge lists
pickle
JSON
YAML
GraphML, and
GEFX.

There are also standard NetworkX functions to convert node and edge labels to integers (such as networkx.relabel.convert_node_labels_to_integers), relabel nodes (networkx.relabel.relabel_nodes), set node attributes (networkx.classes.function.set_node_attributes), or make deep copies (networkx.Graph.to_directed).

There are also certain packages that integrate well with NetworkX and PyTorch and related packages such as direct imports or exports to the Deep Graph Library (DGL) (see CWPK #68 and #69), or built-in converters or the DeepSNAP package may provide a direct bridge between NetworkX and PyTorch Geometric (PyG) (see CWPK #68 and #70).

However, these representations do NOT include the labeled information or annotations. Knowledge graphs, like KBpedia, have some unique aspects that are not fully captured by an existing package like NetworkX.

Fortunately, the previous extract-and-build routines at the heart of this Cooking with Python and KBpedia series are based around CSV files, the same basis as the pandas package. Via pandas we can capture the structure of KBpedia, plus its labels and annotations. Further, as we will see in the next installment, we can also capture full pages for most of these RCs in KBpedia from Wikipedia. This addition will greatly expand our context and feature basis for using KBpedia for machine learning.

For now, I present below two of these three inputs, extracted directly from the KBpedia knowledge graph.

KBpedia Structure

The first of two extraction files useful to all further installments in this Part VI provides the structure of KBpedia. This structure consists of the hierarchical relation between reference concepts using the subClassOf subsumption relation and the assignment of that RC to a typology (SuperType). I first presented this routine in CWPK #56 and it, indeed, captures the requisite structure of the graph:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###             
# 'kb_src'        : 'standard'                                        # Set in master_deck
# 'loop_list'     : kko_order_dict.values(),                          # Note 1   
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/mappings/',              
# 'ext'           : '.csv',                                         
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv',

def graph_extractor(**extract_deck):
    print('Beginning graph structure extraction . . .')
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    class_loop = extract_deck.get('class_loop')
    base = extract_deck.get('base')
    ext = extract_deck.get('ext')
    
    # Note 2
    parent_set = ['kko.SocialSystems','kko.Products','kko.Methodeutic','kko.Eukaryotes',
              'kko.ConceptualSystems','kko.AVInfo','kko.Systems','kko.Places',
              'kko.OrganicChemistry','kko.MediativeRelations','kko.LivingThings',
              'kko.Information','kko.CopulativeRelations','kko.Artifacts','kko.Agents',
              'kko.TimeTypes','kko.Symbolic','kko.SpaceTypes','kko.RepresentationTypes',
              'kko.RelationTypes','kko.OrganicMatter','kko.NaturalMatter',
              'kko.AttributeTypes','kko.Predications','kko.Manifestations',
              'kko.Constituents']

    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    header = ['target', 'source', 'weight', 'SuperType']
    out_file = extract_deck.get('out_file')
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:                                           
        csv_out = csv.writer(output)
        csv_out.writerow(header)    
        for value in loop_list:
            print('   . . . processing', value)
            s_set = []
            root = eval(value)
            s_set = root.descendants()
            frag = value.replace('kko.','')
            for s_item in s_set:
                child_set = list(s_item.subclasses())
                count = len(list(child_set))
                
# Note 3                
                if value not in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        cur_list.append(new_pair)
                        s_rc = s_rc.replace('rc.','')
                        child = child.replace('rc.','')
                        row_out = (s_rc,child,count,frag)
                        csv_out.writerow(row_out)
                elif value in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        if new_pair not in cur_list:
                            cur_list.append(new_pair)
                            s_rc = s_rc.replace('rc.','')
                            child = child.replace('rc.','')
                            row_out = (s_rc,child,count,frag)
                            csv_out.writerow(row_out)
                        elif new_pair in cur_list:
                            continue
        output.close()         
        print('Processing is complete . . .')
graph_extractor(**extract_deck)

Note, again, the parent_set ordering of typology processing at the top of this function. This ordering processes the more distal (leaf) typologies first, and then ignores subsequent processing of identical structural relationships. This means that the graph structure is cleaner and all subsumption relations are “pushed down” to their most specific mention.

You can inspect the actual structure file produced using this routine, which is also the general basis for reading into various machine learners:

import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')

df

KBpedia Annotations

And, we also need to bring in the annotation values. The annotation extraction routine was first presented and described in CWPK #33, and was subsequently generalized and brought into conformance with our configuration routines in CWPK #33. Note, for example, in the header definition, how we are able to handle either classes or properties. In this instance, plus all subsequent machine learning discussion, we concentrate on the labels and annotations for classes:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###                
# 'krb_src'       : 'extract'                                          # Set in master_deck
# 'descent_type'  : 'descent',
# 'loop'          : 'class_loop',
# 'loop_list'     : custom_dict.values(),                              # Single 'Generals' specified 
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv',
# 'render'        : 'r_label',

def annot_extractor(**extract_deck):
    print('Beginning annotation extraction . . .') 
    r_default = ''
    r_label = ''
    r_iri = ''
    render = extract_deck.get('render')
    if render == 'r_default':
        set_render_func(default_render_func)
    elif render == 'r_label':
        set_render_func(render_using_label)
    elif render == 'r_iri':
        set_render_func(render_using_iri)
    else:
        print('You have assigned an incorrect render method--execution stopping.')
        return    
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    out_file = extract_deck.get('out_file')
    class_loop = extract_deck.get('class_loop')
    property_loop = extract_deck.get('property_loop')
    descent_type = extract_deck.get('descent_type')
    """ These are internal counters used in this module's methods """
    p_set = []
    a_ser = []
    x = 1
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:
        csv_out = csv.writer(output)                                       
        if loop == 'class_loop':                                             
            header = ['id', 'prefLabel', 'subClassOf', 'altLabel', 
                      'definition', 'editorialNote', 'isDefinedBy', 'superClassOf']
        else:
            header = ['id', 'prefLabel', 'subPropertyOf', 'domain', 'range', 
                      'functional', 'altLabel', 'definition', 'editorialNote']
        csv_out.writerow(header)    
        for value in loop_list:                                            
            print('   . . . processing', value)                                           
            root = eval(value) 
            if descent_type == 'descent':
                p_set = root.descendants()
            elif descent_type == 'single':
                a_set = root
                p_set.append(a_set)
            else:
                print('You have assigned an incorrect descent method--execution stopping.')
                return    
            for p_item in p_set:
                if p_item not in cur_list:                                 
                    a_pref = p_item.prefLabel
                    a_pref = str(a_pref)[1:-1].strip('"\'')                
                    a_sub = p_item.is_a
                    for a_id, a in enumerate(a_sub):                        
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_sub + '||' + str(a)
                        a_sub  = a_item
                    if loop == 'property_loop':   
                        a_item = ''
                        a_dom = p_item.domain
                        for a_id, a in enumerate(a_dom):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_dom + '||' + str(a)
                            a_dom  = a_item    
                        a_dom = a_item
                        a_rng = p_item.range
                        a_rng = str(a_rng)[1:-1]
                        a_func = ''
                    a_item = ''
                    a_alt = p_item.altLabel
                    for a_id, a in enumerate(a_alt):
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_alt + '||' + str(a)
                        a_alt  = a_item    
                    a_alt = a_item
                    a_def = p_item.definition
                    a_def = str(a_def)[2:-2]
                    a_note = p_item.editorialNote
                    a_note = str(a_note)[1:-1]
                    if loop == 'class_loop':                                  
                        a_isby = p_item.isDefinedBy
                        a_isby = str(a_isby)[2:-2]
                        a_isby = a_isby + '/'
                        a_item = ''
                        a_super = p_item.superClassOf
                        for a_id, a in enumerate(a_super):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_super + '||' + str(a)
                            a_super = a_item    
                        a_super  = a_item
                    if loop == 'class_loop':                                  
                        row_out = (p_item,a_pref,a_sub,a_alt,a_def,a_note,a_isby,a_super)
                    else:
                        row_out = (p_item,a_pref,a_sub,a_dom,a_rng,a_func,
                                   a_alt,a_def,a_note)
                    csv_out.writerow(row_out)                               
                    cur_list.append(p_item)
                    x = x + 1
    print('Total unique IDs written to file:', x)  
    print('The annotation extraction for the', loop, 'is completed.')

You can inspect this actual file of labels and annotations using this routine:

import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv')

df

We will add Wikipedia pages as a third source for informing our machine learning tests and experiments in our next installment.

Untested Potentials

One area in extended NetworkX capabilities that we do not test here is community structure using the Louvain Community Detection package.

Additional Documentation

Here are additional resources on network analysis and NetworkX:

Exploring and Analyzing Network Data with Python is a good description of basic measures
Analysis of Large-Scale Networks: NetworkX provides good examples and explanations
Graph Algorithms: Graph Analysis and Graph Learning are nice examples and drawings
Network Analysis with Python and NetworkX Cheat Sheet
Functions to convert NetworkX graphs to and from other formats
Knowledge Graph Representation with Jointly Structural and Textual Encoding reflects the structure + text nature of knowledge graphs.

NOTE: This CWPK installment is available both as an online interactive file

Posted:November 2, 2020

CWPK #61: NLP, Machine Learning and Analysis

A Wealth of Applications Sets the Stage for Pay Offs from KBpedia

With this installment of the Cooking with Python and KBpedia series we move into Part VI of seven parts, a part with the bulk of the analytical and machine learning (that is, “data science”) discussion, and the last part where significant code is developed and documented. Because of the complexity of these installments, we will also be reducing the number released per week for the next month or so. We also will not be able to post fully operational electronic notebooks to MyBinder since the supporting libraries strain the limits of that service. At the conclusion of this part, which itself has 11 installments, we have four installments to wrap up the series and provide a consistent roadmap to the entire project.

Knowledge graphs are unique information artifacts, and KBpedia is further unique in terms of its consistent and logical construction as well as its incorporation of significant text content via Wikipedia pages. These characteristics provide unique value for KBpedia, but it is also a combination not duplicated anywhere else in the data science ecosystem. One of the objectives, therefore, of this part of our CWPK series is the creation of some baseline knowledge representations useful to data science aims that capture these unique characteristics.

KBpedia’s (or any knowledge graph constructed in a similar manner) combination of characteristics make it a powerful resource in three areas of data science and machine learning. First, the nearly universal scope and degree of topic coverage with about 56,000 concepts, logically organized into typologies with a high degree of disjointedness, means that accurate ‘slices’ or training sets may be extracted from KBpedia nearly instantaneously. Labeled training sets are one of the most time consuming and expensive activities in doing supervised machine learning. We can extract these nearly for free from KBpedia. Further, with its links to tens of millions of entities in its mapped knowledge bases such as Wikidata, literally tens of thousands of conceptual entities in KBpedia can be the retrieval points to nucleate training sets for fine-grained entity recognition.

Second, 80% of KBpedia’s concepts are mapped to Wikipedia articles. While many Wikipedia-based word embedding models exist, the ones in KBpedia are logically categorized and have rough equivalence in terms of scope and prominence, hopefully providing cleaner topic ‘signals’. To probe these assertions, we will create a unique KBpedia-based word embedding corpus that also leverages labels for items of structural importance, such as typology membership. We will use this corpus in many of our tests and as a general focus in our training sets.

And, third, perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning, especially innovations in geometric, heterogeneous methods for deep learning. The first generation of deep machine learning was designed for grid-patterned data and matrices through approaches such as deep neural networks, convolutional neural networks (CNN ), or recurrent neural networks (RNN). The ‘deep’ appelation comes from having multiple calculated, intermediate layers of transformations between the grid inputs and outputs for the model. Graphs, on the other hand, are heterogeneous between nodes and edges. They may be directed (subsumptive) in nature. And, for knowledge graphs, they have much labeling and annotation, including varying degrees of attribute completeness. Language embedding, itself often a product of deep learning, enables the efficient incorporation of text. It is only in the past five years that concerted attention has been devoted to better capturing this feature richness for knowledge graphs.

The eleven installments in this part will look in more depth at networks and graphs, focus on how to create training sets and embeddings for the learners, discuss some natural language packages and uses, and then look in depth at ‘standard’ machine learners and deep learners. We will install the first generation of deep graph learners and then explore some on the cutting edge. We will test many use cases, but will also try to invoke classifiers across this spectrum so that we can draw some general conclusions.

The material below introduces and tees up these topics. We describe leading Python packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning framework, PyTorch, to which we will then tie four different NLP and deep learning libraries. We devote two installments each to these four libraries. The use cases we document across these installments are in addition to the existing ones we have in Clojure posted online.

So, we think we have an interesting suite of benefits to cover in this part, some arising from being based on KBpedia and some arising from the nature of knowledge graphs. On the other hand, due to the relative immaturity of the field, we are still actively learning and innovating around the juncture of AI and knowledge graphs. Thus, one of the reasons we emphasize Python ‘ecosystems’ and ‘frameworks’ in this part is to be better prepared to incorporate those innovations and learnings to come.

Background

One of the first prototypes of machine learning comes from the statistician Ronald Fisher in the 1930s regarding how to classify Iris species based on the attributes of their flowers. It was a multivariate data example using the method we today call linear discriminant analysis. This classic example is still taught. But many dozens of new algorithms and combined approaches have joined the machine learning field since then.

Figure 1 below is one way to characterize the field, with ML standing for machine learning and DL for deep learning, with this one oriented to sub-fields in which some Python package already exists:

Figure 1: Machine Learning Landscape (from S. Chen, “Machine Learning Algorithms For Beginners with Code Examples in Python”, June 2020)

There are many possible diagrams that one might prepare to show the machine learning landscape, including ones with a larger emphasis on text and knowledge graphs. Most all schematics of the field show a basic split between supervised learning and unsupervised learning, (sometimes with reinforcement learning as another main branch), with the main difference being that supervised approaches iterate to achieve statistical fit with pre-determined labels, whereas unsupervised is unlabeled. Accurate labeling can be costly and time consuming. Note that the idea of ‘classification’ is a supervised one, ‘clustering’ a notion of unsupervised.

We will include a ‘standard’ machine learning library in our proposed toolkit, the selection of which I discuss below. However, the most evaluation time I spent in researching these installments was directed to the idea of knowledge representation and embeddings applicable to graphs. Graphs pose a number of differences and challenges to standard machine learning. They have only been a recent (5 yr) focus in machine learning, which is also rapidly changing over time.

All machine learners need to operate on their feature spaces in numerical representations. Text is a tricky form because language is difficult and complex, and how to represent the tokens within our language usable by a computer needs to consider, what? Parts-of-speech, the word itself, sentence construction, semantic meaning, context, adjacency, entity recognition or characterization? These may all figure into how one might represent text. Machine learning has brought us unsupervised methods for converting words to sentences to documents and, now, graphs, to a reduced, numeric representation known as “embeddings.” The embedding method may capture one or more of these textual or structural aspects.

Much of the first interest in machine learning based on graphs was driven by these interests in embeddings for language text. Standard machine classifiers with deep learning using neural networks have given us word2vec, and more recently BERT and its dozens of variants have reinforced the usefulness of deep learning to create pre-trained text representations.

Indeed, embeddings do figure prominently in knowledge graph representation, but only as one among many useful features. Knowledge graphs with hierarchical (subsumption) relationships, as might be found in any taxonomy, become directed. Knowledge graphs are asymmetrical, and often multi-typed and sometimes multi-modal. There is heterogeneity among nodes and links or edges. Not all knowledge graphs are created equal and some of these aspects may not apply. Whether there is an accompanying richness of text description that accompanies the node or edges is another wrinkle. None of the early CNN or RNN or simple neural net approaches match well with these structures.

The general category that appears to have emerged for this scope is geometric deep learning, which applies to all forms of graphs and manifolds. There are other nuances in this area, for example whether a static representation is the basis for analysis or one that is dynamic, essentially allowing learning parameters to be changed as the deep learning progresses through its layers. But GDL has the theoretical potential to address and incorporate all of the wrinkles associated with heterogeneous knowledge graphs.

So, this discussion helps define our desired scope. We want to be able to embrace Python packages that range from simple statistics to simple machine learning, throwing in natural language processing and creating embedding representations, that can then range all the way through deep learning to the cutting-edge aspects of geometric or graph deep learning.

Leading Python Data Science Packages

This background provides the necessary context for our investigations of Python packages, frameworks, or libraries that may fulfill the data science objectives of this part. Our new components often build upon and need to play nice with some of the other requisite packages introduced in earlier installments, including pandas (CWPK #55), NetworkX (CWPK #56), and PyViz (CWPK #55). NumPy has been installed, but not discussed.

We want to focus our evaluation of Python options in these areas:

Natural Language Processing, including embeddings
‘Standard’ Machine Learning
Deep Learning and Abstraction Frameworks, and
Knowledge Graph Representation Learning.

The latter area may help us tie these various components together.

Natural Language Processing

It is not fair to say that natural language processing has become a ‘commodity’ in the data science space, but it is also true there is a wealth of capable, complete packages within Python. There are standard NLP requirements like text cleaning, tokenization, parts-of-speech identification, parsing, lemmatization, phrase identification, and so forth. We want these general text processing capabilities since they are often building blocks and sometimes needed in their own right. We also would like to add to this baseline such considerations as interoperability, creating embeddings, or other special functions.

The two leading NLP packages in Python appear to be:

NLTK – the natural language toolkit that is proven and has been a leader for twenty years
spaCy – a newer, but very impressive package oriented more to tasks, not function calls.

Other leading packages, with varying NLP scope, include:

flair – a very simple framework for state-of-the-art NLP that is based on PyTorch and works based on context
gensim – a semantic and topic modeling library; not general purpose, but with valuable capabilities
OpenNMT-py – an open source library for neural machine translation and neural sequence learning; provided for both the PyTorch and TensorFlow environments
Polyglot – a natural language pipeline that supports massive multilingual applications
Stanza – a neural network pipeline for text analytics; beyond standard functions, has multi-word token (MWT) expansion, morphological features, and dependency parsing; uses the Java CoreNLP from Stanford
TextBlob – a simplified text processor, which is an extension to NLTK.

Another key area is language embedding. Language embeddings are means to translate language into a numerical representation for use in downstream analysis, with great variety in what aspects of language are captured and how to craft them. The simplest and still widely-used representation is tf-idf (term frequency–inverse document frequency) statistical measure. A common variant after that was the vector space model. We also have latent (unsupervised) models such as LDA. A more easily calculated option is explicit semantic analysis (ESA). At the word level, two of the prominent options are word2vec and gloVe, which are used directly in spaCy. These have arisen from deep learning models. We also have similar approaches to represent topics (topicvec), sentences (sentence2vec), categories and paragraphs (Category2Vec), documents (doc2vec), node2vec or entire languages (BERT and variants and GPT-3 and related methods). In all of these cases, the embedding consists of reducing the dimensionality of the input text, which is then represented in numeric form.

There are internal methods for creating embeddings in multiple machine learning libraries. Some packages are more dedicated, such as fastText, which is a library for learning of word embeddings and text classification created by Facebook’s AI Research (FAIR) lab. Another option is TextBrewer, which is an open-source knowledge distillation toolkit based on PyTorch and which uses (among others) BERT to provide text classification, reading comprehension, NER or sequence labeling.

Closely related to how we represent text are corpora and datasets that may be used either for reference or training purposes. These need to be assembled and tested as well as software packages. The availability of corpora to different packages is a useful evaluation criterion. But, the picking of specific corpora depends on the ultimate Python packages used and the task at hand. We will return to this topic in CWPK #63.

‘Standard’ Machine Learning

Of course, nearly all of the Python packages mentioned in this Part VI have some relation to machine learning in one form or another. I call out the ‘standard’ machine learning category separately because, like for NLP, I think it makes sense to have a general learning library not devoted to deep learning but providing a repository of classic learning methods.

There really is no general option that compares with scikit-learn. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means and DBSCAN data clustering, and is designed to interoperate with NumPy and SciPy. The project is extremely active with good documentation and examples.

We’ll return to scikit-learn below.

Deep Learning and Abstraction Frameworks

Deep learning is characterized by many options, methods and philosophies, all in a fast-changing area of knowledge. New methods need to be compared on numerous grounds from feature and training set selection to testing, parameter tuning, and performance comparisons. These realities have put a premium on libraries and frameworks that wrap methods in repeatable interfaces and provide abstract functions for setting up and managing various deep (and other) learning algorithms.

The space of deep learning thus embraces many individual methods and forms, often expressed through a governing ecosystem of other tools and packages. These demands lead to a confusing and overlapping and non-intersecting space of Python options that are hard to describe and comparatively evaluate. Here are some of the libraries and packages that fit within the deep and machine learning space, including abstraction frameworks:

Chainer is an open source deep learning framework written purely in Python on top of NumPy and CuPy Python libraries
Microsoft Cognitive Toolkit (CNTK) is an open-source toolkit for commercial-grade distributed deep learning; however, it has seen its last main release in favor of the interoperable approach, ONNX (see below)
Keras is an open-source library that provides a Python interface for artificial neural networks. Keras now acts as an interface for the TensorFlow library and is built on top of Theano; it has a high-level library for working with datasets
PlaidML is a portable tensor compiler; it runs as a component under Keras
PyTorch is an open source machine learning library based on the Torch library with a very rich ecosystem of interfacing or contributing projects
TensorFlow is a well-known open source machine learning library developed by Google
Theano is a Python library and optimizing compiler for manipulating and evaluating mathematical expressions, especially matrix-valued ones; it is tightly integrated with NumPy, and uses it at the lowest level.

Keras is increasingly aligning with TensorFlow and some, like Chainer and CNTK, are being deprecated in favor of the two leading gorillas, PyTorch and TensorFlow. One approach to improve interoperability is the Open Neural Network Exchange (ONNX) with the repository available on GitHub. There are existing converters to ONNX for Keras, TensorFlow, PyTorch and scikit-learn.

A key development from deep learning of the past three years has been the usefulness of Transformers, a technique that marries encoders and decoders converging on the same representation. The technique is particularly helpful to sequential data and NLP, with state-of-the-art performance to date for:

next-sentence prediction
question answering
reading comprehension
sentiment analysis, and
paraphrasing.

Both BERT and GPT are pre-trained products that utilize this method. Both TensorFlow and PyTorch contain Transformer capabilities.

Knowledge Graph Representation Learning

As noted, most of my research for this Part VI has resided in the area of a subset of deep graph learning applicable to knowledge graphs. The leading deep learning libraries do not, in general, provide support for this area of representational learning, sometimes called knowledge representation learning (KRL) or knowledge graph embedding (KGE). Within this rather limited scope, most options also seem oriented to link prediction and knowledge graph completion (KGC), rather than the heterogeneous aspects with text and OWL2 orientation characteristic of KBpedia.

Various capabilities desired or tested for knowledge graph representational learning include:

low-dimensional vectors (embeddings) with semantic meaning
knowledge graph completion (KGC)
triple classification
entity recognition
entity disambiguation (linking)
relation extraction
recommendation systems
question answering, and
common sense reasoning.

Unsupervised graph relational learning is used for:

link prediction
graph reconstruction
visualization, or
clustering.

Supervised GRL is used for:

node classification, and
graph classification (predict node labels).

This kind of learning is a subset of the area known as geometric deep learning, deep graphs, or graph representation (or representation) learning. We thus see this rough hierarchy:

machine learning → deep learning → geometric deep learning → graph (R) learning → KG learning

In terms of specific packages or libraries, there is a wealth of options in this new field:

AmpliGraph is a suite of neural machine learning models for relational learning using supervised learning on knowledge graphs
DGL-KE is a high performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings based on the Deep Graph Library, which is a library for GNN in PyTorch. Can run DGL-KE on CPU machine, GPU machine, as well as clusters with a set of popular models, including TransE, TransR, RESCAL, DistMult, ComplEx, and RotatE
Graph Nets is DeepMind’s library for building graph networks in TensorFlow
KGCN is the novel knowledge graph convolutional network model that is part of KGLIB; it requires GRAKN
OpenKE is an efficient implementation based on PyTorch for knowledge representation learning
OWL2Vec* represents each OWL named entity (class, instance or property) by a vector, which then can feed downstream tasks; see GitHub
PyKEEN is a Python library for training and evaluating knowledge graph embeddings; see GitHub
Pykg2vec: A Python Library for Knowledge Graph Embedding is a Python library for knowledge graph embedding and representation learning; see GitHub
PyTorch Geometric (PyG) is a geometric deep learning extension library for PyTorch with excellent documentation and an emphasis of providing wrappers to state-of-art models
RDF2Vec is an unsupervised technique that builds further on Word2Vec; RDF2Vec Light is a lightweight approach to KG embeddings
scikit-kge is a Python library to compute embeddings of knowledge graphs that ties directly into scikit-learn; an umbrella for the RESCAL, HolE, TransE, and ER-MLP algorithms; it has not been updated in five years
StellarGraph provides multiple variants of neural networks for both homogeneous and heterogeneous graphs, and relies on the TensorFlow, Keras, NetworkX, and scikit-learn libraries
TorchKGE provides knowledge graph embedding in PyTorch; it reportedly is faster than AmpiGraph and OpenKE.

One graph learning framework that caught my eye is KarateClub, an unsupervised machine learning extension library for NetworkX. I like the approach they are taking, but their library can not yet handle directed graphs. I will be checking periodically on their announced intention to extend this framework to directed graphs in the near future.

Lastly, more broadly, there is the recently announced KGTK, which is a generalized toolkit with broader capabilities for large scale knowledge graph manipulation and analysis. KGTK also puts forward a standard KG file format, among other tools.

A Generalized Python Data Science Architecture

With what we already have in hand, plus the libraries and packages described above, we have a pretty good inventory of candidates to choose from in proceeding with our next installments. Like our investigations around graphics and visualization (see CWPK #55), the broad areas of data science, machine learning, and deep learning have been evolving to one of comprehensive ecosystems. Figure 2 below presents a representation of the Python components that make sense for the machine learning and application environment. As noted, our local Windows machines lack separate GPUs (graphical processing units), so the hardware is based on a standard CPU (which has an integrated GPU that can not be separately targeted). We have already introduced and discusses some of the major Python packages and libraries, including pandas, NetworkX, and PyViz. Here is that representative data science architecture:

Figure 2: Representative Python Components (from S. Raschka et al. “Machine Learning in Python: Main Developments and technology Trends in Data Science, Machine Learning, and Artificial Intelligence”, March 31, 2020)

The defining architectural question for this Part VI is what general deep and machine learning framework we want (if any). I think using a framework makes sense over scripting together individual packages, though for some tests that still might be necessary. If I was to adopt a framework, I would also want one that has a broad set of tools in its ecosystem and common and simpler ways to define projects and manage the overall pipelines from data to results. As noted, the two major candidates appear to be TensorFlow and PyTorch.

TensorFlow has been around the longest, has, today, the strongest ecosystem, and reportedly is better for commercial deployments. Google, besides being the sponsor, uses TensorFlow in most of its ML projects and has shown a commitment to compete with the upstart PyTorch by significantly re-designing and enhancing TensorFlow 2.0.

On the other hand, I very much like the more ‘application’ orientation of PyTorch. Innovation has been fast and market share has been rising. The consensus from online reviews is that PyTorch, in comparison to TensorFlow:

runs dramatically faster on both CPU and GPU architectures
is easier to learn
produces faster graphs, and
Is more amenable to third-party tools.

Though some of the intriguing packages for TensorFlow are not apparently available for PyTorch, including Graph Nets, Keras, Plaid ML, and StellarGraph, PyTorch does have these other packages not yet mentioned that look potentially valuable down the road:

Captum – a unified and generic model interpretability library
Catalyst – a framework for deep learning R&D
DGL – the Deep Graph Libary needed for DGL-KE discussed below
fastai – simplifies training fast and accurate neural nets using modern best practices
flair – a simple framework for state-of-the-art NLP that may complement or supplement spaCy
PyTorch Geometric – a geometric deep learning extension library, also discussed below
PyTorch-NLP – a library of basic utilities for PyTorch NLP that may supplement or replace spaCy or flair, and
skorch – a scikit-learn compatible neural network library that wraps PyTorch.

One disappointment is that neither of these two leading packages directly ingest RDFLib graph files, though with PyTorch and DGL you can import or export a NetworkX graph directly. pandas is also a possible data exchange format.

Consideration of all of these points has led us to select PyTorch as the initial data science framework. It is good to know, however, that a fairly comparable alternative also exists with TensorFlow and Keras.

Finally, with respect to Figure 2 above, we have no plans at present to use the Dask package for parallelizing analytic calculations.

Four Additional Key Packages

With the PyTorch decision made, at least for the present, we are now clear to deal with specific additional packages and libraries. I highlight four of these in this section. Each of these four is the focus of two separate installments as we work to complete this Part VI. One of these four is in natural language processing (spaCy), one in general machine learning (scikit-learn), and two in deep learning with an emphasis on graphs (DGL and DGL-KE, and PyG). These choices again tend to reinforce the idea of evaluating whole ecosystems, as opposed to single packages. Note, of course, that more specifics on these four packages will be presented in the forthcoming installments.

spaCy

I find spaCy to be very impressive as our baseline NLP system, with many potentially useful extensions or compatible packages including sense2vec, spacy-stanza, spacy-wordnet, torchtext, and gensim.

The major competitor is NLTK. The reputation of the NLTK package is stellar and it has proven itself for decades. It is a more disaggregate approach often favored by scholars and researchers to enable users to build complex NLP functionality. It is therefore harder to use and configure, and is also less performant. The real differentiator, however, is the more object or application orientation of spaCy.

Though NLTK appears to have good NLP tools for processing data pipelines using text, most of these functions appear to be in spaCy and there are also the flair and PyTorch-NLP packages available in the PyTorch environment if needed. gensim looks to be a very useful enhancement to the environment because of the advanced text evaluation modes it offers, including sentiment. Not all of these will be tested during this CWPK series, but it will be good to have these general capabilities resident in cowpoke.

scikit-learn

We earlier signaled our intent to embrace scikit-learn, principally to provide basic machine learning support. scikit-learn provides a unified API to these basic tasks, including crafting pipelines and meta-functions to integrate the data flow. scikit-learn works on any numeric data stored as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrames are also acceptable.

Some of the general ML methods, and there are about 40 supervised ones in the package, may be useful and applicable to specific circumstances include:

dimensionality reduction
model testing
preprocessing
scoring methods, and
principal component analysis (PCA).

A real test of this package will be ease of creating (and then being able to easily modify) data processing and analysis pipelines. Another test will be ingesting, using, and exporting data formats useful to the KBpedia knowledge graph. We know that scikit-learn doesn’t talk directly to NetworkX, though there may be recipes for the transfer; graphs are represented in scikit-learn as connectivity matrices. pandas can interface via common formats including CSV, Excel, JSON and SQL, or, with some processing, DataFrames. scikit-learn supports data formats from NumPy and SciPy, and it supports a datasets.load_files format that may be suitable for transferring many and longer text fields. One option that is intriguing is how to leverage the CSV flat-file orientation of our KG build and extract routines in cowpoke for data transfer and transformation.

I also want to keep an eye on the possible use of skorch to better integrate with the overall PyTorch environment, or to add perhaps needed and missing functionality or ease of development. There is much to explore with these various packages and environments.

DGL-KE

For our basic, ‘vanilla’, deep graph analysis package we have chosen the eponymous Deep Graph Library for basic graph neural network operations, which may run on CPU or GPU machines or clusters. The better interface relevant to KBpedia is through DGL-KE, a high performance, reportedly easy-to-use, and scalable package for learning large-scale knowledge graph embeddings that extends DGL. DGL-KE also comes configured with the popular models of TransE, TransR, RESCAL, DistMult, ComplEx, and RotatE.

PyTorch Geometric

PyTorch Geometric (PyG) is closely tied to PyTorch, and most impressively has uniform wrappers to about 40 state-of-art graph neural net methods. The idea of ‘message passing’ in the approach means that heterogeneous features such as structure and text may be combined and made dynamic in their interactions with one another. Many of these intrigued me on paper, and now it will be exciting to test and have the capability to inspect these new methods as they arise. DeepSNAP may provide a direct bridge between NetworkX and PyTorch Geometric.

Possible Future Extensions

During the research on this Part VI I encountered a few leads that are either not ready for prime time or are off scope to the present CWPK series. A potentially powerful, but experimental approach that makes sense is to use SPARQL as the request-and-retrieval mechanism against the graph to feed the machine learners. RDFFrames provides an imperative Python API that gets internally translated to SPARQL, and it is integrated with the PyData machine learning software stack; see GitHub. Some methods above also use SPARQL. One of the benefits of a SPARQL approach, besides its sheer query and inferencing power, is the ability to keep the knowledge graph intact without data transform pipelines. It is quite available to serve up results in very flexible formats. The relative immaturity of the approach and performance considerations may be difficult challenges to overcome.

I earlier mentioned KarateClub, a Python framework combining about 40 state-of-the-art unsupervised graph mining algorithms in the areas of node embedding, whole-graph embedding, and community detection. It builds on the packages of NetworkX, PyGSP, gensim, NumPy, and SciPy. Unfortunately, the package does not support directed graphs, though plans to do so have been stated. This project is worth monitoring.

A third intriguing area involves the use of quaternions based on Clifford algebras in their machine learning codes. Charles Peirce, the intellectual guide for the design of KBpedia, was a mathematician of some renown in his own time, and studied and applauded William Kingdon Clifford and his emerging algebra as a contemporary in the 1870s, shortly before Clifford’s untimely death. Peirce scholars have often pointed to this influence in the development of Peirce’s own algebras. I am personally interested in probing this approach to learn a bit more of Peirce’s manifest interests.

Organization of This Part’s Installments

These selections and the emphasis on our four areas lead to these anticipated CWPK installments over the coming weeks:

CWPK #61 – NLP, Machine Learning and Analysis
CWPK #62 – Network and Graph Analysis
CWPK #63 – Staging Data Sci Resources and Preprocessing
CWPK #64 – Embedding, NLP Analysis, and Entity Recognition
CWPK #65 – scikit-learn Basics and Initial Analyses
CWPK #66 – scikit-learn Classifiers
CWPK #67 – Knowledge Graph Embedding Models
CWPK #68 – Setting Up and Configuring the Deep Graph Learners
CWPK #69 – DGL-KE Classifiers
CWPK #70 – State-of-Art PyG 2 Classifiers
CWPK #71 – A Comparison of Results.

Additional Documentation

Here are some general resources:

Natural Language Processing Recipes: Best Practices and Examples – nice set of NLP notebooks
Another set of notebooks https://aihub.cloud.google.com/s?category=notebook
Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence see Figure 1 for a possible Python architecture diagram; 48 pp and more of an academic overview. has many links to GitHub projects
An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec a pretty comprehensive overview from 2017
Python Machine Learning Tutorials – an overview of ML tutorials from a generally good source, Real Python
PyTorch vs. TensorFlow – A Detailed Comparison – a balanced and fair comparison of the two frameworks.

Network Representational Learning

Machine Learning on Graphs: A Model and Comprehensive Taxonomy – the goal of this survey is to provide a unified view of representation learning methods for graph-structured data, to better understand the different ways to leverage graph structure in deep learning models; see GitHub GCNN TensorFlow implementation
Awesome Graph Classification – a collection of graph classification methods, covering embedding, deep learning, graph kernel and factorization papers
A Comprehensive Comparison of Unsupervised Network Representation Learning Methods – a comparison of unsupervised only and not attentive to heterogeneous networks.

Knowledge Graph Representational Learning

A Review of Relational Machine Learning for Knowledge Graphs(2015); one of the first to focus on the space
Awesome Graph Representation Learning – a curated list of awesome graph representation learning
A Survey on Knowledge Graphs: Representation, Acquisition and Applications from August 2020
Heterogeneous Network Representation Learning: Survey and Benchmark best paper for understanding the challenges of heterogeneous network embeddings
Knowledge Graph Embedding: A Survey of Approaches and Applications is an overview of embedding models of entities and relationships for knowledge base completion
Introduction to Geometric Deep Learning “GDL also shines in applications where the use of graphs is more common, like knowledge graphs.”
Geometric Deep Learning Library Comparison follow-on to the above paper. Compares PyTorch Geometric, Deep Graph Library, Graph Nets
RDF2Vec Light — A Lightweight Approach for Knowledge Graph Embeddings allows to train partial, task-specific models withonly a fraction of the computation requirements compared to other embedding ap-proaches, while retaining a high performance on multiple tasks.

NOTE: This CWPK installment is available both as an online interactive file

Posted:October 29, 2020

CWPK #60: Adding a SPARQL Endpoint – Part II

Finally Getting a Remote SPARQL Working Instance

Yesterday’s installment of Cooking with Python and KBpedia presented the first part of this two-part series on developing a SPARQL endpoint for KBpedia on a remote server. This concluding part picks up with step #7 in the stepwise approach I took to complete this task.

At the outset I thought it would progress rapidly: After all, is not SPARQL a proven query language with central importance to knowledge graphs? But, possibly because our focus in the series is Python, or perhaps for other reasons, I have found a dearth of examples to follow regarding setting up a Python SPARQL endpoint (there are some resources available related to REST APIs).

The first six steps in yesterday’s installment covered getting our environment set up on the remote Linux server, including installing the Web framework Flask and creating a virtual environment. We also presented the Web page form design and template for our SPARQL query form. This second part covers the steps of tieing this template form into actual endpoint code, which proved to be simple in presentation but exceeding difficult to formulate and debug. Once this working endpoint is in hand, I next cover the steps of giving the site an external URL address, starting and stopping the service on the server, and packaging the code for GitHub distribution. I conclude this two-part series with some lessons learned, some comments on the use and relevance of linked data, and point to additional documentation.

Step-wise Approach (con’t)

We pick up our step-wise approach here.

7. Tie SPARQL Form to Local Instance

So far, we have a local instance that works from the command line and an empty SPARQL form. We need to relate thest two pieces together. In the last installment, I noted two SPARQL-related efforts, pyLDAPI (and its GNAF example) and adhs. I could not find working examples for either, but I did consult their code frequently while testing various options.

Thus, unlike many areas throughout this CWPK series, I really had no working examples from which to build or modify our current SPARQL endpoint needs. While the related efforts above and other examples could provide single functions or small snippets, possibly as use for guidance or some modification, it was pretty clear I was going to need to build up the code step-by-step, in a similar stepwise manner to what I was following for the entire endpoint. Fortunately, as described in step #6, I did have a starting point for the Web page template using the GNAF example.

From a code standpoint, the first area we need to address is to convert our example start-up stub, what was called test_sparql.py in the CWPK #58 installment, to our main application for this endpoint. We choose to call it cowpoke-endpoint.py in keeping with its role. We will build on the earlier stub by adding most of our import and Flask-routing (‘@app.route("/")‘, for example) statements, as well as the initialization code for the endpoint functions. We will call out some specific aspects of this file as we build it.

The second coding area we need to address is how to tie the text areas in our Web form template to the actual Python code. We will put some of that code in the template and some of that code in cowpoke-endpoint.py governing getting and processing the SPARQL query. There is a useful pattern for how to relate templates to Python code via what might be entered into a text area from StackOverflow. Here is the code example that should be put within the governing template, using the important {{ url_for('submit') }}:

<form action="{{ url_for('submit') }}" method="post">
    <textarea name="text">
    <input type="submit">
</form>

and here is the matching code that needs to go into the governing Python file:

from flask import Flask, request, render_template

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('form.html')

@app.route('/submit', methods=['POST'])
def submit():
    return 'You entered: {}'.format(request.form['text'])

Note that file names, form names and routes all need to be properly identified and matched. Also note that imports need to be complete. Further notice in the file listing below that we modify the return statement. We also repeat this form related to the SPARQL results text area.

One of the challenging needs in the code development was working with a remote instance, as opposed to local code. I was also now dealing with a Linux environment, not my local Windows one. After much trial-and-error, which I’m sure is quite familiar to professional developers working in a client-server framework, I learned some valuable (essential!) lessons:

First, with my miniconda approach and its minimal starting Python basis, I needed to check every new package import required by the code and check whether it was already in the remote instance. The conda list command is important here to first check whether the package is already in the Python environment or not. If not, I would need to find the proper repository for the package and install it per the instructions in CWPK #58
I needed to make sure that the permission (Linux chmod) and ownership (Linux chown settings were properly set on the target directories for the remote instance such that I could use my SSH-based file transfer program (WinSCP in my case; Filezilla is another leading option). I simply do not do enough Linux work to be comfortable with remote editors. SSH transfer would enable me to work on the developing code in my local Windows instance
I needed to get basic templates working early, since I needed Web page targets for where the outputs or traces of the running code would display
I needed to restart the Apache2 server whenever there was a new code update to be tested. This resulted in a fairly set workflow of edit → upload → re-start Apache → call up remote Web template form (e.g., http://xx.xxx.xxx.xxx/sparql) → inspect trace or logs → rinse and repeat
Be attentive to and properly set content types, since we are moving data and results from Web forms to code and back again. Content header information can be tricky, and one needs to use cURL or wget (or Postman, which is often referenced, but I did not use). One way to inspect headers and content types is in the output Web page templates, using this code:
```
 req = request.form print(req) 
```
In HTML forms, use the < code for the left angle bracket symbol (used in SPARQL queries to denote a URI link), otherwise the link will not display on the Web page since this character is reserved
Used the standard W3C validator when needing to check encodings and Web addresses
Be extremely attentive to the use of tabs v white spaces in your Python code. Get in the habit of using spaces only, and not tabbing for indents. Editors are more forgiving in a Windows development environment; Linux ones are not.

The reason I began assembling these lessons arose from the frustrations I had in early code development. Since I was getting small pieces of the functionality running directly in Python from the command line, some of which is shown in the prior two installments, my initial failures to import these routines in a code file (*.py) and get them to work had me pulling my hair out. I simply could not understand why routines that worked directly from the command line did not work once embedded into a code file.

One discovery is that Flask does not play well with the Python list command. If one inspects prior SPARQL examples in this series (for example, CWPK #25), one can see that this construct is common with the standard query code. One adjustment, therefore, was to remove the list generator, and install a looping function for the query output. This applied to both RDFLib and owlready2.

Besides the lessons presented above, some of the hypotheses I tested to get things to work included the use of CDATA (which only applies to XML), pasting to or saving and retrieving from intermediate text files, changing content-type or mimetype, treatment of the Python multi-line convention ("""), possible use of JavaScript, and more. Probably the major issue I needed to overcome was turning on space and tab display in my local editor to remove their mixed use. This experience really brought home to me the fundamental adherence to indentation in the Python language.

Nonetheless, by following these guidelines and with eventual multiple tries, I was finally able to get a basic code block working, as documented under the next step.

8. Create and validate an external SPARQL query using SPARQLwrapper to this endpoint.

Since the approach that worked above got closer to the standard RDFLib approach, I decided to expand the query form to allow for external searches as well. Besides modifications to the Web page template, the use of external sources also invokes the SPARQLwrapper extension to RDFLib. Though its results presentation is a bit different, and we now have a requirement to also input and retrieve the URL of the external SPARQL endpoint, we were able to add this capability fairly easily.

The resulting code is actually quite simple, though the path to get there was anything but. I present below the eventual code file so developed, with code notes following the listing. You will see that, aside from the Flask code conventions and decorators, that our code file is quite similar to others developed throughout cowpoke:

from flask import Flask, Response, request, render_template            # Note 1
from owlready2 import *
import rdflib
from rdflib import Graph
import json
from SPARQLWrapper import SPARQLWrapper, JSON, XML

# load knowledge graph files
main = '/var/data/kbpedia/kbpedia_reference_concepts.owl'              # Note 2
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = '/var/data/kbpedia/kko.owl'

# set up scopes and namespaces
world = World()                                                        # Note 2 
kb = world.get_ontology(main).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')
skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)

graph = world.as_rdflib_graph()

# set up Flask microservice
app = Flask(__name__)                                                  # Note 3

@app.route("/")
def sparql_form():
    return render_template('sparql_page.html')

# set up route for submitting query, receiving results 
@app.route('/submit', methods=['POST'])                                # Note 4
def submit():
#    if request.method == 'POST':
    q_submit = None
    results = ''
    if request.form['q_submit'] is None or len(request.form['q_submit']) < 5:
        return Response(
        'Your request to the SPARQL endpoint must contain a \'query\'.',
        mimetype = 'text/plain'
        )
    else:
        data = request.form['q_submit']                                # Note 5
        source = request.form['selectSource']
        format = request.form['selectFormat']
        q_url = request.values.get('q_url')
        try:                                                           # Note 6
            if source == 'kbpedia' and format == 'owlready':           # Note 7
                q_query = graph.query_owlready(data)                   # Note 8
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = results.replace(']', ']\n')
            elif source == 'kbpedia' and format == 'rdflib':
                q_query = graph.query(data)                            # Note 8
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = results.replace('))', '))\n')
            elif source == 'kbpedia' and format == 'xml':
                q_query = graph.query(data)
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = q_query.serialize(format='xml')
                results = str(results)
                results = results.replace('<result>', '\n<result>')
            elif source == 'kbpedia' and format == 'json':
                q_query = graph.query(data)
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = q_query.serialize(format='json')
                results = str(results)
                results = results.replace('}}, ', '}}, \n')
            elif source == 'kbpedia' and format == 'html':             #Note 9
                q_query = graph.query(data)
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = q_query.serialize(format='csv')
                results = str(results)
                results = results.readlines()
#                table = '<html><table>'
                for row in results:
#                     row = str(row)                    
                     result = row[0]
#                    row = row.replace('\r\n', '')
#                    row = row.replace(',', '</td><td>')
#                    table += '<tr><td>' + row + '</td></tr>' + '\n'
#                table += '</table><br></html>' 
#                results = table
                return result
            elif source == 'kbpedia' and format == 'txt':
                q_query = graph.query(data)
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = q_query.serialize(format='txt')
            elif source == 'kbpedia' and format == 'csv':
                q_query = graph.query(data)
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = q_query.serialize(format='csv')
            elif source == 'external' and format == 'rdfxml':
                q_url = str(q_url)
                results = q_url
            elif source == 'external' and format == 'xml':
                sparql = SPARQLWrapper(q_url)
                data = data.replace('\r', '')
                sparql.setQuery(data)
                results = sparql.query()
            elif source == 'external' and format == 'json':            #Note 10
                sparql = SPARQLWrapper(q_url)
                data = data.replace('\r', '')
#                data = data.replace("\n", "\n' + '")
#                data = '"' + data + '"'
                sparql.setQuery(data)
                sparql.setReturnFormat(JSON)                           #Note 10
                results = sparql.queryAndConvert()
#                q_sparql = str(sparql)
#                results = q_sparql
            else:                                                      #Note 11
                results = ('This combination of Source + Format is not available. Here are the possible combinations:\n\n' + 
                           '    Kbpedia:   owlready2:    Formats:  owlready2\n' + 
                           '                  rdflib:                 rdflib\n' +
                           '                                             xml\n' +
                           '                                            json\n' +
                           '                                           *html\n' +
                           '                                            text\n' +
                           '                                             csv\n' +
                           '   External:  as entered:                rdf/xml\n' +
                           '                                           *json\n\n' +
                           '            * combo still buggy')
            if format == 'html':
                return Response(results, mimetype='text/html')         # Note 9, 12
            else:
                return Response(results, mimetype='text/plain')
        except Exception as e:                                         # Note 6
            return Response(
            'Error(s) found in query: ' + str(e),
            mimetype = 'text/plain'
            )

if __name__ == "__main__":
    app.run(debug=true)

Here are some annotation notes related to this code, as keyed by note number above:

There are many specific packages needed for this SPARQL application, as discussed in the main text. The major point to make here is that each of these packages needs to be loaded into the remote virtual environment, per the discussion in CWPK #58
Like other cowpoke modules, these are pretty standard calls to the needed knowledge graphs and configuration settings
These are the standard Flask calls, as discussed in the prior installment
The main routine for the application is located here. We could have chosen to break this routine into multiple files and templates, but since this application is rather straightforward, we have placed all functionality into this one function block
These are the calls that bring the assignments from the actual Web page (template) into the application
We set up a standard try . . . exception block, which allows an error, if it occurs, to exit gracefully with a possible error explanation
We set up all execution options as a two-part condition. One part is whether the source is the internal KBpedia knowledge graph (which may use either the standard rdflib or owlready2 methods) or is external (which uses the sparqlwrapper method). The second part is which of eight format options might be used for the output, though not all are available to the source options; see further Note 11. Also, most of the routines have some minor code to display results line-by-line
Here is where the graph query function differs by whether RDFLib or owlready2 is used
As of the time of release of this installment, I am still getting errors in this HTML output routine. I welcome any suggestions for working code here
As of the time of release of this installment, I am still getting errors in this JSON output routine. I have tried the standard SPARQLwrapper code, SPARQLwrapper2, and adding the JSON format to the initial sparql, all to no avail. It appears there may be some character or encoding issue in moving the query on the Web form to the function. The error also appears to occur in the line indicated. I welcome any suggestions for working code here
This is where any of the two-part combos discussed in Note #7 that do not work get captured
This if . . . else enables the HTML output option.

9. Set up an external URI to the localhost instance With this working code instance now in place, it was time to expose the service through a standard external URI. (During development we used http://xx.xxx.xxx.xxx/sparql). The URL we chose for the service is http://sparql.kbpedia.org/.

We first needed to set up a subdomain pointing to the service via our DNS provider. While we generally provide SSL support for all of our Web sites (the secure protocol behind the https: Web prefix), we decided the minor use of this SPARQL site did not warrant keeping the certificates enabled and current. So, this site is configured for http: alone.

We first configured our Flask sites as described in CWPK #58. To get this site working under the new URL, I only needed to make two changes to the earlier configuration. This configuration file is 000-default.conf and is found on my server at the /etc/apache2/sites-enabled directory. Here at the two changes, called out by note:

<VirtualHost *:80>
  ServerName sparql.kbpedia.org                        #Note 1
  ServerAdmin mike@mkbergman.com
  DocumentRoot /var/www/html

  WSGIDaemonProcess sparql python-path=/usr/bin/python-projects/miniconda3/envs/sparql/lib/python3.8/site-packages
  WSGIScriptAlias / /var/www/html/sparql/wsgi.py       #Note 2
  <Directory /var/www/html/sparql>
     WSGIProcessGroup sparql
     WSGIApplicationGroup %{GLOBAL}
     Order deny,allow
       Allow from all
  </Directory>

  ErrorLog ${APACHE_LOG_DIR}/error.log
  CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>

The first change was to add the new domain sparql.kbpedia.org under ServerName. The second change was to replace the /sparql alias to / under the WSGIScriptAlias directive.

10. Set up an automatic start/re-start cron job.

The last step under our endpoint process is to schedule a cron job on the remote server to start up the sparql virtual environment in the case of an unintended shut down or breaking of the Web site. This last task means we can let the endpoint run virtually unattended. First, let’s look at how simple a re-activation (xxx) script may look:

#!/bin/sh

conda activate sparql

Note the standard bash script header on this file. Also note our standard activation statement. One can create this file and then place it in a logical, findable location. In our instance, we will put it where the same sparql scripts exist, namely in /var/www/html/sparql/.

We next need to make sure this script is readable by our cron job. So we navigate to the the directory where this bash script is located and change its permissions:

 chmod +x re_activate.sh

Once these items are set, we are now able to add this bash script to our scheduled cron jobs. We find this specification and invoke our editor of that file by using:

 nano /etc/crontab

Using the nano editor conventions (or those of your favored editor), we can now add our new cron job in a new entry between the asterisk (*) specifications:

 30 * * * * /bin/sh /var/www/html/sparql/re_activate.sh

We have now completed all of our desired development steps for the KBpedia SPARQL endpoint. As of the release of today’s installment, the site is active.

Endpoint Packaging

I will package up this code as a separate project and repository on GitHub per the steps outlined in CWPK #46 under the MIT license, same as cowpoke. Since there are only a few files, we did not create a formal pip package. Here will be the package address:

https://github.com/Cognonto/cowpoke-endpoint

Linked Data and Why Not Employed

My original plan was to have this SPARQL site offer linked data. Linked data is where the requesting user agent may be served either semantic data such as RDF in various formats or standard HTML if the requester is a browser. It is a useful approach for the semantic Web and has a series of requirements to qualify as ‘5-star‘ linked open data.

From a technical standpoint, the nature of the requesting user agent is determined by the local Web server (Apache2 in our case), which then routes the request to produce either structured data or semi-structured HTML for displaying in a Web page through a process known as content negotiation (the term is sometimes shortened to ‘conneg’). In this manner, our item of interest can be denoted with a single URI, but the content version that gets served to the requesting agent may differ based on the nature of the agent or its request. In a data-oriented setting, for example, the requested data may be served up in a choice of formats to make it easier to consume on the receiving end.

As I noted in my initial investigations regarding Python (CWPK #58), there are not many options compared to other languages such as Java or JavaScript. One of the reasons I initially targeted pyLDAPI was that it promised to provide linked data. (RDFLib-web used to provide an option, but it is no longer maintained and does not work on Python 3.) Unfortunately, I could find no working instances of the pyLDAPI code and, when inspecting the code base itself, I was concerned about the duplicated number of Flask templates required by this approach. Given the number and diversity of classes and properties in KBpedia, my initial review suggested pyLDAPI was not a tenable approach, even if I could figure out how to get the code working.

Given the current state of development, my suggestion is to go with an established triple store with linked data support if one wants to provide linked data. It does not appear that Python has a sufficiently mature option available to make linked data available at acceptable effort.

Lessons and Possible Enhancements

The last section summarized the relative immature state of Python for SPARQL and linked data purposes. In order to get the limited SPARQL functionality working in this CWPK series I have kept my efforts limited to the SPARQL ‘SELECT’ statement and have noted many gotchas and workarounds in the dicussions over this and the prior two installments. Here are some additional lessons not already documented:

Flask apparently does not like ‘return None’
Our minimal conda installation can cause problems with ‘standard’ Python packages dropped from the miniconda3 distro. One of these is json, which I ultimately needed to obtain from conda install -c jmcmurray json.

Clearly, some corners were cut above and some aspects ignored. If one wanted to fully commercialize a Python endpoint for SPARQL based on the efforts in this and the previous CWPK installments, here are some useful additions:

Add the full suite of SPARQL commands to the endpoint (e.g., CONSTRUCT, ASK, DESCRIBE, plus other nuances)
Expand the number of output formats
Add further error trapping and feedback for poorly-formed queries, and
Make it easier to add linked data content negotiation.

Of course, these enhancements do not include more visual or user-interface assists for creating SPARQL queries in the first place. These are useful efforts in their own right.

End of Part V

This installment marks the end of our Part V: Mapping, Stats, and Other Tools. We begin Part VI next week governing natural language applications and machine learning involving KBpedia. We are also now 80% of the way through our entire CWPK series.

Additional Documentation

Here are related documents, some which extend the uses discussed herein:

Flask Resources

RDFLib

SPARQLwrapper

Other

Using cURL for SPARQL
RDFLib JSON-LD
Flask-RDF, a Flask decorator to output RDF using content negotiation
Cautions about SPARQL endpoints.

NOTE: This CWPK installment is available both as an online interactive file

Posted:October 28, 2020

CWPK #59: Adding a SPARQL Endpoint – Part I

What Should be Simple Proves Frustratingly Complex

Sometimes the installments in this Cooking with Python and KBpedia series come together fairly quickly, sometimes not. This installment has proven to be particularly difficult. Research has spread over days, and progress has been frustratingly slow. As a result, I spread the content of developing a remote SPARQL service across two parts.

You will recall we first introduced SPARQL in CWPK #25 in conjunction with the RDFLib package. We showed the flexibility and robustness of this query language to retrieve and filter any and all structural aspects of a knowledge graph. Then, in installment CWPK #50 we expanded on this basis to describe how SPARQL can be an essential component for querying and retrieving data from external sources, principally Wikidata and DBpedia.

Most all public SPARQL endpoints that presently exist (see this representative list, which is disappointingly small) are based on triple stores that come bundled with SPARQL endpoints. A few are also based on endpoint wrappers based on Java such as RDF4j or Jena and a few languages such as C (Redland) or JavaScript. These options obviously do not meet our Python objectives.

As we saw in CWPK #25, RDFLib provides SPARQL query support and also has the related SPARQLwrapper package that enables one to pose queries to external SPARQL endpoints. (easysparql provides similar functionality.) However, the objective we have to turn a local or remote instance into a SPARQL-enabled endpoint accessible to outside parties is not so easily supported. A number of years back there were the well-regarded rdflib-web apps that ran within Flask; unfortunately, this code is out of date and does not run on Python 3. There was also the adhs package that saw limited development and has not been updated in five years. In my initial diligence for this series I also found the pyLDAPI package that looked promising. However, I have not been able to find a working version of this system, and the I find the approach it takes to content negotiation for linked data to be cumbersome and tedious (see next installment).

So, based on the fragments indicated and found from these researches, I decided to tackle setting up a SPARQL endpoint largely on my own. Having established a toe-hold in our remote Linux server in the last installation, I decided to proceed by baby steps reflecting what I had already learned with our local instance to expose an endpoint on our remote server.

Step-wise Approach

We begin our process by setting up our environment, loading needed packages and KBpedia, testing them, and then proceeding to write some code to enable SPARQL queries and then to manage the application. Not knowing if all of these steps will work, I decide to approach these questions in a step-by-step manner.

1. Create a ‘sparql’ conda and Flask address

Note: I have always found the Linux vi editor to be difficult and hard to navigate, since I only use it on occasion. I now use nano as my editor replacement, since it presents key commands at the bottom of the screen useful to my occasional use, and is also part of the standard distro.

We follow the same steps that we worked out in CWPK #58 for setting up a conda virtual environment, that we will name ‘sparql’:

conda create -n sparql python=3

We get the echo to screen as the basic conda environment is created. Remember, this environment is found in the /usr/bin/python-projects/miniconda3/envs/sparql directory location. We then activate the environment:

conda activate sparql

We install some basic packages and then create our new sparql directory and the two standard stub files there:

conda install flask
conda install pip

then the two files, beginning with test_sparql.py:

from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
    return "Hello SPARQL!"

and then wsgi.py:

import sys
sys.path.insert(0, "/var/www/html/sparql/")
from test_sparql import app as application

We then proceed to set up the Apache2 configurations, placed directly below our prior similar specification in the /etc/apache2/sites-enabled directory in the 000-default.conf file:

        WSGIDaemonProcess sparql python-path=/usr/bin/python-projects/miniconda3/envs/sparql/lib/python3.8/site-packages
        WSGIScriptAlias /sparql /var/www/html/sparql/wsgi.py
        <Directory /var/www/html/sparql>
            WSGIProcessGroup sparql
            WSGIApplicationGroup %{GLOBAL}
            Order deny,allow
            Allow from all
        </Directory>

then you can check whether the configuration is OK and re-start the server. Then, when we enter:

http://54.227.249.140/sparql

We see that the right message appears and our configuration is OK.

2. Install all needed Python packages

If you recall from the last installment, we used the minimal miniconda3 package installer for our remote Linux (Ubuntu) instance. This minimal footprint largely only installs conda and Python. That means we must install all of the needed additional packages for our current application.

We noted the pip installer before, but we are best off using one of the conda-related channels since they better check configuration dependencies. To expand our package availability from what is standard in the conda channel, we may need to add some additional channels to our base package. One of the most useful of these is conda-forge. To install it:

conda config --add channels conda-forge

It is best to install packages in bulk, since dependencies are checked at install time. One does this by listing the packages in the same command line. When doing so, you may encounter messages that one or more of the packages was not found. In these cases, you should go to the search box at https://anaconda.com, search for the package, and then note the channel in which the package is found. If that channel is not already part of your configuration, add it.

Many of the needed packages for our SPARQL implementation are found under the conda-forge channel. Here is how a bulk install may look:

conda install networkx owlready2 rdflib sparqlwrapper pandas --channel conda-forge

We also then need to install cowpoke using pip by using this command while in the sparql virtual environment:

pip install cowpoke

Every time we invoke the sparql virtual environment these packages will be available, which you can inspect using:

conda list

Also, if you want to share with others the package configuration of your conda environments, you may create the standard configuration file using this command:

conda env export > environment.yaml

The file will be written to the directory in which you invoke this command.

3. Install KBpedia ‘sandbox’ KGs

Clearly, besides the Python code, we also need the various knowledge graphs used by KBpedia. These graphs are the same *.owl (rdf/xml) files that we first discussed in CWPK #18 . We will use the same ‘sandbox’ files from that installment.

Our first need is to decide where we want to store our KBpedia knowledge graphs. For the same reasons noted above, we choose to create the directory structure of /var/data/kbpedia. Once we create these directories, we need to set up the ownership and access properties for the files we will place there. So, we navigate to the parent directory data of our target kbpedia directory and issue two statements to set the ownership and access rights to this location:

sudo chown -R user-owner:user-group kbpedia
sudo chmod -R 775 kbpedia

The -R switch means that our settings get applied recursively to all files and directories in the target directory. The permissions level (775) means that user owners or groups may write to these files (general users may not).

These permission changes now allow us to transfer our local ‘sandbox’ files to this new directory. The two files that we need to transfer using our SSH or file transfer clients are:

kbpedia_reference_concepts.owl
kko.owl

Recall these are the RDF/XML conversions of the original *.n3 files. We now have the data available on the remote instance for our SPARQL purposes.

4. Verify access and use of KBpedia and owlready2

OK, so to see that some of this is working, I pick up on the file viewing code in CWPK #18 to see if we can load and view this stuff. I enter this code into a temp.py file and run python (python temp.py) under the /var/www/html/sparql/ directory:

main = '/var/data/kbpedia/kko.owl'  

with open(main) as fobj:                           
    for line in fobj:
        print (line)

Good; we see the kko.owl file scroll by.

So, the next test is to see if owlready2 is loaded properly and we can inspect the KBpedia knowledge graph.

Picking up from some of the first tests in CWPK #20, I create a script file locally and enter these instructions (note where the kko.owl file is now located):

main = '/var/data/kbpedia/kko.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core'

from owlready2 import *
kko = get_ontology(main).load()

skos = get_ontology(skos_file).load()
kko.imported_ontologies.append(skos) 

list(kko.classes())

When in the sparql directory under /var/www/html/sparql, I call up Python (remember to have the sparql virtual environment active!), which gives me this command line feedback:

(sparql) root@ip-xxx-xx-x-xx:/var/www/html/sparql# python
Python 3.8.5 (default, Sep  4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

and I paste the code block above at the cursor (>>>). I then hit Enter at the end of the code block, and we then see our kko classes get listed out.

Good, it appears we have the proper packages and directory locations. We can Ctrl-d (since we are on Linux) to exit the Python interactive session.

5. Create a ‘remote_access.py’ to verify a SPARQL query against the local version of the remote instance

So far, so good. We are now ready to test support for SPARQL. We again look to one of our prior installments, CWPK #25, to test whether SPARQL is working for us with all of the constituent KBpedia knowledge graphs. As we did with the prior question, we formulate a code block and invoke it interactively on the remote server with our python command. Here is the code (note that we have switched the definition of main to the full KBpedia reference concepts graph):

main = '/var/data/kbpedia/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = '/var/data/kbpedia/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(main).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)

import rdflib

graph = world.as_rdflib_graph()

form_1 = list(graph.query_owlready("""
  PREFIX rc: <http://kbpedia.org/kko/rc/>
  PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
  SELECT DISTINCT ?x ?label
  WHERE
  {
    ?x rdfs:subClassOf rc:Mammal.
    ?x skos:prefLabel  ?label. 
  }
"""))

print(form_1)

Fantastic! This works, too, even to the level of giving us the owlready2 circular reference warnings we received when we first invoked CWPK #25!

Now, let’s also test if we can query using SPARQL to another remote endpoint from our remote instance using again more code from the CWPK #25 installment and also after importing the sparqlwrapper package:

main = '/var/data/kbpedia/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = '/var/data/kbpedia/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(main).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)

from SPARQLWrapper import SPARQLWrapper, JSON
from rdflib import Graph

sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

sparql.setQuery("""
  PREFIX schema: <http://schema.org/>
  SELECT ?item ?itemLabel ?wikilink ?itemDescription ?subClass ?subClassLabel WHERE {
  VALUES ?item { wd:Q25297630
  wd:Q537127
  wd:Q16831714
  wd:Q24398318
  wd:Q11755880
  wd:Q681337
}
  ?item wdt:P910 ?subClass.

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
print(results)

Most excellent! We have also confirmed we can use our remote server for remote endpoint queries.

6. Create a Flask-based SPARQL input form for the local version This progress is rewarding, but the task now becomes substantially harder. We need to set up interfaces that will allow these queries to be run from external sources to our remote instance. There are two ways we can tackle this requirement.

The first way, the subject of this particular question, is to set up a Web page form that any outside user may access from the Web to issue a SPARQL query via an editable input form. The second way, the subject of question #9, is to enable a remote query issued via sparqlwrapper and Python that goes directly to the endpoint and bypasses the need for a form.

Since we already have installed Flask and validated it in the last installment, our task under this present question is to set up the Web form (in the form of a template as used by Flask) in which we enter our SPARQL queries. Flask maps Web (HTTP) requests to Python functions, which we showed in the last installment where the /sparql URI fragment maps to the /var/www/html/sparql path and its test_sparql.py function. Flask runs this code and then displays results to the browser using HTTP protocols, with the GET method being the most common, but all HTTP methods may be supported. The Python code invoked may call up templates (based on Jinja) that can then invoke HTML pages forms and various response functions.

I noted earlier two SPARQL-related efforts, pyLDAPI and adhs. While neither appears to have a working example, both contain aspects that can inform this task and subsequent ones. A (non-working) implementation of pyLDAPI called GNAF, in particular, has a SPARQL Web page that looked to be useful as a starting template.

If you recall, Flask uses HTML-based templates as its ‘view’-related approach to the model-view-controller (MVC) design. Besides embedding standard HTML, these templates may also contain set Flask statements that relate the Web page to various model or controller commands. These templates should be placed into a set directory under the Flask directory structure. The templates can be nested within one another, useful, for example, when one wants a header and footer repeated across multiple pages, but for our instance I chose a single-page template.

In essence, I took the two main text areas from the starting GNAF template and embedded them in duplicate presentations of the header and footer from the KBpedia current Web page design. (You should know that the server hosting the subject SPARQL page is different from the physical server hosting the standard KBpedia Web site.) I took this approach because I was considering making a SPARQL query form a standard part of the main KBpedia site, which I implement at the conclusion of the next installment. Here is how the resulting Web page form looks:

Figure 1: KBpedia SPARQL Form

Though located on a remote server different than the standard KBpedia Web site, we have designed the KBpedia SPARQL form to mimic the look of that standard site (1) with the same menu options, and both interact seamlessly. Sample SPARQL queries are provided both for the internal KBpedia knowledge graph and for external sites (2), including links (2) to additional query examples. These queries, whether samples or ones of your own crafting, can be pasted into the query entry box (3). Once pasted, you have the option to enter an external SPARQL query URL (4), pick whether your query should be directed internally to KBpedia or externally (4) (if the query is external), and to select amongst about 8 output formats (4), including standard RDF/XML, JSON, CSV, HTML, etc. Then, when you submit the query (4), the results appear in the final text box (5). If the results are helpful, you may copy them and paste them into a local file.

You can inspect this resulting SPARQL Web page at the following address (View Page Source to see the HTML):

http://sparql.kbpedia.org/

You will note that besides logo and menu items similar to the standard KBpedia site, that this form has two text areas, one for entering the SPARQL query and one for viewing subsequent results. There are also some switches regarding input and output forms. It is these switches and the two text areas that relate most directly to the next question.

Tying this form to (which, of course was actually developed in conjunction with) its accompanying code was the most difficult coding effort I have undertaken with this CWPK series to date. I cover this coding development, along with the remaining questions and related topics, in our next installment.

NOTE: This CWPK installment is available both as an online interactive file

Main Links

Search

Clean Corpora and Datasets are a Major Part of the Effort

Plan for Completion of Part VI

PyTorch Architecture

Possible Extensions

Corpora and Datasets

Pre-trained Resources

Setting Up the Environment

Getting Wikipedia Pages

Clean All Input Text

Initial Results

Modifying WikiCorpus

Remove Stoplist

Phrase Identification and Extraction

Additional Documentation

PyTorch and pandas

PyTorch Resources and Tutorials

spaCy and gensim

Knowledge Graphs Deserve Attention in Their Own Right

Initial Setup

Network Metrics and Operations

Subgraphs

DeepGraphs

Full Network Exchange

KBpedia Structure

KBpedia Annotations

Untested Potentials

Additional Documentation

A Wealth of Applications Sets the Stage for Pay Offs from KBpedia

Background

Leading Python Data Science Packages

Natural Language Processing

‘Standard’ Machine Learning

Deep Learning and Abstraction Frameworks

Knowledge Graph Representation Learning

A Generalized Python Data Science Architecture

Four Additional Key Packages

spaCy

scikit-learn

DGL-KE

PyTorch Geometric

Possible Future Extensions

Organization of This Part’s Installments

Additional Documentation

Network Representational Learning

Knowledge Graph Representational Learning

Finally Getting a Remote SPARQL Working Instance

Step-wise Approach (con’t)

Endpoint Packaging

Linked Data and Why Not Employed

Lessons and Possible Enhancements

End of Part V

Additional Documentation

Flask Resources

RDFLib

SPARQLwrapper

Other

What Should be Simple Proves Frustratingly Complex

Step-wise Approach