Knowledge Graphs Deserve Attention in Their Own Right

We first introduced NetworkX in installment CWPK #56 of our Cooking with Python and KBpedia series. The purpose of NetworkX in that installment was to stage data for graph visualizations. In today’s installment, we look at the other side of the NetworkX coin; that is, as a graph analytics capability. We will also discuss NetworkX in relation to staging data for machine learning.

The idea of graphs or networks is at the center of the concept of knowledge graphs. Graphs are unique information artifacts that can be analyzed in their own right as well as being foundations for many unique analytical techniques, including for machine learning and its deep learning subset. Still, graphs as conceptual and mathematical structures are of relatively recent vintage. For example, the field known as graph theory is less than 300 years old. I outlined much of the intellectual history of graphs and their role in analysis in a 2012 article, The Age of the Graph.

Graph or network analysis has three principal aspects. The first aspect is to analyze the nature of the graph itself, with its connections, topologies and paths. The second is to use the structural aspects of the graph representation in order to conduct unique analyses. Some of these analyses relate to community or influence or relatedness. The third aspect is to use various or all aspects of the graph representation of the domain to provide, through dimensionality reduction, tractable and feature-rich methods for analyzing or conducting data science work useful to the domain. We’ll briefly cover the first two aspects in this installment. The remaining installments in this Part VI relate more to the third aspect of graph and deep graph representation.

Initial Setup

We will pick up with our NetworkX work from CWPK #56 to start this installment. (See the concluding sections below if you have not already generated the graph_specs.csv file.)

Since I have been away from the code for a bit, I first decide to make sure my Python packages are up-to-date by running this standard command:

>>>conda update --all

Then, we invoke our standard start-up routine:

from cowpoke.__main__ import *
from cowpoke.config import *
from owlready2 import *

We then want to bring NetworkX into our workspace, along with pandas for data management. The routine we are going to write will read our earlier graph_specs.csv file using pandas. We will use this specification to create a networkx representation of the KBpedia structure, and then begin reporting on some basic graph stats (which will take a few seconds to run):

import networkx as nx
import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')
Graphtype = nx.DiGraph()
G = nx.from_pandas_edgelist(df, edge_attr='weight', create_using=Graphtype)
print('Graph construction complete.')

# Print the number of nodes in the graph
print('Number of Nodes:', len(G.nodes()))

print('Edges:', G.edges('Mammal'))
# Get the subgraph of all nodes around node
sub = [ n[1] for n in G.edges('Mammal') ]
# Actually, need to add the 'marge' node too
# Now create a new graph, which is the subgraph
sg = nx.Graph(G.subgraph(sub))
# Print the nodes of the new subgraph and edges
print('Subgraph nodes:', sg.nodes())
print('Subgraph edges:', sg.edges())
# Print basic graph info
print('Basic graph info:',

We have picked ‘mammal’ to generate some subgraphs and we also call up basic graph info based on networkx. As a directed graph, KBpedia can be characterized by both ‘in degree’ and ‘out degree’. ‘in degree’ is the number of edges pointing to a given node (or vertex); ‘out degree’ is the opposite. The average across all nodes in KBpedia exceeds 1.3. Both measures are the same because our only edge type in this structure is subClassOf, which is transitive.

Network Metrics and Operations

So we see that our KBpedia graph has loaded properly, and now we are ready to do some basic network analysis. Most of the analysis deals with the relations structure of the graph. NetworkX has a very clean interface to common measures and metrics of graphs, as our examples below demonstrate.

Density‘ is the ratio of actual edges in the network to all possible edges in the network, and ranges from 0 to 1. A ‘dense’ graph is one where the number of edges is close to the maximal number of edges; a ‘sparse’ graph is the opposite. The maximal number of edges is calculated as the potential connections, or nodes X (nodes -1). This potential is multiplied by two for a directed graph, since A → B and B → A are both possible. The density is thus the actual number of connections divided by the potential number. The density of KBpedia is quite sparse.

print('Density:', nx.density(G))

Degree‘ is a measure to find the most important nodes in graph, since a node’s degree is the sum of its edges. You can find the degree for an individual node, or the max ones as these two algorithms indicate:


Average clustering‘ is the sum of all node clusterings. A node is clustered if it has a relatively high number of actual links to neighbors in relation to potential links to neighbors. A small-world network is one where the distance between random nodes grows in proportion to the natural log of the number of nodes in the graph. Low average clustering is an indicator of a small-world network.

print('Average clustering:', nx.average_clustering(G))

G_node = 'Mammal'
print('Clustering for node', G_node, ':', nx.clustering(G, G_node))

Path length‘ is calculated as the number of hop jumps traversing two end nodes is a network. An ‘average path length‘ measures shortest paths over a graph and then averages them. A small number indicates a shorter, more easily navigated graph on average, but there can be much variance.

print('Average shortest path length:', nx.average_shortest_path_length(G))

The next three measures throw an error, since KBpedia ‘is not strongly connected.’ ‘Eccentricity‘ is the maximum length between a node and its connecting nodes in a graph, with the ‘diameter‘ being the maximum eccentricity across all nodes and the ‘radius‘ being the minimum.

print('Eccentricity:', nx.eccentricity(G))
print('Diameter:', nx.diameter(G))
print('Radius:', nx.radius(G))

The algorithms that follow take longer to calculate or produce long listings. The first such measure we see is ‘centrality‘, which in NetworkX is the number of connections to a given node, with higher connectivity a proxy for importance. Centrality can be measured in many different ways; there are multiple options in NetworkX.

# Calculate different centrality measures
print('Centrality:', nx.degree_centrality(G))
print('Centrality (eigenvector):', nx.eigenvector_centrality(G))
print('In-degree centrality:', nx.in_degree_centrality(G))
print('Out-degree centrality:', nx.out_degree_centrality(G))

Here are some longer analysis routines (unfortunately, betweenness takes hours to calculate):

# Calculate different centrality measures
print('Betweenness:', nx.betweenness_centrality(G))

As a directed graph, some NetworkX measures are not applicable. Here are some of them:

  • nx.is_connected(G)
  • nx.connected_components(G).


We earlier showed code for extracting a subgraph. Here is a generalized version of that function. Replace the ‘Bird’ reference concept with any other valid RC from KBpedia:

# Provide label for current KBpedia reference concept
rc = 'Bird'
# Get the subgraph of all nodes around node
sub = [ n[1] for n in G.edges(rc) ]
# Actually, need to add the 'rc' node too
# Now create a new graph, which is the subgraph
sg = nx.Graph(G.subgraph(sub))
# Print the nodes of the new subgraph and edges
print('Subgraph nodes:', sg.nodes())
print('Subgraph edges:', sg.edges())


There is a notable utility package called DeepGraphs (and its documentation) that appears to offer some nice partitioning and quick visualization options. I have not installed or tested it.

Full Network Exchange

So far, we have seen the use of networks in driving visualizations (CWPK #56) and, per above, as knowledge artifacts with their own unique characteristics and metrics. The next role we need to highlight for networks is as information providers and graph-based representations of structure and features to analytical applications and machine learners.

NetworkX can convert to and from other data formats:

All of these are attractive because PyTorch has direct routines for them.

NetworkX can also read and write graphs in multiple formats, some of which include:

There are also standard NetworkX functions to convert node and edge labels to integers (such as networkx.relabel.convert_node_labels_to_integers), relabel nodes (networkx.relabel.relabel_nodes), set node attributes (networkx.classes.function.set_node_attributes), or make deep copies (networkx.Graph.to_directed).

There are also certain packages that integrate well with NetworkX and PyTorch and related packages such as direct imports or exports to the Deep Graph Library (DGL) (see CWPK #68 and #69), or built-in converters or the DeepSNAP package may provide a direct bridge between NetworkX and PyTorch Geometric (PyG) (see CWPK #68 and #70).

However, these representations do NOT include the labeled information or annotations. Knowledge graphs, like KBpedia, have some unique aspects that are not fully captured by an existing package like NetworkX.

Fortunately, the previous extract-and-build routines at the heart of this Cooking with Python and KBpedia series are based around CSV files, the same basis as the pandas package. Via pandas we can capture the structure of KBpedia, plus its labels and annotations. Further, as we will see in the next installment, we can also capture full pages for most of these RCs in KBpedia from Wikipedia. This addition will greatly expand our context and feature basis for using KBpedia for machine learning.

For now, I present below two of these three inputs, extracted directly from the KBpedia knowledge graph.

KBpedia Structure

The first of two extraction files useful to all further installments in this Part VI provides the structure of KBpedia. This structure consists of the hierarchical relation between reference concepts using the subClassOf subsumption relation and the assignment of that RC to a typology (SuperType). I first presented this routine in CWPK #56 and it, indeed, captures the requisite structure of the graph:

### KEY CONFIG SETTINGS (see extract_deck in ###             
# 'kb_src'        : 'standard'                                        # Set in master_deck
# 'loop_list'     : kko_order_dict.values(),                          # Note 1   
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/mappings/',              
# 'ext'           : '.csv',                                         
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv',

def graph_extractor(**extract_deck):
    print('Beginning graph structure extraction . . .')
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    class_loop = extract_deck.get('class_loop')
    base = extract_deck.get('base')
    ext = extract_deck.get('ext')
    # Note 2
    parent_set = ['kko.SocialSystems','kko.Products','kko.Methodeutic','kko.Eukaryotes',

    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
    header = ['target', 'source', 'weight', 'SuperType']
    out_file = extract_deck.get('out_file')
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:                                           
        csv_out = csv.writer(output)
        for value in loop_list:
            print('   . . . processing', value)
            s_set = []
            root = eval(value)
            s_set = root.descendants()
            frag = value.replace('kko.','')
            for s_item in s_set:
                child_set = list(s_item.subclasses())
                count = len(list(child_set))
# Note 3                
                if value not in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        s_rc = s_rc.replace('rc.','')
                        child = child.replace('rc.','')
                        row_out = (s_rc,child,count,frag)
                elif value in parent_set:
                    for child_item in child_set:
                        s_rc = str(s_item)
                        child = str(child_item)
                        new_pair = s_rc + child
                        new_pair = str(new_pair)
                        if new_pair not in cur_list:
                            s_rc = s_rc.replace('rc.','')
                            child = child.replace('rc.','')
                            row_out = (s_rc,child,count,frag)
                        elif new_pair in cur_list:
        print('Processing is complete . . .')

Note, again, the parent_set ordering of typology processing at the top of this function. This ordering processes the more distal (leaf) typologies first, and then ignores subsequent processing of identical structural relationships. This means that the graph structure is cleaner and all subsumption relations are “pushed down” to their most specific mention.

You can inspect the actual structure file produced using this routine, which is also the general basis for reading into various machine learners:

import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/data/graph_specs.csv')


KBpedia Annotations

And, we also need to bring in the annotation values. The annotation extraction routine was first presented and described in CWPK #33, and was subsequently generalized and brought into conformance with our configuration routines in CWPK #33. Note, for example, in the header definition, how we are able to handle either classes or properties. In this instance, plus all subsequent machine learning discussion, we concentrate on the labels and annotations for classes:

### KEY CONFIG SETTINGS (see extract_deck in ###                
# 'krb_src'       : 'extract'                                          # Set in master_deck
# 'descent_type'  : 'descent',
# 'loop'          : 'class_loop',
# 'loop_list'     : custom_dict.values(),                              # Single 'Generals' specified 
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv',
# 'render'        : 'r_label',

def annot_extractor(**extract_deck):
    print('Beginning annotation extraction . . .') 
    r_default = ''
    r_label = ''
    r_iri = ''
    render = extract_deck.get('render')
    if render == 'r_default':
    elif render == 'r_label':
    elif render == 'r_iri':
        print('You have assigned an incorrect render method--execution stopping.')
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    out_file = extract_deck.get('out_file')
    class_loop = extract_deck.get('class_loop')
    property_loop = extract_deck.get('property_loop')
    descent_type = extract_deck.get('descent_type')
    """ These are internal counters used in this module's methods """
    p_set = []
    a_ser = []
    x = 1
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:
        csv_out = csv.writer(output)                                       
        if loop == 'class_loop':                                             
            header = ['id', 'prefLabel', 'subClassOf', 'altLabel', 
                      'definition', 'editorialNote', 'isDefinedBy', 'superClassOf']
            header = ['id', 'prefLabel', 'subPropertyOf', 'domain', 'range', 
                      'functional', 'altLabel', 'definition', 'editorialNote']
        for value in loop_list:                                            
            print('   . . . processing', value)                                           
            root = eval(value) 
            if descent_type == 'descent':
                p_set = root.descendants()
            elif descent_type == 'single':
                a_set = root
                print('You have assigned an incorrect descent method--execution stopping.')
            for p_item in p_set:
                if p_item not in cur_list:                                 
                    a_pref = p_item.prefLabel
                    a_pref = str(a_pref)[1:-1].strip('"\'')                
                    a_sub = p_item.is_a
                    for a_id, a in enumerate(a_sub):                        
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_sub + '||' + str(a)
                        a_sub  = a_item
                    if loop == 'property_loop':   
                        a_item = ''
                        a_dom = p_item.domain
                        for a_id, a in enumerate(a_dom):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_dom + '||' + str(a)
                            a_dom  = a_item    
                        a_dom = a_item
                        a_rng = p_item.range
                        a_rng = str(a_rng)[1:-1]
                        a_func = ''
                    a_item = ''
                    a_alt = p_item.altLabel
                    for a_id, a in enumerate(a_alt):
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_alt + '||' + str(a)
                        a_alt  = a_item    
                    a_alt = a_item
                    a_def = p_item.definition
                    a_def = str(a_def)[2:-2]
                    a_note = p_item.editorialNote
                    a_note = str(a_note)[1:-1]
                    if loop == 'class_loop':                                  
                        a_isby = p_item.isDefinedBy
                        a_isby = str(a_isby)[2:-2]
                        a_isby = a_isby + '/'
                        a_item = ''
                        a_super = p_item.superClassOf
                        for a_id, a in enumerate(a_super):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_super + '||' + str(a)
                            a_super = a_item    
                        a_super  = a_item
                    if loop == 'class_loop':                                  
                        row_out = (p_item,a_pref,a_sub,a_alt,a_def,a_note,a_isby,a_super)
                        row_out = (p_item,a_pref,a_sub,a_dom,a_rng,a_func,
                    x = x + 1
    print('Total unique IDs written to file:', x)  
    print('The annotation extraction for the', loop, 'is completed.') 

You can inspect this actual file of labels and annotations using this routine:

import pandas as pd

df = pd.read_csv('C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv')


We will add Wikipedia pages as a third source for informing our machine learning tests and experiments in our next installment.

Untested Potentials

One area in extended NetworkX capabilities that we do not test here is community structure using the Louvain Community Detection package.

Additional Documentation

Here are additional resources on network analysis and NetworkX:

NOTE: This article is part of the Cooking with Python and KBpedia series.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

A Wealth of Applications Sets the Stage for Pay Offs from KBpedia

With this installment of the Cooking with Python and KBpedia series we move into Part VI of seven parts, a part with the bulk of the analytical and machine learning (that is, “data science”) discussion, and the last part where significant code is developed and documented. Because of the complexity of these installments, we will also be reducing the number released per week for the next month or so. We also will not be able to post fully operational electronic notebooks to MyBinder since the supporting libraries strain the limits of that service. At the conclusion of this part, which itself has 11 installments, we have four installments to wrap up the series and provide a consistent roadmap to the entire project.

Knowledge graphs are unique information artifacts, and KBpedia is further unique in terms of its consistent and logical construction as well as its incorporation of significant text content via Wikipedia pages. These characteristics provide unique value for KBpedia, but it is also a combination not duplicated anywhere else in the data science ecosystem. One of the objectives, therefore, of this part of our CWPK series is the creation of some baseline knowledge representations useful to data science aims that capture these unique characteristics.

KBpedia’s (or any knowledge graph constructed in a similar manner) combination of characteristics make it a powerful resource in three areas of data science and machine learning. First, the nearly universal scope and degree of topic coverage with about 56,000 concepts, logically organized into typologies with a high degree of disjointedness, means that accurate ‘slices’ or training sets may be extracted from KBpedia nearly instantaneously. Labeled training sets are one of the most time consuming and expensive activities in doing supervised machine learning. We can extract these nearly for free from KBpedia. Further, with its links to tens of millions of entities in its mapped knowledge bases such as Wikidata, literally tens of thousands of conceptual entities in KBpedia can be the retrieval points to nucleate training sets for fine-grained entity recognition.

Second, 80% of KBpedia’s concepts are mapped to Wikipedia articles. While many Wikipedia-based word embedding models exist, the ones in KBpedia are logically categorized and have rough equivalence in terms of scope and prominence, hopefully providing cleaner topic ‘signals’. To probe these assertions, we will create a unique KBpedia-based word embedding corpus that also leverages labels for items of structural importance, such as typology membership. We will use this corpus in many of our tests and as a general focus in our training sets.

And, third, perhaps the most important area, knowledge graphs offer unique structures and challenges for machine learning, especially innovations in geometric, heterogeneous methods for deep learning. The first generation of deep machine learning was designed for grid-patterned data and matrices through approaches such as deep neural networks, convolutional neural networks (CNN ), or recurrent neural networks (RNN). The ‘deep’ appelation comes from having multiple calculated, intermediate layers of transformations between the grid inputs and outputs for the model. Graphs, on the other hand, are heterogeneous between nodes and edges. They may be directed (subsumptive) in nature. And, for knowledge graphs, they have much labeling and annotation, including varying degrees of attribute completeness. Language embedding, itself often a product of deep learning, enables the efficient incorporation of text. It is only in the past five years that concerted attention has been devoted to better capturing this feature richness for knowledge graphs.

The eleven installments in this part will look in more depth at networks and graphs, focus on how to create training sets and embeddings for the learners, discuss some natural language packages and uses, and then look in depth at ‘standard’ machine learners and deep learners. We will install the first generation of deep graph learners and then explore some on the cutting edge. We will test many use cases, but will also try to invoke classifiers across this spectrum so that we can draw some general conclusions.

The material below introduces and tees up these topics. We describe leading Python packages for data science, and how we have architected our own approach, We have picked a particular Python machine learning framework, PyTorch, to which we will then tie four different NLP and deep learning libraries. We devote two installments each to these four libraries. The use cases we document across these installments are in addition to the existing ones we have in Clojure posted online.

So, we think we have an interesting suite of benefits to cover in this part, some arising from being based on KBpedia and some arising from the nature of knowledge graphs. On the other hand, due to the relative immaturity of the field, we are still actively learning and innovating around the juncture of AI and knowledge graphs. Thus, one of the reasons we emphasize Python ‘ecosystems’ and ‘frameworks’ in this part is to be better prepared to incorporate those innovations and learnings to come.


One of the first prototypes of machine learning comes from the statistician Ronald Fisher in the 1930s regarding how to classify Iris species based on the attributes of their flowers. It was a multivariate data example using the method we today call linear discriminant analysis. This classic example is still taught. But many dozens of new algorithms and combined approaches have joined the machine learning field since then.

Figure 1 below is one way to characterize the field, with ML standing for machine learning and DL for deep learning, with this one oriented to sub-fields in which some Python package already exists:

Machine Learning Landscape
Figure 1: Machine Learning Landscape (from S. Chen, “Machine Learning Algorithms For Beginners with Code Examples in Python”, June 2020)

There are many possible diagrams that one might prepare to show the machine learning landscape, including ones with a larger emphasis on text and knowledge graphs. Most all schematics of the field show a basic split between supervised learning and unsupervised learning, (sometimes with reinforcement learning as another main branch), with the main difference being that supervised approaches iterate to achieve statistical fit with pre-determined labels, whereas unsupervised is unlabeled. Accurate labeling can be costly and time consuming. Note that the idea of ‘classification’ is a supervised one, ‘clustering’ a notion of unsupervised.

We will include a ‘standard’ machine learning library in our proposed toolkit, the selection of which I discuss below. However, the most evaluation time I spent in researching these installments was directed to the idea of knowledge representation and embeddings applicable to graphs. Graphs pose a number of differences and challenges to standard machine learning. They have only been a recent (5 yr) focus in machine learning, which is also rapidly changing over time.

All machine learners need to operate on their feature spaces in numerical representations. Text is a tricky form because language is difficult and complex, and how to represent the tokens within our language usable by a computer needs to consider, what? Parts-of-speech, the word itself, sentence construction, semantic meaning, context, adjacency, entity recognition or characterization? These may all figure into how one might represent text. Machine learning has brought us unsupervised methods for converting words to sentences to documents and, now, graphs, to a reduced, numeric representation known as “embeddings.” The embedding method may capture one or more of these textual or structural aspects.

Much of the first interest in machine learning based on graphs was driven by these interests in embeddings for language text. Standard machine classifiers with deep learning using neural networks have given us word2vec, and more recently BERT and its dozens of variants have reinforced the usefulness of deep learning to create pre-trained text representations.

Indeed, embeddings do figure prominently in knowledge graph representation, but only as one among many useful features. Knowledge graphs with hierarchical (subsumption) relationships, as might be found in any taxonomy, become directed. Knowledge graphs are asymmetrical, and often multi-typed and sometimes multi-modal. There is heterogeneity among nodes and links or edges. Not all knowledge graphs are created equal and some of these aspects may not apply. Whether there is an accompanying richness of text description that accompanies the node or edges is another wrinkle. None of the early CNN or RNN or simple neural net approaches match well with these structures.

The general category that appears to have emerged for this scope is geometric deep learning, which applies to all forms of graphs and manifolds. There are other nuances in this area, for example whether a static representation is the basis for analysis or one that is dynamic, essentially allowing learning parameters to be changed as the deep learning progresses through its layers. But GDL has the theoretical potential to address and incorporate all of the wrinkles associated with heterogeneous knowledge graphs.

So, this discussion helps define our desired scope. We want to be able to embrace Python packages that range from simple statistics to simple machine learning, throwing in natural language processing and creating embedding representations, that can then range all the way through deep learning to the cutting-edge aspects of geometric or graph deep learning.

Leading Python Data Science Packages

This background provides the necessary context for our investigations of Python packages, frameworks, or libraries that may fulfill the data science objectives of this part. Our new components often build upon and need to play nice with some of the other requisite packages introduced in earlier installments, including pandas (CWPK #55), NetworkX (CWPK #56), and PyViz (CWPK #55). NumPy has been installed, but not discussed.

We want to focus our evaluation of Python options in these areas:

  • Natural Language Processing, including embeddings
  • ‘Standard’ Machine Learning
  • Deep Learning and Abstraction Frameworks, and
  • Knowledge Graph Representation Learning.

The latter area may help us tie these various components together.

Natural Language Processing

It is not fair to say that natural language processing has become a ‘commodity’ in the data science space, but it is also true there is a wealth of capable, complete packages within Python. There are standard NLP requirements like text cleaning, tokenization, parts-of-speech identification, parsing, lemmatization, phrase identification, and so forth. We want these general text processing capabilities since they are often building blocks and sometimes needed in their own right. We also would like to add to this baseline such considerations as interoperability, creating embeddings, or other special functions.

The two leading NLP packages in Python appear to be:

  • NLTK – the natural language toolkit that is proven and has been a leader for twenty years
  • spaCy – a newer, but very impressive package oriented more to tasks, not function calls.

Other leading packages, with varying NLP scope, include:

  • flair – a very simple framework for state-of-the-art NLP that is based on PyTorch and works based on context
  • gensim – a semantic and topic modeling library; not general purpose, but with valuable capabilities
  • OpenNMT-py – an open source library for neural machine translation and neural sequence learning; provided for both the PyTorch and TensorFlow environments
  • Polyglot – a natural language pipeline that supports massive multilingual applications
  • Stanza – a neural network pipeline for text analytics; beyond standard functions, has multi-word token (MWT) expansion, morphological features, and dependency parsing; uses the Java CoreNLP from Stanford
  • TextBlob – a simplified text processor, which is an extension to NLTK.

Another key area is language embedding. Language embeddings are means to translate language into a numerical representation for use in downstream analysis, with great variety in what aspects of language are captured and how to craft them. The simplest and still widely-used representation is tf-idf (term frequency–inverse document frequency) statistical measure. A common variant after that was the vector space model. We also have latent (unsupervised) models such as LDA. A more easily calculated option is explicit semantic analysis (ESA). At the word level, two of the prominent options are word2vec and gloVe, which are used directly in spaCy. These have arisen from deep learning models. We also have similar approaches to represent topics (topicvec), sentences (sentence2vec), categories and paragraphs (Category2Vec), documents (doc2vec), node2vec or entire languages (BERT and variants and GPT-3 and related methods). In all of these cases, the embedding consists of reducing the dimensionality of the input text, which is then represented in numeric form.

There are internal methods for creating embeddings in multiple machine learning libraries. Some packages are more dedicated, such as fastText, which is a library for learning of word embeddings and text classification created by Facebook’s AI Research (FAIR) lab. Another option is TextBrewer, which is an open-source knowledge distillation toolkit based on PyTorch and which uses (among others) BERT to provide text classification, reading comprehension, NER or sequence labeling.

Closely related to how we represent text are corpora and datasets that may be used either for reference or training purposes. These need to be assembled and tested as well as software packages. The availability of corpora to different packages is a useful evaluation criterion. But, the picking of specific corpora depends on the ultimate Python packages used and the task at hand. We will return to this topic in CWPK #63.

‘Standard’ Machine Learning

Of course, nearly all of the Python packages mentioned in this Part VI have some relation to machine learning in one form or another. I call out the ‘standard’ machine learning category separately because, like for NLP, I think it makes sense to have a general learning library not devoted to deep learning but providing a repository of classic learning methods.

There really is no general option that compares with scikit-learn. It features various classification, regression, and clustering algorithms, including support vector machines, random forests, gradient boosting, k-means and DBSCAN data clustering, and is designed to interoperate with NumPy and SciPy. The project is extremely active with good documentation and examples.

We’ll return to scikit-learn below.

Deep Learning and Abstraction Frameworks

Deep learning is characterized by many options, methods and philosophies, all in a fast-changing area of knowledge. New methods need to be compared on numerous grounds from feature and training set selection to testing, parameter tuning, and performance comparisons. These realities have put a premium on libraries and frameworks that wrap methods in repeatable interfaces and provide abstract functions for setting up and managing various deep (and other) learning algorithms.

The space of deep learning thus embraces many individual methods and forms, often expressed through a governing ecosystem of other tools and packages. These demands lead to a confusing and overlapping and non-intersecting space of Python options that are hard to describe and comparatively evaluate. Here are some of the libraries and packages that fit within the deep and machine learning space, including abstraction frameworks:

  • Chainer is an open source deep learning framework written purely in Python on top of NumPy and CuPy Python libraries
  • Microsoft Cognitive Toolkit (CNTK) is an open-source toolkit for commercial-grade distributed deep learning; however, it has seen its last main release in favor of the interoperable approach, ONNX (see below)
  • Keras is an open-source library that provides a Python interface for artificial neural networks. Keras now acts as an interface for the TensorFlow library and is built on top of Theano; it has a high-level library for working with datasets
  • PlaidML is a portable tensor compiler; it runs as a component under Keras
  • PyTorch is an open source machine learning library based on the Torch library with a very rich ecosystem of interfacing or contributing projects
  • TensorFlow is a well-known open source machine learning library developed by Google
  • Theano is a Python library and optimizing compiler for manipulating and evaluating mathematical expressions, especially matrix-valued ones; it is tightly integrated with NumPy, and uses it at the lowest level.

Keras is increasingly aligning with TensorFlow and some, like Chainer and CNTK, are being deprecated in favor of the two leading gorillas, PyTorch and TensorFlow. One approach to improve interoperability is the Open Neural Network Exchange (ONNX) with the repository available on GitHub. There are existing converters to ONNX for Keras, TensorFlow, PyTorch and scikit-learn.

A key development from deep learning of the past three years has been the usefulness of Transformers, a technique that marries encoders and decoders converging on the same representation. The technique is particularly helpful to sequential data and NLP, with state-of-the-art performance to date for:

  • next-sentence prediction
  • question answering
  • reading comprehension
  • sentiment analysis, and
  • paraphrasing.

Both BERT and GPT are pre-trained products that utilize this method. Both TensorFlow and PyTorch contain Transformer capabilities.

Knowledge Graph Representation Learning

As noted, most of my research for this Part VI has resided in the area of a subset of deep graph learning applicable to knowledge graphs. The leading deep learning libraries do not, in general, provide support for this area of representational learning, sometimes called knowledge representation learning (KRL) or knowledge graph embedding (KGE). Within this rather limited scope, most options also seem oriented to link prediction and knowledge graph completion (KGC), rather than the heterogeneous aspects with text and OWL2 orientation characteristic of KBpedia.

Various capabilities desired or tested for knowledge graph representational learning include:

  • low-dimensional vectors (embeddings) with semantic meaning
  • knowledge graph completion (KGC)
  • triple classification
  • entity recognition
  • entity disambiguation (linking)
  • relation extraction
  • recommendation systems
  • question answering, and
  • common sense reasoning.

Unsupervised graph relational learning is used for:

  • link prediction
  • graph reconstruction
  • visualization, or
  • clustering.

Supervised GRL is used for:

  • node classification, and
  • graph classification (predict node labels).

This kind of learning is a subset of the area known as geometric deep learning, deep graphs, or graph representation (or representation) learning. We thus see this rough hierarchy:

machine learning → deep learning → geometric deep learning → graph (R) learning → KG learning

In terms of specific packages or libraries, there is a wealth of options in this new field:

  • AmpliGraph is a suite of neural machine learning models for relational learning using supervised learning on knowledge graphs
  • DGL-KE is a high performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings based on the Deep Graph Library, which is a library for GNN in PyTorch. Can run DGL-KE on CPU machine, GPU machine, as well as clusters with a set of popular models, including TransE, TransR, RESCAL, DistMult, ComplEx, and RotatE
  • Graph Nets is DeepMind’s library for building graph networks in TensorFlow
  • KGCN is the novel knowledge graph convolutional network model that is part of KGLIB; it requires GRAKN
  • OpenKE is an efficient implementation based on PyTorch for knowledge representation learning
  • OWL2Vec* represents each OWL named entity (class, instance or property) by a vector, which then can feed downstream tasks; see GitHub
  • PyKEEN is a Python library for training and evaluating knowledge graph embeddings; see GitHub
  • Pykg2vec: A Python Library for Knowledge Graph Embedding is a Python library for knowledge graph embedding and representation learning; see GitHub
  • PyTorch Geometric (PyG) is a geometric deep learning extension library for PyTorch with excellent documentation and an emphasis of providing wrappers to state-of-art models
  • RDF2Vec is an unsupervised technique that builds further on Word2Vec; RDF2Vec Light is a lightweight approach to KG embeddings
  • scikit-kge is a Python library to compute embeddings of knowledge graphs that ties directly into scikit-learn; an umbrella for the RESCAL, HolE, TransE, and ER-MLP algorithms; it has not been updated in five years
  • StellarGraph provides multiple variants of neural networks for both homogeneous and heterogeneous graphs, and relies on the TensorFlow, Keras, NetworkX, and scikit-learn libraries
  • TorchKGE provides knowledge graph embedding in PyTorch; it reportedly is faster than AmpiGraph and OpenKE.

One graph learning framework that caught my eye is KarateClub, an unsupervised machine learning extension library for NetworkX. I like the approach they are taking, but their library can not yet handle directed graphs. I will be checking periodically on their announced intention to extend this framework to directed graphs in the near future.

Lastly, more broadly, there is the recently announced KGTK, which is a generalized toolkit with broader capabilities for large scale knowledge graph manipulation and analysis. KGTK also puts forward a standard KG file format, among other tools.

A Generalized Python Data Science Architecture

With what we already have in hand, plus the libraries and packages described above, we have a pretty good inventory of candidates to choose from in proceeding with our next installments. Like our investigations around graphics and visualization (see CWPK #55), the broad areas of data science, machine learning, and deep learning have been evolving to one of comprehensive ecosystems. Figure 2 below presents a representation of the Python components that make sense for the machine learning and application environment. As noted, our local Windows machines lack separate GPUs (graphical processing units), so the hardware is based on a standard CPU (which has an integrated GPU that can not be separately targeted). We have already introduced and discusses some of the major Python packages and libraries, including pandas, NetworkX, and PyViz. Here is that representative data science architecture:

Representative Python Components

The defining architectural question for this Part VI is what general deep and machine learning framework we want (if any). I think using a framework makes sense over scripting together individual packages, though for some tests that still might be necessary. If I was to adopt a framework, I would also want one that has a broad set of tools in its ecosystem and common and simpler ways to define projects and manage the overall pipelines from data to results. As noted, the two major candidates appear to be TensorFlow and PyTorch.

TensorFlow has been around the longest, has, today, the strongest ecosystem, and reportedly is better for commercial deployments. Google, besides being the sponsor, uses TensorFlow in most of its ML projects and has shown a commitment to compete with the upstart PyTorch by significantly re-designing and enhancing TensorFlow 2.0.

On the other hand, I very much like the more ‘application’ orientation of PyTorch. Innovation has been fast and market share has been rising. The consensus from online reviews is that PyTorch, in comparison to TensorFlow:

  • runs dramatically faster on both CPU and GPU architectures
  • is easier to learn
  • produces faster graphs, and
  • Is more amenable to third-party tools.

Though some of the intriguing packages for TensorFlow are not apparently available for PyTorch, including Graph Nets, Keras, Plaid ML, and StellarGraph, PyTorch does have these other packages not yet mentioned that look potentially valuable down the road:

  • Captum – a unified and generic model interpretability library
  • Catalyst – a framework for deep learning R&D
  • DGL – the Deep Graph Libary needed for DGL-KE discussed below
  • fastai – simplifies training fast and accurate neural nets using modern best practices
  • flair – a simple framework for state-of-the-art NLP that may complement or supplement spaCy
  • PyTorch Geometric – a geometric deep learning extension library, also discussed below
  • PyTorch-NLP – a library of basic utilities for PyTorch NLP that may supplement or replace spaCy or flair, and
  • skorch – a scikit-learn compatible neural network library that wraps PyTorch.

One disappointment is that neither of these two leading packages directly ingest RDFLib graph files, though with PyTorch and DGL you can import or export a NetworkX graph directly. pandas is also a possible data exchange format.

Consideration of all of these points has led us to select PyTorch as the initial data science framework. It is good to know, however, that a fairly comparable alternative also exists with TensorFlow and Keras.

Finally, with respect to Figure 2 above, we have no plans at present to use the Dask package for parallelizing analytic calculations.

Four Additional Key Packages

With the PyTorch decision made, at least for the present, we are now clear to deal with specific additional packages and libraries. I highlight four of these in this section. Each of these four is the focus of two separate installments as we work to complete this Part VI. One of these four is in natural language processing (spaCy), one in general machine learning (scikit-learn), and two in deep learning with an emphasis on graphs (DGL and DGL-KE, and PyG). These choices again tend to reinforce the idea of evaluating whole ecosystems, as opposed to single packages. Note, of course, that more specifics on these four packages will be presented in the forthcoming installments.


I find spaCy to be very impressive as our baseline NLP system, with many potentially useful extensions or compatible packages including sense2vec, spacy-stanza, spacy-wordnet, torchtext, and gensim.

The major competitor is NLTK. The reputation of the NLTK package is stellar and it has proven itself for decades. It is a more disaggregate approach often favored by scholars and researchers to enable users to build complex NLP functionality. It is therefore harder to use and configure, and is also less performant. The real differentiator, however, is the more object or application orientation of spaCy.

Though NLTK appears to have good NLP tools for processing data pipelines using text, most of these functions appear to be in spaCy and there are also the flair and PyTorch-NLP packages available in the PyTorch environment if needed. gensim looks to be a very useful enhancement to the environment because of the advanced text evaluation modes it offers, including sentiment. Not all of these will be tested during this CWPK series, but it will be good to have these general capabilities resident in cowpoke.


We earlier signaled our intent to embrace scikit-learn, principally to provide basic machine learning support. scikit-learn provides a unified API to these basic tasks, including crafting pipelines and meta-functions to integrate the data flow. scikit-learn works on any numeric data stored as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrames are also acceptable.

Some of the general ML methods, and there are about 40 supervised ones in the package, may be useful and applicable to specific circumstances include:

  • dimensionality reduction
  • model testing
  • preprocessing
  • scoring methods, and
  • principal component analysis (PCA).

A real test of this package will be ease of creating (and then being able to easily modify) data processing and analysis pipelines. Another test will be ingesting, using, and exporting data formats useful to the KBpedia knowledge graph. We know that scikit-learn doesn’t talk directly to NetworkX, though there may be recipes for the transfer; graphs are represented in scikit-learn as connectivity matrices. pandas can interface via common formats including CSV, Excel, JSON and SQL, or, with some processing, DataFrames. scikit-learn supports data formats from NumPy and SciPy, and it supports a datasets.load_files format that may be suitable for transferring many and longer text fields. One option that is intriguing is how to leverage the CSV flat-file orientation of our KG build and extract routines in cowpoke for data transfer and transformation.

I also want to keep an eye on the possible use of skorch to better integrate with the overall PyTorch environment, or to add perhaps needed and missing functionality or ease of development. There is much to explore with these various packages and environments.


For our basic, ‘vanilla’, deep graph analysis package we have chosen the eponymous Deep Graph Library for basic graph neural network operations, which may run on CPU or GPU machines or clusters. The better interface relevant to KBpedia is through DGL-KE, a high performance, reportedly easy-to-use, and scalable package for learning large-scale knowledge graph embeddings that extends DGL. DGL-KE also comes configured with the popular models of TransE, TransR, RESCAL, DistMult, ComplEx, and RotatE.

PyTorch Geometric

PyTorch Geometric (PyG) is closely tied to PyTorch, and most impressively has uniform wrappers to about 40 state-of-art graph neural net methods. The idea of ‘message passing’ in the approach means that heterogeneous features such as structure and text may be combined and made dynamic in their interactions with one another. Many of these intrigued me on paper, and now it will be exciting to test and have the capability to inspect these new methods as they arise. DeepSNAP may provide a direct bridge between NetworkX and PyTorch Geometric.

Possible Future Extensions

During the research on this Part VI I encountered a few leads that are either not ready for prime time or are off scope to the present CWPK series. A potentially powerful, but experimental approach that makes sense is to use SPARQL as the request-and-retrieval mechanism against the graph to feed the machine learners. RDFFrames provides an imperative Python API that gets internally translated to SPARQL, and it is integrated with the PyData machine learning software stack; see GitHub. Some methods above also use SPARQL. One of the benefits of a SPARQL approach, besides its sheer query and inferencing power, is the ability to keep the knowledge graph intact without data transform pipelines. It is quite available to serve up results in very flexible formats. The relative immaturity of the approach and performance considerations may be difficult challenges to overcome.

I earlier mentioned KarateClub, a Python framework combining about 40 state-of-the-art unsupervised graph mining algorithms in the areas of node embedding, whole-graph embedding, and community detection. It builds on the packages of NetworkX, PyGSP, gensim, NumPy, and SciPy. Unfortunately, the package does not support directed graphs, though plans to do so have been stated. This project is worth monitoring.

A third intriguing area involves the use of quaternions based on Clifford algebras in their machine learning codes. Charles Peirce, the intellectual guide for the design of KBpedia, was a mathematician of some renown in his own time, and studied and applauded William Kingdon Clifford and his emerging algebra as a contemporary in the 1870s, shortly before Clifford’s untimely death. Peirce scholars have often pointed to this influence in the development of Peirce’s own algebras. I am personally interested in probing this approach to learn a bit more of Peirce’s manifest interests.

Organization of This Part’s Installments

These selections and the emphasis on our four areas lead to these anticipated CWPK installments over the coming weeks:

  • CWPK #61 – NLP, Machine Learning and Analysis
  • CWPK #62 – Network and Graph Analysis
  • CWPK #63 – Staging Data Sci Resources and Preprocessing
  • CWPK #64 – Embedding, NLP Analysis, and Entity Recognition
  • CWPK #65 – scikit-learn Basics and Initial Analyses
  • CWPK #66 – scikit-learn Classifiers
  • CWPK #67 – Knowledge Graph Embedding Models
  • CWPK #68 – Setting Up and Configuring the Deep Graph Learners
  • CWPK #69 – DGL-KE Classifiers
  • CWPK #70 – State-of-Art PyG 2 Classifiers
  • CWPK #71 – A Comparison of Results.

Additional Documentation

Here are some general resources:

Network Representational Learning

Knowledge Graph Representational Learning

NOTE: This article is part of the Cooking with Python and KBpedia series.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Finally Getting a Remote SPARQL Working Instance

Yesterday’s installment of Cooking with Python and KBpedia presented the first part of this two-part series on developing a SPARQL endpoint for KBpedia on a remote server. This concluding part picks up with step #7 in the stepwise approach I took to complete this task.

At the outset I thought it would progress rapidly: After all, is not SPARQL a proven query language with central importance to knowledge graphs? But, possibly because our focus in the series is Python, or perhaps for other reasons, I have found a dearth of examples to follow regarding setting up a Python SPARQL endpoint (there are some resources available related to REST APIs).

The first six steps in yesterday’s installment covered getting our environment set up on the remote Linux server, including installing the Web framework Flask and creating a virtual environment. We also presented the Web page form design and template for our SPARQL query form. This second part covers the steps of tieing this template form into actual endpoint code, which proved to be simple in presentation but exceeding difficult to formulate and debug. Once this working endpoint is in hand, I next cover the steps of giving the site an external URL address, starting and stopping the service on the server, and packaging the code for GitHub distribution. I conclude this two-part series with some lessons learned, some comments on the use and relevance of linked data, and point to additional documentation.

Step-wise Approach (con’t)

We pick up our step-wise approach here.

7. Tie SPARQL Form to Local Instance

So far, we have a local instance that works from the command line and an empty SPARQL form. We need to relate thest two pieces together. In the last installment, I noted two SPARQL-related efforts, pyLDAPI (and its GNAF example) and adhs. I could not find working examples for either, but I did consult their code frequently while testing various options.

Thus, unlike many areas throughout this CWPK series, I really had no working examples from which to build or modify our current SPARQL endpoint needs. While the related efforts above and other examples could provide single functions or small snippets, possibly as use for guidance or some modification, it was pretty clear I was going to need to build up the code step-by-step, in a similar stepwise manner to what I was following for the entire endpoint. Fortunately, as described in step #6, I did have a starting point for the Web page template using the GNAF example.

From a code standpoint, the first area we need to address is to convert our example start-up stub, what was called in the CWPK #58 installment, to our main application for this endpoint. We choose to call it in keeping with its role. We will build on the earlier stub by adding most of our import and Flask-routing (‘@app.route("/")‘, for example) statements, as well as the initialization code for the endpoint functions. We will call out some specific aspects of this file as we build it.

The second coding area we need to address is how to tie the text areas in our Web form template to the actual Python code. We will put some of that code in the template and some of that code in governing getting and processing the SPARQL query. There is a useful pattern for how to relate templates to Python code via what might be entered into a text area from StackOverflow. Here is the code example that should be put within the governing template, using the important {{ url_for('submit') }}:

<form action="{{ url_for('submit') }}" method="post">
<textarea name="text">
<input type="submit">

and here is the matching code that needs to go into the governing Python file:

from flask import Flask, request, render_template

app = Flask(__name__)

def index():
return render_template('form.html')

@app.route('/submit', methods=['POST'])
def submit():
return 'You entered: {}'.format(request.form['text'])

Note that file names, form names and routes all need to be properly identified and matched. Also note that imports need to be complete. Further notice in the file listing below that we modify the return statement. We also repeat this form related to the SPARQL results text area.

One of the challenging needs in the code development was working with a remote instance, as opposed to local code. I was also now dealing with a Linux environment, not my local Windows one. After much trial-and-error, which I’m sure is quite familiar to professional developers working in a client-server framework, I learned some valuable (essential!) lessons:

  1. First, with my miniconda approach and its minimal starting Python basis, I needed to check every new package import required by the code and check whether it was already in the remote instance. The conda list command is important here to first check whether the package is already in the Python environment or not. If not, I would need to find the proper repository for the package and install it per the instructions in CWPK #58

  2. I needed to make sure that the permission (Linux chmod) and ownership (Linux chown settings were properly set on the target directories for the remote instance such that I could use my SSH-based file transfer program (WinSCP in my case; Filezilla is another leading option). I simply do not do enough Linux work to be comfortable with remote editors. SSH transfer would enable me to work on the developing code in my local Windows instance

  3. I needed to get basic templates working early, since I needed Web page targets for where the outputs or traces of the running code would display

  4. I needed to restart the Apache2 server whenever there was a new code update to be tested. This resulted in a fairly set workflow of edit → upload → re-start Apache → call up remote Web template form (e.g., → inspect trace or logs → rinse and repeat

  5. Be attentive to and properly set content types, since we are moving data and results from Web forms to code and back again. Content header information can be tricky, and one needs to use cURL or wget (or Postman, which is often referenced, but I did not use). One way to inspect headers and content types is in the output Web page templates, using this code:

     req = request.form print(req) 
  6. In HTML forms, use the < code for the left angle bracket symbol (used in SPARQL queries to denote a URI link), otherwise the link will not display on the Web page since this character is reserved

  7. Used the standard W3C validator when needing to check encodings and Web addresses

  8. Be extremely attentive to the use of tabs v white spaces in your Python code. Get in the habit of using spaces only, and not tabbing for indents. Editors are more forgiving in a Windows development environment; Linux ones are not.

The reason I began assembling these lessons arose from the frustrations I had in early code development. Since I was getting small pieces of the functionality running directly in Python from the command line, some of which is shown in the prior two installments, my initial failures to import these routines in a code file (*.py) and get them to work had me pulling my hair out. I simply could not understand why routines that worked directly from the command line did not work once embedded into a code file.

One discovery is that Flask does not play well with the Python list command. If one inspects prior SPARQL examples in this series (for example, CWPK #25), one can see that this construct is common with the standard query code. One adjustment, therefore, was to remove the list generator, and install a looping function for the query output. This applied to both RDFLib and owlready2.

Besides the lessons presented above, some of the hypotheses I tested to get things to work included the use of CDATA (which only applies to XML), pasting to or saving and retrieving from intermediate text files, changing content-type or mimetype, treatment of the Python multi-line convention ("""), possible use of JavaScript, and more. Probably the major issue I needed to overcome was turning on space and tab display in my local editor to remove their mixed use. This experience really brought home to me the fundamental adherence to indentation in the Python language.

Nonetheless, by following these guidelines and with eventual multiple tries, I was finally able to get a basic code block working, as documented under the next step.

8. Create and validate an external SPARQL query using SPARQLwrapper to this endpoint.

Since the approach that worked above got closer to the standard RDFLib approach, I decided to expand the query form to allow for external searches as well. Besides modifications to the Web page template, the use of external sources also invokes the SPARQLwrapper extension to RDFLib. Though its results presentation is a bit different, and we now have a requirement to also input and retrieve the URL of the external SPARQL endpoint, we were able to add this capability fairly easily.

The resulting code is actually quite simple, though the path to get there was anything but. I present below the eventual code file so developed, with code notes following the listing. You will see that, aside from the Flask code conventions and decorators, that our code file is quite similar to others developed throughout cowpoke:

from flask import Flask, Response, request, render_template            # Note 1
from owlready2 import *
import rdflib
from rdflib import Graph
import json
from SPARQLWrapper import SPARQLWrapper, JSON, XML

# load knowledge graph files
main = '/var/data/kbpedia/kbpedia_reference_concepts.owl'              # Note 2
skos_file = '' 
kko_file = '/var/data/kbpedia/kko.owl'

# set up scopes and namespaces
world = World()                                                        # Note 2 
kb = world.get_ontology(main).load()
rc = kb.get_namespace('')
skos = world.get_ontology(skos_file).load()
kko = world.get_ontology(kko_file).load()

graph = world.as_rdflib_graph()

# set up Flask microservice
app = Flask(__name__)                                                  # Note 3

def sparql_form():
    return render_template('sparql_page.html')

# set up route for submitting query, receiving results 
@app.route('/submit', methods=['POST'])                                # Note 4
def submit():
#    if request.method == 'POST':
    q_submit = None
    results = ''
    if request.form['q_submit'] is None or len(request.form['q_submit']) < 5:
        return Response(
        'Your request to the SPARQL endpoint must contain a \'query\'.',
        mimetype = 'text/plain'
        data = request.form['q_submit']                                # Note 5
        source = request.form['selectSource']
        format = request.form['selectFormat']
        q_url = request.values.get('q_url')
        try:                                                           # Note 6
            if source == 'kbpedia' and format == 'owlready':           # Note 7
                q_query = graph.query_owlready(data)                   # Note 8
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = results.replace(']', ']\n')
            elif source == 'kbpedia' and format == 'rdflib':
                q_query = graph.query(data)                            # Note 8
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = results.replace('))', '))\n')
            elif source == 'kbpedia' and format == 'xml':
                q_query = graph.query(data)
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = q_query.serialize(format='xml')
                results = str(results)
                results = results.replace('<result>', '\n<result>')
            elif source == 'kbpedia' and format == 'json':
                q_query = graph.query(data)
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = q_query.serialize(format='json')
                results = str(results)
                results = results.replace('}}, ', '}}, \n')
            elif source == 'kbpedia' and format == 'html':             #Note 9
                q_query = graph.query(data)
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = q_query.serialize(format='csv')
                results = str(results)
                results = results.readlines()
#                table = '<html><table>'
                for row in results:
#                     row = str(row)                    
                     result = row[0]
#                    row = row.replace('\r\n', '')
#                    row = row.replace(',', '</td><td>')
#                    table += '<tr><td>' + row + '</td></tr>' + '\n'
#                table += '</table><br></html>' 
#                results = table
                return result
            elif source == 'kbpedia' and format == 'txt':
                q_query = graph.query(data)
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = q_query.serialize(format='txt')
            elif source == 'kbpedia' and format == 'csv':
                q_query = graph.query(data)
                for row in q_query:
                    row = str(row)
                    results = results + row
                results = q_query.serialize(format='csv')
            elif source == 'external' and format == 'rdfxml':
                q_url = str(q_url)
                results = q_url
            elif source == 'external' and format == 'xml':
                sparql = SPARQLWrapper(q_url)
                data = data.replace('\r', '')
                results = sparql.query()
            elif source == 'external' and format == 'json':            #Note 10
                sparql = SPARQLWrapper(q_url)
                data = data.replace('\r', '')
#                data = data.replace("\n", "\n' + '")
#                data = '"' + data + '"'
                sparql.setReturnFormat(JSON)                           #Note 10
                results = sparql.queryAndConvert()
#                q_sparql = str(sparql)
#                results = q_sparql
            else:                                                      #Note 11
                results = ('This combination of Source + Format is not available. Here are the possible combinations:\n\n' + 
                           '    Kbpedia:   owlready2:    Formats:  owlready2\n' + 
                           '                  rdflib:                 rdflib\n' +
                           '                                             xml\n' +
                           '                                            json\n' +
                           '                                           *html\n' +
                           '                                            text\n' +
                           '                                             csv\n' +
                           '   External:  as entered:                rdf/xml\n' +
                           '                                           *json\n\n' +
                           '            * combo still buggy')
            if format == 'html':
                return Response(results, mimetype='text/html')         # Note 9, 12
                return Response(results, mimetype='text/plain')
        except Exception as e:                                         # Note 6
            return Response(
            'Error(s) found in query: ' + str(e),
            mimetype = 'text/plain'

if __name__ == "__main__":                     

Here are some annotation notes related to this code, as keyed by note number above:

  1. There are many specific packages needed for this SPARQL application, as discussed in the main text. The major point to make here is that each of these packages needs to be loaded into the remote virtual environment, per the discussion in CWPK #58

  2. Like other cowpoke modules, these are pretty standard calls to the needed knowledge graphs and configuration settings

  3. These are the standard Flask calls, as discussed in the prior installment

  4. The main routine for the application is located here. We could have chosen to break this routine into multiple files and templates, but since this application is rather straightforward, we have placed all functionality into this one function block

  5. These are the calls that bring the assignments from the actual Web page (template) into the application

  6. We set up a standard try . . . exception block, which allows an error, if it occurs, to exit gracefully with a possible error explanation

  7. We set up all execution options as a two-part condition. One part is whether the source is the internal KBpedia knowledge graph (which may use either the standard rdflib or owlready2 methods) or is external (which uses the sparqlwrapper method). The second part is which of eight format options might be used for the output, though not all are available to the source options; see further Note 11. Also, most of the routines have some minor code to display results line-by-line

  8. Here is where the graph query function differs by whether RDFLib or owlready2 is used

  9. As of the time of release of this installment, I am still getting errors in this HTML output routine. I welcome any suggestions for working code here

  10. As of the time of release of this installment, I am still getting errors in this JSON output routine. I have tried the standard SPARQLwrapper code, SPARQLwrapper2, and adding the JSON format to the initial sparql, all to no avail. It appears there may be some character or encoding issue in moving the query on the Web form to the function. The error also appears to occur in the line indicated. I welcome any suggestions for working code here

  11. This is where any of the two-part combos discussed in Note #7 that do not work get captured

  12. This if . . . else enables the HTML output option.

9. Set up an external URI to the localhost instance With this working code instance now in place, it was time to expose the service through a standard external URI. (During development we used The URL we chose for the service is

We first needed to set up a subdomain pointing to the service via our DNS provider. While we generally provide SSL support for all of our Web sites (the secure protocol behind the https: Web prefix), we decided the minor use of this SPARQL site did not warrant keeping the certificates enabled and current. So, this site is configured for http: alone.

We first configured our Flask sites as described in CWPK #58. To get this site working under the new URL, I only needed to make two changes to the earlier configuration. This configuration file is 000-default.conf and is found on my server at the /etc/apache2/sites-enabled directory. Here at the two changes, called out by note:

<VirtualHost *:80>
ServerName #Note 1
DocumentRoot /var/www/html

WSGIDaemonProcess sparql python-path=/usr/bin/python-projects/miniconda3/envs/sparql/lib/python3.8/site-packages
WSGIScriptAlias / /var/www/html/sparql/ #Note 2
<Directory /var/www/html/sparql>
WSGIProcessGroup sparql
WSGIApplicationGroup %{GLOBAL}
Order deny,allow
Allow from all

ErrorLog ${APACHE_LOG_DIR}/error.log
CustomLog ${APACHE_LOG_DIR}/access.log combined

The first change was to add the new domain under ServerName. The second change was to replace the /sparql alias to / under the WSGIScriptAlias directive.

10. Set up an automatic start/re-start cron job.

The last step under our endpoint process is to schedule a cron job on the remote server to start up the sparql virtual environment in the case of an unintended shut down or breaking of the Web site. This last task means we can let the endpoint run virtually unattended. First, let’s look at how simple a re-activation (xxx) script may look:


conda activate sparql

Note the standard bash script header on this file. Also note our standard activation statement. One can create this file and then place it in a logical, findable location. In our instance, we will put it where the same sparql scripts exist, namely in /var/www/html/sparql/.

We next need to make sure this script is readable by our cron job. So we navigate to the the directory where this bash script is located and change its permissions:

 chmod +x 

Once these items are set, we are now able to add this bash script to our scheduled cron jobs. We find this specification and invoke our editor of that file by using:

 nano /etc/crontab 

Using the nano editor conventions (or those of your favored editor), we can now add our new cron job in a new entry between the asterisk (*) specifications:

 30 * * * * /bin/sh /var/www/html/sparql/ 

We have now completed all of our desired development steps for the KBpedia SPARQL endpoint. As of the release of today’s installment, the site is active.

Endpoint Packaging

I will package up this code as a separate project and repository on GitHub per the steps outlined in CWPK #46 under the MIT license, same as cowpoke. Since there are only a few files, we did not create a formal pip package. Here will be the package address:

Linked Data and Why Not Employed

My original plan was to have this SPARQL site offer linked data. Linked data is where the requesting user agent may be served either semantic data such as RDF in various formats or standard HTML if the requester is a browser. It is a useful approach for the semantic Web and has a series of requirements to qualify as ‘5-star‘ linked open data.

From a technical standpoint, the nature of the requesting user agent is determined by the local Web server (Apache2 in our case), which then routes the request to produce either structured data or semi-structured HTML for displaying in a Web page through a process known as content negotiation (the term is sometimes shortened to ‘conneg’). In this manner, our item of interest can be denoted with a single URI, but the content version that gets served to the requesting agent may differ based on the nature of the agent or its request. In a data-oriented setting, for example, the requested data may be served up in a choice of formats to make it easier to consume on the receiving end.

As I noted in my initial investigations regarding Python (CWPK #58), there are not many options compared to other languages such as Java or JavaScript. One of the reasons I initially targeted pyLDAPI was that it promised to provide linked data. (RDFLib-web used to provide an option, but it is no longer maintained and does not work on Python 3.) Unfortunately, I could find no working instances of the pyLDAPI code and, when inspecting the code base itself, I was concerned about the duplicated number of Flask templates required by this approach. Given the number and diversity of classes and properties in KBpedia, my initial review suggested pyLDAPI was not a tenable approach, even if I could figure out how to get the code working.

Given the current state of development, my suggestion is to go with an established triple store with linked data support if one wants to provide linked data. It does not appear that Python has a sufficiently mature option available to make linked data available at acceptable effort.

Lessons and Possible Enhancements

The last section summarized the relative immature state of Python for SPARQL and linked data purposes. In order to get the limited SPARQL functionality working in this CWPK series I have kept my efforts limited to the SPARQL ‘SELECT’ statement and have noted many gotchas and workarounds in the dicussions over this and the prior two installments. Here are some additional lessons not already documented:

  1. Flask apparently does not like ‘return None’
  2. Our minimal conda installation can cause problems with ‘standard’ Python packages dropped from the miniconda3 distro. One of these is json, which I ultimately needed to obtain from conda install -c jmcmurray json.

Clearly, some corners were cut above and some aspects ignored. If one wanted to fully commercialize a Python endpoint for SPARQL based on the efforts in this and the previous CWPK installments, here are some useful additions:

  • Add the full suite of SPARQL commands to the endpoint (e.g., CONSTRUCT, ASK, DESCRIBE, plus other nuances)
  • Expand the number of output formats
  • Add further error trapping and feedback for poorly-formed queries, and
  • Make it easier to add linked data content negotiation.

Of course, these enhancements do not include more visual or user-interface assists for creating SPARQL queries in the first place. These are useful efforts in their own right.

End of Part V

This installment marks the end of our Part V: Mapping, Stats, and Other Tools. We begin Part VI next week governing natural language applications and machine learning involving KBpedia. We are also now 80% of the way through our entire CWPK series.

Additional Documentation

Here are related documents, some which extend the uses discussed herein:

Flask Resources




NOTE: This article is part of the Cooking with Python and KBpedia series.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

What Should be Simple Proves Frustratingly Complex

Sometimes the installments in this Cooking with Python and KBpedia series come together fairly quickly, sometimes not. This installment has proven to be particularly difficult. Research has spread over days, and progress has been frustratingly slow. As a result, I spread the content of developing a remote SPARQL service across two parts.

At the outset I thought it would progress rapidly: After all, is not SPARQL a proven query language with central importance to knowledge graphs? But, possibly because our focus in the series is Python, or perhaps for other reasons, I have found a dearth of examples to follow regarding setting up a Python endpoint.

You will recall we first introduced SPARQL in CWPK #25 in conjunction with the RDFLib package. We showed the flexibility and robustness of this query language to retrieve and filter any and all structural aspects of a knowledge graph. Then, in installment CWPK #50 we expanded on this basis to describe how SPARQL can be an essential component for querying and retrieving data from external sources, principally Wikidata and DBpedia.

Most all public SPARQL endpoints that presently exist (see this representative list, which is disappointingly small) are based on triple stores that come bundled with SPARQL endpoints. A few are also based on endpoint wrappers based on Java such as RDF4j or Jena and a few languages such as C (Redland) or JavaScript. These options obviously do not meet our Python objectives.

As we saw in CWPK #25, RDFLib provides SPARQL query support and also has the related SPARQLwrapper package that enables one to pose queries to external SPARQL endpoints. (easysparql provides similar functionality.) However, the objective we have to turn a local or remote instance into a SPARQL-enabled endpoint accessible to outside parties is not so easily supported. A number of years back there were the well-regarded rdflib-web apps that ran within Flask; unfortunately, this code is out of date and does not run on Python 3. There was also the adhs package that saw limited development and has not been updated in five years. In my initial diligence for this series I also found the pyLDAPI package that looked promising. However, I have not been able to find a working version of this system, and the I find the approach it takes to content negotiation for linked data to be cumbersome and tedious (see next installment).

So, based on the fragments indicated and found from these researches, I decided to tackle setting up a SPARQL endpoint largely on my own. Having established a toe-hold in our remote Linux server in the last installation, I decided to proceed by baby steps reflecting what I had already learned with our local instance to expose an endpoint on our remote server.

Step-wise Approach

We begin our process by setting up our environment, loading needed packages and KBpedia, testing them, and then proceeding to write some code to enable SPARQL queries and then to manage the application. Not knowing if all of these steps will work, I decide to approach these questions in a step-by-step manner.

1. Create a ‘sparql’ conda and Flask address

Note: I have always found the Linux vi editor to be difficult and hard to navigate, since I only use it on occasion. I now use nano as my editor replacement, since it presents key commands at the bottom of the screen useful to my occasional use, and is also part of the standard distro.

We follow the same steps that we worked out in CWPK #58 for setting up a conda virtual environment, that we will name ‘sparql’:

conda create -n sparql python=3

We get the echo to screen as the basic conda environment is created. Remember, this environment is found in the /usr/bin/python-projects/miniconda3/envs/sparql directory location. We then activate the environment:

conda activate sparql

We install some basic packages and then create our new sparql directory and the two standard stub files there:

conda install flask
conda install pip

then the two files, beginning with

from flask import Flask
app = Flask(__name__)
def hello():
return "Hello SPARQL!"

and then

import sys
sys.path.insert(0, "/var/www/html/sparql/")
from test_sparql import app as application

We then proceed to set up the Apache2 configurations, placed directly below our prior similar specification in the /etc/apache2/sites-enabled directory in the 000-default.conf file:

        WSGIDaemonProcess sparql python-path=/usr/bin/python-projects/miniconda3/envs/sparql/lib/python3.8/site-packages
WSGIScriptAlias /sparql /var/www/html/sparql/
<Directory /var/www/html/sparql>
WSGIProcessGroup sparql
WSGIApplicationGroup %{GLOBAL}
Order deny,allow
Allow from all

then you can check whether the configuration is OK and re-start the server. Then, when we enter:

We see that the right message appears and our configuration is OK.

2. Install all needed Python packages

If you recall from the last installment, we used the minimal miniconda3 package installer for our remote Linux (Ubuntu) instance. This minimal footprint largely only installs conda and Python. That means we must install all of the needed additional packages for our current application.

We noted the pip installer before, but we are best off using one of the conda-related channels since they better check configuration dependencies. To expand our package availability from what is standard in the conda channel, we may need to add some additional channels to our base package. One of the most useful of these is conda-forge. To install it:

conda config --add channels conda-forge

It is best to install packages in bulk, since dependencies are checked at install time. One does this by listing the packages in the same command line. When doing so, you may encounter messages that one or more of the packages was not found. In these cases, you should go to the search box at, search for the package, and then note the channel in which the package is found. If that channel is not already part of your configuration, add it.

Many of the needed packages for our SPARQL implementation are found under the conda-forge channel. Here is how a bulk install may look:

conda install networkx owlready2 rdflib sparqlwrapper pandas --channel conda-forge

We also then need to install cowpoke using pip by using this command while in the sparql virtual environment:

pip install cowpoke

Every time we invoke the sparql virtual environment these packages will be available, which you can inspect using:

conda list

Also, if you want to share with others the package configuration of your conda environments, you may create the standard configuration file using this command:

conda env export > environment.yaml

The file will be written to the directory in which you invoke this command.

3. Install KBpedia ‘sandbox’ KGs

Clearly, besides the Python code, we also need the various knowledge graphs used by KBpedia. These graphs are the same *.owl (rdf/xml) files that we first discussed in CWPK #18 . We will use the same ‘sandbox’ files from that installment.

Our first need is to decide where we want to store our KBpedia knowledge graphs. For the same reasons noted above, we choose to create the directory structure of /var/data/kbpedia. Once we create these directories, we need to set up the ownership and access properties for the files we will place there. So, we navigate to the parent directory data of our target kbpedia directory and issue two statements to set the ownership and access rights to this location:

sudo chown -R user-owner:user-group kbpedia
sudo chmod -R 775 kbpedia

The -R switch means that our settings get applied recursively to all files and directories in the target directory. The permissions level (775) means that user owners or groups may write to these files (general users may not).

These permission changes now allow us to transfer our local ‘sandbox’ files to this new directory. The two files that we need to transfer using our SSH or file transfer clients are:


Recall these are the RDF/XML conversions of the original *.n3 files. We now have the data available on the remote instance for our SPARQL purposes.

4. Verify access and use of KBpedia and owlready2

OK, so to see that some of this is working, I pick up on the file viewing code in CWPK #18 to see if we can load and view this stuff. I enter this code into a file and run python (python under the /var/www/html/sparql/ directory:

main = '/var/data/kbpedia/kko.owl'  

with open(main) as fobj:
for line in fobj:
print (line)

Good; we see the kko.owl file scroll by.

So, the next test is to see if owlready2 is loaded properly and we can inspect the KBpedia knowledge graph.

Picking up from some of the first tests in CWPK #20, I create a script file locally and enter these instructions (note where the kko.owl file is now located):

main = '/var/data/kbpedia/kko.owl'
skos_file = ''

from owlready2 import *
kko = get_ontology(main).load()

skos = get_ontology(skos_file).load()


When in the sparql directory under /var/www/html/sparql, I call up Python (remember to have the sparql virtual environment active!), which gives me this command line feedback:

(sparql) root@ip-xxx-xx-x-xx:/var/www/html/sparql# python
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

and I paste the code block above at the cursor (>>>). I then hit Enter at the end of the code block, and we then see our kko classes get listed out.

Good, it appears we have the proper packages and directory locations. We can Ctrl-d (since we are on Linux) to exit the Python interactive session.

5. Create a ‘’ to verify a SPARQL query against the local version of the remote instance

So far, so good. We are now ready to test support for SPARQL. We again look to one of our prior installments, CWPK #25, to test whether SPARQL is working for us with all of the constituent KBpedia knowledge graphs. As we did with the prior question, we formulate a code block and invoke it interactively on the remote server with our python command. Here is the code (note that we have switched the definition of main to the full KBpedia reference concepts graph):

main = '/var/data/kbpedia/kbpedia_reference_concepts.owl'
skos_file = ''
kko_file = '/var/data/kbpedia/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(main).load()
rc = kb.get_namespace('')

skos = world.get_ontology(skos_file).load()

kko = world.get_ontology(kko_file).load()

import rdflib

graph = world.as_rdflib_graph()

form_1 = list(graph.query_owlready("""
PREFIX rc: <>
PREFIX skos: <>
?x rdfs:subClassOf rc:Mammal.
?x skos:prefLabel ?label.


Fantastic! This works, too, even to the level of giving us the owlready2 circular reference warnings we received when we first invoked CWPK #25!

Now, let’s also test if we can query using SPARQL to another remote endpoint from our remote instance using again more code from the CWPK #25 installment and also after importing the sparqlwrapper package:

main = '/var/data/kbpedia/kbpedia_reference_concepts.owl'
skos_file = ''
kko_file = '/var/data/kbpedia/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(main).load()
rc = kb.get_namespace('')

skos = world.get_ontology(skos_file).load()

kko = world.get_ontology(kko_file).load()

from SPARQLWrapper import SPARQLWrapper, JSON
from rdflib import Graph

sparql = SPARQLWrapper("")

PREFIX schema: <>
SELECT ?item ?itemLabel ?wikilink ?itemDescription ?subClass ?subClassLabel WHERE {
VALUES ?item { wd:Q25297630
?item wdt:P910 ?subClass.

SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
results = sparql.query().convert()

Most excellent! We have also confirmed we can use our remote server for remote endpoint queries.

6. Create a Flask-based SPARQL input form for the local version This progress is rewarding, but the task now becomes substantially harder. We need to set up interfaces that will allow these queries to be run from external sources to our remote instance. There are two ways we can tackle this requirement.

The first way, the subject of this particular question, is to set up a Web page form that any outside user may access from the Web to issue a SPARQL query via an editable input form. The second way, the subject of question #9, is to enable a remote query issued via sparqlwrapper and Python that goes directly to the endpoint and bypasses the need for a form.

Since we already have installed Flask and validated it in the last installment, our task under this present question is to set up the Web form (in the form of a template as used by Flask) in which we enter our SPARQL queries. Flask maps Web (HTTP) requests to Python functions, which we showed in the last installment where the /sparql URI fragment maps to the /var/www/html/sparql path and its function. Flask runs this code and then displays results to the browser using HTTP protocols, with the GET method being the most common, but all HTTP methods may be supported. The Python code invoked may call up templates (based on Jinja) that can then invoke HTML pages forms and various response functions.

I noted earlier two SPARQL-related efforts, pyLDAPI and adhs. While neither appears to have a working example, both contain aspects that can inform this task and subsequent ones. A (non-working) implementation of pyLDAPI called GNAF, in particular, has a SPARQL Web page that looked to be useful as a starting template.

If you recall, Flask uses HTML-based templates as its ‘view’-related approach to the model-view-controller (MVC) design. Besides embedding standard HTML, these templates may also contain set Flask statements that relate the Web page to various model or controller commands. These templates should be placed into a set directory under the Flask directory structure. The templates can be nested within one another, useful, for example, when one wants a header and footer repeated across multiple pages, but for our instance I chose a single-page template.

In essence, I took the two main text areas from the starting GNAF template and embedded them in duplicate presentations of the header and footer from the KBpedia current Web page design. (You should know that the server hosting the subject SPARQL page is different from the physical server hosting the standard KBpedia Web site.) I took this approach because I was considering making a SPARQL query form a standard part of the main KBpedia site, which I implement at the conclusion of the next installment. Here is how the resulting Web page form looks:

KBpedia SPARQL Form
Figure 1: KBpedia SPARQL Form

Though located on a remote server different than the standard KBpedia Web site, we have designed the KBpedia SPARQL form to mimic the look of that standard site (1) with the same menu options, and both interact seamlessly. Sample SPARQL queries are provided both for the internal KBpedia knowledge graph and for external sites (2), including links (2) to additional query examples. These queries, whether samples or ones of your own crafting, can be pasted into the query entry box (3). Once pasted, you have the option to enter an external SPARQL query URL (4), pick whether your query should be directed internally to KBpedia or externally (4) (if the query is external), and to select amongst about 8 output formats (4), including standard RDF/XML, JSON, CSV, HTML, etc. Then, when you submit the query (4), the results appear in the final text box (5). If the results are helpful, you may copy them and paste them into a local file.

You can inspect this resulting SPARQL Web page at the following address (View Page Source to see the HTML):

You will note that besides logo and menu items similar to the standard KBpedia site, that this form has two text areas, one for entering the SPARQL query and one for viewing subsequent results. There are also some switches regarding input and output forms. It is these switches and the two text areas that relate most directly to the next question.

Tying this form to (which, of course was actually developed in conjunction with) its accompanying code was the most difficult coding effort I have undertaken with this CWPK series to date. I cover this coding development, along with the remaining questions and related topics, in our next installment.

NOTE: This article is part of the Cooking with Python and KBpedia series.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

The Beginning of Moving Our Environment into the Cloud

Today’s installment in our Cooking with Python and KBpedia series begins a three-part mini-series of moving our environment into the clouds. This first part begins with setting up a remote instance and Web page server using the mini-framework Flask. The next installment expands on this basis to begin adding a SPARQL endpoint to the remote instance. And, then, in the concluding third part to this mini-series, we complete the endpoint and make the SPARQL service active.

We undertake this mini-series because we want to make KBpedia more open to SPARQL queries from anywhere on the Web and we will be migrating our main KBpedia Web site content to Python. The three installments in this mini-series pave the way for us to achieve these aims.

We begin today’s installment by looking at the approach and alternatives for achieving these aims, and then we proceed to outline the steps needed to set up a Python instance on a remote (cloud) instance and to begin serving Web pages.

Approach and Alternatives

We saw earlier in CWPK #25 and CWPK #50 the usefulness of the SPARQL query language to interact with knowledge graphs. Our first objective, therefore, was to establish a similar facility for KBpedia. Though we have an alternate choice to set up some form of RESTful API to KBpedia (see further Additional Documentation below), and perhaps that may make sense at some point in time, our preference is to use SPARQL given its robustness and many query examples as earlier installments document.

Further, we can foresee that our work with Python in KBpedia may warrant our moving our standard KBpedia Web site to that away from Clojure and its current Bootstrap-based Web page format. Though Python is not generally known for its Web-page serving capabilities, some exploration of that area may indicate whether we may go in that direction or not. Lastly, given our intent to make querying of KBpedia more broadly available, we also wanted to adhere to linked data standards. This latter objective is perhaps the most problematic of our aims as we will discuss in a couple of installments.

Typically, and certainly the easiest path, when one wants to set up a SPARQL endpoint with linked data ‘conneg‘ (content negotiation) is to use an existing triple store that has these capabilities built in. We already use Virtuoso as a triple store, and there are a couple of Python installation guides already available for Virtuoso. Most of the other widely available triple stores have similar capabilities.

Were we not interested in general Web page serving and were outside of the bounds of the objectives of this CWPK series, use of a triple store is the path we would recommend. However, since our aims are different, we needed to turn our attention to Python-based solutions.

From the standpoint of Web-page serving, perhaps the best known and most widely installed Python option is Django, a fully featured, open-source Web framework. Django has a similar scope and suite of capabilities to its PHP counterpart, Drupal. These are comprehensive and complicated frameworks, well suited to massive database-backed sites or ones with e-commerce or other specialty applications. We have had much experience with Drupal, and frankly did not want to tackle the complexity of a full framework.

I was much more attracted to a simpler microframework. The two leading Python options are Flask and Bottle (though there is also Falcon, which does not seem as developed). I was frankly more impressed with the documentation and growth and acceptance shown by Flask, and there appeared to be more analogous installations. Flask is based on the Jinja template engine and the Werkzeug WSGI (Web-server Gateway Interface) utility library. It is fully based on Unicode.

Another factor that needs to be considered is support for RDFlib the key package (and related) that we will be using for the SPARQL efforts. I first discussed this package in CWPK #24, though it is featured in many installments.

Basic Installation

We will be setting up these new endpoints on our cloud server, which is a large EC2 instance on Amazon Web Services running Ubuntu 18.04 LTS. Of course, this is only one of many cloud services. As a result, I will not discuss all of the preliminary steps to first securing an instance, or setting up an SSH client to access it, nor any of the initial other start-up issues. For EC2 on AWS, there are many such tutorials. Two that I have encountered in doing the research for this installment include this one and this other one. There are multiple others, and ones applicable to other providers than AWS as well.

So, we begin from the point that an instance exists and you know how to log in and do basic steps with it. Even with this simplification, I began my considerations with these questions in mind:

  1. Do I need a package manager, and if so, which one?
  2. Where should I place my Python projects within the remote instance directory structure?
  3. How do I also include the pip package environment?
  4. Should I use virtual environments?

With regard to the first question, I was sure I wanted to maintain the conda package manager, but I was not convinced I needed the full GUI and set of packages with Anaconda. I wanted to keep consistency with the set-up environment we put in place for our local instance (see CWPK #9). However, since I had gained much experience with Anaconda and conda, I felt comfortable not installing the full 4 GB full Anaconda distribution. This led me to adopt the minimal miniconda package, which has a much smaller footprint. True, I will need to install all specific Python packages and work from the command line (terminal), but that is easy given the experience I have gained.

Second, in reviewing best practices information for directory structures on Ubuntu, I decided to create an analogous ‘python-projects’ master directory, similar to what I established for the local instance, under the standard user application directory. I thus decided and created this master directory: usr/bin/python-projects.

So, having decided to use miniconda and knowing the master directory I wanted to use, I proceeded to begin to set up my remote installation. I followed a useful source for this process, though importantly updated all references to the current versions. The first step was to navigate to my master directory and then to download the 64-bit Linux installer onto my remove instance, followed by executing the installation script:

wget  * make sure you use updated version


The installation script first requires you to page through the license agreement (many lines!!) and then accept it. Upon acceptance, the Miniconda code is installed with some skeletal stubs in a new directory under your master directory, which in my case is /usr/bin/python-projects/miniconda3. The last step in the installation process asks where you want Miniconda installed. The default is to use root/miniconda3. Since I wanted to keep all of my Python project stuff isolated, I overrode this suggestion and used my standard location of /usr/bin/python-projects/miniconda3.

After all of the appropriate files are copied, I agreed to initialize Miniconda. This completes the basic Miniconda installation.

Important Note: You will need to sign out and then sign back into your SSH shell at this point before the changes become active!

After signing back in, I again navigated to my Python master directory and then installed the pip package manager since not all of the Python packages we use in this CWPK series and cowpoke are available through conda.

Virtual Environments and a Detour

In the standard use of a Linux installation, one uses the distro’s own installation and package management procedures. In the case of Ubuntu and related distros such as Debian, apt stands for ‘advanced package tool‘ and through commands such as apt-get is one means to install new capabilities to your remote instance. Other Linux distros may use yum or rpm or similar to install new packages.

Then, of course, within Python, one has the pip package installer and the conda one we are using. Further, within a Linux installation, how one may install packages and where they may be applicable depends on the user rights of the installer. That is one reason why it is important to have sudo (super user) rights as an administrator of the instance when one wants new packages to be available to all users.

These package managers may conflict in rights and, if not used properly, may act to work at cross purposes. For example, in a standard AWS EC2 instance using Ubuntu, it comes packaged with its own Python version. This default is useful for the occasional app that needs Python, but does not conform to the segregation of packages that Python often requires using its ‘virtual environments‘.

On the face of it, the use of virtual environments seems to make sense for Python: we can keep projects separate with different dependencies and even different versions of Python. I took this viewpoint at face value when I began this transition to installing Python on a remote instance.

Given this, it is important to never run sudo pip install. We need to know where and to what our Python packages apply, and generic (Linux-type) install routines confuse Linux conventions with Python ones. We are best in being explicit. There are conditions, then, where the idea of a Python virtual environment makes sense. These circumstances include, among others, a Python shop that has many projects; multiple developers with different versions and different applications; a mix of Python applications between Python 2 and 3 or where specific version dependencies create unworking conditions; and so forth.

However, what I found in migrating to a remote instance is that virtual environments, combined with a remote instance and different operating system, added complexity that made finding and correcting problems difficult. After struggling for hours trying to get systems to work, not really knowing where the problem was occurring nor where to look for diagnostics, I learned some important things the hard way. I describe below some of these lessons.

An Unfortunate Bad Fork

So, I was convinced that a virtual environment made sense and set about following some online sources that documented this path. In general, these approaches used virtualenv (or venv), a pip-based environment manager, to set up this environment. Further, since I was using Ubuntu, AWS and Apache2, these aspects added to the constraints I needed to accommodate to pursue this path.

Important Note: For my configuration, this path is a dead end! If you have a similar configuration, do NOT follow this path. The section after this one presents the correct approach!

In implementing this path, I first installed pip:

sudo apt install -y python3-pip

I was now ready to tackle the fourth and last of my installation questions, namely to provide a virtual environment for my KBpedia-related Python projects. To do so, we first need to install the virtual environment package:

sudo apt install -y python3-venv

Then, we make sure we are in the base directory where we want this virtual environment to reside. (In our case, /usr/bin/python-projects/. We also will name our virtual environment kbpedia. We first establish the virtual environment:

python3.6 -m venv kbpedia

which pre-populates a directory with some skeletal capabilities. Then we activate it:

source kbpedia/bin/activate

The virtual environment is now active, and you can work with it as if you were are the standard command prompt, though that prompt does change in form to something like (kbpedia) :/usr/bin/python-projects#. You work with this directory as you normally would, adding test code next in our case. When done working with this environment, type deactivate to return from the virtual environment.

The problem is, none of this worked for my circumstance, and likely never would. What I had neglected in taking this fork is that conda is both a package manager and a virtual environment manager. With the steps I had just taken, I had inadvertently set up multiple virtual environments, which were definitely in conflict.

The Proper Installation

Once I realized that my choice of conda meant I had already committed to a virtual environment manager, I needed to back off all of the installs I had undertaken with the bad fork. That meant I needed to uninstall all packages installed under that environment, remove the venv environment manager, remove all symbolic links, and so forth. I also needed to remove all Apache2 updates I had installed for working with wsgi. I had no confidence whatever I had installed had registered properly.

The bad fork was a costly mistake, and it took me quite of bit of time to research and find the proper commands to remove the steps I had undertaken (which I do not document here). My intent was to get back to ‘bare iron’ for my remote installation so that I could pursue a clean install based on conda.

Installing the mod-wsgi Apache Module

After reclaiming the instance, my first step was to install the appropriate Apache2 modules to work with Python and wsgi. I began by installing the WSGI module to Apache2:

sudo apt-get install libapache2-mod-wsgi-py3

Which we then test to see if it was indeed installed properly:

apt-cache show libapache2-mod-wsgi-py3

Which displays:

Package: libapache2-mod-wsgi-py3
Architecture: amd64
Version: 4.5.17-1ubuntu1
Priority: optional
Section: universe/httpd
Source: mod-wsgi
Origin: Ubuntu
Maintainer: Ubuntu Developers <> Original-Maintainer: Debian Python Modules Team <> Bugs: Installed-Size: 271 Provides: httpd-wsgi Depends: apache2-api-20120211, apache2-bin (>= 2.4.16), python3 (>= 3.6), python3 (<< 3.7), libc6 (>= 2.14), libpython3.6 (>= 3.6.5) Conflicts: libapache2-mod-wsgi Filename: pool/universe/m/mod-wsgi/libapache2-mod-wsgi-py3_4.5.17-1ubuntu1_amd64.deb Size: 88268 MD5sum: 540669f9c5cc6d7a9980255656dd1273 SHA1: 4130c072593fc7da07b2ff41a6eb7d8722afd9df SHA256: 6e443114d228c17f307736ee9145d6e6fcef74ff8f9ec872c645b951028f898b Homepage: Description-en: Python 3 WSGI adapter module for Apache The mod_wsgi adapter is an Apache module that provides a WSGI (Web Server Gateway Interface, a standard interface between web server software and web applications written in Python) compliant interface for hosting Python based web applications within Apache. The adapter provides significantly better performance than using existing WSGI adapters for mod_python or CGI. . This package provides module for Python 3.X. Description-md5: 9804c7965adca269cbc58c4a8eb236d8 </></>

And next, we check to see if the module is properly registered:

Our first test is:

sudo apachectl -t

and the second is:

sudo apachectl -M | grep wsgi

which gives us the correct answer:

wsgi_module (shared)

Setting Up the Conda Virtual Environment

We then proceed to create the ‘kbpedia’ virtual environment under conda:

conda create -n kbpedia python=3

Which we test with the Ubuntu path inquiry:

echo $PATH

which gives us the correct path registration (the first of the five listed):


and then we activate the ‘kbpedia’ environment:

conda activate kbpedia

Installing Needed Packages

Now that we are active in our virtual environment, we need to install our missing packages:

conda install flask
conda install pip

Installing Needed Files

For Flask to operate, we need two files. The first file is the basic calling application,, that we place under the Web sites directory location, /var/www/html/kbpedia/ (we create the kbpedia directory). We call up the editor for this new file:


and enter the following simple Python program:

from flask import Flask
app = Flask(__name__)@app.route("/")
def hello():
return "Hello KBpedia!"

It is important to know that Flask comes with its own minimal Web server, useful for only demo or test purposes, and one (because it can not be secured) should NOT be used in a production environment. Nonetheless, we can test this initial Flask specification by entering either of these two commands:

 wget -O- 


Hmmm. These commands seem not to work. Something must not be correct in our Python file format. To get better feedback, we invoke Python and then:

flask run

This gives us the standard traceback listing that we have seen previously with Python programs. We get an error message that the name ‘app’ is not recognized. As we inspect the code more closely, we can see that one line in the example that we were following was inadvertenly truncated (denoted by the decorator ‘@’ sign). We again edit to now appear as follows:

from flask import Flask
app = Flask(__name__)
def hello():
return "Hello KBpedia!"

Great! That seems to fix the problem.

We next need to enter a second Python program that tells WSGI where to find this app. We call up the editor again in the same directory location:


(We also may use the *.wsgi file extension if we so choose; some examples prefer this second option.)

and enter this simple program into our editor:

import sys
sys.path.insert(0, "/var/www/html/kbpedia/")
from test_kbpedia import app as application

This program tells WSGI where to find the application, which we register via the sys package as the code states.

Lastly, to complete our needed chain of references, Apache2 needs instructions for where a ‘kbpedia’ reference in an incoming URI needs to point in order to find its corresponding resources. So, we go to the /etc/apache2/sites-enabled directory and then edit this file:

vi 000-default.conf

Under DocumentRoot in this file, enter these instructions:

        WSGIDaemonProcess kbpedia python-path=/usr/bin/python-projects/miniconda3/envs/kbpedia/lib/python3.8/site-packages
WSGIScriptAlias /kbpedia /var/www/html/kbpedia/
<Directory /var/www/html/kbpedia>
WSGIProcessGroup kbpedia
WSGIApplicationGroup %{GLOBAL}
Order deny,allow
Allow from all

This provides the path for where to find the WSGI file and Python.

We check to make sure all of these files are written correctly by entering this command:

sudo apachectl -t

When we first run this, while we get a ‘Syntax OK’ message, we also get a warning message that we need to “Set the ‘ServerName’ directive globally to suppress this message”. While there is no problem to continue in this mode, in order to understand the suite of supporting files, we navigate to where the ServerName is set in the directory /etc/apache2​ by editing this file:

vi apache2.conf

and enter this name:

ServerName localhost

Note anytime we make a change to an Apache2 configuration file, that we need to re-start the server to make sure the new configuration is now active. Here is the basic command:

sudo service apache2 restart

If there is no problem, the prompt is returned after entering the command.

We are now complete with our initial inputs. To test whether we have properly set up Flask and our Web page serving, we use the IP for our instance and enter this command in a browser:

And, the browser returns a Web page stating:

Hello KBpedia!

Fantastic! We now are serving Web pages from our remote instance.

Of course, this is only the simplest of examples, and we will need to invoke templates in order to use actual HTML and style sheets. These are topics we will undertake in the next installment. But, we have achieved a useful milestone in this three-part mini-series.

Some More conda Commands

If you want to install libraries into an environment, or you want to use that environment, you have to activate the environment first by:

conda activate kbpedia

After invoking this command, we can install desired packages into the target environment, for example:

conda install package-name

But sometimes we need to use a different channel, in which case we first need to install that channel:

conda install --channel asmeurer

then, invoke it (say):

conda install -c conda-forge package-name

To see what packages are presently available to your conda environment, type:

conda list

And, to see what environments you have set up within conda, enter:

conda env list

which, in our current circumstance, give us this result:

base                     /usr/bin/python-projects/miniconda3
kbpedia * /usr/bin/python-projects/miniconda3/envs/kbpedia

The environment shown with the asterisk (*) is the currently active one.

Another useful command to know is to get full information on your currently active conda environment. To do so, type:

conda info

in our case, that produces the following output:

     active environment : kbpedia
    active env location : /usr/bin/python-projects/miniconda3/envs/kbpedia
            shell level : 1
       user config file : /root/.condarc
 populated config files :
          conda version : 4.8.4
    conda-build version : not installed
         python version :
       virtual packages : __glibc=2.27
       base environment : /usr/bin/python-projects/miniconda3  (writable)
           channel URLs :
          package cache : /usr/bin/python-projects/miniconda3/pkgs
       envs directories : /usr/bin/python-projects/miniconda3/envs
               platform : linux-64
             user-agent : conda/4.8.4 requests/2.24.0 CPython/3.8.3 Linux/4.15.0-115-generic ubuntu/16.04.1 glibc/2.27
                UID:GID : 0:0
             netrc file : None
           offline mode : False

Additional Documentation

Here are some additional resources touched upon in the prior discussions:

Getting Oriented

Flask on AWS

Each of these cover some of the first steps needed to get set up on AWS, which we skip over herein:


General Flask Resources


Here is a nice overview of RDFlib.

General Remote Instance Set-up

A video on setting up an EC2 instance and Putty; also deals with updating Python, Filezilla and crontab.

Other Web Page Resources

NOTE: This article is part of the Cooking with Python and KBpedia series.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

