Posted:September 24, 2020

Segregating the Structure and Looking for Orphans

We have progressed through these build portions of the Cooking with Python and KBpedia series to capture the bulk of the structure in KBpedia by defining its classes, properties, and the hierarchical relationships among them. We have, so to speak, tossed all of the components into the bin, and have mostly defined our knowledge structure’s scaffolding. But we still lack some structural definitions and analysis prior to beginning the testing for whether this structure is coherent or not. Today’s installment directly addresses these gaps.

You will note we still, as yet, have not done anything to annotate our concepts or predicates. That is OK, and we will hold off for a bit further, because annotations are all trappings useful for humans and language to interact with the knowledge graph. It is the structural aspects alone that set the logical framework of the knowledge graph. We will settle questions about this prior to adding labels, definitions, and alternative terms to KBpedia.

Say Goodbye to the Start-up

This is the last installment that we will begin with our standard start-up routine. As needed, our installments will from here on begin with standard Python module import statements. We will be moving our start-up routine into cowpoke.__main__ import and removing that comment below. We also have added the ‘extract’ switch, as we first described a couple of installments back:

Which environment? The specific load routine you should choose below depends on whether you are using the online MyBinder service or local files. The example below is based on using local files, which given the complexity of the routines that are emerging, is probably your better choice. Make sure to modify the URIs for your local directories.
Getting ready for cowpoke: As I mentioned a few installments back, all of this code we are assembling will be released under the cowpoke package come installment CWPK #46, which is due to be posted in one week. Stay tuned!
from owlready2 import * 
from cowpoke.config import *
# from cowpoke.__main__ import *
import csv                                                
import types

world = World()

kb_src = master_deck.get('kb_src')                         # we get the build setting from config.py

if kb_src is None:
    kb_src = 'standard'
elif kb_src is 'extract':
    kb_src = 'standard'  
elif kb_src is 'full':
    kb_src = 'start'    
elif kb_src == 'sandbox':
    kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
elif kb_src == 'standard':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
elif kb_src == 'start':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kbpedia_rc_stub.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
else:
    print('You have entered an inaccurate source parameter for the build.')
skos_file = 'http://www.w3.org/2004/02/skos/core' 

This will move to cowpoke.__main__ import as well:

kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

You will need to Run (shift+enter) the routines above in order to test any of the subsequent methods.

Structure Utilities

This section describes a number of utilities we may apply to the structure of KBpedia. Most of these routines need only be run infrequently, and generally, always is preparation for testing last structure items before initiating a formal, new build.

In the last installment, we developed the first two of these utilities, the dup_remover check and the set_union routine. These two join the routines below in the new utils module.

SuperTypes

In our prior build routines, we had some specific steps dealing with defining ‘SuperTypes’, that is, the root concepts to each of our typologies. With this new Python cowpoke design, these specifications have moved to the KBpedia Knowledge Ontology (KKO) upper ontology (see CWPK #38). If you choose to add a new upper-level typology, you will need to take these steps:

  1. Using an ontology editor, add the new upper level SuperType to its appropriate level under Generals in the KKO ontology;

  2. Add all required annotations (definition, preLabel and altLabels) for that new concept in KKO;

  3. Add a new entry to the typol_dict dictionary list in config.py;

  4. Flesh out and complete a typology flat file for that new SuperType and place it into the appropriate directory used for your builds;

  5. Build the KBpedia structure (or whatever you may have named it) and test the structure (per this and the next installments); and

  6. Add the annotations to any new RCs in the typology (CWPK #44).

Note: Lower-level typologies may also be added to an existing KBpedia concept node (‘rc‘ namespace). In those cases, the new typology needs to be added explicitly to the class_struct_build process in CWPK #40, but no further changes need to be made to KKO since the parent typology is already hooked into the system.

Difference Analysis

The difference analysis (set_difference) code is mostly identical to the set_union routine from the prior installment, except for the difference calculation shown on the line with Note #6. It is best used to check the difference from only one or two other sets (typologies).

The basic run command for this utility is:

    set_difference(**build_deck)

Disjoint Analysis

We first showed how to list disjoint classes in CWPK #17. Let’s take that basic command, and use it to extract our existing disjoint assignments to file, plus do a bit of output file cleanup. Since this is only rarely run (but helpful when done so!), we have not generalized it much:

def disjoint_status()
    output = list(world.disjoint_classes())
    disjoint_file = open('C:/1-PythonProjects/kbpedia/v300/build_ins/working/kbpedia_disjoint.csv', 'w', encoding='utf8')
    disjoint_file.write('id,disjoints\n')
    for element in output:
        element = str(element)
        element = element.replace('AllDisjoint([', '')
        element = element.replace('C:\\1-PythonProjects\\kbpedia\\sandbox\\', '')
        element = element.replace(' | ', ',')
        element = element.replace(' ', '')
        element = element.replace('])', '')
        element = element.replace(',ontology=get_ontology("http://kbpedia.org/ontologies/kko#"))', '')
        element = element.replace(']', '')
        disjoint_file.write(element)
        disjoint_file.write('\n')
    disjoint_file.close()

Mostly this routine just cleans up the output from the standard owlready2 ‘disjoint’ call. It was only cleaned up to the point of readability, since it will not be used in any roundtripping. The next couple of sub-sections address how we typically handle disjointedness assertions.

Disjoint assignments are some of the most important in KBpedia. We try to ensure that any truly non-overlapping typologies are declared as ‘disjoint’ from one another. Also, we try to scrutinize closely two typologies with only minimal overlap. These minor overlaps may be misassignments or perhaps we can move or slightly reconfigure the concept to avoid the overlap, in which case we can re-configure the two comparing typologies to be actually disjoint. We need some offline analysis to review these situations.

Typology Intersections

We already showed a set_intersection method in the previous installment. However, for disjoint analysis we want to run pairwise comparisons between all typologies and flag those that have no overlap or have minimal overlaps. With 72 items in the current typology list (excluding Generals, which is the catch-all combined parent), we thus have 2,556 options to test, since order is not important in the pair. The basic formula is n(n-1)/2. With this many comparisons, the process clearly needs to be automated.

So, our basic approach is to begin with the first typology, compare it to all others, move to the second and compare, and so forth until we have exhausted the typology list. For each iteration, we will collect the RCs from the first ontology, the RCs from the second typology, convert them to sets, and then do a set intersection. We then want to print out the count of the intersections, and the actual RCs in the two typology sets that overlap if the intersection falls below a set number of overlaps. Here is the basic routine, with notes explained after the code:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                
# 'kb_src'        : 'standard'
# count           : 20                                                    # Note 1
# out_file        : 'C:/1-PythonProjects/kbpedia/v300/targets/stats/kko_intersections.csv'

from itertools import combinations                                        # Note 2

def typol_intersects(**build_deck):
    kko_list = typol_dict.values()
    count = build_deck.get('count')
    out_file = build_deck.get('out_file')
    with open(out_file, 'w', encoding='utf8') as output:
        print('count,kko_1,kko_2,intersect RCs', file=output)
        for i in combinations(kko_list,2):                                # Note 3
            kko_1 = i[0]                                                  # Note 4
            kko_2 = i[1]                                                  # Note 4
            kko_1_frag = kko_1.replace('kko.', '')
            kko_1 = getattr(kko, kko_1_frag)                              # Note 5
            kko_2_frag = kko_2.replace('kko.', '')
            kko_2 = getattr(kko, kko_2_frag)     
            descent_1 = kko_1.descendants(include_self = False)           # Note 6
            descent_1 = set(descent_1)
            descent_2 = kko_2.descendants(include_self = False)
            descent_2 = set(descent_2)
            intersect = descent_1.intersection(descent_2)                 # Note 7
            num = len(intersect)
            if num <= count:                                              # Note 1
                print(num, kko_1, kko_2, intersect, sep=',', file=output)
            else: 
                print(num, kko_1, kko_2, sep=',', file=output)
    print('KKO typology intersection analysis is done.')

We pick up our settings, like other routines, from the (**build_deck), and we set a threshold of a maximum of 20 overlaps or fewer (1) (you may change this to any value you wish) for printing out the results. If you’d like to inspect one output (calculated as of today’s installment; it may change), you can inspect the file by running this cell:

with open('C:/1-PythonProjects/kbpedia/sandbox/kko_intersections.csv', 'r') as f:
    print(f.read())

Each line in the output presents the intersection count, followed by the listing of the two typologies being compared, and the a listing of the intersecting reference concepts (RCs) if they fall below the minimum.

The code takes advantage of a new module in this series, itertools (2), that has a number of very useful data analysis options. We are looking at the combinations method (3) that iterates for us all of the unordered pairwise comparisons (2,556 in our case). We pull out the actual typology item by index from the tandem (4), and, like before, evaluate that string to retrieve the actual typology class reference (5). Using the owlready2 built-in function, we are able to get all of the RC descendant members for each of the typologies, convert them to sets, and then intersect them (7) with the efficient set intersection notation.

We want to do two things with this output. First, we want to make sure that all null intersections (count = 0) are included in our disjoint assignments in KBpedia. This is where we can quickly compare to the output from the earlier disjoint_status function. Second, for intersections with minimal overlap, we want to inspect those items and discover if we can revise scope or assignments for some RCs to make the pair disjoint. This latter step is a bit tricky (aside from any misassignments, which have now been flagged for correction) because we do not want to change our ideas of ‘natural’ classes merely to make a disjoint assertion. However, sometimes either the scope of the typology, or the scope of the shared RC, may be tweaked such that a defensible disjoint status may be asserted. When there are very few overlaps, for example, one approach that has sometimes made sense is to move a concept into a parent category above the two comparison child typologies. There are also circumstances where the overlap is real, and even if only with a few overlap items, the non-disjointedness should be maintained (and thus no changes should be made).

Some time and experience is likely required in this area. Disjoint assertions are some of the most powerful for inferencing and satisfiability testing of the knowledge graph. (I suspect I have spent more intellectual horsepower on the questions of disjoint typologies than any other in KBpedia.)

From the standpoint of the Python code used for this method, see the concluding section under Additional Documentation to check out some useful sources.

Branch and Orphan Check

A periodic check that is helpful is whether a given RC has a broken lineage to the root of its typology. Such unbroken links can not occur when the typology is a direct extraction from KBpedia without external modification. However, the use of external tools, general edits, or other modifications to a typology used for ingest can result in broken inheritance chains. In the case where more than one RC in a chain of RCs lacks a connection to the root, the disconnected fragment is called a ‘branch’. Where the disconnected fragment is a singleton RC, it is called an ‘orphan’.

Again, because this routine is infrequently needed, I mostly hardwired the formal settings below. You can move them back to the build_deck settings. Here is the routine, with again notes that follow the code listing:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : kko.Generals.descendants()                             # Note 1                            
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/targets/stats/branches_orphans.csv'

def branch_orphan_check(**build_deck):
    print('Beginning branch and orphan checks . . .')                     
#    loop_list = build_deck.get('loop_list')                               # Note 1
    loop_list = kko.Generals.descendants()                                 # Note 2
    loop_list = set(loop_list)
    kko_list = list(typol_dict.values())
    item_list = []
    for i, item in enumerate(kko_list):                                    # Note 2                                    
        item_frag = item.replace('kko.','')
        kko_item = getattr(kko, item_frag)
        kko_list[i] = kko_item
    print('After:', kko_list)
    out_file = 'C:/1-PythonProjects/kbpedia/v300/targets/stats/branches_orphans.csv'
    with open(out_file, 'w', encoding='utf8') as output:
        print('rc', file=output)
        kko_list = set(kko_list)
        for loopval in loop_list:
            val = loopval
            print('   . . . evaluating', loopval, 'checking for branches and orphans . . .')  
            val_list = val.ancestors(include_self = False)
            val_list = set(val_list)
            intersect = val_list.intersection(kko_list)
            num = len(intersect)
            print(num)
            if num == 0:
                print('Unconnected RC:', val, file=output)    
    print('Branch and orphan analysis now complete.')

In this example, we set the overall loop basis to be all of the RCs in the system; that is, the .descendants of the Generals typology root. If to be driven from the build_deck, the value could be changed to a single typology using the custom_dict setting, but it may be just as easy to set it directly in this code.

While the .descendants produces an array of class objects, evaluating all of the typologies requires us to loop over kko_list, which is a 2-tuple dictionary with the key values as strings. As we have seen before, we need to convert those strings into class object types (2), which also requires us to enumerate the list, which allows us to substitute the initial string values to class values.

We then convert our two input lists to sets, and do an intersection as in prior routines when we run the routine. If the item does not have the typology root as an ancestor, we then know the item is an orphan or the top of a branch not connected to the root.

This kind of analysis is most useful when first constructing a new, initial typology. As disconnects get connected, the worth of this analysis declines.

branch_orphan_check(**build_deck)

Duplicates in the Parental Chain

Our last structural utility at this juncture is one that analyzes whether a given reference concept (RC) is only assigned once to its lowest logical occurrence in a parental inheritance chain. While there is nothing illogical about assigning a concept wherever it is subsumed by a parent, multiple assignments for a single RC in a given inheritance chain lead to unreadability and difficulties in maintaining the system.

For example, we know that a ‘polar bear’ is a ‘bear’, which is a ‘mammal’ that is part of ‘Eutheria’, all of which are ‘LivingThings’. There is nothing logically wrong with assigning the ‘polar bear’ concept to all of these other items. Inferencing would show ‘polar bear’ to be a subclass of all of these items. However, redundant assignments act to clog our listing, and makes it difficult to know when we see an occurrence whether it is at its terminal node location or not. We get cleaner ontologies that are easier to maintain by trying to adhere to the best practice of a single assignment to an inheritance chain, best placed at its lowest hierarchical level.

Redundant assignments, in my view, are all too common with most knowledge graphs. I like the analytical routine below since it helps me to pare down to the essence of the logic of the ontology structure. Code notes are discussed below the listing:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : kko.ProtistsFungus.descendants()                            # Note 1
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/targets/stats/parental_dups.csv'


def dups_parental_chain(**build_deck):
    print('Beginning duplicate RC placement analysis . . .')                     
    loop_list = kko.AudioInfo.descendants()                                # Note 1
    out_file = 'C:/1-PythonProjects/kbpedia/v300/targets/stats/parental_dups.csv'    
    with open(out_file, 'w', encoding='utf8') as output:
        print('count,rc,dups', file=output)
        for item in loop_list:                                            # Note 2
            rc = item
            rc_list = rc.ancestors(include_self = False)
            dup_keep = []
            for par_item in rc_list:
                parent = par_item
                par_list = parent.subclasses()
                for dup_item in par_list:
                    dup = dup_item
                    if rc == dup:
#                        dup_check = dup.ancestors(include_self = False)
#                        if(all(x in rc_list for x in dup_check)):
#                            print(rc, ',', parent, file=output)   
                        dup_keep.append(parent)                
            count = len(dup_keep)
            if count > 1:
                print(count, ',', rc, ',', dup_keep, file=output)
    print('Duplicate RC checking and analysis is complete.')
dups_parental_chain(**build_deck)
Beginning duplicate RC placement analysis . . .
Duplicate RC checking and analysis is complete.

On my local machine, this analysis takes about 3.5 minutes to run.

We directly assign to trace all of the RCs under the Generals root (1), of the three in the KKO’s universal categories. Again, these can be tailored through settings from the build_deck. If you do so, make sure you make the .descendants assignment as well. The remaining parts of the routine should be somewhat familiar by now.

The routine basically works by first looping over all of the RCs in the system (2), grabbing all ancestors up to the owl.Thing root, looping over all of the ancestors and grabbing their immediate subclasses, and then checking to see if one of the subclasses is the starting RC. If so, that is recorded, and RCs with more than one subclass instance are written to file.

These listings perhaps could be reduced further in size with further filtering. However, it is best I believe, at this juncture, to manually inspect such structural changes. It is straightforward to manually check the RCs listed, and remove any superfluous subsumption assignments.

I may add some more refinements to this routine later to flag any subclass assignments that occur in the same parental chain.

If our system passes the tests above, or at least to the extent that we, as knowledge graph managers, deem acceptable for a next release, then we are ready to begin our logic tests of the structure, the subject of our next installment.

Additional Documentation

Here are some useful links on the itertools module, as well as other pairwise considerations:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 24, 2020 at 10:12 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2382/cwpk-42-other-structural-considerations/
The URI to trackback this post is: https://www.mkbergman.com/2382/cwpk-42-other-structural-considerations/trackback/
Posted:September 23, 2020

Trying to Overcome Some Performance Problems and Extend into Property Structure

Up until the last installment in this Cooking with Python and KBpedia series, everything we did performed quickly in our interactive tests. However, we have encountered some build bottlenecks with our full build routines. If we use the single Generals typology (under which all other KBpedia typologies reside and includes all class structure definitions), the full build requires 700 minutes! Worse, if we actually loop over all of the constituent typology files (and exclude the Generals typology), the full build requires 620 minutes! Wow, that is unacceptable. Fortunately, we do not need to loop over all typology files, but this poor performance demands some better ways to approach things.

So, as we continue our detour to a full structure build, I wanted to test some pretty quick options. I also thought some of these tests have use in their own right apart from these performance questions. Tests with broader usefulness we will add to a new utils module in cowpoke. Some of the tests we will look at include:

  • Add a memory add-in to Jupyter Notebook
  • Use sqlite3 datastore rather than entirely in-memory
  • Saving and reloading between passes
  • Removing duplicates in our input build files
  • Creating a unique union of class specifications across typologies, or
  • Some other ideas that we are deferring for now.

After we tweak the system based on these tests, we resume our structure building quest, now including KBpedia properties, to complete today’s CWPK installment.

Standard Start-up

We invoke our standard start-up functions. We have updated our ‘full’ switch to ‘start’, and have cleaned out some earlier initiatizations that were not actually needed:

Which environment? The specific load routine you should choose below depends on whether you are using the online MyBinder service or local files. The example below is based on using local files, which given the complexity of the routines that are emerging, is probably your better choice. Make sure to modify the URIs for your local directories.
from owlready2 import * 
from cowpoke.config import *
# from cowpoke.__main__ import *
import csv                                                
import types

world = World()

kb_src = every_deck.get('kb_src')                         # we get the build setting from config.py
#kb_src = 'standard'                                      # we can also do quick tests with an override

if kb_src is None:
    kb_src = 'standard'
if kb_src == 'sandbox':
    kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
elif kb_src == 'standard':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
elif kb_src == 'start':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kbpedia_rc_stub.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
else:
    print('You have entered an inaccurate source parameter for the build.')
skos_file = 'http://www.w3.org/2004/02/skos/core' 
    

(Note: As noted earlier, when we move these kb_src build instructions to a module, we also will add another ‘extract’ option and add back in the cowpoke.__main__ import statement.)

kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

#skos = world.get_ontology(skos_file).load()
#kb.imported_ontologies.append(skos)
#core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

Just to make sure we have loaded everything necessary, we can test for whether one of the superclasses for properties is there:

print(kko.eventuals)
kko.eventuals

Some Performance Improvement Tests

To cut to the bottom line, if you want to do a full build of KBpedia (as you may have configured it), the fastest approach is to use the Generals typology, followed by whatever change supplements you may have. (See this and later installments.) As we have noted, a full class build will take about 70 minutes on a conventional desktop using the Generals typology.

However, not all builds are full ones, and in trying to improve on performance we have derived a number of utility functions that may be useful in multiple areas. I detail the performance tests, the routines associated with them, and the code developed for them in this section. Note that most of these tests have been placed into the utils module of cowpoke.

Notebook Memory Monitor

I did some online investigations of memory settings for Python, which I previously reported is apparently not readily settable and not really required. I did not devote huge research time, but I was pleased to see that Python has a reputation of grabbing as much memory as it needs up to local limits, and also apparently releasing memory and doing fair garbage clean-up. There are some modules to expose more and give more control, but my sense is that was not a productive path to alter Python directly for the KBpedia project.

My next question related to whether there might be a Jupyter Notebook limitation, because that is where I was working out and documenting the developing routines. I came across reference to an extension of Notebook, nbreuse, that would provide a memory use monitor in the notebook’s interface. According to instructions:

pip install nbresuse[resources]

which proceeded without a hitch.

I am running a test in the background on another notebook page, but here is how my screen presently looks:

The nbreuse Memory Display on Jupyter Notebook
Figure 1: The nbreuse Memory Display on Jupyter Notebook

When I first start an interactive session with KBpedia the memory demand is about 150 MB. Most processes demand about 500 MB, and rarely do I see a value over 1 GB, all well within my current limits (16 GB of RAM, with perhaps as much as 8 GB available). So, I have ruled out an internal notebook issue, though I have chosen to keep the extension installed because I like the memory use feedback.

Using a Persistent Datastore

One observation from owlready2’s developer, Jean-Baptiste Lamy, is that it sometimes speeds up some operations for larger knowledge graphs when the ontology is made persistent. Normally, owlready2 tries to keep the entire graph in memory. One makes an ontology persistent by calling the native owlready2 datastore, sqlite3, and then relates it (in an unencumbered sense, which is our circumstance) to the global (‘default_world‘) namespace:

default_world.set_backend(filename='C:/1-PythonProjects/kbpedia/v300/build_ins/working/kb_test.sqlite3')

This is a good command to remember and does indeed save the state, but I did not notice any improvements to the specific KBpedia load times.

Saving Files Between Passes

The class_struct_build function presented in the prior CWPK installment splits nicely into three passes: define the new class; add parents to the class; then remove the redundant rdfs:subClassOf owl:Thing assignment. I decided to open and close files between passes to see if perhaps a poor memory garbage clean-up or other memory issue was at play.

We do see that larger memory models cause a slowdown in performance after some apparent internal limit, as witnessed when the very large typologies cross some performance threshold. Unfortunately, opening and closing files between passes had no notable effect on processing times.

Duplicates Removal

A simple cursory inspection of an extracted ontology file indicates multiple repeat rows (triple statements). If we are seeing longer than desired load times, why not reduce the overall total number of rows that need to be processed? Further, it would be nice, anyway, to have a general utility for reducing duplicate rows.

There are many ways one might approach such a problem with Python, but the method that appealed most to me is to read the rows in as a really simple approach. We simply define our ingested row (taken in its entirety as a complete triple statement) as being a member of a list newrows = [], and then check to see whether it has been ingested before or not. That’s it!

We embed these simple commands in the looping framework we developed in the last installment:

def dup_remover(**build_deck):
    print('Beginning duplicate remover routine . . .')
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    base = build_deck.get('base')
    base_out = build_deck.get('base_out')
    ext = build_deck.get('ext')
    for loopval in loop_list:
        print('   . . . removing dups in', loopval)
        frag = loopval.replace('kko.','')
        in_file = (base + frag + ext)
        newrows = []                                            # set list to empty
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
            for row in reader:
                if row not in newrows:                          # if row is new, then:
                    newrows.append(row)                         # add it!
        out_file = (base_out + frag + ext)    
        with open(out_file, 'w', encoding='utf8', newline='') as output:
            is_first_row = True
            writer = csv.DictWriter(output, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])                 
            writer.writerows(newrows)
    print('File dup removals now complete.')   

Again, if your environment is set up right (and pay attention to the new settings in config.py!), you can run:

dup_remover(**build_deck)

The dup_remover function takes about 1 hour to run on a conventional desktop cycling across all available typologies. The few largest typologies take the bulk of time. More than half of the smallest typologies run in under a minute. This profile shows that prior to some memory threshold that performance screams, but larger sets (as we have seen elsewhere) require much longer periods of time. Two of the largest typologies, Generals and Manifestations, each take about 8 minutes to run.

(For just occasional use, this looks acceptable to me. If it continues to be too lengthy, my next test would be to ingest the rows as set members. Members of a Python set are unique, are intended to be immutable when defined, and are hashed for greater retrieval speed. You can’t use this approach if maintaining row order (or set member) order is important, but in our case, it does not matter what order our class structure triples are ingested. If I refactor this function, I will first try this approach.)

These runs across all KBpedia typologies show that nearly 12% of all rows across the files are duplicate ones. Because of the lag in performance at larger sizes, removal of duplicates probably makes best sense for the largest typologies, and ones you expect to use multiple times, in order to justify the upfront time to remove duplicates.

We will place this routine in the utils module.

Unions and Intersections Across Typologies

In the same vein as removing duplicates within a typology, as our example just above did, we can also look to remove duplicates across a group of typologies. By using the set notation just discussed, we can also do intersections or other set operations. These kinds of operations have applications beyond duplicate checking down the road.

It is also the case that I can do a cross-check against the descendants in the General typology (see CWPK #28 for a discussion of the .descendants()). While I assume this typology (and it should!) contains all of the classes and parental definitions in KBpedia outside of KKO, I can do a union across all non-General typologies and check if they actually do.

So, with these arguments suggesting the worth of a general routine, we again pick up on our looping construct, and do both unions and intersections across an input deck of typologies. Because it is a bit simpler, we begin with unions:

def set_union(**build_deck):
    print('Beginning set union routine . . .')                            # Note 1
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    base = build_deck.get('base')
    short_base = build_deck.get('short_base')
    base_out = build_deck.get('base_out')
    ext = build_deck.get('ext')
    f_union = (short_base + 'union' + ext)
    filetemp = open(f_union, 'w+', encoding='utf8')                       # Note 2
    filetemp.truncate(0)
    filetemp.close()
    input_rows = []
    union_rows = []
    first_pass = 0                                                        # Note 3
    for loopval in loop_list:
        print('   . . . evaluating', loopval, 'using set union operations . . .')
        frag = loopval.replace('kko.','')
        f_input = (base + frag + ext)
        with open(f_input, 'r', encoding='utf8') as input_f:              # Note 4
            is_first_row = True
            reader = csv.DictReader(input_f, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
            for row in reader:
                if row not in input_rows:                          
                    input_rows.append(row)
            if first_pass == 0:                                           # Note 3
                union_rows = input_rows
        with open(f_union, 'r', encoding='utf8', newline='') as union_f:  # Note 5
            is_first_row = True
            reader = csv.DictReader(union_f, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
            for row in reader:
                if row not in union_rows:
                    if row not in input_rows:
                        union_rows.append(row)
                    union_rows = input_rows + union_rows                  # Note 6
        with open(f_union, 'w', encoding='utf8', newline='') as union_f:
            is_first_row = True
            u_writer = csv.DictWriter(union_f, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])                 
            u_writer.writerows(union_rows)                                # Note 5            
        first_pass = 1        
    print('Set union operation now complete.')        

Assuming your system is properly configured and you have run the start-up routines above, you can now Run the function (again passing the build settings from config.py):

set_union(**build_deck)

The beginning of this routine (1) is patterned after some of our prior routines. We do need to add creating an empty file or clearing out the prior one (‘union’) as we start the routine (2). We give it the mode of ‘w+’ because we may either be writing (creating) or reading it, depending on prior state. We also need to set a flag (3) so that we populate our first pass with the contents of the first file (since it is a union with itself).

We begin with the first file on our input list (4), and then loop over the next files in our list as new inputs to the routine. Each pass we continue to add to the ‘union’ file that is accumulating from prior passes (5). It is kind of amazing to think that all of this machinery is necessary to get to the simple union operation (6) at the core of the routine.

Here is now the intersection counterpart to that method:

def set_intersection(**build_deck):
    print('Beginning set intersection routine . . .')                     
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    base = build_deck.get('base')
    short_base = build_deck.get('short_base')
    base_out = build_deck.get('base_out')
    ext = build_deck.get('ext')
    f_intersection = (short_base + 'intersection' + ext)                  # Note 1
    filetemp = open(f_intersection, 'w+', encoding='utf8')                       
    filetemp.truncate(0)
    filetemp.close()
    first_pass = 0                                                        
    for loopval in loop_list:
        print('   . . . evaluating', loopval, 'using set intersection operations . . .')
        frag = loopval.replace('kko.','')
        f_input = (base + frag + ext)
        input_rows = set()                                                # Note 2
        intersection_rows = set()
        with open(f_input, 'r', encoding='utf8') as input_f:              
            input_rows = input_f.readlines()[1:]                          # Note 3
        with open(f_intersection, 'r', encoding='utf8', newline='') as intersection_f:
            if first_pass == 0:                                           
                intersection_rows = input_rows
            else:
                intersection_rows = intersection_f.readlines()[1:]
            intersection = list(set(intersection_rows) & set(input_rows)) # Note 2
        with open(f_intersection, 'w', encoding='utf8', newline='') as intersection_f:
            intersection_f.write('id,subClassOf,parent\n')
            for row in intersection:
                intersection_f.write('%s' % row)                          # Note 4                                     
        first_pass = 1        
    print('Set intersection operation now complete.') 

We have the same basic tee-up (1) as the prior routine, except we have changed our variable names from ‘union’ to ‘intersection’. I also wanted to use a set notation for dealing with intersections, so we needed to change our iteration basis (2) to sets, and the intersection algorithm also changed form. However, dealing with sets in the csv module reader proved to be too difficult for my skill set, since the row object of the csv module takes the form of a dictionary. So, I reverted to the standard reader and writer in Python (3), which enables us to read lines as a single list object. By going that route, however, we needed to start our iterator on row 2 to skip the header (of (‘id‘, ‘subPropertyOf‘, ‘parent‘)). Also, remember, item 1 is a 0 in Python, which is why the additional [1:] argument is added. Using the standard writer also means we need to iterate our write statement (4) over the set, with the older %s format allowing us to insert the row value as a string.

Again, assuming we have everything set up and configured properly, we can Run:

set_intersection(**build_deck)

Of course, the intersection of many datasets often results in empty (null) results. So, you are encouraged to use this utility with care and likely use the custom_dict specification in config.py for your specifications.

Transducers

One of the innovations in Clojure, our primary KBpedia development language, are transducers. The term is a portmanteau of ‘transform reducers’ and is a way to generalize and reduce the number of arguments in a tuple or iterated object. Transducers produce fast, processible data streams in a simple functional form, and can also be used to create a kind of domain-specific language (DSL) for functions. Either input streams or data vectors can be transformed in real time or offline to an internal representation. We believe transducers are a key source of the speed of our Clojure KBpedia build implementation.

Quick research suggests there are two leading options for transducers in Python. One was developed by Rich Hickey and Cognitect, Rich’s firm to support and promote Clojure, which he originated. Here are some references:

The second option embraces the transducer concept, but tries to develop it in a more ‘pythonic’ way:

I suspect I will return to this topic at a later point, possibly when some of the heavy lifting analytic tasks are underway. For now, I will skip doing anything immediately, even though there are likely noticeable performance benefits. I would rather continue the development of cowpoke without its influence, since transducers are in no way mainstream in the Python community. I will be better positioned to return to this topic after learning more.

Others

We could get fancier still with performance tests and optimizations, especially if we move to something like pandas or the vaex modules. Indeed, as we move forward with our installments, we will have occasion to pull in and document additional modules. For now, though, we have taken some steps to marginally improve our performance sufficient to continue on our KBpedia processing quest.

The Transition to Properties

I blithely assumed that once I got some of the memory and structure tests above behind us, I could readily return to my prior class_struct_builder routine, make some minor setting changes, and apply it to properties. My assumption was not accurate.

I ran into another perflexing morass of ontology namespaces, prefixes, file names, and ontology IRI names, all of which needed to be in sync. What I was doing to create the initial stub form worked, and new things could be brought in while the full ontology was in memory. But, then, however, if the build routine needed to stop, as we just have needed to do between loading classes and loading properties, when started up again the build would fail. The interim ontology version we were saving to file was not writing all of the information available to it in memory to file. Hahaha! Isn’t that how it sometimes works? Just when you are assuming smooth sailing, you hit a brick wall.

Needless to say I got the issue worked out, with a summary of some of my findings on the owlready2 mailing list. Jean-Baptiste Lamy, the developer, is very responsive and I assume some of the specific steps I needed to take in our use case may be generalized better in the software in later versions. Nonetheless, I needed to make those internal modifications, re-do the initial build steps, in order to have the environment properly set to accept new property or class inputs. (In my years of experience with open-source software, one should expect occasional deadends or key parameters needing to be exposed, which will require workarounds. A responsive lead developer with an active project is therefore an important criteria in selecting ‘keystone‘ open-source software.)

After much experimentation, we were finally able to find the right combination of things in the morass. There are a couple of other places on the owlready2 mailing list where these issues are discussed. For now, the logjam has been broken and we can proceed with the property structure build routine.

Property Structure Ingest

Another blithe assumption that did not prove true was to be able to clone the class or typology build routines to properties. There is much different in the circumstances that leads to a different (and simpler) code approach.

First, our extraction routines for properties only resulted in one structural file, not the many files that are characteristic of the classes and typologies. Second, all of our added properties tied directly into kko placeholders. The earlier steps of creating a new class, adding it to a parent, and then deleting the temporary class assignment could be simplified to a direct assignment to a property already in kko. This tremendously simplifies all property structure build steps.

We still need to be attentive to whether a given processing step uses strings or actual types, but nonethless, our property build routines have considerable resemblance to what we have done before.

Still, the shift from classes to properties, different sources for an interim build, and other specific changes suggested it was time to initiate a new practice of listing essential configuration settings as a header to certain key code blocks. We are now getting to the point where there are sufficient moving parts where proper settings before running a code block are essential. Running some codes with wrong settings risks overriding existing data without warning or backup. Again, always back up your current versions before running major routines.

This is the property_struct_builder routine, with code notes explained after the code block:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###             # Note 1     
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : custom_dict.values(),                            
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/'
# 'ext'           : '.csv',
# 'frag' for starting 'in_file' is specified in routine

def property_struct_builder(**build_deck):
    print('Beginning KBpedia property structure build . . .')
    kko_list = typol_dict.values()
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    class_loop = build_deck.get('property_loop')
    base = build_deck.get('base')
    ext = build_deck.get('ext')
    if loop is not 'property_loop':
        print("Needs to be a 'property_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval)
        frag = 'struct_properties'                                    # Note 2
        in_file = (base + frag + ext)
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subPropertyOf', 'parent'])
            for row in reader:
                if is_first_row:
                    is_first_row = False                
                    continue
                r_id = row['id'] 
                r_parent = row['parent']
                value = r_parent.find('owl.')
                if value == 0:                                        # Note 3
                    continue
                value = r_id.find('rc.')
                if value == 0:
                    id_frag = r_id.replace('rc.', '')
                    parent_frag = r_parent.replace('kko.', '')
                    var2 = getattr(kko, parent_frag)                  # Note 4
                    with rc:                        
                        r_id = types.new_class(id_frag, (var2,))
        print('KBpedia property structure build is complete.')
        input.close() 

IMPORTANT NOTE: To reiterate, I will preface some of the code examples in these CWPK installments with the operating configuraton settings (1) shown at the top of the code listing. This is because simply running some routines based on prior settings to config.py may cause existing files in other directories to be corrupted or overwritten. Putting such notices up I hope will be a better alerting technique than burying the settings within the general narrative. Plus it makes it easier to get a routine up and running sooner.

Unlike for the class or typology builds, our earlier extractions of properties resulted in a single file, which makes our ingest process easier. We are able to set our file input variable of ‘frag’ to a single file variable (2). We also use a different string function of .find (3) to discover if the object assignment is an existing property, which returns a location index number if found, but a ‘-1’ if not. (A boolean option to achieve the same end is in.) And, like we have seen so many times to this point, we also need to invoke a method to evaluate a string value to its underlying type in the system (4).

This new routine allows us to now add properties to our baseline ‘rc’ ontology:

property_struct_builder(**build_deck)

And, if we like what we see, we can save it:

kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl', format="rdfxml") 

Well, our detour to deal with performance and other issues proved to be more difficult than when we first started that drive. As I look over the ground covered so far in this CWPK series, these last three installments have taken, on average, three times more time per installment than have all of the prior installments. Things have indeed felt stuck, but I also knew going in that closing the circle on the ’roundtrip’ was going to be one of the more demanding portions. And, so, it has proven. Still: Hooray! Making it fully around the bend is pretty sweet.

We lastly need to clean up a few loose ends on the structural side before we move on to adding annotations. Let’s finish up these structural specs in our next installment.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 23, 2020 at 9:42 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2381/cwpk-41-optimizations-and-property-structure-ingest/
The URI to trackback this post is: https://www.mkbergman.com/2381/cwpk-41-optimizations-and-property-structure-ingest/trackback/
Posted:September 22, 2020

We Build Up Our Ingest Routine to All Structure

Now that we have a template for structure builds in this Cooking with Python and KBpedia series, we continue to refine that template to generalize the routine and expand it to looping over multiple input files and to apply it to property structure as well. These are the topics we cover in this current installment, with a detour as I explain below.

In order to prep for today’s material, I encourage you to go back and look at the large routine we developed in the last installment. We can see three areas we need to address in order to generalize this routine:

  • First, last installment’s structure build routine (as designed) requires three passes to complete file ingest. Each one of those passes has a duplicate code section to convert our file input forms to required shorter versions. We would like to extract these duplicates as a helper function in order to lesson code complexity and improve readability
  • Second, we need a more generic way of specifying the input file or files to be processed by the routine, preferably including being able to loop over and process all of the files in a given input dictionary (as housed in config.py), and
  • Third, we would like to generalize the approach to dealing with class hierarchical structure to also deal with property ingest and hierarchical structure.

So, with these objectives in mind, let’s begin.

Adding a Helper Function

For reference, here is the code block in the prior installment that we repeat three times, and for which we would like to develop a helper function (BTW, this code block will not run here in isolation):

id = row['id']                                                 
parent = row['parent']                                         
id = id.replace('http://kbpedia.org/kko/rc/', 'rc.')          
id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
id_frag = id.replace('rc.', '')
id_frag = id_frag.replace('kko.', '')
parent = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') 
parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
parent = parent.replace('owl:', 'owl.')
parent_frag = parent.replace('rc.', '')
parent_frag = parent_frag.replace('kko.', '')
parent_frag = parent_frag.replace('owl.', '')

We will call our helper function row_clean since its purpose is to convert the full IRIs of the CSV input rows to shorter forms required by owlready2 (sometimes object names with a namespace prefix, other times just with the shortened object name). We also need these to work on either the subject of the row (‘id’) or the object of the row (‘parent’ in this case). That leads to four combinations of 2 row objects by 2 shortened forms.

Note that the second argument (‘iss’) passed to the function below is a keyword argument, always shown with the equal sign in the function definition. Also note sometimes, rather than an empty string as shown, if you assign the keyword argument a legitimate value when defined, that becomes the default assignment for that keyword and does not have to have a value assigned to it when called. (NB: Indeed, many built-in Python functions have multiple arguments that are infrequently exposed. I have found it frequently helpful to do a dir() on functions to discover their broader capabilities.)

### Here is the helper function

def row_clean(value, iss=''):                                # arg values come from calling code
    if iss == 'i_id':                                        # check to see which replacement method
        value = value.replace('http://kbpedia.org/kko/rc/', 'rc.')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        return value                                         # returns the calculated value to calling code
    if iss == 'i_id_frag':
        value = value.replace('http://kbpedia.org/kko/rc/', '')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', '')
        return value
    if iss == 'i_parent':
        value = value.replace('http://kbpedia.org/kko/rc/', 'rc.')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        value = value.replace('owl:', 'owl.')
        return value
    if iss == 'i_parent_frag':
        value = value.replace('http://kbpedia.org/kko/rc/', '')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', '')
        value = value.replace('owl:', '')
        return value
        
### Here is the code we will put in the main calling routine:
        
# r_id = row['id']                                           # this is the version we will actually keep
r_id = 'http://kbpedia.org/kko/rc/AlarmSignal'               # temporary assignment just to test code
# r_parent = row['parent']
r_parent = 'http://kbpedia.org/kko/rc/SoundsByType'
id = row_clean(r_id, iss='i_id')                             # send the two arguments to helper function
id_frag = row_clean(r_id, iss='i_id_frag')
parent = row_clean(r_parent, iss='i_parent')
parent_frag = row_clean(r_parent, iss='i_parent_frag')

print('id:', id)                                             # temporary print to check if results OK
print('id_frag', id_frag)
print('parent:', parent)
print('parent_frag:', parent_frag)

Because we have entered some direct assignments the code block above does Run (or <\shift+enter>).

Note in the main calling routine code that to get our routine values we are calling the row_clean function and passing the required two arguments: the value for either the ‘id’ or ‘parent’ in that row, and whether we want prefixed or shortened fragments.

I strongly suspect there are better and shorter ways to remove this duplicate code, but this approach with a helper function, even in a less optimal form, still has cut the original code length in half (36 lines to 18 lines due to three duplicates). Expect to see a similar form to this in our code going forward. (NB: I am finding that looking for these duplicate code blocks is forcing me to learn function definitions and seek shorter but more expressive forms.)

Looping Over Files

If you recall our extraction steps of getting flat CSV files out of KBpedia in CWPK #28 to CWPK #35, we can end up with close to 100 extraction files. These splits encourage modularity and are easier to work on or substitute. Still, when it comes time to building KBpedia back up again after we complete a roundtrip, a complete build requires we process many files. We thus need looping routines across our build files to automate this process.

The first thought is to simply put groupings of files in individual directories and then point the routine at a directory and instruct it to loop over all files. If we have concerns that the directories may have more file types than we want to process with our current routine, we could also introduce some file name string checks to filter by name, fragment, or extension. These options would enable us to generalize a file looping routine to apply to many conditions.

But, I’ve decided to take a different choice. Since our extractions are driven by Python dictionaries, and we can direct those extractions to any directory prefix, we can re-use these same specifications for build processes. Should we later discover that a general file harvester makes sense, we can generalize at that time from this dictionary design. Also, by applying the same dictionary approach to extraction or building, we help reinforce our roundtripping mindset in how we name and process files.

So, we already have the unique names that distinguish our input classes (in the typol_dict dictionary in config.py) and our properties (in the prop_dict dictionary), and foresee using additional dictionaries going forward in this CWPK series. We only need enter a directory root and the appropriate dictionary to loop over the unique terms associated with our various building blocks. For classes, the typology listing is a great lookup.

We will take our generic class build template from the last installment, and put it into a function that loops over opening our file set, running the routine, and then saving to our desired output location. For now, to get the logic right, I will just set this up as a wrapper before actually plopping in the full build loop routine. (Note: we have to import a couple of modules because we have not yet fully set the environment for today’s installment):

from cowpoke.config import *
import csv                                                

def class_builder(**build_deck):
    print('Beginning KBpedia class structure build . . .')
    r_default = ''
    r_label = ''
    r_iri = ''
# probably want the run specification here (see CWPK #35 for render in struct_extractor)
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    class_loop = build_deck.get('class_loop')
    base = build_deck.get('base')
    ext = build_deck.get('ext')
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval)
        frag = loopval.replace('kko.','')
        in_file = (base + frag + ext)
        x = 1
        with open(in_file, mode='r', encoding='utf8') as input:                                           
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])                 
            for row in reader:
## Here is where we place the real class build routine                
                if x <= 2:
                    r_id = row['id']
                    r_parent = row['parent']
                    print(r_id, r_parent)
                    x = x + 1
        input.close()
        
class_builder(**build_deck)        

OK. We now know how to loop over our class build input files. Now, we can Kernel → Restart & Clear Outputs → and then Restart and Clear All Outputs (which should be a familiar red button to you if using Jupyter Notebook) to get ourselves to a clean starting place, to begin setting up our structure build environmment.

Setting Up the Build Environment

As before with our extract routines, we now have a build_deck dictionary of build configuration settings in config.py. If you see some unfamiliar switches as we proceed through this build process, you may want to inspect that file. The settings are pretty close analogs to the same types of settings for our extractions, as specified in the run_deck dictionary. Most all of this code will migrate to the new build module.

We begin by importing our necessary modules and setting our file settings for the build:

Which environment? The specific load routine you should choose below depends on whether you are using the online MyBinder service or local files. The example below is based on using local files, which given the complexity of the routines that are emerging, is probably your better choice. Make sure to modify the URIs for your local directories.
from owlready2 import * 
from cowpoke.config import *
# from cowpoke.__main__ import *
import csv                                                
import types

world = World()

kb_src = every_deck.get('kb_src')                         # we get the build setting from config.py
#kb_src = 'standard'                                      # we can also do quick tests with an override

if kb_src is None:
    kb_src = 'standard'
if kb_src == 'sandbox':
    kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
elif kb_src == 'standard':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
elif kb_src == 'start':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kbpedia_rc_stub.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
else:
    print('You have entered an inaccurate source parameter for the build.')
skos_file = 'http://www.w3.org/2004/02/skos/core' 
    

We load our ontologies into owlready2 and set our namespaces:

kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

#skos = world.get_ontology(skos_file).load()
#kb.imported_ontologies.append(skos)
#core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

Since we’ve cleared memory and our workspace, we again add back in our new row_clean helper function:

def row_clean(value, iss=''):                                # arg values come from calling code
    if iss == 'i_id':                                        # check to see which replacement method
        value = value.replace('http://kbpedia.org/kko/rc/', 'rc.')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        return value                                         # returns the calculated value to calling code
    if iss == 'i_id_frag':
        value = value.replace('http://kbpedia.org/kko/rc/', '')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', '')
        return value
    if iss == 'i_parent':
        value = value.replace('http://kbpedia.org/kko/rc/', 'rc.')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        value = value.replace('owl:', 'owl.')
        return value
    if iss == 'i_parent_frag':
        value = value.replace('http://kbpedia.org/kko/rc/', '')           
        value = value.replace('http://kbpedia.org/ontologies/kko#', '')
        value = value.replace('owl:', '')
        return value

Running the Complete Class Build

And then add our class build template to our new routine for iterating over all of our class input build files. CAUTION: to process all inputs to KBpedia, best done with the single assignment of the Generals typology (since all other typologies not already included in KKO are children of it), takes about 70 min on a conventional desktop.

You may notice that we made some slight changes to named variables in the draft template developed in the last installment:

  • src_filein_file
  • csv_fileinput

And, we have placed it into a defined function, class_struct_builder:

def class_struct_builder(**build_deck):                                    # Note 1
    print('Beginning KBpedia class structure build . . .')                 # Note 5
    kko_list = typol_dict.values()                                         # Note 2
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    class_loop = build_deck.get('class_loop')
    base = build_deck.get('base')
    ext = build_deck.get('ext')
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval)                              # Note 5
        frag = loopval.replace('kko.','')
        in_file = (base + frag + ext)
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])                 
            for row in reader:
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')                           # Note 3
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:                                       
                    is_first_row = False
                    continue      
                with rc:                                                
                    kko_id = None
                    kko_frag = None
                    if parent_frag == 'Thing':                                                        
                        if id in kko_list:                                
                            kko_id = id
                            kko_frag = id_frag
                        else:    
                            id = types.new_class(id_frag, (Thing,))       
                if kko_id != None:                                         
                    with kko:                                                
                        kko_id = types.new_class(kko_frag, (Thing,))  
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
            for row in reader:                                                
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:
                    is_first_row = False
                    continue          
                with rc:
                    kko_id = None                                   
                    kko_frag = None
                    kko_parent = None
                    kko_parent_frag = None
                    if parent_frag is not 'Thing':
                        if id in kko_list:
                            continue
                        elif parent in kko_list:
                            kko_id = id
                            kko_frag = id_frag
                            kko_parent = parent
                            kko_parent_frag = parent_frag
                        else:   
                            var1 = getattr(rc, id_frag)               
                            var2 = getattr(rc, parent_frag)
                            if var2 == None:                            
                                continue
                            else:                                
                                var1.is_a.append(var2)
                if kko_parent != None:                                         
                    with kko:                
                        if kko_id in kko_list:                               
                            continue
                        else:
                            var1 = getattr(rc, kko_frag)
                            var2 = getattr(kko, kko_parent_frag)                     
                            var1.is_a.append(var2)
        with open(in_file, 'r', encoding='utf8') as input:                 # Note 4
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
            for row in reader:                                              
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:
                    is_first_row = False
                    continue
                if parent_frag == 'Thing': 
# This is the new code section, replacing the commented out below          # Note 4                   
                    var1 = getattr(rc, id_frag)
                    var2 = getattr(owl, parent_frag)
                    try:
                        var1.is_a.remove(var2)
                    except Exception:
#                        var1 = getattr(kko, id_frag)
#                        print(var1)
#                        var1.is_a.remove(owl.Thing)
#                        print('Last step in removing Thing')
                        continue
#                    print(var1, var2)
#                    if id in thing_list:                                     
#                        continue
#                    else:
#                        if id in kko_list:                                    
#                            var1 = getattr(kko, id_frag)
#                            thing_list.add(id)
#                        else:                                                 
#                            var1 = getattr(rc, id_frag)
#                            var2 = getattr(owl, parent_frag)
#                            if var2 == None:
#                                print('Empty Thing:')
#                                print('var1:', var1, 'var2:', var2)                            
#                            try:
#                                var1.is_a.remove(var2)
#                            except ValueError:
#                                print('PROBLEM:')
#                                print('var1:', var1, 'var2:', var2)                
#                                if len(thing_list) == 0:
#                                    print('thing_list is empty.')
#                                else:
#                                    print(*thing_list)
#                                break
#                        print(var1, var2)
#                        thing_list.append(id)
#                        thing_list.add(id)
    out_file = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/build_stop.csv'
    with open(out_file, 'w', encoding='utf8') as f:
            print('KBpedia class structure build is complete.')
            f.write('KBpedia class structure build is complete.')                # Note 5
            f.close()

Our function call pulls up the same keyword argument passing that we discussed for the extraction routines earlier (1). The double asterisk (**build_deck) argument means to bring in any of that dictionary’s keyword values if referenced in the routine. We can readily pick up loop or lookup specifications by referencing a dictionary (2). The kko_list is a handy one since it gives us a basis for selecting between KKO objects and the reference concepts (RCs) in KBpedia. The revised routine above also brings in our new helper function (3).

Pretty much the next portions of the routine are as described in the last installment, until we come up to Pass #3 (4), which is where we hit a major roadblock (coming up around the next bend in the road). We also added some print statements (5) that give feedback when the routine is running.

To run this file locally you will need to have the cowpoke project installed and know where to find your build_ins/typology directory. You also need to make sure your settings in config.py are properly set for your conditions. Assuming you have done so, you can invoke this routine (best with only a subset of your typology dictionary, assigned to, say, custom_dict:

class_struct_builder(**build_deck)

Realize everything has to be configured properly for this code to run. You will need to review earlier installments if you run into problems. Assuming you have gotten it to run to completion without error, you may want to then save it. We need to preface our ‘save’ statement with the ‘kb’ ontology identifier. I also have chosen to use the ‘working’ directory for saving these temporary results:

kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl', format='rdfxml') 

However, I ran into plenty of problems myself. Indeed, the large code block commented out above (4) caused me hours of fits trying to troubleshoot and get the routine to act as I wanted. This whole effort put up a roadblock in my plan, sufficient that I had to add another installment. I explain this detour next.

A Brief History of Going Crazy

If we set as an objective being able to specify multiple input files for a current build, a couple of issues immediately arise. Recall, we designed our typology extraction files to be self-contained, which means that every class used as an object must also be declared as its own class subject. To speed up our extractions, we do not keep track of the many objects so needing definitions. That means each encounter triggers the need for another class definition. Multiple duplicate declarations do not cause a problem when loading the ontology, but when used as a specification input when doing multiple passes some tricky problems arise.

One obvious contributor to the difficulty is the need to identify and separately keep track of (and sometimes differentially process) our ‘kko’ and ‘rc’ namespaces. We need to account for this distinction in every loop and every assignment or removal that we make to the ontology while building it in memory. That can all be trapped for when in the class build cycle, which is the first two passes of the routine (first create the class, second add to parents), but gets decidedly tricky when removing the excess owl:Thing declarations.

To appreciate this issue a bit, here is the basic statement for removing a ‘Bird’ class from a parent ‘Reptile’:

rc.Bird.is_a.remove(rc.Reptile)

Our inputs can not be strings, but in loops variables often become so, and need to be evaluated to their type via the var1.is_a.getattr(rc, var2)

Unfortunately, when we make a rc.Bird.is_a.remove(rc.Reptile) request once it has been previously removed, the relationship is empty and owlready2 throws an error (as does Python when trying to remove an undeclared object). So, while we are able to extract without keeping track, we eventually do when we come time to build. Thus, as each file is processed, we need to account for prior removals and make sure we do not make the request again.

The later part of the code listing above (4) kept processing most of the files well, but not when too many were processed. I had the curious error of seeing the routine fail on the first entry of some files. It appeared to me perhaps the list accumulator I was using to keep track of prior removals was limited in size in some manner (it is not) or some counter or loop was not being cleared or initialized in the right location. If it ain’t perfect, it don’t run.

As a newbie with no prior experience to fall back on, here are some of the things I looked at and tested in trying to debug this Pass #3 owl:Thing deletion routine:

  • memory – was it a memory problem? Well, there are some performance issues we continue with in the next installment, but, no, Python seems to grab the memory it needs and does (apparently) a fair job of garbage cleanup. It was also not a problem with the notebook memory
  • loops – there are lots of ways to initiate loops or iterate over different structures from lists, sets, dictionaries, length and counters, etc. How loops are called and incremented differ by the iterator type chosen. I suspect this is where the issue still resides, because I continue to not have a native feel for:
    • sets v lists
    • clearing before loops
    • referencing the right loops
  • using the right fragment – the interplay of namespaces with scope is also not yet intuitive to me. Sometimes it is important to use the namespace prefixed reference to an object, other times not so. I am still learning about scope
  • not much worried about syntax because REPL was always running
  • list length limitations – I discussed this one above, as was able to eliminate it as the source
  • indentations – it is sometimes possible to put what one thinks is the closing statement to a routine at the wrong indentation, so that it runs, but is not affecting the correct code block. In my debugging efforts so far I often find this a source of the problem, especially when there is too much complexity or editing of the code. This is another reason to generalize duplicate code
  • code statement placement in order – in a similar way, counters and loop initializations can easily be placed into the wrong spots. The routine often may run, but still not do what you think it is, and
  • many others – I’m a newbie, right?

It was so frustrating trying to get this correct because I could get most everything working like I wanted, but then perhaps the routine would fail in the midst of processing a long list or would complete, but, upon inspection, may have missed some items or treated them incorrectly.

What little I do know about such matters tells me to try to pinpoint and isolate the problem. When processing long lists, that means testing for possible error conditions and liberally sprinkling various print statements with different text and different echoing of current values to the screen. For example, in an else: condition of an if: statement, I might put a print like:

  print('In kko_list loop of None error trap:', var1, var2)

But pinpointing a problem does not indicate how to solve it, though it does help to narrow attention. I had done so in the routine above, but I was still erroring out of some files. Sometimes that would happen, but it was still unclear what the offending part might be. When Python errors like that, it provides an error message and trackback, but somethings that information is cryptic. The failure point may occur any time after the last message to screen. Again, I was being pricked by needles in the haystack, but I still had not specifically found and removed them.

Error Trapping

I knew from my Python reading that it had a fairly good exception mechanism. Since print() statements were only taking me so far, I decided I needed to bite the bullet (for the needle pricks in my hand!) and start learning more about error trapping.

The basic approach for allowing a program to continue to run when an error condition is met is through the Python exception. It basically looks like this kind of routine:

   statement1
statement2
try:
non_zero = statement1 / statement2
except exception:
print('Oops, dividing by 0!')
continue

I was exploring this more graceful way to treat errors when I realized, duh, that same approach also captured exactly what I was trying to accomplish with avoiding multiple deletions in the first place! That is, I could continue to ‘try’ to delete the next instance of the owl:Thing assigment, and if it had already been deleted (which caused it to throw an exception, that is, what I was trying to fix!), I could exit gracefully and move on. Further, this would allow me to embed specific print() statements at the exact point of failure.

After this aHa! I changed the code as shown above (4). I suspect it is a slow way to process the huge numbers I have, but it works. I will continue to look for better means, but at least with this approach I was able to move on with the project.

Still, whether for this reason or others not yet contemplated, once we start processing huge numbers with multiple KBpedia build files, I am seeing performance much slower than what I would like. We address those topics in the next installment, which will also cause us to detour still further before we can get back on track to completing our property structure additions to the build.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 22, 2020 at 10:19 am in CWPK, KBpedia, Semantic Web Tools | Comments (3)
The URI link reference to this post is: https://www.mkbergman.com/2379/cwpk-40-looping-and-multiple-structure-file-ingest/
The URI to trackback this post is: https://www.mkbergman.com/2379/cwpk-40-looping-and-multiple-structure-file-ingest/trackback/
Posted:September 21, 2020

Builds Are a More Complicated Workflow than Extractions

In installment CWPK #37 to this Cooking with Python and KBpedia series, we looked at roundtripping information flows and their relation to versions and broad directory structure. From a strict order-of-execution basis, one of the first steps in the build process is to check and vet the input files to make sure they are clean and conformant. Let’s keep that thought in mind, and it is a topic we will address in CWPK #45. But we are going to skip over that topic for now so that we do not add complexity to figuring out the basis of what we need from a backbone standpoint of a build.

By “build” I have been using a single term to skirt around a number of issues (or options). We can “build” our system from scratch at all times, which is the simplest conceptual one in that we go from A to Z, processing every necessary point in between. But, sometimes a “build” is not a wholesale replacement, but an incremental one. It might be incremental because we are simply adding, say, a single new typology or doing other interim updates. The routines or methods we need to write should accommodate these real-world use cases.

We also have the question of vetting our builds to ensure they are syntactically and logically coherent. The actual process of completing a vetted, ‘successful’ build may require multiple iterations and successively more refined tweaks to the input specifications in order to get our knowledge graphs to a point of ‘acceptance’. These last iterating steps of successive refinements all follow the same “build” steps, but ones which involve fewer and fewer fixes until ‘acceptance’ when the knowledge graph is deemed ready for public release. In my own experience following these general build steps over nearly a decade now, there may be tens or more “builds” necessary to bring a new version to public release. These steps may not be needed in your own circumstance, but a useful generic build process should anticipate them.

Further, as we explore these issues in some depth, we will find weaknesses or problems in our earlier extraction flows. We still want to find generic patterns and routines, but as we add these ones related to “builds” that will also cause some reachback into the assumptions and approaches we earlier used for extractions (CWPK #29 to CWPK #35). This installment thus begins with input-output (I/O) considerations from this “build” perspective.

Basic I/O Considerations

It seems like I spend much time in this series on code and directory architecting. I’m sure one reason is that it is easier for me to write than code! But, seriously, it is also the case that thinking through use cases or trying to capture prior workflows is the best way I know to write responsive code. So, in the case of our first leg regarding extraction, we had a fairly simple task: we had information in our knowledge graphs that we needed to get out in a usable file format for external applications (and roundtripping). If our file names and directory locations were not exactly correct, no big deal: We can easily manipulate these flat-text files and move them to other places as needed.

That is not so with the builds. First, we can only ingest our external files used in a build that can be read and “understood” by our knowledge graph. Second, as I touched on above, sometimes these builds may be total (or ‘full’ or ‘start’), sometimes they may be incremental (or ‘fix’). We want generic input-and-output routines to reflect these differences.

In a ‘full’ build scenario, we need to start with a core, bootstrap, skeletal ontology to which we add concepts (classes) and predicates (properties) until the structure is complete (or nearly so), and then to add annotations to all of that input. In a ‘fix’ build scenario, we need to start with an already constructed ‘full’ graph, and then make modifications (updates, deletes, additions) to it. How much cleaning or logical testing we may do may vary by these steps. We also need to invoke and write files to differing locations to embrace these options.

To make this complexity a bit simpler, like we have done before, we will make a couple of simplifying choices. First, we will use a directory where all build input files reside, which we call build_ins. This directory is the location where we first put the files extracted from a prior version to be used as the starting basis for the new version (see Figure 2 in CWPK #37). It is also the directory where we place our starting ontology files, the stubs, that bootstrap the locations for new properties and classes to be added. We also place our fixes inputs into this directory.

Second, the result of our various build steps will generally be placed into a single sub-directory, the targets directory. This directory is the source for all completed builds used for analysis and extractions for external uses and new builds. It is also the source of the knowledge graph input when we are in an incremental update or ‘fix’ mode, since we desire to modify the current build in-progress, not always start from scratch. The targets directory is also the appropriate location for logging, statistics, and working ‘scratchpad’ subdirectories while we are working on a given build.

To this structure I also add a sandbox directory for experiments, etc., that do not fall within a conventional build paradigm. The sandbox material can either be total scratch or copied manually to other locations if there is some other value.

Please see Figure 2 in CWPK #37 to see the complete enumeration of these directory structures.

Basic I/O Routines

Similar to what we did with the extraction side of the roundtrip, we will begin our structural builds (and the annotation ones two installments hence) in the interactive format of Jupyter Notebook. We will be able to progress cell-by-cell Running these or invoking them with the shift+enter convention. After our cleaning routines in CWPK #45, we will then be able to embed these interactive routines into build and clean modules in CWPK #47 as part of the cowpoke package.

From the get-go with the build module we need to have a more flexible load routine for cowpoke that enables us to specify different sources and targets for the specific build, the inputs-outputs, or I/O. We had already discovered in the extraction routines that we needed to bring three ontologies into our project namespace, KKO, the reference concepts of KBpedia, and SKOS. We may also need to differentiate ‘start’ v ‘fix’ wrinkles in our builds. That leads to three different combinations of source and target: ‘standard’ (same as ‘fixes’), ‘start’, and our optional ‘sandbox’) for our basic “build” I/O:

Which environment? The specific load routine you should choose below depends on whether you are using the online MyBinder service (the ‘raw’ version) or local files. The example below is based on using local files (though replace with your own local directory specification).
from owlready2 import * 
from cowpoke.config import *
# from cowpoke.__main__ import *
import csv                                                # we import all modules used in subsequent steps

world = World()

kko = []
kb = []
rc = []
core = []
skos = []
kb_src = build_deck.get('kb_src')                         # we get the build setting from config.py

if kb_src is None:
    kb_src = 'standard'
if kb_src == 'sandbox':
    kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
elif kb_src == 'standard':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
elif kb_src == 'start':
    kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kbpedia_rc_stub.owl'
    kko_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl'
else:
    print('You have entered an inaccurate source parameter for the build.')
skos_file = 'http://www.w3.org/2004/02/skos/core' 
    

(NOTE: We later add an ‘extract’ option to the above to integrate this with our earlier extraction routines.)

As I covered in CWPK #21, two tricky areas in this project are related to scope. The first tricky area relates to the internal Python scope of LEGB, which stands for local → enclosed → global → built-in and means that objects declared on the right are available to the left, but not left to right for the arrows shown. Care is thus needed about how information gets passed between Python program components. So, yes, a bit of that trickiness is in play with this installment, but the broader issues pertain to the second tricky area.

The second area is the interplay of imports, ontologies, and namespaces within owlready2, plus its own internal ‘world’ namespace.

I have struggled to get these distinctions right, and I’m still not sure I have all of the nuances down or correct. But, here are some things I have learned in cowpoke.

First, when loading an ontology, I give it a ‘world’ namespace assigned to the ‘World’ internal global namespace for owlready2. Since I am only doing cowpoke-related development in Python at a given time, I can afford to claim the entire space and perhaps lessen other naming problems. Maybe this is superfluous, but I have found it to be a recipe that works for me.

Second, when one imports an ontology into the working ontology (declaring the working ontology being step one), all ontologies available to the import are available to the working ontology. However, if one wants to modify or add items to these imported ontologies, each one needs to be explicity declared, as is done for skos and kko in our current effort.

Third, it is essential to declare the namespaces for these imports under the current working ontology. Then, from that point forward, it is also essential to be cognizant that these separate namespaces need to be addressed explicitly. In the case of cowpoke and KBpedia, for example, we have classes from our governing upper ontology, KKO (also with namespace ‘kko‘) and the reference concepts of the full KBpedia (namespace ‘rc‘). More than one namespace in the working ontology does complicate matters quite a bit, but that is also the more realistic architecture and design approach. Part of the nature of semantic technologies is to promote interoperability among multiple knowledge graphs or ontologies, each of which will have at least one of its own namespaces. To do meaningful work across ontologies, it is important to understand these ontology ← → namespace distinctions.

This is how these assignments needed to work out for our build routines based on these considerations:

kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')                # need to make sure we set the namespace

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')       # need to assign namespace to main onto ('kb')

Now that we have set up our initial build switches and defined our ontologies and related namespaces, we are ready to construct the code for our first build attempt. In this instance, we will be working with only a single class structure input file to the build, typol_AudioInfo.csv, which according to our ‘start’ build switch (see above) is found in the kbpedia/v300/build_ins/typologies/ directory under our project location.

The routine below needs to go through three different passes (at least as I have naively specified it!), and is fairly complicated. There are quite a few notes below the code listing explaining some of these steps. Also note we will be definining this code block as a function and the import types statement will be moved to the header in our eventual build module:

import types

src_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/typologies/typol_AudioInfo.csv'
kko_list = typol_dict.values()
with open(src_file, 'r', encoding='utf8') as csv_file:                 # Note 1
    is_first_row = True
    reader = csv.DictReader(csv_file, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])                 
    for row in reader:                                                 ## Note 2: Pass 1: register class
        id = row['id']                                                 # Note 3
        parent = row['parent']                                         # Note 3
        id = id.replace('http://kbpedia.org/kko/rc/', 'rc.')           # Note 4
        id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        id_frag = id.replace('rc.', '')
        id_frag = id_frag.replace('kko.', '')
        parent = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') 
        parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        parent = parent.replace('owl:', 'owl.')
        parent_frag = parent.replace('rc.', '')
        parent_frag = parent_frag.replace('kko.', '')
        parent_frag = parent_frag.replace('owl.', '')
        if is_first_row:                                               # Note 5
            is_first_row = False
            continue      
        with rc:                                                       # Note 6
            kko_id = None
            kko_frag = None
            if parent_frag == 'Thing':                                 # Note 7                               
                if id in kko_list:                                     # Note 8
                    kko_id = id
                    kko_frag = id_frag
                else:    
                    id = types.new_class(id_frag, (Thing,))            # Note 6
        if kko_id != None:                                             # Note 8
            with kko:                                                  # same form as Note 6
                kko_id = types.new_class(kko_frag, (Thing,))  
with open(src_file, 'r', encoding='utf8') as csv_file:
    is_first_row = True
    reader = csv.DictReader(csv_file, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
    for row in reader:                                                 ## Note 2: Pass 2: assign parent
        id = row['id']
        parent = row['parent']
        id = id.replace('http://kbpedia.org/kko/rc/', 'rc.')           # Note 4
        id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        id_frag = id.replace('rc.', '')
        id_frag = id_frag.replace('kko.', '')
        parent = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') 
        parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        parent = parent.replace('owl:', 'owl.')
        parent_frag = parent.replace('rc.', '')
        parent_frag = parent_frag.replace('kko.', '')
        parent_frag = parent_frag.replace('owl.', '')
        if is_first_row:
            is_first_row = False
            continue          
        with rc:
            kko_id = None                                              # Note 9
            kko_frag = None
            kko_parent = None
            kko_parent_frag = None
            if parent_frag is not 'Thing':                             # Note 10
                if parent in kko_list:
                    kko_id = id
                    kko_frag = id_frag
                    kko_parent = parent
                    kko_parent_frag = parent_frag
                else:   
                    var1 = getattr(rc, id_frag)                        # Note 11
                    var2 = getattr(rc, parent_frag)
                    if var2 == None:                                   # Note 12
                        continue
                    else:
                        var1.is_a.append(var2)                         # Note 13
        if kko_parent != None:                                         # Note 14        
            with kko:                
                if kko_id in kko_list:                                 # Note 15
                    continue
                else:
                    var1 = getattr(rc, kko_frag)                       # Note 16
                    var2 = getattr(kko, kko_parent_frag)
                    var1.is_a.append(var2)
thing_list = []                                                        # Note 17
with open(src_file, 'r', encoding='utf8') as csv_file:
    is_first_row = True
    reader = csv.DictReader(csv_file, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
    for row in reader:                                                 ## Note 2: Pass 3: remove owl.Thing
        id = row['id']
        parent = row['parent']
        id = id.replace('http://kbpedia.org/kko/rc/', 'rc.')           # Note 4
        id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        id_frag = id.replace('rc.', '')
        id_frag = id_frag.replace('kko.', '')
        parent = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') 
        parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.')
        parent = parent.replace('owl:', 'owl.')
        parent_frag = parent.replace('rc.', '')
        parent_frag = parent_frag.replace('kko.', '')
        parent_frag = parent_frag.replace('owl.', '')
        if is_first_row:
            is_first_row = False
            continue
        if parent_frag == 'Thing':                                     # Note 18
            if id in thing_list:                                       # Note 17
                continue
            else:
                if id in kko_list:                                     # Note 19
                    var1 = getattr(kko, id_frag)
                    thing_list.append(id)
                else:                                                  # Note 19
                    var1 = getattr(rc, id_frag)
                    var1.is_a.remove(owl.Thing)
                    thing_list.append(id)

The code block above was the most challenging to date in this CWPK series. Some of the lessons from working this out are offered in CWPK #21. Here are the notes that correspond to some of the statements made in the code above:

  1. This is a fairly standard CSV processing routine. However, note the ‘fieldnames’ that are assigned, which give us a basis as the routine proceeds to pick out individual column values by row

  2. Each file processed requires three passes: Pass #1 – registers each new item in the source file as a bona fide owl:Class; Pass #2 – each new item, now properly registered to the system, is assigned its parent class; and Pass #3 – each of the new items has its direct assignment to owl:Class removed to provide a cleaner hierarchy layout

  3. We are assigning each row value to a local variable for processing during the loop

  4. In this, and in the lines to follow, we are reducing the class string and its parent string from potentially its full IRI string to prefix + Name. This gives us the flexibility to have different format input files. We will eventually pull this repeated code each loop out into its own function

  5. This is a standard approach in CSV file processing to skip the first header row in the file

  6. There are a few methods apparently possible in owlready2 for assigning a class, but this form of looping over the ontology using the ‘rc‘ namespace is the only version I was able to get to work successfully, with the assignment statement as shown in the second part of this method. Note the assignment to ‘Thing’ is in the form of a tuple, which is why there is a trailing comma

  7. Via this check, we only pick up the initial class declarations in our input file, and skip over all of the others that set actual direct parents (which we deal with in Pass #2)

  8. We check all of our input roles to see if the row class is already in our kko dictionary (kko_list, set above the routine) or not. If it is a kko.Class, we assign the row information to a new variable, which we then process outside of the ‘rc’ loop so as to not get the namespaces confused

  9. Initializing all of this loops variables to ‘None’

  10. Same processing checks as for Pass #1, except now we are checking on the parent values

  11. This is an owlready2 tip, and a critical one, for getting a class type value from a string input; without this, the class assignment method (Note 13) fails

  12. If var2 is not in the ‘rc‘ namespace (in other words, it is in ‘kko‘, we skip the parent assignment in the ‘rc‘ loop

  13. This is another owlready2 method for assigning a class to a parent class. In this loop given the checks performed, both parent and id are in the ‘rc‘ namespace

  14. As for Pass #1, we are now processing the ‘kko‘ namespace items outside of the ‘rc‘ namespace and in its own ‘kko‘ namespace

  15. We earlier picked up rows with parents in the ‘kko‘ namespace; via this call, we also exclude rows with a ‘kko‘ id as well, since our imported KKO ontology already has all kko class assignments set

  16. We use the same parent class assignment method as in Note #11, but now for ids in the ‘rc‘ namespace and parents in the ‘kko‘ namespace. However, the routine so far also results in a long listing of classes directly under owl:Thing root (1) in an ontology editor such as Protégé:

Class Import with Duplicate owl:Thing Assignments
Figure 1: Class Import with Duplicate owl:Thing Assignments
  1. We use a ‘thing_list’, and assign it as an empty set at the beginning of the Pass #3 routine, because we will be deleting class assignments to owl:Thing. There may be multiple declarations in our build file, but we only may delete the assignment once from the knowledge base. The lookup to ‘thing_list’ prevents us from erroring when trying to delete for a second or more times

  2. We are selecting on ‘Thing’ because we want to unassign all of the temporary owl:Thing class assignments needed to provide placeholders in Pass #1 (Note: recall in our structure extractor routines in CWPK #28 we added an extra assignment to add an owl:Thing class definition so that all classes in the extracted files could be recognized and loaded by external ontology editors)

  3. We differentiate between ‘rc‘ and ‘kko‘ concepts because the kko are defined separated in the KKO ontology, used as one of our build stubs.

As you run this routine in real time from Jupyter Notebook, you can inspect what have been removed by inspecting:

list(thing_list)

We can now inspect this loading of an individual typology into our stub. We need to preface our ‘save’ statement with the ‘kb’ ontology identifier. I also have chosen to use the ‘working’ directory for saving these temporary results:

kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/build_ins/working/kbpedia_reference_concepts.owl', format="rdfxml") 

So, phew! After much time and trial, I was able to get this code running successfully! Here is the output of the full routine:

Class Import with Proper Hierarchical Placement
Figure 2: Class Import with Proper Hierarchical Placement

We can see that our flat listing under the root is now gone (1) and all concepts are properly organized according to the proper structure in the KKO hierarchy (2).

We now have a template for looping over multiple typologies to contribute to a build as well as to bring in KBpedia’s property structure. These are the topics of our next installment.

Additional Documentation

There were dozens of small issues and problems that arose in working out the routine above. Here are some resources that were especially helpful in informing that effort:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 21, 2020 at 10:16 am in CWPK, KBpedia, Semantic Web Tools | Comments (1)
The URI link reference to this post is: https://www.mkbergman.com/2378/cwpk-39-i-o-and-structural-ingest/
The URI to trackback this post is: https://www.mkbergman.com/2378/cwpk-39-i-o-and-structural-ingest/trackback/
Posted:September 17, 2020

This Installment Defines KBpedia’s ‘Bootstraps’

We begin the build process with this installment in the Cooking with Python and KBpedia series. We do so by creating the ‘bootstraps‘ of ‘core’ ontologies that are the targets as we ingest new classes, properties, and annotations for KBpedia. The general process we outline herein is appropriate to building any large knowledge graph. You may swap out your own starting ontology and semantic scaffoldings to apply this process to different knowledge graphs.

The idea of a ‘bootstrap’ in computer science means a core set of rudimentary instructions that is called at immediate initialization of a program. This bootstrapped core provides all of the parent instructions that are called by the subsequent applications that actually do the desired computer tasks. The bootstrap is the way those applications can perform basic binary operations like allocating registers, creating files, pushing or popping instructions to the stack, and other low-level functions.

In the case of KBpedia and our approach to the build process for knowledge graphs, the ‘bootstrap’ is the basic calls to the semantic languages such as RDF or OWL and the creation of a top-level parental set of classes and properties to which we connect the subsequent knowledge graph content. We call these starting bootstraps ‘stubs’.

These ‘stubs’ are created outside of the build process, generally using an ontology IDE like Protégé. In our case, we have already created the ‘stubs’ used in the various KBpedia build processes. As we create new versions, we must make some minor modifications to these ‘stubs’. However, in general, the stubs are rather static in nature and may only rarely need to be changed in a material manner. As you will see from inspection, these stubs are minimal in structure and rather easy to create on your own with your own favorite ontology editor.

The KBpedia build processes use one core ontology stub, the KBpedia Knowledge Ontology (KKO) and two supporting stubs for use in building the full KBpedia knowledge graph or individual typologies.

Overview of the Build Process

We set up a new directory structure with appropriate starting files as the first activity. The build first starts with a pre-ingest step of checking out input files for proper encoding and other ‘cleaning’ tests. Upon passing these checks, we are ready to continue with the build.

The build process begins by loading the stub. This loaded stub then becomes the target for all subsequent ingest steps.

The ingest process has two phases. In the first phase we ingest build files that specify the structural nature of the knowledge graph, in this case, KBpedia. This structural scaffolding consists of, first, class statements, and then object property or data property ‘is-a’ statements. In the case of classes, the binding predicate is the rdfs:subClassOf property. In the case of properties, it is the rdfs:subPropertyOf property.

This phase sets the structure over which we can reason and infer with the knowledge graph. Thus, we also have the optional steps in this phase to check whether our ingests have been consistent and satisfiable. If the structural scaffolding meets these tests, we are ready for the second phase.

The second phase is to bring in the many annotations that we have gathered for the classes and properties. A description and preferred label are requirements for each item. These are best supplemented with alternative labels (synonyms in the broadest sense) and other properties. We can then load either mapping or additional annotation properties should we desire them.

These steps are not inviolate. Files that we know are clean can skip the pre-clean steps, for example. Or, we may already have a completed and vetted knowledge graph to which we only want to supplement some information. In other words, the build routines can also be used in different orders and with only partial input sets once we have a working system.

Steps to Prep

We will assume that you have already done your offline work to add to or modify your build input files. (As we proceed installment-by-installment during this build discussion we will provide a listing of required files as appropriate.) Depending on the given project, working on these offline build files may actually represent the bulk of your overall efforts. You might be querying outside sources to add to annotations, or changing or adding to your knowledge graph’s structure, or trying new top-level ontologies, etc., etc.

Once you deem this offline work to be complete, you need to do some prep to support the new build process (which in the simplest case are the extraction files we just discussed in this CWPK series). Your first task is to create a new skeletal directory structure under a new version parent, similar to what is shown in Figure 2 in the prior CWPK #37 installment. One way to avoid typing in all new directory names is to copy a prior version directory, copy it to the new version location, and then delete irrelevant files. (Further, if you know you may do this multiple times, you may then copy this shell structure for later use for subsequent versions.)

You then need to copy over all of the prior stub files from the prior version to the new ‘stub’ directory. Depending on what you have been doing locally, you may need to make further changes to mirror your needed work preferences.

Each stub file then needs to be brought into an ontology editor (Protégé, of course, in our case) and updated for new version number, as this diagram indicates:

Making Version Changes to KKO
Figure 1: Making Version Changes to KKO

Note that every ontology has a base IRI, and you should update the reference or version number (http://kbpedia.org/kbpedia/v250 in our case) (1) in the ontology URI field. You then need to copy the text under your current owl:versionInfo annotation, and paste it into a new owl:priorVersion (2) annotation. You may need to make some minor editing changes to reflect past tense for the prior version. Then, last, you need to update the owl:versionInfo (3) annotation.

You may, of course, make other ontology metadata changes at this time.

KKO: The Core Stub

The KKO stub is the core one for the build process. It represents its own standalone ontology, but also is the top-level ontology used by KBpedia.

KKO is also the most likely of the three stubs to need modication before a new run. Recall that KKO is organized under three main branches corresponding to the universal categories of Charles Sanders Peirce. Two of the branches, Monads and Particulars, do not participate in a KBpedia build. (Though future version releases of KKO may affect these branches, in which case the KKO stub should be updated.) But the third branch, Generals, is very much involved in a KBpedia build. All roots (parents) of KBpedia’s typologies tie-in under the Generals branch.

You will need, then, to make changes to the Generals of KKO prior to starting a build if any of these conditions is met:

  1. You are dropping or removing any typologies or SuperTypes
  2. You are adding any typologies or SuperTypes.

If you are only modifying a typology, you need not change KKO. Loading the modified typology during the full build process will accomplish this modification.

Like the other two stubs, you also need to make sure you have updated your version references. As distributed with cowpoke as part of these CWPK installments, here is the KKO stub as used in this project (remember, to see the file chose Run from the notebook menu or press shift+enter when highlighting the cell:

Note: You may obtain the three ‘stub’ files used in this installment from https://github.com/Cognonto/CWPK/tree/master/sandbox/builds/stubs. Make sure and use the ones with the *.owl extension.
with open(r'C:\1-PythonProjects\kbpedia\v300\build_ins\stubs\kko.owl', 'r', encoding='utf8') as f:
    print(f.read())

The KBpedia Stub

The KBpedia stub is the ‘umbrella’ above the entire project. It incorporates the KKO stub, plus is the general target for all subsequent build steps in the full-build process. When looked at in code view, as the file below shows, this ‘umbrella’ is rather sparse. However, if you are to look at it in, say, Protégé, then you will also see all of KKO due to its being imported.

Again, the KBpedia stub should have its version updated prior to a new version build:

with open(r'C:\1-PythonProjects\kbpedia\v300\build_ins\stubs\kbpedia_rc_stub.owl', 'r', encoding='utf8') as f:
    print(f.read())

The Typology Stub

The typology stub is the simplest of the three. Its use is merely to provide a ‘header’ sufficient for loading an individual typology into an editor such as Protégé.

However, despite being listed last, it is the typology stub we will first work with in developing our build routines, because it is our simplest possible starting point. Again, assuming you have made your version updates, here is the file:

with open(r'C:\1-PythonProjects\kbpedia\v300\build_ins\stubs\typology_stub.owl', 'r', encoding='utf8') as f:
    print(f.read())

OK, so our stubs are now updated and set up. We are ready to begin some ingest coding . . . .

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 17, 2020 at 10:30 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2377/cwpk-38-stubs-and-starting-files/
The URI to trackback this post is: https://www.mkbergman.com/2377/cwpk-38-stubs-and-starting-files/trackback/