Segregating the Structure and Looking for Orphans
We have progressed through these build portions of the Cooking with Python and KBpedia series to capture the bulk of the structure in KBpedia by defining its classes, properties, and the hierarchical relationships among them. We have, so to speak, tossed all of the components into the bin, and have mostly defined our knowledge structure’s scaffolding. But we still lack some structural definitions and analysis prior to beginning the testing for whether this structure is coherent or not. Today’s installment directly addresses these gaps.
You will note we still, as yet, have not done anything to annotate our concepts or predicates. That is OK, and we will hold off for a bit further, because annotations are all trappings useful for humans and language to interact with the knowledge graph. It is the structural aspects alone that set the logical framework of the knowledge graph. We will settle questions about this prior to adding labels, definitions, and alternative terms to KBpedia.
Say Goodbye to the Start-up
This is the last installment that we will begin with our standard start-up routine. As needed, our installments will from here on begin with standard Python module import statements. We will be moving our start-up routine into
cowpoke.__main__ import and removing that comment below. We also have added the ‘extract’ switch, as we first described a couple of installments back:
from owlready2 import * from cowpoke.config import * # from cowpoke.__main__ import * import csv import types = World() world = master_deck.get('kb_src') # we get the build setting from config.py kb_src if kb_src is None: = 'standard' kb_src elif kb_src is 'extract': = 'standard' kb_src elif kb_src is 'full': = 'start' kb_src elif kb_src == 'sandbox': = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl' kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl' kko_file elif kb_src == 'standard': = 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.owl' kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl' kko_file elif kb_src == 'start': = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kbpedia_rc_stub.owl' kbpedia = 'C:/1-PythonProjects/kbpedia/v300/build_ins/stubs/kko.owl' kko_file else: print('You have entered an inaccurate source parameter for the build.') = 'http://www.w3.org/2004/02/skos/core' skos_file
This will move to
cowpoke.__main__ import as well:
= world.get_ontology(kbpedia).load() kb = kb.get_namespace('http://kbpedia.org/kko/rc/') rc = world.get_ontology(skos_file).load() skos kb.imported_ontologies.append(skos)= world.get_namespace('http://www.w3.org/2004/02/skos/core#') core = world.get_ontology(kko_file).load() kko kb.imported_ontologies.append(kko)= kb.get_namespace('http://kbpedia.org/ontologies/kko#')kko
You will need to Run (
shift+enter) the routines above in order to test any of the subsequent methods.
This section describes a number of utilities we may apply to the structure of KBpedia. Most of these routines need only be run infrequently, and generally, always is preparation for testing last structure items before initiating a formal, new build.
In the last installment, we developed the first two of these utilities, the
dup_remover check and the
set_union routine. These two join the routines below in the new
In our prior build routines, we had some specific steps dealing with defining ‘SuperTypes’, that is, the root concepts to each of our typologies. With this new Python cowpoke design, these specifications have moved to the KBpedia Knowledge Ontology (KKO) upper ontology (see CWPK #38). If you choose to add a new upper-level typology, you will need to take these steps:
Using an ontology editor, add the new upper level SuperType to its appropriate level under
Generalsin the KKO ontology;
Add all required annotations (
altLabels) for that new concept in KKO;
Add a new entry to the
typol_dictdictionary list in
Flesh out and complete a typology flat file for that new SuperType and place it into the appropriate directory used for your builds;
Build the KBpedia structure (or whatever you may have named it) and test the structure (per this and the next installments); and
Add the annotations to any new RCs in the typology (CWPK #44).
Note: Lower-level typologies may also be added to an existing KBpedia concept node (‘
rc‘ namespace). In those cases, the new typology needs to be added explicitly to the
class_struct_build process in CWPK #40, but no further changes need to be made to KKO since the parent typology is already hooked into the system.
The difference analysis (
set_difference) code is mostly identical to the
set_union routine from the prior installment, except for the difference calculation shown on the line with Note #6. It is best used to check the difference from only one or two other sets (typologies).
The basic run command for this utility is:
We first showed how to list disjoint classes in CWPK #17. Let’s take that basic command, and use it to extract our existing disjoint assignments to file, plus do a bit of output file cleanup. Since this is only rarely run (but helpful when done so!), we have not generalized it much:
def disjoint_status() = list(world.disjoint_classes()) output = open('C:/1-PythonProjects/kbpedia/v300/build_ins/working/kbpedia_disjoint.csv', 'w', encoding='utf8') disjoint_file 'id,disjoints\n') disjoint_file.write(for element in output: = str(element) element = element.replace('AllDisjoint([', '') element = element.replace('C:\\1-PythonProjects\\kbpedia\\sandbox\\', '') element = element.replace(' | ', ',') element = element.replace(' ', '') element = element.replace('])', '') element = element.replace(',ontology=get_ontology("http://kbpedia.org/ontologies/kko#"))', '') element = element.replace(']', '') element disjoint_file.write(element)'\n') disjoint_file.write( disjoint_file.close()
Mostly this routine just cleans up the output from the standard owlready2 ‘disjoint’ call. It was only cleaned up to the point of readability, since it will not be used in any roundtripping. The next couple of sub-sections address how we typically handle disjointedness assertions.
Disjoint assignments are some of the most important in KBpedia. We try to ensure that any truly non-overlapping typologies are declared as ‘disjoint’ from one another. Also, we try to scrutinize closely two typologies with only minimal overlap. These minor overlaps may be misassignments or perhaps we can move or slightly reconfigure the concept to avoid the overlap, in which case we can re-configure the two comparing typologies to be actually disjoint. We need some offline analysis to review these situations.
We already showed a
set_intersection method in the previous installment. However, for disjoint analysis we want to run pairwise comparisons between all typologies and flag those that have no overlap or have minimal overlaps. With 72 items in the current typology list (excluding
Generals, which is the catch-all combined parent), we thus have 2,556 options to test, since order is not important in the pair. The basic formula is
n(n-1)/2. With this many comparisons, the process clearly needs to be automated.
So, our basic approach is to begin with the first typology, compare it to all others, move to the second and compare, and so forth until we have exhausted the typology list. For each iteration, we will collect the RCs from the first ontology, the RCs from the second typology, convert them to sets, and then do a set intersection. We then want to print out the count of the intersections, and the actual RCs in the two typology sets that overlap if the intersection falls below a set number of overlaps. Here is the basic routine, with notes explained after the code:
### KEY CONFIG SETTINGS (see build_deck in config.py) ### # 'kb_src' : 'standard' # count : 20 # Note 1 # out_file : 'C:/1-PythonProjects/kbpedia/v300/targets/stats/kko_intersections.csv' from itertools import combinations # Note 2 def typol_intersects(**build_deck): = typol_dict.values() kko_list = build_deck.get('count') count = build_deck.get('out_file') out_file with open(out_file, 'w', encoding='utf8') as output: print('count,kko_1,kko_2,intersect RCs', file=output) for i in combinations(kko_list,2): # Note 3 = i # Note 4 kko_1 = i # Note 4 kko_2 = kko_1.replace('kko.', '') kko_1_frag = getattr(kko, kko_1_frag) # Note 5 kko_1 = kko_2.replace('kko.', '') kko_2_frag = getattr(kko, kko_2_frag) kko_2 = kko_1.descendants(include_self = False) # Note 6 descent_1 = set(descent_1) descent_1 = kko_2.descendants(include_self = False) descent_2 = set(descent_2) descent_2 = descent_1.intersection(descent_2) # Note 7 intersect = len(intersect) num if num <= count: # Note 1 print(num, kko_1, kko_2, intersect, sep=',', file=output) else: print(num, kko_1, kko_2, sep=',', file=output) print('KKO typology intersection analysis is done.')
We pick up our settings, like other routines, from the
(**build_deck), and we set a threshold of a maximum of 20 overlaps or fewer (1) (you may change this to any value you wish) for printing out the results. If you’d like to inspect one output (calculated as of today’s installment; it may change), you can inspect the file by running this cell:
with open('C:/1-PythonProjects/kbpedia/sandbox/kko_intersections.csv', 'r') as f: print(f.read())
Each line in the output presents the intersection count, followed by the listing of the two typologies being compared, and the a listing of the intersecting reference concepts (RCs) if they fall below the minimum.
The code takes advantage of a new module in this series,
itertools (2), that has a number of very useful data analysis options. We are looking at the
combinations method (3) that iterates for us all of the unordered pairwise comparisons (2,556 in our case). We pull out the actual typology item by index from the tandem (4), and, like before, evaluate that string to retrieve the actual typology class reference (5). Using the owlready2 built-in function, we are able to get all of the RC descendant members for each of the typologies, convert them to sets, and then intersect them (7) with the efficient set intersection notation.
We want to do two things with this output. First, we want to make sure that all null intersections (count = 0) are included in our disjoint assignments in KBpedia. This is where we can quickly compare to the output from the earlier
disjoint_status function. Second, for intersections with minimal overlap, we want to inspect those items and discover if we can revise scope or assignments for some RCs to make the pair disjoint. This latter step is a bit tricky (aside from any misassignments, which have now been flagged for correction) because we do not want to change our ideas of ‘natural’ classes merely to make a disjoint assertion. However, sometimes either the scope of the typology, or the scope of the shared RC, may be tweaked such that a defensible disjoint status may be asserted. When there are very few overlaps, for example, one approach that has sometimes made sense is to move a concept into a parent category above the two comparison child typologies. There are also circumstances where the overlap is real, and even if only with a few overlap items, the non-disjointedness should be maintained (and thus no changes should be made).
Some time and experience is likely required in this area. Disjoint assertions are some of the most powerful for inferencing and satisfiability testing of the knowledge graph. (I suspect I have spent more intellectual horsepower on the questions of disjoint typologies than any other in KBpedia.)
From the standpoint of the Python code used for this method, see the concluding section under Additional Documentation to check out some useful sources.
Branch and Orphan Check
A periodic check that is helpful is whether a given RC has a broken lineage to the root of its typology. Such unbroken links can not occur when the typology is a direct extraction from KBpedia without external modification. However, the use of external tools, general edits, or other modifications to a typology used for ingest can result in broken inheritance chains. In the case where more than one RC in a chain of RCs lacks a connection to the root, the disconnected fragment is called a ‘branch’. Where the disconnected fragment is a singleton RC, it is called an ‘orphan’.
Again, because this routine is infrequently needed, I mostly hardwired the formal settings below. You can move them back to the
build_deck settings. Here is the routine, with again notes that follow the code listing:
### KEY CONFIG SETTINGS (see build_deck in config.py) ### # 'kb_src' : 'standard' # 'loop_list' : kko.Generals.descendants() # Note 1 # 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/targets/stats/branches_orphans.csv' def branch_orphan_check(**build_deck): print('Beginning branch and orphan checks . . .') # loop_list = build_deck.get('loop_list') # Note 1 = kko.Generals.descendants() # Note 2 loop_list = set(loop_list) loop_list = list(typol_dict.values()) kko_list =  item_list for i, item in enumerate(kko_list): # Note 2 = item.replace('kko.','') item_frag = getattr(kko, item_frag) kko_item = kko_item kko_list[i] print('After:', kko_list) = 'C:/1-PythonProjects/kbpedia/v300/targets/stats/branches_orphans.csv' out_file with open(out_file, 'w', encoding='utf8') as output: print('rc', file=output) = set(kko_list) kko_list for loopval in loop_list: = loopval val print(' . . . evaluating', loopval, 'checking for branches and orphans . . .') = val.ancestors(include_self = False) val_list = set(val_list) val_list = val_list.intersection(kko_list) intersect = len(intersect) num print(num) if num == 0: print('Unconnected RC:', val, file=output) print('Branch and orphan analysis now complete.')
In this example, we set the overall loop basis to be all of the RCs in the system; that is, the
.descendants of the
Generals typology root. If to be driven from the
build_deck, the value could be changed to a single typology using the
custom_dict setting, but it may be just as easy to set it directly in this code.
.descendants produces an array of class objects, evaluating all of the typologies requires us to loop over
kko_list, which is a 2-tuple dictionary with the key values as strings. As we have seen before, we need to convert those strings into class object types (2), which also requires us to
enumerate the list, which allows us to substitute the initial string values to class values.
We then convert our two input lists to sets, and do an intersection as in prior routines when we run the routine. If the item does not have the typology root as an ancestor, we then know the item is an orphan or the top of a branch not connected to the root.
This kind of analysis is most useful when first constructing a new, initial typology. As disconnects get connected, the worth of this analysis declines.
Duplicates in the Parental Chain
Our last structural utility at this juncture is one that analyzes whether a given reference concept (RC) is only assigned once to its lowest logical occurrence in a parental inheritance chain. While there is nothing illogical about assigning a concept wherever it is subsumed by a parent, multiple assignments for a single RC in a given inheritance chain lead to unreadability and difficulties in maintaining the system.
For example, we know that a ‘polar bear’ is a ‘bear’, which is a ‘mammal’ that is part of ‘Eutheria’, all of which are ‘LivingThings’. There is nothing logically wrong with assigning the ‘polar bear’ concept to all of these other items. Inferencing would show ‘polar bear’ to be a subclass of all of these items. However, redundant assignments act to clog our listing, and makes it difficult to know when we see an occurrence whether it is at its terminal node location or not. We get cleaner ontologies that are easier to maintain by trying to adhere to the best practice of a single assignment to an inheritance chain, best placed at its lowest hierarchical level.
Redundant assignments, in my view, are all too common with most knowledge graphs. I like the analytical routine below since it helps me to pare down to the essence of the logic of the ontology structure. Code notes are discussed below the listing:
### KEY CONFIG SETTINGS (see build_deck in config.py) ### # 'kb_src' : 'standard' # 'loop_list' : kko.ProtistsFungus.descendants() # Note 1 # 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/targets/stats/parental_dups.csv' def dups_parental_chain(**build_deck): print('Beginning duplicate RC placement analysis . . .') = kko.AudioInfo.descendants() # Note 1 loop_list = 'C:/1-PythonProjects/kbpedia/v300/targets/stats/parental_dups.csv' out_file with open(out_file, 'w', encoding='utf8') as output: print('count,rc,dups', file=output) for item in loop_list: # Note 2 = item rc = rc.ancestors(include_self = False) rc_list =  dup_keep for par_item in rc_list: = par_item parent = parent.subclasses() par_list for dup_item in par_list: = dup_item dup if rc == dup: # dup_check = dup.ancestors(include_self = False) # if(all(x in rc_list for x in dup_check)): # print(rc, ',', parent, file=output) dup_keep.append(parent) = len(dup_keep) count if count > 1: print(count, ',', rc, ',', dup_keep, file=output) print('Duplicate RC checking and analysis is complete.')
Beginning duplicate RC placement analysis . . .
Duplicate RC checking and analysis is complete.
On my local machine, this analysis takes about 3.5 minutes to run.
We directly assign to trace all of the RCs under the
Generals root (1), of the three in the KKO’s universal categories. Again, these can be tailored through settings from the
build_deck. If you do so, make sure you make the
.descendants assignment as well. The remaining parts of the routine should be somewhat familiar by now.
The routine basically works by first looping over all of the RCs in the system (2), grabbing all ancestors up to the
owl.Thing root, looping over all of the ancestors and grabbing their immediate subclasses, and then checking to see if one of the subclasses is the starting RC. If so, that is recorded, and RCs with more than one subclass instance are written to file.
These listings perhaps could be reduced further in size with further filtering. However, it is best I believe, at this juncture, to manually inspect such structural changes. It is straightforward to manually check the RCs listed, and remove any superfluous subsumption assignments.
I may add some more refinements to this routine later to flag any subclass assignments that occur in the same parental chain.
If our system passes the tests above, or at least to the extent that we, as knowledge graph managers, deem acceptable for a next release, then we are ready to begin our logic tests of the structure, the subject of our next installment.
Here are some useful links on the
itertools module, as well as other pairwise considerations: