Posted:October 5, 2020

Here is the Master Listing of Extraction and Build Steps

We are near the end of this major part in our Cooking with Python and KBpedia series in which we cover how to build KBpedia from a series of flat-text (CSV) input files. Though these CSV files may have been modified substantially offline (see, in part, CWPK #36), they are initially generated in an extraction loop, which we covered in CWPK #28-35. We have looked at these various steps in an incremental fashion, building up our code base function by function. This approach is perhaps good from a teaching perspective, but makes it kind of murky how all of the pieces fit together.

In this installment, I will list all of the steps — in sequence — for proceeding from the initial flat file extractions, to offline modifications of those files, and then the steps to build KBpedia again from the resulting new inputs. Since how all of these steps proceed depends critically on configuration settings prior to executing a given step, I also try to capture the main configuration settings appropriate to each step. The steps outlined here cover a full extract-build ‘roundtrip‘ cycle. In the next installment, we will address some of the considerations that go into doing incremental or partial extractions or builds.

Please note that the actual functions in our code modules may be modified slightly from what we presented in our interactive notebook files. These minor changes, when made, are needed to cover gaps or slight errors uncovered during full build and extraction sets. As an example, my initial passes of class structure extractions overlooked the kko.superClasses and rdfs.isDefinedyBy properties. Some issues in CSV extraction and build settings were also discovered that led to excess quoting of strings. The “official” code, then, is what is contained in the cowpoke modules, and not necessarily exactly what is in the notebook pages.

Therefore, of the many installments in this CWPK series, this present one is perhaps one of the most important for you to keep and reference. We will have occasion to summarize other steps in our series, but this installment is the most comprehensive view of the extract-and-build ’roundtrip’ cycle.

Summary of Extraction and Build Steps

Here are the basic steps in a complete roundtrip from extracting to building the knowledge graph anew:

  1. Startup

  2. Extraction

  • Structure Extraction of Classes
  • Structure Extraction of Properties
  • Annotation Extraction of Classes
  • Annotation Extraction of Properties
  • Extraction of Mappings
  1. Offline Development and Manipulation

  2. Clean and Test Build Input Files

  3. Build

  • Build Class Structure
  • Build Property Structure
  • Build Class Annotations
  • Build Property Annotations
  • Ingest of Mappings
  1. Test Build

The order of extraction and building of classes and properties must begin each phase because we need to have these resources adequately registered to the knowledge graph. Once done, however, there is no ordering requirement for whether mapping or annotation proceeds next. Since annotation changes are always likely in every new version or build, I have listed them before mapping, but that is only a matter of preference.

Each of these steps is described below, plus some key configuration settings as appropriate. We begin with our first step, startup:

1. Startup

from cowpoke.__main__ import *
from cowpoke.config import *

We will re-cap the entire breakdown and build process here. We first begin with structure extraction, first classes and then properties:

2. Extraction

The purpose of a full extraction is to retrieve all assertions in KBpedia aside from those in the upper (also called top-level) KBpedia Knowledge Ontology, or KKO.

A. Structure Extraction of Classes

We begin with the (mostly) hierarchical typologies and their linkage into KKO and with one another. Since all of the reference concepts in KBpedia are subsumed by the top-level category of Generals, we can specify it alone as a means to retrieve all of the RCs in KBpedia:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###
# 'krb_src'       : 'extract'                                          # Set in master_deck
# 'descent_type'  : 'descent',
# 'loop'          : 'class_loop',
# 'loop_list'     : custom_dict.values(),                              # Single 'Generals' specified       
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_struct_out.csv',
# 'render'        : 'r_iri',

def struct2_extractor(**extract_deck):
    print('Beginning structure extraction . . .')
# 1 - render method goes here    
    r_default = ''
    r_label = ''
    r_iri = ''
    render = extract_deck.get('render')
    if render == 'r_default':
        set_render_func(default_render_func)
    elif render == 'r_label':
        set_render_func(render_using_label)
    elif render == 'r_iri':
        set_render_func(render_using_iri)
    else:
        print('You have assigned an incorrect render method--execution stopping.')
        return
# 2 - note about custom extractions
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    out_file = extract_deck.get('out_file')
    class_loop = extract_deck.get('class_loop')
    property_loop = extract_deck.get('property_loop')
    descent_type = extract_deck.get('descent_type')
    x = 1
    cur_list = []
    a_set = []
    s_set = []
    new_class = 'owl:Thing'
# 5 - what gets passed to 'output'
    with open(out_file, mode='w', encoding='utf8', newline='') as output:
        csv_out = csv.writer(output)
        if loop == 'class_loop':                                             
            header = ['id', 'subClassOf', 'parent']
            p_item = 'rdfs:subClassOf'
        else:
            header = ['id', 'subPropertyOf', 'parent']
            p_item = 'rdfs:subPropertyOf'
        csv_out.writerow(header)       
# 3 - what gets passed to 'loop_list' 
        for value in loop_list:
            print('   . . . processing', value)                                           
            root = eval(value)
# 4 - descendant or single here
            if descent_type == 'descent':
                a_set = root.descendants()
                a_set = set(a_set)
                s_set = a_set.union(s_set)
            elif descent_type == 'single':
                a_set = root
                s_set.append(a_set)
            else:
                print('You have assigned an incorrect descent method--execution stopping.')
                return                         
        print('   . . . processing consolidated set.')
        for s_item in s_set:
            o_set = s_item.is_a
            for o_item in o_set:
                row_out = (s_item,p_item,o_item)
                csv_out.writerow(row_out)
                if loop == 'class_loop':
                    if s_item not in cur_list:                
                        row_out = (s_item,p_item,new_class)
                        csv_out.writerow(row_out)
                cur_list.append(s_item)
                x = x + 1
    print('Total unique IDs written to file:', x)
    print('The structure extraction for the ', loop, 'is completed.')
struct2_extractor(**extract_deck)

B. Structure Extraction of Properties

See above with the following changes/notes:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###
# 'krb_src' : 'extract' # Set in master_deck
# 'descent_type' : 'descent',
# 'loop' : 'property_loop',
# 'loop_list' : prop_dict.values(),
# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/extractions/properties/prop_struct_out.csv',
# 'render' : 'r_default',

C. Annotation Extraction of Classes

Annotations require a different method, though with a similar composition to the prior ones. It was during testing of the full extract-build roundtrip that I realized our initial class annotation extraction routine was missing for the rdfs.isDefinedBy and kko.superClassOf properties. The code in extract.py has been updated to reflect these changes.

Again, we first begin with classes. Note: by convention, I have shifted a couple structural:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###                
# 'krb_src'       : 'extract'                                          # Set in master_deck
# 'descent_type'  : 'descent',
# 'loop'          : 'class_loop',
# 'loop_list'     : custom_dict.values(),                              # Single 'Generals' specified 
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv',
# 'render'        : 'r_label',

def annot2_extractor(**extract_deck):
    print('Beginning annotation extraction . . .') 
    r_default = ''
    r_label = ''
    r_iri = ''
    render = extract_deck.get('render')
    if render == 'r_default':
        set_render_func(default_render_func)
    elif render == 'r_label':
        set_render_func(render_using_label)
    elif render == 'r_iri':
        set_render_func(render_using_iri)
    else:
        print('You have assigned an incorrect render method--execution stopping.')
        return    
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    out_file = extract_deck.get('out_file')
    class_loop = extract_deck.get('class_loop')
    property_loop = extract_deck.get('property_loop')
    descent_type = extract_deck.get('descent_type')
    """ These are internal counters used in this module's methods """
    p_set = []
    a_ser = []
    x = 1
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:
        csv_out = csv.writer(output)                                       
        if loop == 'class_loop':                                             
            header = ['id', 'prefLabel', 'subClassOf', 'altLabel', 
                      'definition', 'editorialNote', 'isDefinedBy', 'superClassOf']
        else:
            header = ['id', 'prefLabel', 'subPropertyOf', 'domain', 'range', 
                      'functional', 'altLabel', 'definition', 'editorialNote']
        csv_out.writerow(header)    
        for value in loop_list:                                            
            print('   . . . processing', value)                                           
            root = eval(value) 
            if descent_type == 'descent':
                p_set = root.descendants()
            elif descent_type == 'single':
                a_set = root
                p_set.append(a_set)
            else:
                print('You have assigned an incorrect descent method--execution stopping.')
                return    
            for p_item in p_set:
                if p_item not in cur_list:                                 
                    a_pref = p_item.prefLabel
                    a_pref = str(a_pref)[1:-1].strip('"\'')                
                    a_sub = p_item.is_a
                    for a_id, a in enumerate(a_sub):                        
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_sub + '||' + str(a)
                        a_sub  = a_item
                    if loop == 'property_loop':   
                        a_item = ''
                        a_dom = p_item.domain
                        for a_id, a in enumerate(a_dom):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_dom + '||' + str(a)
                            a_dom  = a_item    
                        a_dom = a_item
                        a_rng = p_item.range
                        a_rng = str(a_rng)[1:-1]
                        a_func = ''
                    a_item = ''
                    a_alt = p_item.altLabel
                    for a_id, a in enumerate(a_alt):
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_alt + '||' + str(a)
                        a_alt  = a_item    
                    a_alt = a_item
                    a_def = p_item.definition
                    a_def = str(a_def)[2:-2]
                    a_note = p_item.editorialNote
                    a_note = str(a_note)[1:-1]
                    if loop == 'class_loop':                                  
                        a_isby = p_item.isDefinedBy
                        a_isby = str(a_isby)[2:-2]
                        a_isby = a_isby + '/'
                        a_item = ''
                        a_super = p_item.superClassOf
                        for a_id, a in enumerate(a_super):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_super + '||' + str(a)
                            a_super = a_item    
                        a_super  = a_item
                    if loop == 'class_loop':                                  
                        row_out = (p_item,a_pref,a_sub,a_alt,a_def,a_note,a_isby,a_super)
                    else:
                        row_out = (p_item,a_pref,a_sub,a_dom,a_rng,a_func,
                                   a_alt,a_def,a_note)
                    csv_out.writerow(row_out)                               
                    cur_list.append(p_item)
                    x = x + 1
    print('Total unique IDs written to file:', x)  
    print('The annotation extraction for the', loop, 'is completed.')
annot2_extractor(**extract_deck)
d=csv.get_dialect('excel')
print("Delimiter: ", d.delimiter)
print("Doublequote: ", d.doublequote)
print("Escapechar: ", d.escapechar)
print("lineterminator: ", repr(d.lineterminator))
print("quotechar: ", d.quotechar)
print("Quoting: ", d.quoting)
print("skipinitialspace: ", d.skipinitialspace)
print("strict: ", d.strict)

D. Annotation Extraction of Properties

See above with the following changes/notes:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###                
# 'krb_src' : 'extract' # Set in master_deck
# 'descent_type' : 'descent',
# 'loop' : 'property_loop',
# 'loop_list' : prop_dict.values(),
# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/extractions/properties/prop_annot_out.csv',
# 'render' : 'r_default',

E. Extraction of Mappings

Mappings to external sources is an integral part of KBpedia, as is likely the case for any similar, large-scale knowledge graph. As such, extractions of existing mappings is also a logical step in the overall extraction process.

Though we will not address mappings until CWPK #49, those steps belong here in the overall set of procedures for the extract-build roundtrip process.

3. Offline Development and Manipulation

The above extraction steps can capture changes over time that have been made with an ontology editing tool such as Protégé. Once that knowledge graph is at a state of readiness after using Protégé, and more major changes are desired to your knowledge graph, it is sometimes easier to work with flat files in bulk. I discussed some of my own steps using spreadsheets in CWPK #36, and I will also walk through some refactorings using bulk files in our next installment, CWPK #48. That case study will help us see at least a few of the circumstances that warrant bulk refactoring. Major additions or changes to the typologies is also an occasion for such bulk activities.

At any rate, this step in the overall roundtripping process is where such modifications are made before rebuilding the knowledge graph anew.

4. Clean and Test Build Input Files

We covered these topics in CWPK #45. If you recall, cleaning and testing of input files occurs at this logical point, but we delayed discussing it in detail until we had covered the overall build process steps. This is why this sequence number for this installment appears a bit out of order.

5. Build

The start of the build cycle is to have all structure, annotation, and mapping files in proper shape and vetted for encoding and quality.

(Note: where ‘Generals’ is specified, keep the initial capitalization, since it is also generated as such from the extraction routines and is consistent with typology naming.)

A. Build Class Structure

We start with the knowledge graph classes and their subsumption relationships, as specified in one or more class structure CSV input files. In this case, we are doing a full build, so we begin with the KKO and RC stubs, plus run our Generals typology since it is inclusive:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###             # Option 1: from Generals
# 'kb_src'        : 'start'                                           # Set in master_deck; only step with 'start'
# 'loop_list'     : custom_dict.values(),                             # Single 'Generals' specified 
# 'loop'          : 'class_loop',
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/',              
# 'ext'           : '_struct_out.csv',                                # Note change           
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.csv',

### KEY CONFIG SETTINGS (see build_deck in config.py) ###             # Option 2: from all typologies
# 'kb_src'        : 'start'                                           # Set in master_deck; only step with 'start'
# 'loop_list'     : typol_dict.values(),                               
# 'loop'          : 'class_loop',
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/',              
# 'ext'           : '.csv',                                           # Note change           
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.csv',

from cowpoke.build import *

def class2_struct_builder(**build_deck):                                  
    print('Beginning KBpedia class structure build . . .')               
    kko_list = typol_dict.values()                                      
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    base = build_deck.get('base')
    ext = build_deck.get('ext')
    out_file = build_deck.get('out_file')
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval)                           
        frag = loopval.replace('kko.','')
        in_file = (base + frag + ext)
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])                 
            for row in reader:
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')                         
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:                                       
                    is_first_row = False
                    continue      
                with rc:                                                
                    kko_id = None
                    kko_frag = None
                    if parent_frag == 'Thing':                                                        
                        if id in kko_list:                                
                            kko_id = id
                            kko_frag = id_frag
                        else:    
                            id = types.new_class(id_frag, (Thing,))       
                if kko_id != None:                                         
                    with kko:                                                
                        kko_id = types.new_class(kko_frag, (Thing,))  
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
            for row in reader:                                                
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:
                    is_first_row = False
                    continue          
                with rc:
                    kko_id = None                                   
                    kko_frag = None
                    kko_parent = None
                    kko_parent_frag = None
                    if parent_frag is not 'Thing':
                        if id in kko_list:
                            continue
                        elif parent in kko_list:
                            kko_id = id
                            kko_frag = id_frag
                            kko_parent = parent
                            kko_parent_frag = parent_frag
                        else:   
                            var1 = getattr(rc, id_frag)               
                            var2 = getattr(rc, parent_frag)
                            if var2 == None:                            
                                continue
                            else:
                                print(var1, var2)
                                var1.is_a.append(var2)
                if kko_parent != None:                                         
                    with kko:                
                        if kko_id in kko_list:                               
                            continue
                        else:
                            var1 = getattr(rc, kko_frag)
                            var2 = getattr(kko, kko_parent_frag)                     
                            var1.is_a.append(var2)
        with open(in_file, 'r', encoding='utf8') as input:                
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
            for row in reader:                                              
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:
                    is_first_row = False
                    continue
                if parent_frag == 'Thing':               
                    var1 = getattr(rc, id_frag)
                    var2 = getattr(owl, parent_frag)
                    try:
                        var1.is_a.remove(var2)
                    except Exception:
                        continue
    kb.save(out_file, format="rdfxml")      
    print('KBpedia class structure build is complete.')
class2_struct_builder(**build_deck)

B. Build Property Structure

After classes, when then add property structure to the system. Note, however, that we now switch to our normal ‘standard’ kb source:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###             
# 'kb_src'        : 'standard'                                        # Set in master_deck
# 'loop_list'     : prop_dict.values(),                             
# 'loop'          : 'property_loop',
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/',              
# 'ext'           : '_struct_out.csv',                                         
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.csv',
# 'frag'          : set in code block; see below

def prop2_struct_builder(**build_deck):
    print('Beginning KBpedia property structure build . . .')
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    base = build_deck.get('base')
    ext = build_deck.get('ext')
    out_file = build_deck.get('out_file')
    if loop is not 'property_loop':
        print("Needs to be a 'property_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval)
        frag = 'prop'                                    
        in_file = (base + frag + ext)
        print(in_file)
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subPropertyOf', 'parent'])
            for row in reader:
                if is_first_row:
                    is_first_row = False                
                    continue
                r_id = row['id']
                r_parent = row['parent']
                value = r_parent.find('owl.')
                if value == 0:                                        
                    continue
                value = r_id.find('rc.')
                if value == 0:
                    id_frag = r_id.replace('rc.', '')
                    parent_frag = r_parent.replace('kko.', '')
                    var2 = getattr(kko, parent_frag)                 
                    with rc:                        
                        r_id = types.new_class(id_frag, (var2,))
    kb.save(out_file, format="rdfxml")
    print(kbpedia)
    print(out_file)
    print('KBpedia property structure build is complete.')   
prop2_struct_builder(**build_deck)

C. Build Class Annotations

With the subsumption structure built, we next load our annotations, beginning with the class ones:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : file_dict.values(),                           # see 'in_file'
# 'loop'          : 'class_loop',
# 'in_file'       : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/Generals_annot_out.csv',
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts.csv',

def class2_annot_build(**build_deck):
    print('Beginning KBpedia class annotation build . . .')
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    class_loop = build_deck.get('class_loop')
    out_file = build_deck.get('out_file')
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval) 
        in_file = loopval
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subClassOf', 
                                   'altLabel', 'definition', 'editorialNote', 'isDefinedBy', 'superClassOf'])                 
            for row in reader:
                r_id = row['id']
                id = getattr(rc, r_id)
                if id == None:
                    print(r_id)
                    continue
                r_pref = row['prefLabel']
                r_alt = row['altLabel']
                r_def = row['definition']
                r_note = row['editorialNote']
                r_isby = row['isDefinedBy']
                r_super = row['superClassOf']
                if is_first_row:                                       
                    is_first_row = False
                    continue      
                id.prefLabel.append(r_pref)
                i_alt = r_alt.split('||')
                if i_alt != ['']: 
                    for item in i_alt:
                        id.altLabel.append(item)
                id.definition.append(r_def)        
                i_note = r_note.split('||')
                if i_note != ['']:   
                    for item in i_note:
                        id.editorialNote.append(item)
                id.isDefinedBy.append(r_isby)
                i_super = r_super.split('||')
                if i_super != ['']:   
                    for item in i_super:
                        item = 'http://kbpedia.org/kko/rc/' + item
#                        Code block to be used if objectProperty; 5.5 hr load
#                        item = getattr(rc, item)
#                        if item == None:
#                            print('Failed assignment:', r_id, item)
#                            continue
#                        else:                                
                        id.superClassOf.append(item)
    kb.save(out_file, format="rdfxml") 
    print('KBpedia class annotation build is complete.')   
class2_annot_build(**build_deck)

D. Build Property Annotations

And then the property annotations:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : file_dict.values(),                           # see 'in_file'
# 'loop'          : 'property_loop',
# 'in_file'       : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv',
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts.csv',

def prop2_annot_build(**build_deck):
    print('Beginning KBpedia property annotation build . . .')
    xsd = kb.get_namespace('http://w3.org/2001/XMLSchema#')
    wgs84 = kb.get_namespace('http://www.opengis.net/def/crs/OGC/1.3/CRS84')    
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    out_file = build_deck.get('out_file')
    x = 1
    if loop is not 'property_loop':
        print("Needs to be a 'property_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval) 
        in_file = loopval
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subPropertyOf', 'domain',  
                                   'range', 'functional', 'altLabel', 'definition', 'editorialNote'])                 
            for row in reader:
                r_id = row['id']                
                r_pref = row['prefLabel']
                r_dom = row['domain']
                r_rng = row['range']
                r_alt = row['altLabel']
                r_def = row['definition']
                r_note = row['editorialNote']
                r_id = r_id.replace('rc.', '')
                id = getattr(rc, r_id)
                if id == None:
                    continue
                if is_first_row:                                       
                    is_first_row = False
                    continue
                id.prefLabel.append(r_pref)
                i_dom = r_dom.split('||')
                if i_dom != ['']: 
                    for item in i_dom:
                        if 'kko.' in item:
                            item = item.replace('kko.', '')
                            item = getattr(kko, item)
                            id.domain.append(item) 
                        elif 'owl.' in item:
                            item = item.replace('owl.', '')
                            item = getattr(owl, item)
                            id.domain.append(item)
                        elif item == ['']:
                            continue    
                        elif item != '':
                            item = getattr(rc, item)
                            if item == None:
                                continue
                            else:
                                id.domain.append(item) 
                        else:
                            print('No domain assignment:', 'Item no:', x, item)
                            continue                             
                if 'owl.' in r_rng:
                    r_rng = r_rng.replace('owl.', '')
                    r_rng = getattr(owl, r_rng)
                    id.range.append(r_rng)
                elif 'string' in r_rng:    
                    id.range = [str]
                elif 'decimal' in r_rng:
                    id.range = [float]
                elif 'anyuri' in r_rng:
                    id.range = [normstr]
                elif 'boolean' in r_rng:    
                    id.range = [bool]
                elif 'datetime' in r_rng:    
                    id.range = [datetime.datetime]   
                elif 'date' in r_rng:    
                    id.range = [datetime.date]      
                elif 'time' in r_rng:    
                    id.range = [datetime.time] 
                elif 'wgs84.' in r_rng:
                    r_rng = r_rng.replace('wgs84.', '')
                    r_rng = getattr(wgs84, r_rng)
                    id.range.append(r_rng)        
                elif r_rng == ['']:
                    print('r_rng = empty:', r_rng)
                else:
                    print('r_rng = else:', r_rng, id)
#                    id.range.append(r_rng)
                i_alt = r_alt.split('||')    
                if i_alt != ['']: 
                    for item in i_alt:
                        id.altLabel.append(item)
                id.definition.append(r_def)        
                i_note = r_note.split('||')
                if i_note != ['']:   
                    for item in i_note:
                        id.editorialNote.append(item)
                x = x + 1        
    kb.save(out_file, format="rdfxml") 
    print('KBpedia property annotation build is complete.')
prop2_annot_build(**build_deck)
Beginning KBpedia property annotation build . . .
. . . processing C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv
r_rng = else: xsd.anyURI rc.release_notes
r_rng = else: xsd.anyURI rc.schema_version
r_rng = else: xsd.anyURI rc.unit_code
r_rng = else: xsd.anyURI rc.property_id
r_rng = else: xsd.anyURI rc.ticket_token
r_rng = else: xsd.anyURI rc.role_name
r_rng = else: xsd.anyURI rc.feature_list
r_rng = else: xsd.hexBinary rc.associated_media
r_rng = else: xsd.hexBinary rc.encoding
r_rng = else: xsd.hexBinary rc.encodings
r_rng = else: xsd.hexBinary rc.photo
r_rng = else: xsd.hexBinary rc.photos
r_rng = else: xsd.hexBinary rc.primary_image_of_page
r_rng = else: xsd.hexBinary rc.thumbnail
r_rng = else: xsd.anyURI rc.code_repository
r_rng = else: xsd.anyURI rc.content_url
r_rng = else: xsd.anyURI rc.discussion_url
r_rng = else: xsd.anyURI rc.download_url
r_rng = else: xsd.anyURI rc.embed_url
r_rng = else: xsd.anyURI rc.install_url
r_rng = else: xsd.anyURI rc.map
r_rng = else: xsd.anyURI rc.maps
r_rng = else: xsd.anyURI rc.payment_url
r_rng = else: xsd.anyURI rc.reply_to_url
r_rng = else: xsd.anyURI rc.service_url
r_rng = else: xsd.anyURI rc.significant_link
r_rng = else: xsd.anyURI rc.significant_links
r_rng = else: xsd.anyURI rc.target_url
r_rng = else: xsd.anyURI rc.thumbnail_url
r_rng = else: xsd.anyURI rc.tracking_url
r_rng = else: xsd.anyURI rc.url
r_rng = else: xsd.anyURI rc.related_link
r_rng = else: xsd.anyURI rc.genre_schema
r_rng = else: xsd.anyURI rc.same_as
r_rng = else: xsd.anyURI rc.action_platform
r_rng = else: xsd.anyURI rc.fees_and_commissions_specification
r_rng = else: xsd.anyURI rc.requirements
r_rng = else: xsd.anyURI rc.software_requirements
r_rng = else: xsd.anyURI rc.storage_requirements
r_rng = else: xsd.anyURI rc.artform
r_rng = else: xsd.anyURI rc.artwork_surface
r_rng = else: xsd.anyURI rc.course_mode
r_rng = else: xsd.anyURI rc.encoding_format
r_rng = else: xsd.anyURI rc.file_format_schema
r_rng = else: xsd.anyURI rc.named_position
r_rng = else: xsd.anyURI rc.surface
r_rng = else: wgs84 rc.geo_midpoint
r_rng = else: xsd.anyURI rc.memory_requirements
r_rng = else: wgs84 rc.aerodrome_reference_point
r_rng = else: wgs84 rc.coordinate_location
r_rng = else: wgs84 rc.coordinates_of_easternmost_point
r_rng = else: wgs84 rc.coordinates_of_northernmost_point
r_rng = else: wgs84 rc.coordinates_of_southernmost_point
r_rng = else: wgs84 rc.coordinates_of_the_point_of_view
r_rng = else: wgs84 rc.coordinates_of_westernmost_point
r_rng = else: wgs84 rc.geo
r_rng = else: xsd.anyURI rc.additional_type
r_rng = else: xsd.anyURI rc.application_category
r_rng = else: xsd.anyURI rc.application_sub_category
r_rng = else: xsd.anyURI rc.art_medium
r_rng = else: xsd.anyURI rc.sport_schema
KBpedia property annotation build is complete.

E. Ingest of Mappings

Mappings to external sources are an integral part of KBpedia, as is likely the case for any similar, large-scale knowledge graph. As such, ingest of new or revised mappings is also a logical step in the overall build process, and occurs at this point in the sequence.

Though we will not address mappings until CWPK #49, those steps belong here in the overall set of procedures for the extract-build roundtrip process.

6. Test Build

We then conduct our series of logic tests (CWPK #43). This portion of the process may actually be the longest of all, given that it may take multiple iterations to pass all of these tests. However, in other circumstances, the build tests may also go quite quickly if relatively few changes were made between versions.

Wrap Up

Of course, these steps could be embedded in an overall ‘complete’ extract and build routine, but I have not done so.

Before we conclude this major part in our CWPK series, we next proceed to show how all of the steps may be combined to achieve a rather large re-factoring of all of KBpedia.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 5, 2020 at 9:55 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2392/cwpk-47-summary-of-the-extract-build-roundtrip/
The URI to trackback this post is: https://www.mkbergman.com/2392/cwpk-47-summary-of-the-extract-build-roundtrip/trackback/
Posted:October 1, 2020

Getting Serious about the Code and Deciding to Say Adios

To this point, we have accumulated a growing roster of methods for extracting and building KBpedia, and utilities to support those processes. With this growing maturity from our Cooking with Python and KBpedia series, it is time for us to put in place a formal testing regime for cowpoke and to take the steps to register it as a formal Python package.

The idea of unit testing is to assemble simple tests of single code functions that may be exercised whenever we deem changes in our code base warrants. These simple programs evaluate against a known results set to determine whether the routine still performs as expected. Unit tests are not a blanket approval of a method, but a way to ascertain whether certain key functions perform as expected. Unit testing is viewed by many as the foundation for integrated tests, the combination of which are one of the most important improvements in software development of the past 30 years.

As with so many other areas, there is a diversity of modules available to aid the testing process in Python. The unittest module is a part of Python’s standard library, and is our basis as well. But we will layer on to that a series of modules that will enable us to guide and develop our unit tests directly through the Spyder IDE.

I am most assuredly an amateur programmer. As I’ve stated before, I have never been paid a dime for writing a line of code. (And, now after more than halfway through this series, you can probably see why!) But since there is a widespread view that unit testing is a best practice, from Day One in my plan for this CWPK series I had slotted in one or two installments to learn and implement some unit tests. I began this particular installment with a high expectation, and indeed wrote most of this intro before I sat down to focus on learning and implementing tests. Yet I reached a conclusion quite contrary to my expectations. I’m writing this last sentence here just as I wrap up this investigation, with a slight taste of ashes that reminds me of our various experiences with the somewhat related area of agile programming. For my purposes and personality, there is just too much process, diversion, and paint-by-the-numbers to make unit testing a formal part of my workflow. I think I can see applications in large team development with mission-critical interdependencies, but my major realization is that I am already doing comprehensive, integrated testing. Unit tests are a diversion and a productivity loss, as I presently see them, in the case of knowledge graph roundtripping.

However, that being said, we still have the imperative to package up our CWPK code, which we have named cowpoke, as a standard Python package that we can readily make available through the common channels of GitHub and pip. We conclude this installment with our efforts in these areas, which now means you have complete and unfettered access to all of the code we have prepared to date through these CWPK installments.

Installing the Environment

To enable Spyder as our unittest interface, I began by installing a package extension specific to that task:

  conda install -c spyder-ide spyder-unittest

The unittest operations in Spyder also requires the pytest module, which is already part of my base installation, but we make sure anyway:

  conda install pytest

You will want to set up a folder under your project for ‘tests’ and to write your test files, often multiples, to this directory for the package. As you install, you may be asked to grant some permissions, and here is where you will configure to point to your project.

You should then logout and restart your computer, and return to your project to continue. The system will also install a separate .pytest_cache directory under your project.

I found, like Python packages in general, the install and addition of the testing modules to be smooth and easy. A new pane gets created (upper right by default) in Spyder, and test run options appear under the Spyder Run menu item.

Anatomy of a Unit Test

By definition, a unit test is limited to a single “unit” often used synonymously as a discrete function or algorithm. Ted Kaminsky nicely summarizes the standard guidance as to what constitutes a good unit test:

  1. Tests should only test one thing
  2. Each test should be independent and self-contained
  3. Refactoring should not break tests
  4. Try to achieve maximal coverage with tests.

A commitment to unit tests encourages more public methods and greater piecing apart of routines. The general form of a unit test looks like:

  fixtures
def test_test
setup
assert
test
teardown

The pytest module uses ‘fixtures’ as a way to set up input templates of state or connectivity needed as inputs to the function. The unit test function is named, by convention, with a test_ prefix that informs the module a test is available. Though your production routines may favor shorter or more cryptic variable and function names, within the unit test environment best practice is to use longer and descriptive labels, since the tests and how they are being reported occur in a separate testing panel removed in both code and space from the subject routine.

Each test goes through an initial setup portion and then concludes with a teardown, where the temporary test structures are released when the test is done. The actual tests are done against assertions that have pre-determined ‘correct’ results, so that the test can evaluate to pass or fail. Multiple assertions may be evaluated in a given unit test, so more than one pass-fail may be returned. Like unit tests across tools and languages, results that pass are often shown in green on the screen, fails in red.

Determining Where Units Tests Are Applicable

I began my unit test efforts in earnest by first assembling an inventory of cowpoke‘s defined functions to date:

extract.py build.py utils.py
annot_extractor
struct_extractor
typol_extractor
row_clean
class_struct_builder
prop_struct_builder
class_annot_builder
prop_annot_builder
dup_remover
set_union
set_difference
set_intersection
typol_intersects
disjoint_status
branch_orphan_check
dup_parental_chain

I then began to lay out my plan of attack on paper. When I research such matters I note sources that seem to have good code examples and I will mark them for later consultation, but my initial investigations are spent more on finding clear coding approaches and constructs and generalities or patterns for how to set up things. One of the first observations is that all of my roundtripping routines involved quite a bit of I/O and configuration. I was therefore looking especially for guidance around the idea of ‘fixtures‘ or ‘parameters’ with pytest. A second observation is that most of my utils.py routines are used infrequently, sometimes no more frequently than once every build or three. These were not heavily used routines.

Most of the unit test examples I came across were toy cases, such as adding or multiplying a couple of numbers or concatenating some strings. I tried to focus my investigations on use of CSV files, since that is such a central construct in our knowledge graph approach. I started to see hints that perhaps unit tests are not a good idea for file and I/O purposes. A quote from the user Dunes on StackOverflow seemed to best capture the sense I was gaining from my research: “Unit tests that access the file system are generally not a good idea. This is because the test should be self contained, by making your test data external to the test it’s no longer immediately obvious which test the csv file belongs to or even if it’s still in use.”

Hmmm. I could see that, good idea or not, what I was going to have to do to set up my tests and get them “mocked” up for all of the I/O and data staging I would need was not a trivial matter. It was also perhaps the case that my general roundtripping routines, with their many steps and loops, were already too complex for unit testing. It was beginning to dawn on me that to design my unit tests properly, I would need to further piece apart my existing routines into more atomic functions. Wow, I really did not like that idea, since it would kick me all of the way back to Square One and force me to re-factor all of my code to date. And I had been making such great progress!

I could see that unit testing was not going to be some minor ‘adder’ to improve best practices, but more akin to a whole change in philosophy and approach. It was at minimum looking that I would need to double the size of my code base, learn a bunch of whole new stuff needed by the test machinery, change my design and architecture, and for examples of isolated functions that told me nothing about application-wide behavior and seemed to only test what I already knew to be true. Ouch! This unit test stuff was not looking to be a good deal.

Calling Time Out and Testing Premises

We had similar realizations about the use of agile development in the past. While we are a boutique development shop that tends to work on smaller, bespoke projects, we have also been subcontractors on much larger teams with enterprise-scale budgets and project management. It is sometimes exciting, often lucrative, and too frequently exasperating to work on big, multi-team projects. We understand the discipline needed for larger-scale projects and can see the merit (if lightly applied) of agile approaches. But too often agile is just another way to kill innovation and productivity through too many meetings and process.

I had taken as a given that unit testing was an unalloyed good. But, here I was, barely hours into a concerted investigation, and I was seeing serious red flags. Because I had initially not questioned the premise, I had not specifically looked into criticisms or critics of unit testing. The truth is, I had just taken it all as a given and had not inspected my testing assumptions. I believe in my bones in the merit of tested and vetted information products, but perhaps unit testing was not a way to go in our circumstance. What was indeed best and true here?

So, I shifted my investigations from ‘how to do’ to ‘whether to do’ and discovered more criticism and naysayers than I had imagined. Some of this criticism was now a dozen or more years old. Some of the criticism is empirical, some philosophical or nuanced.

There is apparently a steep learning curve to master unit testing and making it an integral part of the development process. My initial investigations had flagged that prospect in spades. Unit testing sets up its own incentive objective, which can be a good thing, but if not done with the right balance or awareness, can result in mindless code proliferation or developing to the incentive. More public and smaller methods result, that are hard to maintain over time:

Declining Usefulness of Unit Tests
Figure 1: Declining Usefulness of Unit Tests (from W. Platz, “The Eroding Agile Test Pyramid”, Feb 20, 2109)

Integrated testing can also be made more difficult due to the code fragmentation.

Respected innovators like Donald Knuth have called unit testing “a waste of time.” Past enthusiasts like David Heinemeier Hansson, the developer of Ruby on Rails, now argue that integrated testing is the proper focus. Kaminski, noted above, has also been critical. There have been many others critical of the approach.

A couple of articles by James Coplien on Why Most Unit Testing is a Waste and its seque in 2014 were lightning rods on the topic. There is a more profane approach to the question, but still thoughtful. Even commercial proponents propose additional steps and tools to improve the unit testing experience and results. There appears to be some growing realization that there are boundaries to unit testing and the need for better definitions of where unit tests may be essential or relevant.

Framing Testing in a Different Light

This more open-minded investigation of the question of unit testing has changed my perspective. My impression is that there is a place and likely best practices and methods for doing unit testing. However, an excessive insistence on unit testing may actually be counter-productive by distorting incentives and leading to code proliferation and fragmentation. Paradoxically, this may make the code base harder to maintain and make it more difficult to discover integrated or system issues. One area that concerns me is in RESTful or Web-based distributed development where APIs and interfaces are prominent, but hard to mock up. The lack of examples useful to my needs is another concern.

More fundamentally, this exercise has caused me to think of testing in a new light. I remain convinced that testing and reliability are paramount, but that has meaning only in relation to the ultimate deliverables or purposes, not the individual pieceparts. The objective is the purpose of the software, not unit testing per se.

A roundtripping objective, my governing purpose, is, actually, a system test of the highest order. We need to be able to break down and manipulate a knowledge artifact, re-build it again, and be able to inspect and use it in process-heavy external environments. Being able to load and inspect and apply logic tests in a totally different Protégé environment is a demanding system test for whether our code base has been accurate in the entire cycle of transformations. I’m already doing loads of testing, and relevant, too. My realization was that the entire basis of my CWPK series was to create an artifact, test it for coherence, modify it, and then test it for coherence again. Such roundtripping is indeed a demanding task.

I am glad I began with the premise of instituting some unit tests in the cowpoke project. It has caused me to think more clearly about why test in the first place, and that achieving end goals should take precedence over adhering to any particular method or process. There is no end to the learning, is there?

The conclusion about the immediate objective was to put unit testing off to the side. If I can completely break down and then re-build a knowledge graph, there is no shame in not doing unit testing.

Setting Up the GitHub Repository

We have already created the basic directory structure for a Python package, as first outlined in CWPK #33 onward. It is now time to formalize this structure, create a GitHub repository, and add additional packaging requirements suitable for listing cowpoke for pip distribution.

Here are the steps I undertook:

  1. Went to the directory where the cowpoke code is stored under my local Python projects
  2. Using Git, created a new repository at this location
  3. Committed all existing Python files in that directory to the new repository
  4. Added the additional files needed for pip as detailed in the next section
  5. Created an empty cowpoke repository under our main branch (Cognonto) in GitHub
  6. Using TortoiseGit under my local file system, ‘pushed’ the local Git repository to GitHub.

It is important that the directory created under GitHub be completely empty. This means at time of creation that I did NOT add a README.md Markdown file. That file is created under the next set of steps and is ‘pushed’ to this new directory.

Upon completion of the next steps, I then ‘pushed’ my local files to GitHub. I did so by picking TortoiseGit when in the root of my local cowpoke directory, and then I entered the HTTPS link for the empty directory on GitHub as the remote URI location. That link is found under the green ‘Code’ button at the upper right of the GitHub cowpoke directory. For reference, this link is:

https://github.com/Cognonto/cowpoke.git

I will speak more about the use of GitHub at the conclusion of this CWPK series. The bottom-line trick I have discovered, however, is to make sure local or remote is ‘clean’ prior to cloning from the other, and then to ‘pull’ changes from the destination repository before ‘pushing’ from the source one.

Download cowpoke

From your standpoint as a user, you can obtain the cowpoke code from GitHub by essentially reversing this process. The steps you should follow are:

  1. If using Windows, make sure TortoiseGit is installed on your local machine. Search for instructions on the Web if you do not have this application installed
  2. Go to the cowpoke GitHub location indicated above
  3. Create a new cowpoke directory under your Python packages wherever you have them stored locally (should be under xxx/main-python-directory/Lib/site-packages
  4. Create a new Git repository at that same location; leave blank
  5. ‘Pull’ the repository from GitHub using the cowpoke GitHub location indicated above as your remote specification.

Creating the cowpoke Package

It is not necessary to have a pip package for cowpoke, since it is possible (if you have the GitPython package installed) to obtain the code directly from GitHub:

pip install gitpython

import git
git.Git("/xxx/main-python-directory/Lib/site-packages").clone("git://github.com/Cognonto/cowpoke.git")

However, it is easier to treat cowpoke as a standard Python package, and we created one and did so by following guidelines for the PyPi installer package (pip).

First, I did a test installation at test.pypi.org using this step-by-step guide. There are a few required files that each package must contain, including notably:

setup.py                   # definitions of the package and dependencies
LICENSE # the license for the package
README.md # the readme description file
code files

All of these requirements and the steps to follow are outlined in the guide.

Windows is a little tricky. I had a hard time using the Apache 2 license, so fell back to the MIT one. Also, the acceptance of tokens, as suggested by the guide, proved problematic, possibly due to lack of a $HOME directory on my Windows machine. I used my straight login and password names for the test site instead, and that worked fine. One must also have the setup.py working just right, or the test will fail with an error. (You can run python setup.py -install to check your pip packages locally.) Also, the instructions kept insisting I use ‘python3‘, but my local configuration sets Python directly to version 3, so the numeral was not causing Python to run properly; using the simple python did the trick for my environment.

Nonetheless, after making these changes, I was able to successfully complete the test install.

This test exercise means the package file structure is now suitable for the actual formal package upload. There is a separate guide for the formal site. Note that the formal package registry has a separate site (https://pypi.org/) with its own login and password than the test site. Per the test site instructions, I had already installed the twine install assistance package. So, after logging into the PyPI support, we begin the upload process with:

python -m twine upload dist/*

I am then prompted for my PyPI login and password. The material is then uploaded with progress bars, and upon acceptance we get a message about where to find our new cowpoke package:

https://pypi.org/project/cowpoke/

Now, it is important to know that one can not update this information without incrementing the version number. So, it is essential that the input information be accurate and complete, which means the test upload is a very important step.

Going forward, it is now possible for you to install cowpoke directly into your Python project by using:

pip install cowpoke

Lastly, please notice I have updated the first notice banner at the conclusion of these installments to indicate where to find the cowpoke Python code.

Additional Documentation

Here are some sources on the general question of testing and unit testing in Python:

Here are some sources on how to create a repository on GitHub and create a pip package:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on October 1, 2020 at 11:17 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2388/cwpk-46-creating-the-cowpoke-package-and-unit-tests/
The URI to trackback this post is: https://www.mkbergman.com/2388/cwpk-46-creating-the-cowpoke-package-and-unit-tests/trackback/
Posted:September 30, 2020

Out of Sequence, But Reducing ‘Garbage’ Always Makes Sense

We have noted in previous installments in this Cooking with Python and KBpedia series how important consistent UTF-8 encoding is to roundtripping with our files. One of the ways we can enforce this importance is to consistently read and write files with UTF-8 specified, as discussed in CWPK #31. But, what if we have obtained external information? How can we ensure it is in the proper encoding or has wrong character assignments fixed? If we are going to perform such checks, what other consistency tests might we want to include? In this installment, we add some pre-build routines to test and clean our files for proper ingest.

As I noted in CWPK #39, cleaning comes before the build steps in the actual build process. But we wanted to have an understanding of broader information flows throughout the build or use scenarios before formulating the cleaning routines. That is both because they are not always operationally applied, and because working out the build steps was aided by not having to carry around extra routines. Now that we have the ingest and build steps fairly well outlined, it is an easier matter to see where and how cleaning steps best fit into this flow.

At the outset, we know we want to work with clean files when building KBpedia. Do we want such checks to run in every build, or optionally? Do we want to run checks against single files or against entire directories or projects? Further, are we not likely to want to add more checks over time as our experience with the build process and problems encountered increase? Lastly, we can see down the road (CWPK #48) to where we also only want to make incremental changes to an existing knowledge graph, as opposed to building one from scratch or de novo. How might that affect cleaning requirements or placement of methods?

Design Considerations

In thinking about these questions, we decided to take this general approach to testing and vetting clean files:

  1. Once vetted, files will remain clean (insofar as the tests run) until next edited. It may not make sense to check all files automatically at the beginning of a build. This point suggests we should have a separate set of cleaning routines from the overall build process. We may later want to include that into an overall complete build routine, but we can do so later as part of a make file approach rather than including cleaning as a mandatory part of all builds.

  2. Once we have assembled our files for a new build, we should assume that all files are unvetted. As build iterations proceed, we only need to vet those files that have been modified. When initially testing a new build, it probably makes sense for us to be able to loop over all of the input files in a given directory (corresponding to most of the subdirectories under kbpedia > version > build; see prior CWPK #37 installment). These points suggest we want the option to configure our clean routines for either all files in a subdirectory or a list of files. To keep configuration complexity lower, we will stipulate that if a list of files is used, they should all be in the same subdirectory.

  3. Our biggest cleaning concern is that we have clean, UTF-8 text (encodings) in all of our input files. However, if we need to run this single test, we ought to test for other consistency concerns, as well. Here are the additional tests that look useful in our initial module development:

    • Have new fields (columns) been added to our CSV files?
    • Are our input files missing already defined fields?
    • Are we missing required fields (prefLabel and definition)?
    • Are our fields properly constructed (CamelCase with initial cap for classes, initial lowercase for properties, and URI encoding for IRIs)?
  4. If we do have encoding issues, and given the manual effort required to fix them, can we include some form of encoding ‘fix’ routine? It turns out there is a Python package for such a routine, that we will test in this installment and include if deemed useful.

These considerations are what have guided the design of the cowpoke clean module. Also, as we noted in CWPK #9, our design is limited to Python 3x. Python 2 has not been accommodated in cowpoke.

A Brief Detour for URIs

KBpedia is a knowledge graph based on semantic technologies and which incorporates seven major public and online knowledge bases: Wikipedia, Wikidata, DBpedia, schema.org, GeoNames, UNSPSC, and OpenCyc. A common aspect of all of these sources is that reference to information is a Web string that ‘identifies’ the item at hand that, when clicked, also takes us to the source of that item. In the early days of the Web this identifier mostly pertained to Web pages and was known as a Universal Resource Locator, or URL. They were the underlined blue links of the Web’s early days.

But, there are other protocols for discovering resources on the Internet beside the Web protocols of HTTP and HTTPS. There is Gopher, FTP, email, and others. Also, as information began to proliferate from Web pages to data items within databases and these other sources, the idea of a ‘locator’ was generalized to include ‘identifiers’ when it is a data item and not a page. This generalization is known as a URI, or if a ‘name’ within other schema or protocols, known as a URN. Here, for example, is the URI address of the English Wikipedia main page:

  https://en.wikipedia.org/wiki/Main_Page

Note that white space is not allowed in this string, and is replaced with underscores in this example.

The allowed characters that could be used in constructing one of these addresses were limited to mostly ASCII characters, with some characters like the forward-slash (‘/’) forbidden because they are a defined constructor of an address. If one wanted to include non-allowed characters in a URI address, it needed to be percent encoded. Here, for example, is the English Wikipedia address for its article on the Côte d’Azur Observatory:

  https://en.wikipedia.org/wiki/C%C3%B4te_d%27Azur_Observatory

This format is clearly hard to read. Most Web browsers, for example, decode these strings when you look at the address within the browser, so it appears as this:

  https://en.wikipedia.org/wiki/Côte_d'Azur_Observatory

And, in fact, if you submit the string as exactly shown above, encoders at Wikipedia would accept this input string.

The Internationalized Resource Identifier (IRI was proposed and then adopted on the Web as a way of bringing in a wider range of acceptable characters useful to international languages. Mostly what we see in browsers today is the IRI version of these addresses, even if not initially formulated as such.

Sources like Wikipedia and Wikidata restrict their addresses to URIs. A source like DBpedia, on the other hand, supports IRIs. Wikipedia also has a discussion on how to fix these links.

The challenge in these different address formats is that if encoding gets screwed up, IRI versions of addresses can also get screwed up. That might be marginally acceptable when we are encoding something like a definition or comment (an annotation), but absolutely breaks the data record if it occurs to that record’s identifying address: Any change or alteration of the exact characters in the address means we can no longer access that data item.

Non-percent encoded Wikipedia addresses and DBpedia addresses are two problem areas. We also have tried to limit KBpedia’s identifiers to the ASCII version of these international characters. For example, the KBpedia item for Côte-d’Or shows as the address:

  http://kbpedia.org/kko/rc/CoteDOr

We still have a readable label, but one with encoding traps removed.

I provide this detour to highlight that we also need to give special attention in our clean module to how Web addresses are coming in to the system and being treated. We obviously want to maintain the original addresses as supplied by the respective external sources. We also want to test and make sure these have not been improperly encoded. And we also want to test that our canonical subset of characters used in KBpedia is being uniformly applied to our own internal addresses.

Encoding Issues and ftfy

Despite it being design point #4 above, let’s first tackle the question of whether encoding fixes may be employed. I move it up the list because it is also the best way to illustrate why encoding issues are at the top of our concerns. First, let’s look at 20 selected records from KBpedia annotations that contain a diversity of language and symbol encodings.

Getting the files: The three mentioned files below are part of the the formal cowpoke release, which does not come until CWPK #46. For now, you can obtain these mentioned files from https://github.com/Cognonto/CWPK/tree/master/sandbox/builds/working.

These three files are part of the cowpoke distribution. This first file is the starting set of 20 selected records (remember Run or shift+enter to run the cell):

with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_orig.csv', 'r', encoding='utf8') as f:
    print(f.read())

However, here is that same file when directly imported into Excel and then saved (notice we had to change the encoding to get the file to load in Python):

with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_excel.csv', 'r', encoding='cp1252') as f:
    print(f.read())

Wow, did that file ever get screwed up! (You will obviously need to change the file locations to match your local configuration.) In fact, there are ways to open CSV files properly in Excel by first firing up the application and then using the File → Open dialogs, but the form above occurs in English MS Excel when you open the file directly, make a couple of changes, and then save. If you do not have a backup, you would be in a world of hurt.

So, how might we fix this file, or can we? The first thing to attempt is to load the file with the Python encoding set to UTF-8. Indeed, in many cases, that is sufficient to restore the proper character displays. One thing that is impressive in the migration to Python 3.6 and later is tremendously more forgiving behavior around UTF-8. That is apparently because of the uniform application now of UTF-8 across Python, plus encoding tests that occur earlier when opening files than occurred with prior versions of Python.

But in instances where this does not work, the next alternative is to use ftfy (fixes text for you). The first thing we need to do is to import the module, which is already part of our conda distribution (see CWPK #9):

import ftfy

Then, we can apply ftfy methods (of which there are many useful ones!) to see if we can resurrect that encoding-corrupted file from Excel:

import io

with io.open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_excel.csv', encoding='utf-8', mode='r', errors='ignore',) as f:
    lines = f.readlines()
    print(lines)
    fixed_lines = [ftfy.fix_text(line) for line in lines]
    print(fixed_lines)
# so you may inspect the results, but we will also write it to file:
    with io.open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_fixed.csv', encoding='utf-8', mode='w',) as out:
        print(fixed_lines, file=out)

I have to say this is pretty darn impressive! We have recovered nearly all of the original formats. Now, it is the case there are some stoppers in the file, which is why we needed to incorporate the more flexible io method of opening the file to be able to ignore the errors. Each of the glitches that occur in the file still need to be manually fixed. But, we can also use the ‘replace’ as a different argument to ‘error’ to insert some known characters to more quickly find these glitches. Overall, this is a much reduced level of effort to fix the file than without ftfy. We have moved from a potentially catastrophic situation to one that is an irritant to fix. That is progress!

Just to confirm (and for which one could do file compares to see specific differences to also help in the manual corrections), here is our now ‘fixed’ output file:

with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_fixed.csv', 'r', encoding='utf-8') as f:
    print(f.read())

We can also inspect our files as to what encoding we think it has. Again, we use an added package, chardet in this case, to test any suspect file. Here is the general form:

import chardet

with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_fixed.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

Note that one of the arguments is to pass the first 10,000 characters to the method as the basis for estimating the encoding type. Since the routine is quick, there is really no reason to lower this amount, and higher does not seem to provide any better statistics.

Again, a gratifying aspect of the improvements to Python since version 3.6 or so has been a more uniform approach to UTF-8. We also see we have some tools at our disposal, namely ftfy, that can help us dig out of holes that prior encoding mistakes may have dug. In our early years when encoding mismatches were more frequent, we also developed a Clojure routine for fixing bad characters (or at least converting them to a more readable form). It is likely this routine is no longer needed with Python’s improved handling of UTF-8. However, if this is a problem for your own input files, you can import the unicodedata module for the Python standard library to convert accented (diacritic) characters to ones based on ASCII. Here is the basic form of that procedure:

import unicodedata

def remove_diacrits(input_str):
    input_str = unicodedata.normalize('NFD', input_str).encode('ascii', 'ignore')\
           .decode('utf-8')
    return str(input_str)

s = remove_diacrits("Protégé")

print(s)
Protege

You can embed that routine in a CSV read that also deals with entire rows at a time, similar to some of the other procedures noted here.

However, the best advice, as we have reiterated, is to make sure that files are written and opened in UTF-8. But, it is good to know if we encounter encoding issues in the wild, that both Python and some of its great packages stand ready to help rectify matters (or at least partially so, with less pain). We have also seen how encoding problems can often be a source of garbage input data.

Flat File Checks

Though Python routines could be written for the next points below, they may be easier to deal with directly in a spreadsheet. This is OK, since we are also at that point in our roundtripping where we are dealing directly with CSV files anyway.

To work directly with the sheet, highlight the file’s entire set of rows and columns that are intended for eventual ingest during a build. Give that block a logic name in the upper left text box entry directly above the sheet, such as ‘Match’ or ‘Big’. You can continue to invoke that block name to re-highlight your subject block. From there, are can readily sort for the specific input column of interest in order to inspect the entire row of values.

Here is my checklist for such flat file inspection:

  1. Does any item in the ‘id’ column lack a URI fragment identifier? If so, provide using the class and property URI naming conventions in KBpedia (CamelCase in both instances, upper initial case for classes, lower initial case for properties, with only alphanumerics and underscore as allowable characters). Before adding a new ‘id’, make sure it is initially specified in one of the class or property struct input files

  2. Does any item in the ‘prefLabel’ column lack a preferred label? If so, add one; this field is mandatory

  3. Does any item in the ‘definition’ column lack an entry? If so, add one. Though this field is not mandatory, it is highly encouraged

  4. Check a few rows. Does any column entry have leading or trailing white spaces? If so, use the spreadsheet TRIM function

  5. Check a few rows. Do any of the files with a ‘definition’ column show its full text spread over more than one cell? If so, you have an upstream CSV processing issue that is splitting entries at the common or some other character that should be escaped. The best fix, if intermediate processing has not occurred, is to re-extract the file with correct CSV settings. If not, you may need to concatenate multiple cells in a row in order to re-construct the full string

  6. URI fragment identifier? If so, provide using the class and property URI naming conventions in KBpedia (CamelCase in both instances, upper initial case for classes, lower initial case for properties, with only alphanumerics and underscore as allowable characters). Before adding a new ‘id’, make sure it is initially specified in one of the class or property struct input files

  7. Check entries for wrong or misspecified namespaces or prefixes. Make sure fragments end with the appropriate characters (‘#’ or ‘/’ if used in a URI construction)

  8. Check columns where multiple entries may reside using the double-pipe (‘||’) convention, and ensure these decomposable strings are being constructed properly.

One of the reasons I am resistant to a complete build routine cascading through all of these steps at once is that problems in intermediate processing files propagate through all subsequent steps. That not only screws up much stuff, but it is harder to trace where the problem first arose. This is an instance where I prefer a ‘semi-automatic’ approach, with editorial inspection required between essential build steps.

Other Cleaning Routines

Fortunately, in our case, we are extracting fairly simple CSV files (though often with some long text entries for definitions) and ingesting in basically the same format. As long as we are attentive to how we modify the intermediate flat files, there is not too much further room for error.

However, there are many sources of external data that may eventually warrant incorporation in some manner into your knowledge graph. These external sources may pose a larger set of cleaning and wrangling challenges. Date and time formats, for example, can be particularly challenging.

Hadley Wickham, the noted R programmer and developer of many fine graphics programs, wrote a paper, Tidy Data, that is an excellent starting primer on wrangling flat files. In the case of our KBpedia knowledge graph and its supporting CSV, about the only guideline that he proposes that we consciously violate is to combine many-to-one data items sometimes in a single column (notable for altLabels, but a few others as well). According to Wickham, we should put each individual value on its own row. I have not done so to keep the listings more compact and the row count manageable. Nonetheless, his general guidance is excellent. Another useful guide is Wrangling Messy CSV Files by Detecting Row and TypePatterns.

There are also many additional packages in Python that may assist in dealing with ‘dirty’ input files. Depending on the specific problems you may encounter, some quick Web searches should turn up some useful avenues to pursue.

Lastly, in both our utils.py and other modules going forward, we will have occasion to develop some bespoke cleaning and formatting routines as our particular topic warrants.

Additional Documentation

Here is some additional documentation related to todays CWPK installment:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 30, 2020 at 9:57 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2387/cwpk-45-cleaning-and-file-pre-checks/
The URI to trackback this post is: https://www.mkbergman.com/2387/cwpk-45-cleaning-and-file-pre-checks/trackback/
Posted:September 29, 2020

More Fields, But Less Complexity

We now tackle the ingest of annotations for classes and properties in this installment of the Cooking with Python and KBpedia series. In prior installments we built the structural aspects of KBpedia. We now add the labels, definitions, and other assignments to them.

As with the extraction routines, we will split these efforts into class annotations and then property annotations. Our actual load routines are fairly straightforward, and we have no real logic concerns in how these annotations get added. The most complex wrinkle we will need to address are those annotation fields, altLabels and notes in particular, where we have potentially many assignments for a single reference concept (RC) or property. Like we saw with the extraction routines, for these items we will need to set up additional internal loops to segregate and assign the items for loading based on our standard double-pipe (‘||’) delimiter.

The two functions we develop in this installment, class_annot_builder and prop_annot_builder will be added to the build.py module.

Start-up

Since we are in an active part of the build cycle, we want to continue with our main knowledge graph in-progress for our load routine, so please make sure that kb_src is set to ‘standard’ in your config.py configuration. We then invoke our standard start-up:

from cowpoke.__main__ import *
from cowpoke.config import *

Loading Class Annotations

Class annotations consist of potentially the item’s prefLabel, altLabels, definition, and editorialNote. The first item is mandatory, the next two should be provided to adhere to best practices. The last is optional. There are, of course, other standard annotations possible. Should your own conventions require or encourage them, you will likely need to modify the procedure below to account for that fact.

As with these methods before, we provide a header showing ‘typical’ configuration settings (in config.py), and then proceed with a method that loops through all of the rows in the input file. Here is the basic class annotation build procedure. There are no new wrinkles in this routine from what has been seen previously:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : file_dict.values(),                           # see 'in_file'
# 'loop'          : 'class_loop',
# 'in_file'       : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/Generals_annot_out.csv',
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts_test.csv',


def class_annot_build(**build_deck):
    print('Beginning KBpedia class annotation build . . .')
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    class_loop = build_deck.get('class_loop')
#    r_id = ''
#    r_pref = ''
#    r_def = ''
#    r_alt = ''
#    r_note = ''
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval) 
        in_file = loopval
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=[C])                 
            for row in reader:
                r_id_frag = row['id']
                id = getattr(rc, r_id_frag)
                if id == None:
                    print(r_id_frag)
                    continue
                r_pref = row['prefLabel']
                r_alt = row['altLabel']
                r_def = row['definition']
                r_note = row['editorialNote']
                if is_first_row:                                       
                    is_first_row = False
                    continue      
                id.prefLabel.append(r_pref)
                id.definition.append(r_def)
                i_alt = r_alt.split('||')
                if i_alt != ['']: 
                    for item in i_alt:
                        id.altLabel.append(item)
                i_note = r_note.split('||')
                if i_note != ['']:   
                    for item in i_note:
                        id.editorialNote.append(item)
    print('KBpedia class annotation build is complete.')               
class_annot_build(**build_deck)
kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts_test.owl', format='rdfxml') 

BTW, when we commit this method to our build.py module, we will add the save routine at the end.

Loading Property Annotations

We now turn our attention to annotations of properties:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : prop_dict.values(),                           # see 'in_file'
# 'loop'          : 'class_loop',
# 'in_file'       : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv',
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts_test.csv',

def prop_annot_build(**build_deck):
    print('Beginning KBpedia property annotation build . . .')
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    out_file = build_deck.get('out_file')
    if loop is not 'property_loop':
        print("Needs to be a 'property_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval) 
        in_file = loopval
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subPropertyOf', 'domain',  
                                   'range', 'functional', 'altLabel', 'definition', 'editorialNote'])                 
            for row in reader:
                r_id = row['id']                
                r_pref = row['prefLabel']
                r_dom = row['domain']
                r_rng = row['range']
                r_alt = row['altLabel']
                r_def = row['definition']
                r_note = row['editorialNote']
                r_id = r_id.replace('rc.', '')
                id = getattr(rc, r_id)
                if id == None:
                    print(r_id)
                    continue
                if is_first_row:                                       
                    is_first_row = False
                    continue
                id.prefLabel.append(r_pref)
                i_dom = r_dom.split('||')
                if i_dom != ['']: 
                    for item in i_dom:
                        id.domain.append(item)
                if 'owl.' in r_rng:
                    r_rng = r_rng.replace('owl.', '')
                    r_rng = getattr(owl, r_rng)
                    id.range.append(r_rng)
                elif r_rng == ['']:
                    continue
                else:
#                    id.range.append(r_rng)
                i_alt = r_alt.split('||')    
                if i_alt != ['']: 
                    for item in i_alt:
                        id.altLabel.append(item)
                id.definition.append(r_def)        
                i_note = r_note.split('||')
                if i_note != ['']:   
                    for item in i_note:
                        id.editorialNote.append(item)
    print('KBpedia property annotation build is complete.') 
prop_annot_build(**build_deck)

Hmmm. One of the things we notice in this routine is that our domain and range assignments have not been adequately picked up in our earlier KBpedia version 2.50 build routines (the ones undertaken in Clojure before this CWPK series). As a result, we can not adequately test range and will need to address this oversight before our series is over.

As before, we will add our ‘save’ routine as well when we commit the method to the build.py module.

kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts_test.owl', format='rdfxml') 

We now have all of the building blocks to create our extract-build roundtrip. We summarize the formal steps and configuration settings in CWPK #47. But, first, we need to return to cleaning our input files and instituting some unit tests.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 29, 2020 at 9:24 am in CWPK, KBpedia, Semantic Web Tools | Comments (6)
The URI link reference to this post is: https://www.mkbergman.com/2385/cwpk-44-annotation-ingest/
The URI to trackback this post is: https://www.mkbergman.com/2385/cwpk-44-annotation-ingest/trackback/
Posted:September 28, 2020

Two Key Concepts: Consistency and Satisfiability

The last structural step in a build is to test the knowledge graph for logic, the topic of today’s Cooking with Python and KBpedia installment. We first introduced the concepts of consistency and satisfiability in CWPK #26. Axioms are assertions in an ontology, as informed by its base language; that is, the aggregate of the triple statements in a knowledge graph. Consistency is where no stated axiom entails a contradiction, either in semantic or syntactic terms. A consistent knowledge graph is one where its model has an interpretation under which all formulas in the theory are true. Satisfiability means that it is possible to find an interpretation (model) that makes the axiom true.

Satisfiability is a test of classes to discover if there is an interpretation that is non-empty. This is tested against all of the logical axioms in the current knowledge graph, most effectively driven by disjoint and functional assertions. Consistency is an ontology measure to test whether there is a model that meets all axioms. I often use the term incoherent to refer to an ontology that has unsatisfiable assertions.

The Sattler, Stevens, and Lord reference shown under the first link under Additional Documentation below offers this helpful shorthand:

  • Unsatisfiable: How ever hard you try, you will never find an individual which fits an unsatisfiable concept
  • Incoherent: Sooner or later, you are going to contradict yourself, and
  • Inconsistent: At least, one of the things you have said makes no sense.

In the Protégé IDE, unsatisfiable classes are shown in red in the inferred class hierarchy, and makes them subclasses of Nothing, meaning they have no instances, ever. If the ontology is inconsistent, it is indicated by a new window warning about the inconsistency and offering guidance of how to fix.

The two reasoners available to us, via either owlready2 or Protégé, are HermiT and Pellet. Hermit is better at identifying inconsistencies, while Pellet is better at identifying unsatisfiable classes. We will use both in our structural logic tests.

However, before we get into those logic topics, we need to load up our system with our new start-up routines.

Our New Startup Sequence

As we discussed in the last installment, we no longer will post the specific start-up steps. At the same time that we are moving our prior functions into modules, discussed next, we have moved those steps to the cowpoke package proper. Here is our new start-up instruction:

from cowpoke.__main__ import *
from cowpoke.config import *

Please review your configuration settings in config.py to make sure you are using the appropriate input files and you know where to write out results. Assuming you have just finished your initial structural build steps, as discussed in the past few installments, you should likely be using the kb_src = 'standard' setting.

Summary of the Added Modules

Here are the steps we took to add the two new modules of build and utils to the cowpoke package:

  1. Added these import statements to __init__.py:
  from cowpoke.build import *
from cowpoke.utils import *
  1. Added what had been our standard start-up expressions to __main__.py

  2. Created two new files using Spyder for the cowpoke project, build.py and utils.py, and added our standard file header to them

  3. Moved the various functions defined in recent installments into their appropriate new file, and ensured each was added in appropriate format to define a function def

  4. Tested the routines and made sure all functions were now appropriately disclosed and operational.

The build.py module contains these functions, covered in CWPK #4041:

  • row_clean – a helper function to shorten resource IRI strings to internal formats
  • class_struct_builder – the function to process class input files into KBpedia’s internal representation
  • property_struct_builder – the function to process property input files into KBpedia’s internal representation.

The utils.py module contains these functions, covered in CWPK #4142:

  • dup_remover – a function to remove duplicate rows in input files
  • set_union – a function to determine the union between two or more class input files
  • set_difference – a function to determie the difference between two (or more, though not recommended) class input files
  • set_intersection – a function to determine the intersection between two or more class input files
  • typol_intersects – a comprehensive function that calculates the pairwise intersection among all KBpedia typologies
  • disjoint_status – a function to extract the disjoint assertions from KBpedia
  • branch_orphan_check – a function to identify classes that are not properly connected with the KBpedia structure
  • dups_parental_chain – a helper function to identify classes that have more than one direct superclass assignment across the KBpedia structure, used to inform how to reduce redundant class hierarchy declarations.

Logic Testing of the Structure

Prior to logic testing, I suggest you review CWPK #26 again for useful background information. You may also want to refer to the sources listed below under Additional Documentation.

Use of owlready2

While it is true that owlready2 embeds basic logic calls to either the HermiT and Pellet reasoners, the amount of information forthcoming from these tools is likely insufficient to meet the needs of your logic tests. First, let’s invoke the Hermit reasoner, calling up our kb ontology:

sync_reasoner(kb)

Unfortunately, with our set-up as is, HermiT errors out on us. This is because the reasoner will not accept a file address for our imported KKO upper ontology. We could change that reference in our stored knowledge graph, but we will skip for now since we can obtain similar information from the Pellet reasoner.

So, we invoke the Pellet alternative (note the analysis will take about three or so minutes to run):

sync_reasoner_pellet(kb)

For test purposes, I had temporarily assigned JaguarCat as a subclass of JaguarVehicle, which is a common assignment error where a name might refer to two different things, in this case animals and automobiles, that are disjoint. As we noted above, this subclass assignment violates our disjoint assertions and thus is shown under the owl.Nothing category.

If we add the temporary file switch to this call, however, we will write this information to the temporary file shown in the listing, plus more importantly get some traceback on where the problem may be occurring. This is the most detailed message available:

sync_reasoner_pellet(kb, keep_tmp_file=1)
* Owlready2 * Running Pellet...
java -Xmx2000M -cp C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\antlr-3.2.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\antlr-runtime-3.2.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\aterm-java-1.6.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\commons-codec-1.6.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\httpclient-4.2.3.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\httpcore-4.2.2.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\jcl-over-slf4j-1.6.4.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\jena-arq-2.10.0.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\jena-core-2.10.0.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\jena-iri-0.9.5.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\jena-tdb-0.10.0.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\jgrapht-jdk1.5.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\log4j-1.2.16.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\owlapi-distribution-3.4.3-bin.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\pellet-2.3.1.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\slf4j-api-1.6.4.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\slf4j-log4j12-1.6.4.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\xercesImpl-2.10.0.jar;C:\1-PythonProjects\Python\lib\site-packages\owlready2\pellet\xml-apis-1.4.01.jar pellet.Pellet realize --loader Jena --input-format N-Triples --ignore-imports C:\Users\mike\AppData\Local\Temp\tmpp4n32vj4
* Owlready2 * Pellet took 187.1356818675995 seconds
* Owlready * Equivalenting: kko.Generals kko.SuperTypes
* Owlready * Equivalenting: kko.SuperTypes kko.Generals
* Owlready * Equivalenting: rc.JaguarCat rc.JaguarVehicle
* Owlready * Equivalenting: rc.JaguarCat owl.Nothing
* Owlready * Equivalenting: rc.JaguarVehicle rc.JaguarCat
* Owlready * Equivalenting: rc.JaguarVehicle owl.Nothing
* Owlready * Equivalenting: owl.Nothing rc.JaguarCat
* Owlready * Equivalenting: owl.Nothing rc.JaguarVehicle
* Owlready * Reparenting rc.BiologicalLivingObject: {rc.FiniteSpatialThing, rc.OrganicMaterial, rc.NaturalTangibleStuff, rc.BiologicalMatter, rc.TemporallyContinuousThing} => {rc.BiologicalMatter, rc.FiniteSpatialThing, rc.OrganicMaterial, rc.TemporallyContinuousThing}
* Owlready * Reparenting rc.Animal: {rc.PerceptualAgent-Embodied, rc.AnimalBLO, rc.Organism, rc.Heterotroph} => {rc.PerceptualAgent-Embodied, rc.AnimalBLO, rc.Heterotroph}
* Owlready * Reparenting rc.Vertebrate: {rc.SentientAnimal, rc.MulticellularOrganism, rc.ChordataPhylum} => {rc.SentientAnimal, rc.ChordataPhylum}
* Owlready * Reparenting rc.SolidTangibleThing: {rc.ContainerIndependentShapedThing, rc.FiniteSpatialThing} => {rc.ContainerIndependentShapedThing}
* Owlready * Reparenting rc.Automobile: {rc.SinglePurposeDevice, rc.PassengerMotorVehicle, rc.WheeledTransportationDevice, rc.RoadVehicle, rc.TransportationDevice} => {rc.SinglePurposeDevice, rc.PassengerMotorVehicle, rc.RoadVehicle, rc.WheeledTransportationDevice}
* Owlready * Reparenting rc.AutomobileTypeByBrand: {rc.Automobile, rc.FacetInstanceCollection, rc.VehiclesByBrand} => {rc.Automobile, rc.VehiclesByBrand}
* Owlready * Reparenting rc.DeviceTypeByFunction: {rc.FacetInstanceCollection, rc.PhysicalDevice} => {rc.PhysicalDevice}
* Owlready * Reparenting rc.TransportationDevice: {rc.Conveyance, rc.HumanlyOccupiedSpatialObject, rc.Equipment, rc.DeviceTypeByFunction} => {rc.Conveyance, rc.HumanlyOccupiedSpatialObject, rc.Equipment}
* Owlready * Reparenting rc.LandTransportationDevice: {rc.TransportationProduct, rc.TransportationDevice} => {rc.TransportationDevice}
* Owlready * Reparenting rc.DeviceTypeByPowerSource: {rc.FacetInstanceCollection, rc.PhysicalDevice} => {rc.PhysicalDevice}
* Owlready * (NB: only changes on entities loaded in Python are shown, other changes are done but not listed)

Notice this longer version (as it true for the logs written to file) also flags some of our cyclical references.

Once the run completes, we can also call up the two classes (in this instance, not for what you have locally) that are unsatisfied:

list(kb.inconsistent_classes())
[rc.JaguarCat, owl.Nothing, rc.JaguarVehicle]

Use of owlready2’s reasoners also enables a couple of additional methods that can be helpful, especially in cases such as the analysis of parental chains that we undertook last installment. Here are two additional calls that are useful:

kb.get_parents_of(rc.Automobile)
[rc.PassengerMotorVehicle,
rc.RoadVehicle,
rc.SinglePurposeDevice,
rc.TransportationDevice,
rc.WheeledTransportationDevice]
kb.get_children_of(rc.Automobile)
[rc.HondaCar,
rc.LuxuryCar,
rc.AlfaRomeoCar,
rc.Automobile-GasolineEngine,
rc.AutomobileTypeByBrand,
rc.GermanCar,
rc.AutoSteeringSystemType,
rc.AutomobileTypeByBodyStyle,
rc.AutomobileTypeByConventionalSizeClassification,
rc.AutomobileTypeByModel,
rc.AutonomousCar,
rc.GMAutomobile,
rc.DemonstrationCar,
rc.ElectricCar,
rc.JapaneseCar,
rc.HumberCar,
rc.SaabCar,
rc.NashCar,
rc.NewCar,
rc.OffRoadAutomobile,
rc.PoliceCar,
rc.RentalCar,
rc.UsedAutomobile,
rc.VauxhallCar]

You can also invoke data or property value tests with Pellet, including or not debugging:

    sync_reasoner_pellet(infer_property_values=True, debug=1) 
sync_reasoner_pellet(infer_property_values=True, infer_data_property_values=True)

It is clear that reasoner support in owlready2 is a dynamic thing, with more capabilities being added periodically to new releases. At this juncture, however, for our purposes, we’d like to have a bit more capability and explanation tracing as we complete our structure logic tests. For these purposes, let’s switch to Protégé.

Reasoning with Protégé

At this point, I think using Protégé directly is the better choice for concerted logic testing. To do so, you will likely need to take two steps:

  1. Using the File → Check for plugins … option in Protégé, make sure that Pellet is checked and installed on your system
  2. Offline, increase the memory allocated to Protégé to up to 80% of your free memory. The settings are found in the first lines of either run.bat or Protege.l4j.ini (remember, this series is based on Windows 10) in your Protégé startup directory. The two values are Xms6000M and Xmx6000M (showing my own increased settings for a machine with 16 GB of RAM); you may need to do an online search if you want to understand these settings better.

Then, to operate your reasoners once you have started up and loaded KBpedia (or your current knowledge graph) with Protégé, go to Reasoner (1) on the main menu, then pick your reasoner at the bottom of that menu. In this case, we are starting up with HermiT (2):

Starting Up HermiT in Protégé
Figure 1: Starting Up HermiT in Protégé

Truth is, I have tended to work more with Pellet over the years. My impression is that HermiT is largely consistent with what I have seen in Pellet, and HermiT does load in Protégé with the file assignment of KKO that was not accepted by owlready2.

So, on that basis, I log off and re-load and now choose the Pellet option. When we Reasoner → Start reasoner, and then after loading, go to the classes tab and then pick the Class hierarchy (inferred) (1) (note the yellow background and red text), we see the two temporary assignments now showing under owl:Nothing (2):

Pellet Results in Protégé
Figure 2: Pellet Results in Protégé

In the case of an ‘inconsistent ontology’ a more detailed screen appears (not shown, since we have not rigged KBpedia to display such) that helps track back the possible causes.

Our own internal build routines with Clojure and the OWLAPI has a more detailed output and better tracing of possible unsatisfiable issues. I have not provided such routines in this CWPK series because, it is not absolutely necessary for our ‘roundtripping‘ objectives, and to accomplish such in Python is likely way beyond my limited programming skills. This general area of decomposing structural builds from a logical perspective remains a pretty weak one with available tools.

OOPS! Scanner

Another very useful utility for checking possible problems is the OOPS! (OntOlogy Pitfall Scanner) online tool. You may copy your ontology to its online form (not recommended for something the size of KBpedia) or point the tool to a URI where you have stored the file. If you are using the utility frequently, there is also a REST API to the system.

It presently provides 33 pitfall tests in areas such as structure, function, usability, consistency, and completeness. OOPS! classifies pitfalls it finds into minor, important or critical designations:

Analysis with OOPS!
Figure 3: Analysis with OOPS!

OOPS! will catch issues that you would never identify on your own. Of course, you are not obligated to fix any of the issues, but some will likely seem appropriate. It is probably a good idea to run your knowledge graph against OOPS! at least once each major development cycle.

Some Logic Fix Guidelines

Of course, there may be many logic issues that arise in a knowledge graph. However, since we have largely restricted our scope to structure integrity and disjointedness, here are some general points drawn from experience of how to interpret and correct these kinds of issues.

  1. An owl.Nothing assignment with KBpedia likely is due to a misassigned disjoint assertion, since there has been much testing in this area

  2. The first and likeliest fix is to remove the offending disjoint assertion

  3. If there are multiple overlaps, look to the higher tier concepts, since they may be causative for a cascade of classes below them

  4. A large number of overlaps, with some diversity among them, perhaps indicates a wrong disjoint assertion between typologies

  5. To reclaim what intuitively (or abductively) feels like what should be a disjoint assertion between two typologies, consider cleaving one of the two typologies to better segregate the perceived distinctions

  6. Some conflicts may be resolved by moving the offending concept higher in the hierarchy, since more general typologies have fewer disjoint assertions

  7. Manually drawing Venn diagrams is one technique for helping to think through interactions and overlaps

  8. When introducing a new typology, or somehow shifting or re-organizing others, try to take only incremental steps. Very large structure changes are hard to diagnose and tease out; it seems to require fewer iterations to get to a clean build by taking more and smaller steps

  9. Assign domain and range to all objectProperties and dataProperties, but also be relaxed in the assignments to account for the diversity of data characterizations in the wild. As perhaps cleaning or vetting routines get added, these assignments may be tightened

  10. Ultimately, all such choices are ones of design, understandability, and defensibility. In difficult or edge cases, it is often necessary to study and learn more, and sometimes re-do boundaries of offending concepts in order to segregate the problem areas.

This material completes the structure build portions of our present cycle. We can next turn our attention to loading up the annotations in our knowledge graph to complete the build cycle.

Additional Documentation

Here are some supplementary references that may help to explain these concepts further:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted by AI3's author, Mike Bergman Posted on September 28, 2020 at 9:11 am in CWPK, KBpedia, Semantic Web Tools | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/2384/cwpk-43-logic-testing-of-the-knowledge-graph-structure/
The URI to trackback this post is: https://www.mkbergman.com/2384/cwpk-43-logic-testing-of-the-knowledge-graph-structure/trackback/