Posted:October 6, 2020

CWPK #48: Case Study: A Sweeping Refactoring

Bringing Home the Lessons to Date with KBpedia v 3.00

Today’s installment in our Cooking with Python and KBpedia series is a doozy. Not only are we wrapping up perhaps the most important part of our series — building KBpedia from scratch — but we are also applying the full roundtrip software in our cowpoke Python package to a major re-factoring of KBpedia itself. This re-factoring will lead to the next release of KBpedia v. 3.00.

This re-factoring and new release was NOT part of the original plan for this CWPK series. Today’s current efforts were the result of issues we have discovered in the current version 2.50 of KBpedia, the version with which we began this series. The very process we have gone through in developing the cowpoke software to date has surfaced these problems. The problems have been there and perhaps part of KBpedia for some time, but our prior build routines were such that these issues were not apparent. By virtue of different steps and different purposes, we have now seen these things, and now have the extract and build procedures to address them.

It turns out the seven or so problems so identified provide a ‘perfect‘ (in the sense of ‘storm‘) case study for why a roundtrip capability makes sense and how it may be applied. Without further ado, let’s begin.

Summary of the Problem Issues

The cowpoke Python package as we have used to date has surfaced seven types of issues with KBpedia, v. 250, the basis with which we started this CWPK series. Our starting build files for this series are ones extracted from the current public v 250 version. About half of the issues are in the KBpedia knowledge graph, but had remained hidden given the nuances of our prior Clojure build routines. The other half of the issues relate to our new use of Python and owlready2.

These seven issues, with some background explanation, are:

Remove hyphens – in our prior build routines with Clojure, that language has a style that favors dashes (or hyphens) when conjoining words in a label identifier. Python is not hyphen-friendly. While we have not seen issues when working directly with the owlready2 package, there are some Python functions that burp with hyphenated KBpedia identifiers:

print(rc.Chemistry-Topic)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-5566a42d12b4> in <module>
----> 1 print(rc.Chemistry-Topic)

NameError: name 'rc' is not defined

Across class and property identifiers, there are thousands of such hyphenated IDs in KBpedia. We could retain and trap for these instances, but with provisions for current users of KBpedia, we have decided to conform more to the Python underscore style. That would shield other amateurs (such as me) from having to trap for it in any new Python code written to support cowpoke or its knowledge graphs.

Ensure the kko.superClassOf property is moved to an AnnotationProperty. When we want to use the concept of superclass as an object property, we can now use the built-in owlready2 superclass.
Remove OpenCyc href’s – part of KBpedia’s heritage comes from the open-source version of the Cyc ontology, including many initial concept definitions. OCyc distribution and support was ceased in 2017, though the ontology is still referenceable online. Given the receding usefulness of OCyc, we want to remove all of the internal URI references in definitions within KBpedia.
Remove duplicates – one nice aspect of the owlready2 engine is its identification of circular references, but gracefully proceeding with only a warning. Our new build routines have surfaced about ten of these circularities in KBpedia v 250. Two of these, Person ←→ HomoSapiens, and a second, God ←→ Diety, are intended design decisions by us as editors of KBpedia. The other instances, however, are unintended, and ones we want to resolve. We need to remove these.
Remove the SuperType concept and move it to an annotation property – besides being one of the duplicates (see (4) above), our adoption of Charles Sanders Peirce‘s universal categories in the KBpedia Knowledge Ontology (KKO) has supplanted the ‘SuperType’ nomenclature with Generals.
Correct domain and range assignments – our internal specifications had nearly complete domain and range assignments for version 2.50, but apparently during processing were not properly loaded. The fact they were not completely assigned in the public release was missed, and needs to be corrected.
Remove trailing spaces in the prefLabels for properties – the preferred labels for virtually all of the properties in version 250 had trailing spaces, which never were apparent in listings or user interfaces, but did become evident once the labels were parsed for roundtripping.

The latter four problems ((4), (5), (6), (7) ) were prior to cowpoke, having been issues in KBpedia v 250 at time of release. Processing steps and other different aspects of how the builds are handled in Python made these issues much more evident.

The Plan of Attack

Some of these issues make sense to address prior to others. In looking across these needed changes, here is what emerged as the logical plan of attack:

A. Make KKO changes first ((2), (4), and (5))

Since the build process always involves a pre-built KKO knowledge graph, it is the logical first focus if any changes involve it. Three of the seven issues do so, and efforts can not proceed until these are changed. With respect to (5), we will retain the idea of ‘SuperType’ as the root node of an typology, and designate the 80 or such KKO concepts that operate as such with an annotation property. To prevent confusion with Generals, we will also remove the SuperType concept.

B. Make bulk, flat-file changes ((1), (3), (6), (7))

This step in the plan confirms why it is important to have a design with roundtripping and the ability to make bulk changes to input files via spreadsheets. Mass changes involving many hundreds or thousands of records are not feasible with a manually edited knowledge graph. (I know, not necessarily common, but it does arise as this case shows.) It also makes it hard, if not close to impossible, to make substantial modifications or additions to an existing knowledge graph in order to tailor it for your own domain purposes, the reason why we began this CWPK series in the first place. Addressing the four (1), (3), (6), and (7) problem areas will take the longest share of time to create the new version.

One of these options (3) will require us to develop a new, separate routine (see below).

C. Propagate changes to other input files ((1), (2), (4))

With regard to replacing hyphens with underscores (1), this problem not only occurs when a property or class is declared, but all subsequent references to it. To make a global search-and-replace replacement of underscores for hyphens means all build files must be checked and processed. Any time changes are made to key input files (i.e., the ones of the struct variety), it is important to check appropriate other input files for consistency. We also need to maintain a mapping between the two ID forms so changed such that older URIs continue to point to the correct resources.

D. Re-build

Once all input files are modified and checked, we are ready to start the re-build.

General Processing Notes

The basic build process we are following is what was listed in the last installment, CWPK #47, applied in relation to our plan of attack.

I am recording notable observations from the pursuit of these steps. I am also logging time to provide a sense of overall set-up and processing times. There are, however, three areas that warrant separate discussion after this overall section.

As I progress through various steps, I tend to do two things. First, after a major step in the runs I bring up the interim build of KBpedia in Protégé and check to see if the assignments are being made properly. Depending on the nature of the step at-hand, I will look at different things. Second, especially in the early iterations of a build, I may backup my target ontology. Either I do this by stipulating a different output file in the routine, or create a physical file backup directly. Either way, I do this at these early phases to prevent having to go back to Square One with a particular build if the new build step proves a failure. With annotations, for example, revisions are added to what is already in the knowledge graph, as opposed to replacing the existing entries. This may not be the outcome you want.

The changes needed to KKO (A) above are straightforward to implement. We bring KKO into Protégé and make the changes. Only when the KKO baseline meets our requirements do we begin the formal build process.

The hyphen changes (1) were rather simple to do, but affected much in the four input files (two structural, two annotations for classes and properties). Though some identifiers had more than one hyphen, there were more than 7 K replacements for classes, and more than 13 K replacements for properties, for a total exceeding 20 K replacements across all build files (this amount will go up as we subsequently bring in the mappings to external sources as well; see next installment). I began with the structure files, since they have fewer fields and there were some open tasks on processing specific annotations.

This is a good example of a bulk move with a spreadsheet (see CWPK #36). Since there are fields such as alternative labels or definitions for which hyphens or dashes are fine, we do not want to do a global search-and-replace for underscores. Using the spreadsheet, the answer is to highlight the columns of interest (while using the menu-based search and replace) and only replace within the highlighted selection. If you make a mistake, Undo.

At the same time, I typically assign a name to the major block on the spreadsheet and then sort on various fields (columns) to check for things like open entries, strange characters (that often appear at the top or bottom of sorts), fields that improperly split in earlier steps (look for long ones), or other patterns to which your eye rapidly finds. If I EVER find an error, I try to fix it right then and there. It slows first iterations, but, over time, always fixing problems as discovered leads to cleaner and cleaner inputs.

Starting with the initial class backbone file (Generals_struct_out.csv) and routine (class_struct_builder), after getting the setting configurations set, I begin the run. It fails. This is actually to be expected, since it is an occasion worthy of celebration when a given large routine runs to completion without error on its first try!

On failures, one of the nice things about Python is a helpful ‘traceback’ on where the error occurred. Since we are processing tens of thousands of items at this class build point, we need to pinpoint in the code where the fail was occurring and add some print statements, especially ones that repeat to screen what items are currently going through the processing loop at the point of fail. Then, when you run again, you can see where in your input file the error likely occurs. Then, go back to the input file, and make the correction there.

Depending on the scope of your prior changes, these start-and-stop iterations of run-fail-inspect-correct may occur multiple times. You will eventually work your way through the input file if you are unlucky. But, you perhaps may not even notice any of this if you are lucky! (Of course, these matters are really not so much a matter of luck, since outcomes are improved by attention to detail.)

After a couple of iterations of minor corrections, first the classes and then the properties load properly with all sub- relationships intact. Pretty cool! I can smell the finish line.

In the shift to annotations, I basically wanted to load what had previously been tested and ingested without problems, and then concentrate on the new areas. The class annotation uploads went smoothly (only one hiccup for a mis-labeled resource). Great, so I can now take a quick detour to get rid of the superfluous links to OCyc (3) before facing the final step of bringing in the property annotations.

Another Cleaning Task

Before we can complete the third of our build steps involving the class_annot_builder function, we set for ourselves the removal of the open-source Cyc (OCyc) internal links in definitions. These all have the form of:

  \<\a href="http://sw.opencyc.org/concept/Mx4rvVjk55wpEbGdrcN5Y29ycA">IndependentCountry\<\/a>

My desire is to remove all of the href link markup, but leave the label text between the <\a\> tags. I know I can use regular expressions to recognize a sub-string like the above, but I am no more than a toddler when it comes to formulating regex. Like many other areas in Python, I begin a search for modules that may make this task a bit easier.

I soon discovered there are multiple approaches, and my quick diligence suggests either the beautifulsoup or bleach modules may be best suited. I make the task slightly more complicated by wanting to limit the removal to OCyc links only, and to leave all other href’s.

I chose beautifulsoup because it is a widely used and respected library for Web scraping and many data processing tasks. I also realized this was a one-off occasion, so while I did write a routine, I chose not to include it in the utils module. I also excised the ‘definitions’ column from our input files, made the changes to it, and then uploaded the changes. In this manner, I was able to sidestep some of the general file manipulation requirements that a more commonly used utility would demand. Here is the resulting code:

import csv
from bs4 import BeautifulSoup                                    # Part of the Anaconda distro

in_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/working/def_old.csv'
out_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/working/def_new.csv'

output = open(out_file, 'w+', encoding='utf8', newline='')
x = 0
with open(in_file, 'r', encoding='utf8') as f:
    reader = csv.reader(f)
    for row in reader:
        line = str(row)
        soup = BeautifulSoup(line)                               # You can feed bs4 with lines, docs, etc.
        tags = soup.select('a[href^="http://sw.opencyc.org/"]')  # The key for selecting out the OCyc stuff
        if tags != []:
            for item in tags:                                    # Some entries have no tags, others a few
                item.unwrap()                                    # The main method for getting the text within tags
                item_text = soup.get_text()                      # The text after tags removed
        else:
            item_text = line
        item_text = item_text.replace("['","")                   # A bunch of 'hacky' cleanup of the output
        item_text = item_text.replace("']","")
        item_text = item_text.replace('["', '')
        item_text = item_text.replace('"]', '')
        item_text = item_text.replace("', '", ",")
        print(item_text)
        print(item_text, file=output)
        x = x + 1
    print(x, 'total items processed.')
    output.close()
    print('Definition modifications are complete.')

Figuring out this routine took more time than I planned. Part of the reason is that the ‘definitions’ in KBpedia are the longest and most complicated strings, with many clauses and formatting and quoted sections. So I had quoting conflicts that caused some of the 58 K entries to skip or combine with other lines. I wanted to make sure the correspondence was kept accurate. Another issue was figuring out the exact beautifulsoup syntax for identifying the specific OCyc links (with variable internal references) and extracting out the internal text for the link.

Nonetheless, beautifulsoup is a powerful utility, and I am glad I spent some time learning how to get to first twitch with it.

Updates to Domain and Range

Since the earlier version (2.50) of KBpedia did not have proper loads of domain and range, once I re-established those specifications I foresaw that ingest of these fields might be a problem. The reasons for this supposition are the variety of data types that one might encounter, plus we were dealing with object and data properties, which have a bit more structure and stronger semantics, as well as annotations, which pose different issues in language checks and longer strings.

I was not surprised, then, when this step proved to be the most challenging of the update.

First, indeed, there were more domain and range options as this revised routine indicates (compare to the smaller version in CWPK #47:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : file_dict.values(),                           # see 'in_file'
# 'loop'          : 'property_loop',
# 'in_file'       : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv',
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts.csv',

def prop_annot_build(**build_deck):
    print('Beginning KBpedia property annotation build . . .')
    xsd = kb.get_namespace('http://w3.org/2001/XMLSchema#')
    wgs84 = kb.get_namespace('http://www.opengis.net/def/crs/OGC/1.3/CRS84')    
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    out_file = build_deck.get('out_file')
    x = 1
    if loop is not 'property_loop':
        print("Needs to be a 'property_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval) 
        in_file = loopval
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subPropertyOf', 'domain',  
                                   'range', 'functional', 'altLabel', 'definition', 'editorialNote'])                 
            for row in reader:
                r_id = row['id']                
                r_pref = row['prefLabel']
                r_dom = row['domain']
                r_rng = row['range']
                r_alt = row['altLabel']
                r_def = row['definition']
                r_note = row['editorialNote']
                r_id = r_id.replace('rc.', '')
                id = getattr(rc, r_id)
                if id == None:
                    continue
                if is_first_row:                                       
                    is_first_row = False
                    continue
                id.prefLabel.append(r_pref)
                i_dom = r_dom.split('||')
                if i_dom != ['']: 
                    for item in i_dom:                            # We need to accommodate different namespaces           
                        if 'kko.' in item:
                            item = item.replace('kko.', '')
                            item = getattr(kko, item)
                            id.domain.append(item) 
                        elif 'owl.' in item:
                            item = item.replace('owl.', '')
                            item = getattr(owl, item)
                            id.domain.append(item)
                        elif item == ['']:
                            continue    
                        elif item != '':
                            item = getattr(rc, item)
                            if item == None:
                                continue
                            else:
                                id.domain.append(item) 
                        else:
                            print('No domain assignment:', 'Item no:', x, item)
                            continue                             
                if 'owl.' in r_rng:                               # A tremendous number of range options   
                    r_rng = r_rng.replace('owl.', '')             # xsd datatypes are only partially supported
                    r_rng = getattr(owl, r_rng)
                    id.range.append(r_rng)
                elif 'string' in r_rng:    
                    id.range = [str]
                elif 'decimal' in r_rng:
                    id.range = [float]
                elif 'anyuri' in r_rng:
                    id.range = [normstr]
                elif 'boolean' in r_rng:    
                    id.range = [bool]
                elif 'datetime' in r_rng:    
                    id.range = [datetime.datetime]   
                elif 'date' in r_rng:    
                    id.range = [datetime.date]      
                elif 'time' in r_rng:    
                    id.range = [datetime.time] 
                elif 'wgs84.' in r_rng:
                    r_rng = r_rng.replace('wgs84.', '')
                    r_rng = getattr(wgs84, r_rng)
                    id.range.append(r_rng)        
                elif r_rng == ['']:
                    print('r_rng = empty:', r_rng)
                else:
                    print('r_rng = else:', r_rng, id)
#                    id.range.append(r_rng)
                i_alt = r_alt.split('||')    
                if i_alt != ['']: 
                    for item in i_alt:
                        id.altLabel.append(item)
                id.definition.append(r_def)        
                i_note = r_note.split('||')
                if i_note != ['']:   
                    for item in i_note:
                        id.editorialNote.append(item)
                x = x + 1        
    kb.save(out_file, format="rdfxml") 
    print('KBpedia property annotation build is complete.')

Second, a number of the range types — xsd.anyURI, xsd.hexBinary, and wgs84 — are not supported internally by owlready2, and there is no facility to add them directly to the system. I have made outreach to the responsive developer of owlready2, Jean-Baptiste Lamy, to see whether we can fill this gap before we go live with KBpedia v. 300. (Update: Within two weeks, Jean-Baptiste responded with a fix and new definition capabilities.) Meanwhile, there are relatively few instances of this gap, so we are in pretty good shape to move forward as is. Only a handful of resources are affected by these gaps, out of a total of 58 K.

URI Changes

The changing of an identifier for a knowledge graph resource is not encouraged. Most semantic technology advice is simply to pick permanent or persistent URIs. There is thus little discussion or guidance as to what is best practice when an individual resource ID does need to change. Our change from hyphens to underscores (1) is one such example of when an ID needs to change.

The best point of intervention is at the Web server, since our premise for knowledge graphs is Web-accessible information obtained via (generally) HTTP. While we could provide internal knowledge graph representations to capture the mapping between old and new URIs, an external request in the old form still needs to get a completion response for the new form. The best way to achieve that is via content negotiation by the server.

Under circumstances where prior versions of your knowledge graph were in broad use, the recommended approach would be to follow the guidelines of the W3C (the standards-setting body for semantic technologies) for how to publish a semantic Web vocabulary. This guidance is further supplemented with recipes for how to publish linked data under the rubric of ‘cool URIs‘. Following these guidances is much easier than updating URIs in place.

However, because of decisions yet to be documented to not implement linked data (see CWPK #60 when it is published in about three weeks), the approach we will be taking is much simpler. We will generate a mapping (correspondence) file between the older, retired URIs (the ones with the hyphens) with the new URIs (the ones with the underscores). We will announce this correspondence file at time of v 300 release, which we have earmarked to occur at the conclusion of this CWPK series. The responsibility for URI updates, if needed, will be placed on existing KBpedia users. This decision violates the recommended best practice of never changing URIs, but we deem it manageable based on our current user base and their willingness to make those modifications directly. Posting this correspondence file will be one of the last steps before KBpedia v 300 goes fully ‘live’.

So, we completed the full build, but kept a copy of the one-step-last-removed to return to if (when) we get a better processing of range.

The effort was greater than I anticipated. Actual processing time for a full re-build across all steps was about 90 min. There was perhaps another 8-12 hrs in working through developing the code and solving (or mostly so) the edge cases.

This is the first time I have done this re-build process with Python, but it is a process I have used and learned to improve for nearly a decade. I’m pretty pleased about the build process itself, but am absolutely thrilled with the learning that has taken place to give me tools at-hand. I’m feeling really positive about how this CWPK series is unfolding.

Part IV Conclusion

This brings to a close Part IV in our CWPK series. When I first laid out the plan for this series, I told myself that eventual public release of the series and its installments depended on being able to fully ’roundtrip’ KBpedia. I was somewhat confident setting out that this milestone could be achieved. Today, I know it to be so, and so now can begin the next steps of releasing the installments and their accompanying Jupyter Notebook pages. Successfully achieving the roundtrip milestone in this objective means we began to publish the start of our CWPK series on July 27, 2020. Woohoo!

In terms of the overall plan, we are about 2/3 of the way through the entire anticipated series. We next tackle the remaining steps in completing a full, public release of the knowledge graph. Then, we use the completed KBpedia v 300 to put the knowledge graph through its paces, doing some analysis, some graphing, and some machine learning. As of this moment in time, we have a target of 75 total installments in this Cooking with Python and KBpedia series, which we hope to wrap up by mid-November or so. Please keep with us for the journey!

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.

NOTE: This CWPK installment is available both as an online interactive file

or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.

I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Posted:October 5, 2020

CWPK #47: Summary of the Extract-Build Roundtrip

Here is the Master Listing of Extraction and Build Steps

We are near the end of this major part in our Cooking with Python and KBpedia series in which we cover how to build KBpedia from a series of flat-text (CSV) input files. Though these CSV files may have been modified substantially offline (see, in part, CWPK #36), they are initially generated in an extraction loop, which we covered in CWPK #28-35. We have looked at these various steps in an incremental fashion, building up our code base function by function. This approach is perhaps good from a teaching perspective, but makes it kind of murky how all of the pieces fit together.

In this installment, I will list all of the steps — in sequence — for proceeding from the initial flat file extractions, to offline modifications of those files, and then the steps to build KBpedia again from the resulting new inputs. Since how all of these steps proceed depends critically on configuration settings prior to executing a given step, I also try to capture the main configuration settings appropriate to each step. The steps outlined here cover a full extract-build ‘roundtrip‘ cycle. In the next installment, we will address some of the considerations that go into doing incremental or partial extractions or builds.

Please note that the actual functions in our code modules may be modified slightly from what we presented in our interactive notebook files. These minor changes, when made, are needed to cover gaps or slight errors uncovered during full build and extraction sets. As an example, my initial passes of class structure extractions overlooked the kko.superClasses and rdfs.isDefinedyBy properties. Some issues in CSV extraction and build settings were also discovered that led to excess quoting of strings. The “official” code, then, is what is contained in the cowpoke modules, and not necessarily exactly what is in the notebook pages.

Therefore, of the many installments in this CWPK series, this present one is perhaps one of the most important for you to keep and reference. We will have occasion to summarize other steps in our series, but this installment is the most comprehensive view of the extract-and-build ’roundtrip’ cycle.

Summary of Extraction and Build Steps

Here are the basic steps in a complete roundtrip from extracting to building the knowledge graph anew:

Startup
Extraction

Structure Extraction of Classes
Structure Extraction of Properties
Annotation Extraction of Classes
Annotation Extraction of Properties
Extraction of Mappings

Offline Development and Manipulation
Clean and Test Build Input Files
Build

Build Class Structure
Build Property Structure
Build Class Annotations
Build Property Annotations
Ingest of Mappings

Test Build

The order of extraction and building of classes and properties must begin each phase because we need to have these resources adequately registered to the knowledge graph. Once done, however, there is no ordering requirement for whether mapping or annotation proceeds next. Since annotation changes are always likely in every new version or build, I have listed them before mapping, but that is only a matter of preference.

Each of these steps is described below, plus some key configuration settings as appropriate. We begin with our first step, startup:

1. Startup

from cowpoke.__main__ import *
from cowpoke.config import *

We will re-cap the entire breakdown and build process here. We first begin with structure extraction, first classes and then properties:

2. Extraction

The purpose of a full extraction is to retrieve all assertions in KBpedia aside from those in the upper (also called top-level) KBpedia Knowledge Ontology, or KKO.

A. Structure Extraction of Classes

We begin with the (mostly) hierarchical typologies and their linkage into KKO and with one another. Since all of the reference concepts in KBpedia are subsumed by the top-level category of Generals, we can specify it alone as a means to retrieve all of the RCs in KBpedia:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###
# 'krb_src'       : 'extract'                                          # Set in master_deck
# 'descent_type'  : 'descent',
# 'loop'          : 'class_loop',
# 'loop_list'     : custom_dict.values(),                              # Single 'Generals' specified       
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_struct_out.csv',
# 'render'        : 'r_iri',

def struct2_extractor(**extract_deck):
    print('Beginning structure extraction . . .')
# 1 - render method goes here    
    r_default = ''
    r_label = ''
    r_iri = ''
    render = extract_deck.get('render')
    if render == 'r_default':
        set_render_func(default_render_func)
    elif render == 'r_label':
        set_render_func(render_using_label)
    elif render == 'r_iri':
        set_render_func(render_using_iri)
    else:
        print('You have assigned an incorrect render method--execution stopping.')
        return
# 2 - note about custom extractions
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    out_file = extract_deck.get('out_file')
    class_loop = extract_deck.get('class_loop')
    property_loop = extract_deck.get('property_loop')
    descent_type = extract_deck.get('descent_type')
    x = 1
    cur_list = []
    a_set = []
    s_set = []
    new_class = 'owl:Thing'
# 5 - what gets passed to 'output'
    with open(out_file, mode='w', encoding='utf8', newline='') as output:
        csv_out = csv.writer(output)
        if loop == 'class_loop':                                             
            header = ['id', 'subClassOf', 'parent']
            p_item = 'rdfs:subClassOf'
        else:
            header = ['id', 'subPropertyOf', 'parent']
            p_item = 'rdfs:subPropertyOf'
        csv_out.writerow(header)       
# 3 - what gets passed to 'loop_list' 
        for value in loop_list:
            print('   . . . processing', value)                                           
            root = eval(value)
# 4 - descendant or single here
            if descent_type == 'descent':
                a_set = root.descendants()
                a_set = set(a_set)
                s_set = a_set.union(s_set)
            elif descent_type == 'single':
                a_set = root
                s_set.append(a_set)
            else:
                print('You have assigned an incorrect descent method--execution stopping.')
                return                         
        print('   . . . processing consolidated set.')
        for s_item in s_set:
            o_set = s_item.is_a
            for o_item in o_set:
                row_out = (s_item,p_item,o_item)
                csv_out.writerow(row_out)
                if loop == 'class_loop':
                    if s_item not in cur_list:                
                        row_out = (s_item,p_item,new_class)
                        csv_out.writerow(row_out)
                cur_list.append(s_item)
                x = x + 1
    print('Total unique IDs written to file:', x)
    print('The structure extraction for the ', loop, 'is completed.')

struct2_extractor(**extract_deck)

B. Structure Extraction of Properties

See above with the following changes/notes:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###
# 'krb_src'       : 'extract'                                          # Set in master_deck
# 'descent_type'  : 'descent',
# 'loop'          : 'property_loop',
# 'loop_list'     : prop_dict.values(),
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/properties/prop_struct_out.csv',
# 'render'        : 'r_default',

C. Annotation Extraction of Classes

Annotations require a different method, though with a similar composition to the prior ones. It was during testing of the full extract-build roundtrip that I realized our initial class annotation extraction routine was missing for the rdfs.isDefinedBy and kko.superClassOf properties. The code in extract.py has been updated to reflect these changes.

Again, we first begin with classes. Note: by convention, I have shifted a couple structural:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###                
# 'krb_src'       : 'extract'                                          # Set in master_deck
# 'descent_type'  : 'descent',
# 'loop'          : 'class_loop',
# 'loop_list'     : custom_dict.values(),                              # Single 'Generals' specified 
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv',
# 'render'        : 'r_label',

def annot2_extractor(**extract_deck):
    print('Beginning annotation extraction . . .') 
    r_default = ''
    r_label = ''
    r_iri = ''
    render = extract_deck.get('render')
    if render == 'r_default':
        set_render_func(default_render_func)
    elif render == 'r_label':
        set_render_func(render_using_label)
    elif render == 'r_iri':
        set_render_func(render_using_iri)
    else:
        print('You have assigned an incorrect render method--execution stopping.')
        return    
    loop_list = extract_deck.get('loop_list')
    loop = extract_deck.get('loop')
    out_file = extract_deck.get('out_file')
    class_loop = extract_deck.get('class_loop')
    property_loop = extract_deck.get('property_loop')
    descent_type = extract_deck.get('descent_type')
    """ These are internal counters used in this module's methods """
    p_set = []
    a_ser = []
    x = 1
    cur_list = []
    with open(out_file, mode='w', encoding='utf8', newline='') as output:
        csv_out = csv.writer(output)                                       
        if loop == 'class_loop':                                             
            header = ['id', 'prefLabel', 'subClassOf', 'altLabel', 
                      'definition', 'editorialNote', 'isDefinedBy', 'superClassOf']
        else:
            header = ['id', 'prefLabel', 'subPropertyOf', 'domain', 'range', 
                      'functional', 'altLabel', 'definition', 'editorialNote']
        csv_out.writerow(header)    
        for value in loop_list:                                            
            print('   . . . processing', value)                                           
            root = eval(value) 
            if descent_type == 'descent':
                p_set = root.descendants()
            elif descent_type == 'single':
                a_set = root
                p_set.append(a_set)
            else:
                print('You have assigned an incorrect descent method--execution stopping.')
                return    
            for p_item in p_set:
                if p_item not in cur_list:                                 
                    a_pref = p_item.prefLabel
                    a_pref = str(a_pref)[1:-1].strip('"\'')                
                    a_sub = p_item.is_a
                    for a_id, a in enumerate(a_sub):                        
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_sub + '||' + str(a)
                        a_sub  = a_item
                    if loop == 'property_loop':   
                        a_item = ''
                        a_dom = p_item.domain
                        for a_id, a in enumerate(a_dom):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_dom + '||' + str(a)
                            a_dom  = a_item    
                        a_dom = a_item
                        a_rng = p_item.range
                        a_rng = str(a_rng)[1:-1]
                        a_func = ''
                    a_item = ''
                    a_alt = p_item.altLabel
                    for a_id, a in enumerate(a_alt):
                        a_item = str(a)
                        if a_id > 0:
                            a_item = a_alt + '||' + str(a)
                        a_alt  = a_item    
                    a_alt = a_item
                    a_def = p_item.definition
                    a_def = str(a_def)[2:-2]
                    a_note = p_item.editorialNote
                    a_note = str(a_note)[1:-1]
                    if loop == 'class_loop':                                  
                        a_isby = p_item.isDefinedBy
                        a_isby = str(a_isby)[2:-2]
                        a_isby = a_isby + '/'
                        a_item = ''
                        a_super = p_item.superClassOf
                        for a_id, a in enumerate(a_super):
                            a_item = str(a)
                            if a_id > 0:
                                a_item = a_super + '||' + str(a)
                            a_super = a_item    
                        a_super  = a_item
                    if loop == 'class_loop':                                  
                        row_out = (p_item,a_pref,a_sub,a_alt,a_def,a_note,a_isby,a_super)
                    else:
                        row_out = (p_item,a_pref,a_sub,a_dom,a_rng,a_func,
                                   a_alt,a_def,a_note)
                    csv_out.writerow(row_out)                               
                    cur_list.append(p_item)
                    x = x + 1
    print('Total unique IDs written to file:', x)  
    print('The annotation extraction for the', loop, 'is completed.')

annot2_extractor(**extract_deck)

d=csv.get_dialect('excel')
print("Delimiter: ", d.delimiter)
print("Doublequote: ", d.doublequote)
print("Escapechar: ", d.escapechar)
print("lineterminator: ", repr(d.lineterminator))
print("quotechar: ", d.quotechar)
print("Quoting: ", d.quoting)
print("skipinitialspace: ", d.skipinitialspace)
print("strict: ", d.strict)

D. Annotation Extraction of Properties

See above with the following changes/notes:

### KEY CONFIG SETTINGS (see extract_deck in config.py) ###                
# 'krb_src'       : 'extract'                                          # Set in master_deck
# 'descent_type'  : 'descent',
# 'loop'          : 'property_loop',
# 'loop_list'     : prop_dict.values(),                              
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/extractions/properties/prop_annot_out.csv',
# 'render'        : 'r_default',

E. Extraction of Mappings

Mappings to external sources is an integral part of KBpedia, as is likely the case for any similar, large-scale knowledge graph. As such, extractions of existing mappings is also a logical step in the overall extraction process.

Though we will not address mappings until CWPK #49, those steps belong here in the overall set of procedures for the extract-build roundtrip process.

3. Offline Development and Manipulation

The above extraction steps can capture changes over time that have been made with an ontology editing tool such as Protégé. Once that knowledge graph is at a state of readiness after using Protégé, and more major changes are desired to your knowledge graph, it is sometimes easier to work with flat files in bulk. I discussed some of my own steps using spreadsheets in CWPK #36, and I will also walk through some refactorings using bulk files in our next installment, CWPK #48. That case study will help us see at least a few of the circumstances that warrant bulk refactoring. Major additions or changes to the typologies is also an occasion for such bulk activities.

At any rate, this step in the overall roundtripping process is where such modifications are made before rebuilding the knowledge graph anew.

4. Clean and Test Build Input Files

We covered these topics in CWPK #45. If you recall, cleaning and testing of input files occurs at this logical point, but we delayed discussing it in detail until we had covered the overall build process steps. This is why this sequence number for this installment appears a bit out of order.

5. Build

The start of the build cycle is to have all structure, annotation, and mapping files in proper shape and vetted for encoding and quality.

(Note: where ‘Generals’ is specified, keep the initial capitalization, since it is also generated as such from the extraction routines and is consistent with typology naming.)

A. Build Class Structure

We start with the knowledge graph classes and their subsumption relationships, as specified in one or more class structure CSV input files. In this case, we are doing a full build, so we begin with the KKO and RC stubs, plus run our Generals typology since it is inclusive:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###             # Option 1: from Generals
# 'kb_src'        : 'start'                                           # Set in master_deck; only step with 'start'
# 'loop_list'     : custom_dict.values(),                             # Single 'Generals' specified 
# 'loop'          : 'class_loop',
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/',              
# 'ext'           : '_struct_out.csv',                                # Note change           
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.csv',

### KEY CONFIG SETTINGS (see build_deck in config.py) ###             # Option 2: from all typologies
# 'kb_src'        : 'start'                                           # Set in master_deck; only step with 'start'
# 'loop_list'     : typol_dict.values(),                               
# 'loop'          : 'class_loop',
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/',              
# 'ext'           : '.csv',                                           # Note change           
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.csv',

from cowpoke.build import *

def class2_struct_builder(**build_deck):                                  
    print('Beginning KBpedia class structure build . . .')               
    kko_list = typol_dict.values()                                      
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    base = build_deck.get('base')
    ext = build_deck.get('ext')
    out_file = build_deck.get('out_file')
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval)                           
        frag = loopval.replace('kko.','')
        in_file = (base + frag + ext)
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])                 
            for row in reader:
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')                         
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:                                       
                    is_first_row = False
                    continue      
                with rc:                                                
                    kko_id = None
                    kko_frag = None
                    if parent_frag == 'Thing':                                                        
                        if id in kko_list:                                
                            kko_id = id
                            kko_frag = id_frag
                        else:    
                            id = types.new_class(id_frag, (Thing,))       
                if kko_id != None:                                         
                    with kko:                                                
                        kko_id = types.new_class(kko_frag, (Thing,))  
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
            for row in reader:                                                
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:
                    is_first_row = False
                    continue          
                with rc:
                    kko_id = None                                   
                    kko_frag = None
                    kko_parent = None
                    kko_parent_frag = None
                    if parent_frag is not 'Thing':
                        if id in kko_list:
                            continue
                        elif parent in kko_list:
                            kko_id = id
                            kko_frag = id_frag
                            kko_parent = parent
                            kko_parent_frag = parent_frag
                        else:   
                            var1 = getattr(rc, id_frag)               
                            var2 = getattr(rc, parent_frag)
                            if var2 == None:                            
                                continue
                            else:
                                print(var1, var2)
                                var1.is_a.append(var2)
                if kko_parent != None:                                         
                    with kko:                
                        if kko_id in kko_list:                               
                            continue
                        else:
                            var1 = getattr(rc, kko_frag)
                            var2 = getattr(kko, kko_parent_frag)                     
                            var1.is_a.append(var2)
        with open(in_file, 'r', encoding='utf8') as input:                
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])
            for row in reader:                                              
                r_id = row['id'] 
                r_parent = row['parent']
                id = row_clean(r_id, iss='i_id')
                id_frag = row_clean(r_id, iss='i_id_frag')
                parent = row_clean(r_parent, iss='i_parent')
                parent_frag = row_clean(r_parent, iss='i_parent_frag')
                if is_first_row:
                    is_first_row = False
                    continue
                if parent_frag == 'Thing':               
                    var1 = getattr(rc, id_frag)
                    var2 = getattr(owl, parent_frag)
                    try:
                        var1.is_a.remove(var2)
                    except Exception:
                        continue
    kb.save(out_file, format="rdfxml")      
    print('KBpedia class structure build is complete.')

class2_struct_builder(**build_deck)

B. Build Property Structure

After classes, when then add property structure to the system. Note, however, that we now switch to our normal ‘standard’ kb source:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###             
# 'kb_src'        : 'standard'                                        # Set in master_deck
# 'loop_list'     : prop_dict.values(),                             
# 'loop'          : 'property_loop',
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/',              
# 'ext'           : '_struct_out.csv',                                         
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.csv',
# 'frag'          : set in code block; see below

def prop2_struct_builder(**build_deck):
    print('Beginning KBpedia property structure build . . .')
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    base = build_deck.get('base')
    ext = build_deck.get('ext')
    out_file = build_deck.get('out_file')
    if loop is not 'property_loop':
        print("Needs to be a 'property_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval)
        frag = 'prop'                                    
        in_file = (base + frag + ext)
        print(in_file)
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subPropertyOf', 'parent'])
            for row in reader:
                if is_first_row:
                    is_first_row = False                
                    continue
                r_id = row['id']
                r_parent = row['parent']
                value = r_parent.find('owl.')
                if value == 0:                                        
                    continue
                value = r_id.find('rc.')
                if value == 0:
                    id_frag = r_id.replace('rc.', '')
                    parent_frag = r_parent.replace('kko.', '')
                    var2 = getattr(kko, parent_frag)                 
                    with rc:                        
                        r_id = types.new_class(id_frag, (var2,))
    kb.save(out_file, format="rdfxml")
    print(kbpedia)
    print(out_file)
    print('KBpedia property structure build is complete.')

prop2_struct_builder(**build_deck)

C. Build Class Annotations

With the subsumption structure built, we next load our annotations, beginning with the class ones:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : file_dict.values(),                           # see 'in_file'
# 'loop'          : 'class_loop',
# 'in_file'       : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/Generals_annot_out.csv',
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts.csv',

def class2_annot_build(**build_deck):
    print('Beginning KBpedia class annotation build . . .')
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    class_loop = build_deck.get('class_loop')
    out_file = build_deck.get('out_file')
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval) 
        in_file = loopval
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subClassOf', 
                                   'altLabel', 'definition', 'editorialNote', 'isDefinedBy', 'superClassOf'])                 
            for row in reader:
                r_id = row['id']
                id = getattr(rc, r_id)
                if id == None:
                    print(r_id)
                    continue
                r_pref = row['prefLabel']
                r_alt = row['altLabel']
                r_def = row['definition']
                r_note = row['editorialNote']
                r_isby = row['isDefinedBy']
                r_super = row['superClassOf']
                if is_first_row:                                       
                    is_first_row = False
                    continue      
                id.prefLabel.append(r_pref)
                i_alt = r_alt.split('||')
                if i_alt != ['']: 
                    for item in i_alt:
                        id.altLabel.append(item)
                id.definition.append(r_def)        
                i_note = r_note.split('||')
                if i_note != ['']:   
                    for item in i_note:
                        id.editorialNote.append(item)
                id.isDefinedBy.append(r_isby)
                i_super = r_super.split('||')
                if i_super != ['']:   
                    for item in i_super:
                        item = 'http://kbpedia.org/kko/rc/' + item
#                        Code block to be used if objectProperty; 5.5 hr load
#                        item = getattr(rc, item)
#                        if item == None:
#                            print('Failed assignment:', r_id, item)
#                            continue
#                        else:                                
                        id.superClassOf.append(item)
    kb.save(out_file, format="rdfxml") 
    print('KBpedia class annotation build is complete.')

class2_annot_build(**build_deck)

D. Build Property Annotations

And then the property annotations:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : file_dict.values(),                           # see 'in_file'
# 'loop'          : 'property_loop',
# 'in_file'       : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv',
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts.csv',

def prop2_annot_build(**build_deck):
    print('Beginning KBpedia property annotation build . . .')
    xsd = kb.get_namespace('http://w3.org/2001/XMLSchema#')
    wgs84 = kb.get_namespace('http://www.opengis.net/def/crs/OGC/1.3/CRS84')    
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    out_file = build_deck.get('out_file')
    x = 1
    if loop is not 'property_loop':
        print("Needs to be a 'property_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval) 
        in_file = loopval
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subPropertyOf', 'domain',  
                                   'range', 'functional', 'altLabel', 'definition', 'editorialNote'])                 
            for row in reader:
                r_id = row['id']                
                r_pref = row['prefLabel']
                r_dom = row['domain']
                r_rng = row['range']
                r_alt = row['altLabel']
                r_def = row['definition']
                r_note = row['editorialNote']
                r_id = r_id.replace('rc.', '')
                id = getattr(rc, r_id)
                if id == None:
                    continue
                if is_first_row:                                       
                    is_first_row = False
                    continue
                id.prefLabel.append(r_pref)
                i_dom = r_dom.split('||')
                if i_dom != ['']: 
                    for item in i_dom:
                        if 'kko.' in item:
                            item = item.replace('kko.', '')
                            item = getattr(kko, item)
                            id.domain.append(item) 
                        elif 'owl.' in item:
                            item = item.replace('owl.', '')
                            item = getattr(owl, item)
                            id.domain.append(item)
                        elif item == ['']:
                            continue    
                        elif item != '':
                            item = getattr(rc, item)
                            if item == None:
                                continue
                            else:
                                id.domain.append(item) 
                        else:
                            print('No domain assignment:', 'Item no:', x, item)
                            continue                             
                if 'owl.' in r_rng:
                    r_rng = r_rng.replace('owl.', '')
                    r_rng = getattr(owl, r_rng)
                    id.range.append(r_rng)
                elif 'string' in r_rng:    
                    id.range = [str]
                elif 'decimal' in r_rng:
                    id.range = [float]
                elif 'anyuri' in r_rng:
                    id.range = [normstr]
                elif 'boolean' in r_rng:    
                    id.range = [bool]
                elif 'datetime' in r_rng:    
                    id.range = [datetime.datetime]   
                elif 'date' in r_rng:    
                    id.range = [datetime.date]      
                elif 'time' in r_rng:    
                    id.range = [datetime.time] 
                elif 'wgs84.' in r_rng:
                    r_rng = r_rng.replace('wgs84.', '')
                    r_rng = getattr(wgs84, r_rng)
                    id.range.append(r_rng)        
                elif r_rng == ['']:
                    print('r_rng = empty:', r_rng)
                else:
                    print('r_rng = else:', r_rng, id)
#                    id.range.append(r_rng)
                i_alt = r_alt.split('||')    
                if i_alt != ['']: 
                    for item in i_alt:
                        id.altLabel.append(item)
                id.definition.append(r_def)        
                i_note = r_note.split('||')
                if i_note != ['']:   
                    for item in i_note:
                        id.editorialNote.append(item)
                x = x + 1        
    kb.save(out_file, format="rdfxml") 
    print('KBpedia property annotation build is complete.')

prop2_annot_build(**build_deck)

Beginning KBpedia property annotation build . . .
   . . . processing C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv
r_rng = else: xsd.anyURI rc.release_notes
r_rng = else: xsd.anyURI rc.schema_version
r_rng = else: xsd.anyURI rc.unit_code
r_rng = else: xsd.anyURI rc.property_id
r_rng = else: xsd.anyURI rc.ticket_token
r_rng = else: xsd.anyURI rc.role_name
r_rng = else: xsd.anyURI rc.feature_list
r_rng = else: xsd.hexBinary rc.associated_media
r_rng = else: xsd.hexBinary rc.encoding
r_rng = else: xsd.hexBinary rc.encodings
r_rng = else: xsd.hexBinary rc.photo
r_rng = else: xsd.hexBinary rc.photos
r_rng = else: xsd.hexBinary rc.primary_image_of_page
r_rng = else: xsd.hexBinary rc.thumbnail
r_rng = else: xsd.anyURI rc.code_repository
r_rng = else: xsd.anyURI rc.content_url
r_rng = else: xsd.anyURI rc.discussion_url
r_rng = else: xsd.anyURI rc.download_url
r_rng = else: xsd.anyURI rc.embed_url
r_rng = else: xsd.anyURI rc.install_url
r_rng = else: xsd.anyURI rc.map
r_rng = else: xsd.anyURI rc.maps
r_rng = else: xsd.anyURI rc.payment_url
r_rng = else: xsd.anyURI rc.reply_to_url
r_rng = else: xsd.anyURI rc.service_url
r_rng = else: xsd.anyURI rc.significant_link
r_rng = else: xsd.anyURI rc.significant_links
r_rng = else: xsd.anyURI rc.target_url
r_rng = else: xsd.anyURI rc.thumbnail_url
r_rng = else: xsd.anyURI rc.tracking_url
r_rng = else: xsd.anyURI rc.url
r_rng = else: xsd.anyURI rc.related_link
r_rng = else: xsd.anyURI rc.genre_schema
r_rng = else: xsd.anyURI rc.same_as
r_rng = else: xsd.anyURI rc.action_platform
r_rng = else: xsd.anyURI rc.fees_and_commissions_specification
r_rng = else: xsd.anyURI rc.requirements
r_rng = else: xsd.anyURI rc.software_requirements
r_rng = else: xsd.anyURI rc.storage_requirements
r_rng = else: xsd.anyURI rc.artform
r_rng = else: xsd.anyURI rc.artwork_surface
r_rng = else: xsd.anyURI rc.course_mode
r_rng = else: xsd.anyURI rc.encoding_format
r_rng = else: xsd.anyURI rc.file_format_schema
r_rng = else: xsd.anyURI rc.named_position
r_rng = else: xsd.anyURI rc.surface
r_rng = else: wgs84 rc.geo_midpoint
r_rng = else: xsd.anyURI rc.memory_requirements
r_rng = else: wgs84 rc.aerodrome_reference_point
r_rng = else: wgs84 rc.coordinate_location
r_rng = else: wgs84 rc.coordinates_of_easternmost_point
r_rng = else: wgs84 rc.coordinates_of_northernmost_point
r_rng = else: wgs84 rc.coordinates_of_southernmost_point
r_rng = else: wgs84 rc.coordinates_of_the_point_of_view
r_rng = else: wgs84 rc.coordinates_of_westernmost_point
r_rng = else: wgs84 rc.geo
r_rng = else: xsd.anyURI rc.additional_type
r_rng = else: xsd.anyURI rc.application_category
r_rng = else: xsd.anyURI rc.application_sub_category
r_rng = else: xsd.anyURI rc.art_medium
r_rng = else: xsd.anyURI rc.sport_schema
KBpedia property annotation build is complete.

E. Ingest of Mappings

Mappings to external sources are an integral part of KBpedia, as is likely the case for any similar, large-scale knowledge graph. As such, ingest of new or revised mappings is also a logical step in the overall build process, and occurs at this point in the sequence.

Though we will not address mappings until CWPK #49, those steps belong here in the overall set of procedures for the extract-build roundtrip process.

6. Test Build

We then conduct our series of logic tests (CWPK #43). This portion of the process may actually be the longest of all, given that it may take multiple iterations to pass all of these tests. However, in other circumstances, the build tests may also go quite quickly if relatively few changes were made between versions.

Wrap Up

Of course, these steps could be embedded in an overall ‘complete’ extract and build routine, but I have not done so.

Before we conclude this major part in our CWPK series, we next proceed to show how all of the steps may be combined to achieve a rather large re-factoring of all of KBpedia.

NOTE: This CWPK installment is available both as an online interactive file

Posted:October 1, 2020

CWPK #46: Creating the cowpoke Package and Unit Tests

Getting Serious about the Code and Deciding to Say Adios

To this point, we have accumulated a growing roster of methods for extracting and building KBpedia, and utilities to support those processes. With this growing maturity from our Cooking with Python and KBpedia series, it is time for us to put in place a formal testing regime for cowpoke and to take the steps to register it as a formal Python package.

The idea of unit testing is to assemble simple tests of single code functions that may be exercised whenever we deem changes in our code base warrants. These simple programs evaluate against a known results set to determine whether the routine still performs as expected. Unit tests are not a blanket approval of a method, but a way to ascertain whether certain key functions perform as expected. Unit testing is viewed by many as the foundation for integrated tests, the combination of which are one of the most important improvements in software development of the past 30 years.

As with so many other areas, there is a diversity of modules available to aid the testing process in Python. The unittest module is a part of Python’s standard library, and is our basis as well. But we will layer on to that a series of modules that will enable us to guide and develop our unit tests directly through the Spyder IDE.

I am most assuredly an amateur programmer. As I’ve stated before, I have never been paid a dime for writing a line of code. (And, now after more than halfway through this series, you can probably see why!) But since there is a widespread view that unit testing is a best practice, from Day One in my plan for this CWPK series I had slotted in one or two installments to learn and implement some unit tests. I began this particular installment with a high expectation, and indeed wrote most of this intro before I sat down to focus on learning and implementing tests. Yet I reached a conclusion quite contrary to my expectations. I’m writing this last sentence here just as I wrap up this investigation, with a slight taste of ashes that reminds me of our various experiences with the somewhat related area of agile programming. For my purposes and personality, there is just too much process, diversion, and paint-by-the-numbers to make unit testing a formal part of my workflow. I think I can see applications in large team development with mission-critical interdependencies, but my major realization is that I am already doing comprehensive, integrated testing. Unit tests are a diversion and a productivity loss, as I presently see them, in the case of knowledge graph roundtripping.

However, that being said, we still have the imperative to package up our CWPK code, which we have named cowpoke, as a standard Python package that we can readily make available through the common channels of GitHub and pip. We conclude this installment with our efforts in these areas, which now means you have complete and unfettered access to all of the code we have prepared to date through these CWPK installments.

Installing the Environment

To enable Spyder as our unittest interface, I began by installing a package extension specific to that task:

  conda install -c spyder-ide spyder-unittest

The unittest operations in Spyder also requires the pytest module, which is already part of my base installation, but we make sure anyway:

  conda install pytest

You will want to set up a folder under your project for ‘tests’ and to write your test files, often multiples, to this directory for the package. As you install, you may be asked to grant some permissions, and here is where you will configure to point to your project.

You should then logout and restart your computer, and return to your project to continue. The system will also install a separate .pytest_cache directory under your project.

I found, like Python packages in general, the install and addition of the testing modules to be smooth and easy. A new pane gets created (upper right by default) in Spyder, and test run options appear under the Spyder Run menu item.

Anatomy of a Unit Test

By definition, a unit test is limited to a single “unit” often used synonymously as a discrete function or algorithm. Ted Kaminsky nicely summarizes the standard guidance as to what constitutes a good unit test:

Tests should only test one thing
Each test should be independent and self-contained
Refactoring should not break tests
Try to achieve maximal coverage with tests.

A commitment to unit tests encourages more public methods and greater piecing apart of routines. The general form of a unit test looks like:

  fixtures
  def test_test
      setup
      assert
      test
      teardown

The pytest module uses ‘fixtures’ as a way to set up input templates of state or connectivity needed as inputs to the function. The unit test function is named, by convention, with a test_ prefix that informs the module a test is available. Though your production routines may favor shorter or more cryptic variable and function names, within the unit test environment best practice is to use longer and descriptive labels, since the tests and how they are being reported occur in a separate testing panel removed in both code and space from the subject routine.

Each test goes through an initial setup portion and then concludes with a teardown, where the temporary test structures are released when the test is done. The actual tests are done against assertions that have pre-determined ‘correct’ results, so that the test can evaluate to pass or fail. Multiple assertions may be evaluated in a given unit test, so more than one pass-fail may be returned. Like unit tests across tools and languages, results that pass are often shown in green on the screen, fails in red.

Determining Where Units Tests Are Applicable

I began my unit test efforts in earnest by first assembling an inventory of cowpoke‘s defined functions to date:

`extract.py`	`build.py`	`utils.py`
`annot_extractor struct_extractor typol_extractor`	`row_clean class_struct_builder prop_struct_builder class_annot_builder prop_annot_builder`	`dup_remover set_union set_difference set_intersection typol_intersects disjoint_status branch_orphan_check dup_parental_chain`

I then began to lay out my plan of attack on paper. When I research such matters I note sources that seem to have good code examples and I will mark them for later consultation, but my initial investigations are spent more on finding clear coding approaches and constructs and generalities or patterns for how to set up things. One of the first observations is that all of my roundtripping routines involved quite a bit of I/O and configuration. I was therefore looking especially for guidance around the idea of ‘fixtures‘ or ‘parameters’ with pytest. A second observation is that most of my utils.py routines are used infrequently, sometimes no more frequently than once every build or three. These were not heavily used routines.

Most of the unit test examples I came across were toy cases, such as adding or multiplying a couple of numbers or concatenating some strings. I tried to focus my investigations on use of CSV files, since that is such a central construct in our knowledge graph approach. I started to see hints that perhaps unit tests are not a good idea for file and I/O purposes. A quote from the user Dunes on StackOverflow seemed to best capture the sense I was gaining from my research: “Unit tests that access the file system are generally not a good idea. This is because the test should be self contained, by making your test data external to the test it’s no longer immediately obvious which test the csv file belongs to or even if it’s still in use.”

Hmmm. I could see that, good idea or not, what I was going to have to do to set up my tests and get them “mocked” up for all of the I/O and data staging I would need was not a trivial matter. It was also perhaps the case that my general roundtripping routines, with their many steps and loops, were already too complex for unit testing. It was beginning to dawn on me that to design my unit tests properly, I would need to further piece apart my existing routines into more atomic functions. Wow, I really did not like that idea, since it would kick me all of the way back to Square One and force me to re-factor all of my code to date. And I had been making such great progress!

I could see that unit testing was not going to be some minor ‘adder’ to improve best practices, but more akin to a whole change in philosophy and approach. It was at minimum looking that I would need to double the size of my code base, learn a bunch of whole new stuff needed by the test machinery, change my design and architecture, and for examples of isolated functions that told me nothing about application-wide behavior and seemed to only test what I already knew to be true. Ouch! This unit test stuff was not looking to be a good deal.

Calling Time Out and Testing Premises

We had similar realizations about the use of agile development in the past. While we are a boutique development shop that tends to work on smaller, bespoke projects, we have also been subcontractors on much larger teams with enterprise-scale budgets and project management. It is sometimes exciting, often lucrative, and too frequently exasperating to work on big, multi-team projects. We understand the discipline needed for larger-scale projects and can see the merit (if lightly applied) of agile approaches. But too often agile is just another way to kill innovation and productivity through too many meetings and process.

I had taken as a given that unit testing was an unalloyed good. But, here I was, barely hours into a concerted investigation, and I was seeing serious red flags. Because I had initially not questioned the premise, I had not specifically looked into criticisms or critics of unit testing. The truth is, I had just taken it all as a given and had not inspected my testing assumptions. I believe in my bones in the merit of tested and vetted information products, but perhaps unit testing was not a way to go in our circumstance. What was indeed best and true here?

So, I shifted my investigations from ‘how to do’ to ‘whether to do’ and discovered more criticism and naysayers than I had imagined. Some of this criticism was now a dozen or more years old. Some of the criticism is empirical, some philosophical or nuanced.

There is apparently a steep learning curve to master unit testing and making it an integral part of the development process. My initial investigations had flagged that prospect in spades. Unit testing sets up its own incentive objective, which can be a good thing, but if not done with the right balance or awareness, can result in mindless code proliferation or developing to the incentive. More public and smaller methods result, that are hard to maintain over time:

Figure 1: Declining Usefulness of Unit Tests (from W. Platz, “The Eroding Agile Test Pyramid”, Feb 20, 2109)

Integrated testing can also be made more difficult due to the code fragmentation.

Respected innovators like Donald Knuth have called unit testing “a waste of time.” Past enthusiasts like David Heinemeier Hansson, the developer of Ruby on Rails, now argue that integrated testing is the proper focus. Kaminski, noted above, has also been critical. There have been many others critical of the approach.

A couple of articles by James Coplien on Why Most Unit Testing is a Waste and its seque in 2014 were lightning rods on the topic. There is a more profane approach to the question, but still thoughtful. Even commercial proponents propose additional steps and tools to improve the unit testing experience and results. There appears to be some growing realization that there are boundaries to unit testing and the need for better definitions of where unit tests may be essential or relevant.

Framing Testing in a Different Light

This more open-minded investigation of the question of unit testing has changed my perspective. My impression is that there is a place and likely best practices and methods for doing unit testing. However, an excessive insistence on unit testing may actually be counter-productive by distorting incentives and leading to code proliferation and fragmentation. Paradoxically, this may make the code base harder to maintain and make it more difficult to discover integrated or system issues. One area that concerns me is in RESTful or Web-based distributed development where APIs and interfaces are prominent, but hard to mock up. The lack of examples useful to my needs is another concern.

More fundamentally, this exercise has caused me to think of testing in a new light. I remain convinced that testing and reliability are paramount, but that has meaning only in relation to the ultimate deliverables or purposes, not the individual pieceparts. The objective is the purpose of the software, not unit testing per se.

A roundtripping objective, my governing purpose, is, actually, a system test of the highest order. We need to be able to break down and manipulate a knowledge artifact, re-build it again, and be able to inspect and use it in process-heavy external environments. Being able to load and inspect and apply logic tests in a totally different Protégé environment is a demanding system test for whether our code base has been accurate in the entire cycle of transformations. I’m already doing loads of testing, and relevant, too. My realization was that the entire basis of my CWPK series was to create an artifact, test it for coherence, modify it, and then test it for coherence again. Such roundtripping is indeed a demanding task.

I am glad I began with the premise of instituting some unit tests in the cowpoke project. It has caused me to think more clearly about why test in the first place, and that achieving end goals should take precedence over adhering to any particular method or process. There is no end to the learning, is there?

The conclusion about the immediate objective was to put unit testing off to the side. If I can completely break down and then re-build a knowledge graph, there is no shame in not doing unit testing.

Setting Up the GitHub Repository

We have already created the basic directory structure for a Python package, as first outlined in CWPK #33 onward. It is now time to formalize this structure, create a GitHub repository, and add additional packaging requirements suitable for listing cowpoke for pip distribution.

Here are the steps I undertook:

Went to the directory where the cowpoke code is stored under my local Python projects
Using Git, created a new repository at this location
Committed all existing Python files in that directory to the new repository
Added the additional files needed for pip as detailed in the next section
Created an empty cowpoke repository under our main branch (Cognonto) in GitHub
Using TortoiseGit under my local file system, ‘pushed’ the local Git repository to GitHub.

It is important that the directory created under GitHub be completely empty. This means at time of creation that I did NOT add a README.md Markdown file. That file is created under the next set of steps and is ‘pushed’ to this new directory.

Upon completion of the next steps, I then ‘pushed’ my local files to GitHub. I did so by picking TortoiseGit when in the root of my local cowpoke directory, and then I entered the HTTPS link for the empty directory on GitHub as the remote URI location. That link is found under the green ‘Code’ button at the upper right of the GitHub cowpoke directory. For reference, this link is:

https://github.com/Cognonto/cowpoke.git

I will speak more about the use of GitHub at the conclusion of this CWPK series. The bottom-line trick I have discovered, however, is to make sure local or remote is ‘clean’ prior to cloning from the other, and then to ‘pull’ changes from the destination repository before ‘pushing’ from the source one.

Download cowpoke

From your standpoint as a user, you can obtain the cowpoke code from GitHub by essentially reversing this process. The steps you should follow are:

If using Windows, make sure TortoiseGit is installed on your local machine. Search for instructions on the Web if you do not have this application installed
Go to the cowpoke GitHub location indicated above
Create a new cowpoke directory under your Python packages wherever you have them stored locally (should be under xxx/main-python-directory/Lib/site-packages
Create a new Git repository at that same location; leave blank
‘Pull’ the repository from GitHub using the cowpoke GitHub location indicated above as your remote specification.

Creating the cowpoke Package

It is not necessary to have a pip package for cowpoke, since it is possible (if you have the GitPython package installed) to obtain the code directly from GitHub:

pip install gitpython

import git
git.Git("/xxx/main-python-directory/Lib/site-packages").clone("git://github.com/Cognonto/cowpoke.git")

However, it is easier to treat cowpoke as a standard Python package, and we created one and did so by following guidelines for the PyPi installer package (pip).

First, I did a test installation at test.pypi.org using this step-by-step guide. There are a few required files that each package must contain, including notably:

setup.py                   # definitions of the package and dependencies
LICENSE                    # the license for the package
README.md                  # the readme description file
code files

All of these requirements and the steps to follow are outlined in the guide.

Windows is a little tricky. I had a hard time using the Apache 2 license, so fell back to the MIT one. Also, the acceptance of tokens, as suggested by the guide, proved problematic, possibly due to lack of a $HOME directory on my Windows machine. I used my straight login and password names for the test site instead, and that worked fine. One must also have the setup.py working just right, or the test will fail with an error. (You can run python setup.py -install to check your pip packages locally.) Also, the instructions kept insisting I use ‘python3‘, but my local configuration sets Python directly to version 3, so the numeral was not causing Python to run properly; using the simple python did the trick for my environment.

Nonetheless, after making these changes, I was able to successfully complete the test install.

This test exercise means the package file structure is now suitable for the actual formal package upload. There is a separate guide for the formal site. Note that the formal package registry has a separate site (https://pypi.org/) with its own login and password than the test site. Per the test site instructions, I had already installed the twine install assistance package. So, after logging into the PyPI support, we begin the upload process with:

python -m twine upload dist/*

I am then prompted for my PyPI login and password. The material is then uploaded with progress bars, and upon acceptance we get a message about where to find our new cowpoke package:

https://pypi.org/project/cowpoke/

Now, it is important to know that one can not update this information without incrementing the version number. So, it is essential that the input information be accurate and complete, which means the test upload is a very important step.

Going forward, it is now possible for you to install cowpoke directly into your Python project by using:

pip install cowpoke

Lastly, please notice I have updated the first notice banner at the conclusion of these installments to indicate where to find the cowpoke Python code.

Additional Documentation

Here are some sources on the general question of testing and unit testing in Python:

Some guidelines for testing, especially the lead-in
A tutorial on PyUnit (unittest)
Understanding Unit Testing
The unittest plugin for Spyder
How to run and debug unittest in Spyder.

Here are some sources on how to create a repository on GitHub and create a pip package:

NOTE: This CWPK installment is available both as an online interactive file

Posted:September 30, 2020

CWPK #45: Cleaning and File Pre-checks

Out of Sequence, But Reducing ‘Garbage’ Always Makes Sense

We have noted in previous installments in this Cooking with Python and KBpedia series how important consistent UTF-8 encoding is to roundtripping with our files. One of the ways we can enforce this importance is to consistently read and write files with UTF-8 specified, as discussed in CWPK #31. But, what if we have obtained external information? How can we ensure it is in the proper encoding or has wrong character assignments fixed? If we are going to perform such checks, what other consistency tests might we want to include? In this installment, we add some pre-build routines to test and clean our files for proper ingest.

As I noted in CWPK #39, cleaning comes before the build steps in the actual build process. But we wanted to have an understanding of broader information flows throughout the build or use scenarios before formulating the cleaning routines. That is both because they are not always operationally applied, and because working out the build steps was aided by not having to carry around extra routines. Now that we have the ingest and build steps fairly well outlined, it is an easier matter to see where and how cleaning steps best fit into this flow.

At the outset, we know we want to work with clean files when building KBpedia. Do we want such checks to run in every build, or optionally? Do we want to run checks against single files or against entire directories or projects? Further, are we not likely to want to add more checks over time as our experience with the build process and problems encountered increase? Lastly, we can see down the road (CWPK #48) to where we also only want to make incremental changes to an existing knowledge graph, as opposed to building one from scratch or de novo. How might that affect cleaning requirements or placement of methods?

Design Considerations

In thinking about these questions, we decided to take this general approach to testing and vetting clean files:

Once vetted, files will remain clean (insofar as the tests run) until next edited. It may not make sense to check all files automatically at the beginning of a build. This point suggests we should have a separate set of cleaning routines from the overall build process. We may later want to include that into an overall complete build routine, but we can do so later as part of a make file approach rather than including cleaning as a mandatory part of all builds.
Once we have assembled our files for a new build, we should assume that all files are unvetted. As build iterations proceed, we only need to vet those files that have been modified. When initially testing a new build, it probably makes sense for us to be able to loop over all of the input files in a given directory (corresponding to most of the subdirectories under kbpedia > version > build; see prior CWPK #37 installment). These points suggest we want the option to configure our clean routines for either all files in a subdirectory or a list of files. To keep configuration complexity lower, we will stipulate that if a list of files is used, they should all be in the same subdirectory.
Our biggest cleaning concern is that we have clean, UTF-8 text (encodings) in all of our input files. However, if we need to run this single test, we ought to test for other consistency concerns, as well. Here are the additional tests that look useful in our initial module development:
- Have new fields (columns) been added to our CSV files?
- Are our input files missing already defined fields?
- Are we missing required fields (prefLabel and definition)?
- Are our fields properly constructed (CamelCase with initial cap for classes, initial lowercase for properties, and URI encoding for IRIs)?
If we do have encoding issues, and given the manual effort required to fix them, can we include some form of encoding ‘fix’ routine? It turns out there is a Python package for such a routine, that we will test in this installment and include if deemed useful.

These considerations are what have guided the design of the cowpoke clean module. Also, as we noted in CWPK #9, our design is limited to Python 3x. Python 2 has not been accommodated in cowpoke.

A Brief Detour for URIs

KBpedia is a knowledge graph based on semantic technologies and which incorporates seven major public and online knowledge bases: Wikipedia, Wikidata, DBpedia, schema.org, GeoNames, UNSPSC, and OpenCyc. A common aspect of all of these sources is that reference to information is a Web string that ‘identifies’ the item at hand that, when clicked, also takes us to the source of that item. In the early days of the Web this identifier mostly pertained to Web pages and was known as a Universal Resource Locator, or URL. They were the underlined blue links of the Web’s early days.

But, there are other protocols for discovering resources on the Internet beside the Web protocols of HTTP and HTTPS. There is Gopher, FTP, email, and others. Also, as information began to proliferate from Web pages to data items within databases and these other sources, the idea of a ‘locator’ was generalized to include ‘identifiers’ when it is a data item and not a page. This generalization is known as a URI, or if a ‘name’ within other schema or protocols, known as a URN. Here, for example, is the URI address of the English Wikipedia main page:

  https://en.wikipedia.org/wiki/Main_Page

Note that white space is not allowed in this string, and is replaced with underscores in this example.

The allowed characters that could be used in constructing one of these addresses were limited to mostly ASCII characters, with some characters like the forward-slash (‘/’) forbidden because they are a defined constructor of an address. If one wanted to include non-allowed characters in a URI address, it needed to be percent encoded. Here, for example, is the English Wikipedia address for its article on the Côte d’Azur Observatory:

  https://en.wikipedia.org/wiki/C%C3%B4te_d%27Azur_Observatory

This format is clearly hard to read. Most Web browsers, for example, decode these strings when you look at the address within the browser, so it appears as this:

  https://en.wikipedia.org/wiki/Côte_d'Azur_Observatory

And, in fact, if you submit the string as exactly shown above, encoders at Wikipedia would accept this input string.

The Internationalized Resource Identifier (IRI was proposed and then adopted on the Web as a way of bringing in a wider range of acceptable characters useful to international languages. Mostly what we see in browsers today is the IRI version of these addresses, even if not initially formulated as such.

Sources like Wikipedia and Wikidata restrict their addresses to URIs. A source like DBpedia, on the other hand, supports IRIs. Wikipedia also has a discussion on how to fix these links.

The challenge in these different address formats is that if encoding gets screwed up, IRI versions of addresses can also get screwed up. That might be marginally acceptable when we are encoding something like a definition or comment (an annotation), but absolutely breaks the data record if it occurs to that record’s identifying address: Any change or alteration of the exact characters in the address means we can no longer access that data item.

Non-percent encoded Wikipedia addresses and DBpedia addresses are two problem areas. We also have tried to limit KBpedia’s identifiers to the ASCII version of these international characters. For example, the KBpedia item for Côte-d’Or shows as the address:

  http://kbpedia.org/kko/rc/CoteDOr

We still have a readable label, but one with encoding traps removed.

I provide this detour to highlight that we also need to give special attention in our clean module to how Web addresses are coming in to the system and being treated. We obviously want to maintain the original addresses as supplied by the respective external sources. We also want to test and make sure these have not been improperly encoded. And we also want to test that our canonical subset of characters used in KBpedia is being uniformly applied to our own internal addresses.

Encoding Issues and ftfy

Despite it being design point #4 above, let’s first tackle the question of whether encoding fixes may be employed. I move it up the list because it is also the best way to illustrate why encoding issues are at the top of our concerns. First, let’s look at 20 selected records from KBpedia annotations that contain a diversity of language and symbol encodings.

Getting the files: The three mentioned files below are part of the the formal cowpoke release, which does not come until CWPK #46. For now, you can obtain these mentioned files from https://github.com/Cognonto/CWPK/tree/master/sandbox/builds/working.

These three files are part of the cowpoke distribution. This first file is the starting set of 20 selected records (remember Run or shift+enter to run the cell):

with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_orig.csv', 'r', encoding='utf8') as f:
    print(f.read())

However, here is that same file when directly imported into Excel and then saved (notice we had to change the encoding to get the file to load in Python):

with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_excel.csv', 'r', encoding='cp1252') as f:
    print(f.read())

Wow, did that file ever get screwed up! (You will obviously need to change the file locations to match your local configuration.) In fact, there are ways to open CSV files properly in Excel by first firing up the application and then using the File → Open dialogs, but the form above occurs in English MS Excel when you open the file directly, make a couple of changes, and then save. If you do not have a backup, you would be in a world of hurt.

So, how might we fix this file, or can we? The first thing to attempt is to load the file with the Python encoding set to UTF-8. Indeed, in many cases, that is sufficient to restore the proper character displays. One thing that is impressive in the migration to Python 3.6 and later is tremendously more forgiving behavior around UTF-8. That is apparently because of the uniform application now of UTF-8 across Python, plus encoding tests that occur earlier when opening files than occurred with prior versions of Python.

But in instances where this does not work, the next alternative is to use ftfy (fixes text for you). The first thing we need to do is to import the module, which is already part of our conda distribution (see CWPK #9):

import ftfy

Then, we can apply ftfy methods (of which there are many useful ones!) to see if we can resurrect that encoding-corrupted file from Excel:

import io

with io.open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_excel.csv', encoding='utf-8', mode='r', errors='ignore',) as f:
    lines = f.readlines()
    print(lines)
    fixed_lines = [ftfy.fix_text(line) for line in lines]
    print(fixed_lines)
# so you may inspect the results, but we will also write it to file:
    with io.open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_fixed.csv', encoding='utf-8', mode='w',) as out:
        print(fixed_lines, file=out)

I have to say this is pretty darn impressive! We have recovered nearly all of the original formats. Now, it is the case there are some stoppers in the file, which is why we needed to incorporate the more flexible io method of opening the file to be able to ignore the errors. Each of the glitches that occur in the file still need to be manually fixed. But, we can also use the ‘replace’ as a different argument to ‘error’ to insert some known characters to more quickly find these glitches. Overall, this is a much reduced level of effort to fix the file than without ftfy. We have moved from a potentially catastrophic situation to one that is an irritant to fix. That is progress!

Just to confirm (and for which one could do file compares to see specific differences to also help in the manual corrections), here is our now ‘fixed’ output file:

with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_fixed.csv', 'r', encoding='utf-8') as f:
    print(f.read())

We can also inspect our files as to what encoding we think it has. Again, we use an added package, chardet in this case, to test any suspect file. Here is the general form:

import chardet

with open(r'C:\1-PythonProjects\kbpedia\v300\builds\working\annotations_fixed.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

Note that one of the arguments is to pass the first 10,000 characters to the method as the basis for estimating the encoding type. Since the routine is quick, there is really no reason to lower this amount, and higher does not seem to provide any better statistics.

Again, a gratifying aspect of the improvements to Python since version 3.6 or so has been a more uniform approach to UTF-8. We also see we have some tools at our disposal, namely ftfy, that can help us dig out of holes that prior encoding mistakes may have dug. In our early years when encoding mismatches were more frequent, we also developed a Clojure routine for fixing bad characters (or at least converting them to a more readable form). It is likely this routine is no longer needed with Python’s improved handling of UTF-8. However, if this is a problem for your own input files, you can import the unicodedata module for the Python standard library to convert accented (diacritic) characters to ones based on ASCII. Here is the basic form of that procedure:

import unicodedata

def remove_diacrits(input_str):
    input_str = unicodedata.normalize('NFD', input_str).encode('ascii', 'ignore')\
           .decode('utf-8')
    return str(input_str)

s = remove_diacrits("Protégé")

print(s)

Protege

You can embed that routine in a CSV read that also deals with entire rows at a time, similar to some of the other procedures noted here.

However, the best advice, as we have reiterated, is to make sure that files are written and opened in UTF-8. But, it is good to know if we encounter encoding issues in the wild, that both Python and some of its great packages stand ready to help rectify matters (or at least partially so, with less pain). We have also seen how encoding problems can often be a source of garbage input data.

Flat File Checks

Though Python routines could be written for the next points below, they may be easier to deal with directly in a spreadsheet. This is OK, since we are also at that point in our roundtripping where we are dealing directly with CSV files anyway.

To work directly with the sheet, highlight the file’s entire set of rows and columns that are intended for eventual ingest during a build. Give that block a logic name in the upper left text box entry directly above the sheet, such as ‘Match’ or ‘Big’. You can continue to invoke that block name to re-highlight your subject block. From there, are can readily sort for the specific input column of interest in order to inspect the entire row of values.

Here is my checklist for such flat file inspection:

Does any item in the ‘id’ column lack a URI fragment identifier? If so, provide using the class and property URI naming conventions in KBpedia (CamelCase in both instances, upper initial case for classes, lower initial case for properties, with only alphanumerics and underscore as allowable characters). Before adding a new ‘id’, make sure it is initially specified in one of the class or property struct input files
Does any item in the ‘prefLabel’ column lack a preferred label? If so, add one; this field is mandatory
Does any item in the ‘definition’ column lack an entry? If so, add one. Though this field is not mandatory, it is highly encouraged
Check a few rows. Does any column entry have leading or trailing white spaces? If so, use the spreadsheet TRIM function
Check a few rows. Do any of the files with a ‘definition’ column show its full text spread over more than one cell? If so, you have an upstream CSV processing issue that is splitting entries at the common or some other character that should be escaped. The best fix, if intermediate processing has not occurred, is to re-extract the file with correct CSV settings. If not, you may need to concatenate multiple cells in a row in order to re-construct the full string
URI fragment identifier? If so, provide using the class and property URI naming conventions in KBpedia (CamelCase in both instances, upper initial case for classes, lower initial case for properties, with only alphanumerics and underscore as allowable characters). Before adding a new ‘id’, make sure it is initially specified in one of the class or property struct input files
Check entries for wrong or misspecified namespaces or prefixes. Make sure fragments end with the appropriate characters (‘#’ or ‘/’ if used in a URI construction)
Check columns where multiple entries may reside using the double-pipe (‘||’) convention, and ensure these decomposable strings are being constructed properly.

One of the reasons I am resistant to a complete build routine cascading through all of these steps at once is that problems in intermediate processing files propagate through all subsequent steps. That not only screws up much stuff, but it is harder to trace where the problem first arose. This is an instance where I prefer a ‘semi-automatic’ approach, with editorial inspection required between essential build steps.

Other Cleaning Routines

Fortunately, in our case, we are extracting fairly simple CSV files (though often with some long text entries for definitions) and ingesting in basically the same format. As long as we are attentive to how we modify the intermediate flat files, there is not too much further room for error.

However, there are many sources of external data that may eventually warrant incorporation in some manner into your knowledge graph. These external sources may pose a larger set of cleaning and wrangling challenges. Date and time formats, for example, can be particularly challenging.

Hadley Wickham, the noted R programmer and developer of many fine graphics programs, wrote a paper, Tidy Data, that is an excellent starting primer on wrangling flat files. In the case of our KBpedia knowledge graph and its supporting CSV, about the only guideline that he proposes that we consciously violate is to combine many-to-one data items sometimes in a single column (notable for altLabels, but a few others as well). According to Wickham, we should put each individual value on its own row. I have not done so to keep the listings more compact and the row count manageable. Nonetheless, his general guidance is excellent. Another useful guide is Wrangling Messy CSV Files by Detecting Row and TypePatterns.

There are also many additional packages in Python that may assist in dealing with ‘dirty’ input files. Depending on the specific problems you may encounter, some quick Web searches should turn up some useful avenues to pursue.

Lastly, in both our utils.py and other modules going forward, we will have occasion to develop some bespoke cleaning and formatting routines as our particular topic warrants.

Additional Documentation

Here is some additional documentation related to todays CWPK installment:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

NOTE: This CWPK installment is available both as an online interactive file

Posted:September 29, 2020

CWPK #44: Annotation Ingest

More Fields, But Less Complexity

We now tackle the ingest of annotations for classes and properties in this installment of the Cooking with Python and KBpedia series. In prior installments we built the structural aspects of KBpedia. We now add the labels, definitions, and other assignments to them.

As with the extraction routines, we will split these efforts into class annotations and then property annotations. Our actual load routines are fairly straightforward, and we have no real logic concerns in how these annotations get added. The most complex wrinkle we will need to address are those annotation fields, altLabels and notes in particular, where we have potentially many assignments for a single reference concept (RC) or property. Like we saw with the extraction routines, for these items we will need to set up additional internal loops to segregate and assign the items for loading based on our standard double-pipe (‘||’) delimiter.

The two functions we develop in this installment, class_annot_builder and prop_annot_builder will be added to the build.py module.

Start-up

Since we are in an active part of the build cycle, we want to continue with our main knowledge graph in-progress for our load routine, so please make sure that kb_src is set to ‘standard’ in your config.py configuration. We then invoke our standard start-up:

from cowpoke.__main__ import *
from cowpoke.config import *

Loading Class Annotations

Class annotations consist of potentially the item’s prefLabel, altLabels, definition, and editorialNote. The first item is mandatory, the next two should be provided to adhere to best practices. The last is optional. There are, of course, other standard annotations possible. Should your own conventions require or encourage them, you will likely need to modify the procedure below to account for that fact.

As with these methods before, we provide a header showing ‘typical’ configuration settings (in config.py), and then proceed with a method that loops through all of the rows in the input file. Here is the basic class annotation build procedure. There are no new wrinkles in this routine from what has been seen previously:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : file_dict.values(),                           # see 'in_file'
# 'loop'          : 'class_loop',
# 'in_file'       : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/Generals_annot_out.csv',
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts_test.csv',


def class_annot_build(**build_deck):
    print('Beginning KBpedia class annotation build . . .')
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    class_loop = build_deck.get('class_loop')
#    r_id = ''
#    r_pref = ''
#    r_def = ''
#    r_alt = ''
#    r_note = ''
    if loop is not 'class_loop':
        print("Needs to be a 'class_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval) 
        in_file = loopval
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=[C])                 
            for row in reader:
                r_id_frag = row['id']
                id = getattr(rc, r_id_frag)
                if id == None:
                    print(r_id_frag)
                    continue
                r_pref = row['prefLabel']
                r_alt = row['altLabel']
                r_def = row['definition']
                r_note = row['editorialNote']
                if is_first_row:                                       
                    is_first_row = False
                    continue      
                id.prefLabel.append(r_pref)
                id.definition.append(r_def)
                i_alt = r_alt.split('||')
                if i_alt != ['']: 
                    for item in i_alt:
                        id.altLabel.append(item)
                i_note = r_note.split('||')
                if i_note != ['']:   
                    for item in i_note:
                        id.editorialNote.append(item)
    print('KBpedia class annotation build is complete.')

class_annot_build(**build_deck)

kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts_test.owl', format='rdfxml')

BTW, when we commit this method to our build.py module, we will add the save routine at the end.

Loading Property Annotations

We now turn our attention to annotations of properties:

### KEY CONFIG SETTINGS (see build_deck in config.py) ###                  
# 'kb_src'        : 'standard'                                        
# 'loop_list'     : prop_dict.values(),                           # see 'in_file'
# 'loop'          : 'class_loop',
# 'in_file'       : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv',
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts_test.csv',

def prop_annot_build(**build_deck):
    print('Beginning KBpedia property annotation build . . .')
    loop_list = build_deck.get('loop_list')
    loop = build_deck.get('loop')
    out_file = build_deck.get('out_file')
    if loop is not 'property_loop':
        print("Needs to be a 'property_loop'; returning program.")
        return
    for loopval in loop_list:
        print('   . . . processing', loopval) 
        in_file = loopval
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subPropertyOf', 'domain',  
                                   'range', 'functional', 'altLabel', 'definition', 'editorialNote'])                 
            for row in reader:
                r_id = row['id']                
                r_pref = row['prefLabel']
                r_dom = row['domain']
                r_rng = row['range']
                r_alt = row['altLabel']
                r_def = row['definition']
                r_note = row['editorialNote']
                r_id = r_id.replace('rc.', '')
                id = getattr(rc, r_id)
                if id == None:
                    print(r_id)
                    continue
                if is_first_row:                                       
                    is_first_row = False
                    continue
                id.prefLabel.append(r_pref)
                i_dom = r_dom.split('||')
                if i_dom != ['']: 
                    for item in i_dom:
                        id.domain.append(item)
                if 'owl.' in r_rng:
                    r_rng = r_rng.replace('owl.', '')
                    r_rng = getattr(owl, r_rng)
                    id.range.append(r_rng)
                elif r_rng == ['']:
                    continue
                else:
#                    id.range.append(r_rng)
                i_alt = r_alt.split('||')    
                if i_alt != ['']: 
                    for item in i_alt:
                        id.altLabel.append(item)
                id.definition.append(r_def)        
                i_note = r_note.split('||')
                if i_note != ['']:   
                    for item in i_note:
                        id.editorialNote.append(item)
    print('KBpedia property annotation build is complete.')

prop_annot_build(**build_deck)

Hmmm. One of the things we notice in this routine is that our domain and range assignments have not been adequately picked up in our earlier KBpedia version 2.50 build routines (the ones undertaken in Clojure before this CWPK series). As a result, we can not adequately test range and will need to address this oversight before our series is over.

As before, we will add our ‘save’ routine as well when we commit the method to the build.py module.

kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts_test.owl', format='rdfxml')

We now have all of the building blocks to create our extract-build roundtrip. We summarize the formal steps and configuration settings in CWPK #47. But, first, we need to return to cleaning our input files and instituting some unit tests.

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.

NOTE: This CWPK installment is available both as an online interactive file

Main Links

Search

Bringing Home the Lessons to Date with KBpedia v 3.00

Summary of the Problem Issues

The Plan of Attack

General Processing Notes

Another Cleaning Task

Updates to Domain and Range

URI Changes

Part IV Conclusion

Here is the Master Listing of Extraction and Build Steps

Summary of Extraction and Build Steps

1. Startup

2. Extraction

A. Structure Extraction of Classes

B. Structure Extraction of Properties

C. Annotation Extraction of Classes

D. Annotation Extraction of Properties

E. Extraction of Mappings

3. Offline Development and Manipulation

4. Clean and Test Build Input Files

5. Build

A. Build Class Structure

B. Build Property Structure

C. Build Class Annotations

D. Build Property Annotations

E. Ingest of Mappings

6. Test Build

Wrap Up

Getting Serious about the Code and Deciding to Say Adios

Installing the Environment

Anatomy of a Unit Test

Determining Where Units Tests Are Applicable

Calling Time Out and Testing Premises

Framing Testing in a Different Light

Setting Up the GitHub Repository

Download cowpoke

Creating the cowpoke Package

Additional Documentation

Out of Sequence, But Reducing ‘Garbage’ Always Makes Sense

Design Considerations

A Brief Detour for URIs

Encoding Issues and ftfy

Flat File Checks

Other Cleaning Routines

Additional Documentation

More Fields, But Less Complexity

Start-up

Loading Class Annotations

Loading Property Annotations