Bringing Home the Lessons to Date with KBpedia v 3.00
Today’s installment in our Cooking with Python and KBpedia series is a doozy. Not only are we wrapping up perhaps the most important part of our series — building KBpedia from scratch — but we are also applying the full roundtrip software in our cowpoke Python package to a major re-factoring of KBpedia itself. This re-factoring will lead to the next release of KBpedia v. 3.00.
This re-factoring and new release was NOT part of the original plan for this CWPK series. Today’s current efforts were the result of issues we have discovered in the current version 2.50 of KBpedia, the version with which we began this series. The very process we have gone through in developing the cowpoke software to date has surfaced these problems. The problems have been there and perhaps part of KBpedia for some time, but our prior build routines were such that these issues were not apparent. By virtue of different steps and different purposes, we have now seen these things, and now have the extract and build procedures to address them.
It turns out the seven or so problems so identified provide a ‘perfect‘ (in the sense of ‘storm‘) case study for why a roundtrip capability makes sense and how it may be applied. Without further ado, let’s begin.
Summary of the Problem Issues
The cowpoke Python package as we have used to date has surfaced seven types of issues with KBpedia, v. 250, the basis with which we started this CWPK series. Our starting build files for this series are ones extracted from the current public v 250 version. About half of the issues are in the KBpedia knowledge graph, but had remained hidden given the nuances of our prior Clojure build routines. The other half of the issues relate to our new use of Python and owlready2.
These seven issues, with some background explanation, are:
- Remove hyphens – in our prior build routines with Clojure, that language has a style that favors dashes (or hyphens) when conjoining words in a label identifier. Python is not hyphen-friendly. While we have not seen issues when working directly with the owlready2 package, there are some Python functions that burp with hyphenated KBpedia identifiers:
NameError Traceback (most recent call last)
<ipython-input-1-5566a42d12b4> in <module>
----> 1 print(rc.Chemistry-Topic)
NameError: name 'rc' is not defined
kko.superClassOfproperty is moved to an
AnnotationProperty. When we want to use the concept of superclass as an object property, we can now use the built-in owlready2 superclass.
Remove OpenCyc href’s – part of KBpedia’s heritage comes from the open-source version of the Cyc ontology, including many initial concept definitions. OCyc distribution and support was ceased in 2017, though the ontology is still referenceable online. Given the receding usefulness of OCyc, we want to remove all of the internal URI references in definitions within KBpedia.
Remove duplicates – one nice aspect of the owlready2 engine is its identification of circular references, but gracefully proceeding with only a warning. Our new build routines have surfaced about ten of these circularities in KBpedia v 250. Two of these,
HomoSapiens, and a second,
Diety, are intended design decisions by us as editors of KBpedia. The other instances, however, are unintended, and ones we want to resolve. We need to remove these.
Remove the SuperType concept and move it to an annotation property – besides being one of the duplicates (see (4) above), our adoption of Charles Sanders Peirce‘s universal categories in the KBpedia Knowledge Ontology (KKO) has supplanted the ‘SuperType’ nomenclature with
rangeassignments – our internal specifications had nearly complete
rangeassignments for version 2.50, but apparently during processing were not properly loaded. The fact they were not completely assigned in the public release was missed, and needs to be corrected.
Remove trailing spaces in the
prefLabelsfor properties – the preferred labels for virtually all of the properties in version 250 had trailing spaces, which never were apparent in listings or user interfaces, but did become evident once the labels were parsed for roundtripping.
The latter four problems ((4), (5), (6), (7) ) were prior to cowpoke, having been issues in KBpedia v 250 at time of release. Processing steps and other different aspects of how the builds are handled in Python made these issues much more evident.
The Plan of Attack
Some of these issues make sense to address prior to others. In looking across these needed changes, here is what emerged as the logical plan of attack:
A. Make KKO changes first ((2), (4), and (5))
Since the build process always involves a pre-built KKO knowledge graph, it is the logical first focus if any changes involve it. Three of the seven issues do so, and efforts can not proceed until these are changed. With respect to (5), we will retain the idea of ‘SuperType’ as the root node of an typology, and designate the 80 or such KKO concepts that operate as such with an annotation property. To prevent confusion with
Generals, we will also remove the SuperType concept.
B. Make bulk, flat-file changes ((1), (3), (6), (7))
This step in the plan confirms why it is important to have a design with roundtripping and the ability to make bulk changes to input files via spreadsheets. Mass changes involving many hundreds or thousands of records are not feasible with a manually edited knowledge graph. (I know, not necessarily common, but it does arise as this case shows.) It also makes it hard, if not close to impossible, to make substantial modifications or additions to an existing knowledge graph in order to tailor it for your own domain purposes, the reason why we began this CWPK series in the first place. Addressing the four (1), (3), (6), and (7) problem areas will take the longest share of time to create the new version.
One of these options (3) will require us to develop a new, separate routine (see below).
C. Propagate changes to other input files ((1), (2), (4))
With regard to replacing hyphens with underscores (1), this problem not only occurs when a property or class is declared, but all subsequent references to it. To make a global search-and-replace replacement of underscores for hyphens means all build files must be checked and processed. Any time changes are made to key input files (i.e., the ones of the
struct variety), it is important to check appropriate other input files for consistency. We also need to maintain a mapping between the two ID forms so changed such that older URIs continue to point to the correct resources.
Once all input files are modified and checked, we are ready to start the re-build.
General Processing Notes
The basic build process we are following is what was listed in the last installment, CWPK #47, applied in relation to our plan of attack.
I am recording notable observations from the pursuit of these steps. I am also logging time to provide a sense of overall set-up and processing times. There are, however, three areas that warrant separate discussion after this overall section.
As I progress through various steps, I tend to do two things. First, after a major step in the runs I bring up the interim build of KBpedia in Protégé and check to see if the assignments are being made properly. Depending on the nature of the step at-hand, I will look at different things. Second, especially in the early iterations of a build, I may backup my target ontology. Either I do this by stipulating a different output file in the routine, or create a physical file backup directly. Either way, I do this at these early phases to prevent having to go back to Square One with a particular build if the new build step proves a failure. With annotations, for example, revisions are added to what is already in the knowledge graph, as opposed to replacing the existing entries. This may not be the outcome you want.
The changes needed to KKO (A) above are straightforward to implement. We bring KKO into Protégé and make the changes. Only when the KKO baseline meets our requirements do we begin the formal build process.
The hyphen changes (1) were rather simple to do, but affected much in the four input files (two structural, two annotations for classes and properties). Though some identifiers had more than one hyphen, there were more than 7 K replacements for classes, and more than 13 K replacements for properties, for a total exceeding 20 K replacements across all build files (this amount will go up as we subsequently bring in the mappings to external sources as well; see next installment). I began with the structure files, since they have fewer fields and there were some open tasks on processing specific annotations.
This is a good example of a bulk move with a spreadsheet (see CWPK #36). Since there are fields such as alternative labels or definitions for which hyphens or dashes are fine, we do not want to do a global search-and-replace for underscores. Using the spreadsheet, the answer is to highlight the columns of interest (while using the menu-based search and replace) and only replace within the highlighted selection. If you make a mistake, Undo.
At the same time, I typically assign a name to the major block on the spreadsheet and then sort on various fields (columns) to check for things like open entries, strange characters (that often appear at the top or bottom of sorts), fields that improperly split in earlier steps (look for long ones), or other patterns to which your eye rapidly finds. If I EVER find an error, I try to fix it right then and there. It slows first iterations, but, over time, always fixing problems as discovered leads to cleaner and cleaner inputs.
Starting with the initial class backbone file (
Generals_struct_out.csv) and routine (
class_struct_builder), after getting the setting configurations set, I begin the run. It fails. This is actually to be expected, since it is an occasion worthy of celebration when a given large routine runs to completion without error on its first try!
On failures, one of the nice things about Python is a helpful ‘traceback’ on where the error occurred. Since we are processing tens of thousands of items at this class build point, we need to pinpoint in the code where the fail was occurring and add some print statements, especially ones that repeat to screen what items are currently going through the processing loop at the point of fail. Then, when you run again, you can see where in your input file the error likely occurs. Then, go back to the input file, and make the correction there.
Depending on the scope of your prior changes, these start-and-stop iterations of run-fail-inspect-correct may occur multiple times. You will eventually work your way through the input file if you are unlucky. But, you perhaps may not even notice any of this if you are lucky! (Of course, these matters are really not so much a matter of luck, since outcomes are improved by attention to detail.)
After a couple of iterations of minor corrections, first the classes and then the properties load properly with all sub- relationships intact. Pretty cool! I can smell the finish line.
In the shift to annotations, I basically wanted to load what had previously been tested and ingested without problems, and then concentrate on the new areas. The class annotation uploads went smoothly (only one hiccup for a mis-labeled resource). Great, so I can now take a quick detour to get rid of the superfluous links to OCyc (3) before facing the final step of bringing in the property annotations.
Another Cleaning Task
Before we can complete the third of our build steps involving the
class_annot_builder function, we set for ourselves the removal of the open-source Cyc (OCyc) internal links in definitions. These all have the form of:
My desire is to remove all of the href link markup, but leave the label text between the <\a\> tags. I know I can use regular expressions to recognize a sub-string like the above, but I am no more than a toddler when it comes to formulating regex. Like many other areas in Python, I begin a search for modules that may make this task a bit easier.
I soon discovered there are multiple approaches, and my quick diligence suggests either the
bleach modules may be best suited. I make the task slightly more complicated by wanting to limit the removal to OCyc links only, and to leave all other href’s.
beautifulsoup because it is a widely used and respected library for Web scraping and many data processing tasks. I also realized this was a one-off occasion, so while I did write a routine, I chose not to include it in the
utils module. I also excised the ‘definitions’ column from our input files, made the changes to it, and then uploaded the changes. In this manner, I was able to sidestep some of the general file manipulation requirements that a more commonly used utility would demand. Here is the resulting code:
import csv from bs4 import BeautifulSoup # Part of the Anaconda distro = 'C:/1-PythonProjects/kbpedia/v300/build_ins/working/def_old.csv' in_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/working/def_new.csv' out_file = open(out_file, 'w+', encoding='utf8', newline='') output = 0 x with open(in_file, 'r', encoding='utf8') as f: = csv.reader(f) reader for row in reader: = str(row) line = BeautifulSoup(line) # You can feed bs4 with lines, docs, etc. soup = soup.select('a[href^="http://sw.opencyc.org/"]') # The key for selecting out the OCyc stuff tags if tags != : for item in tags: # Some entries have no tags, others a few # The main method for getting the text within tags item.unwrap() = soup.get_text() # The text after tags removed item_text else: = line item_text = item_text.replace("['","") # A bunch of 'hacky' cleanup of the output item_text = item_text.replace("']","") item_text = item_text.replace('["', '') item_text = item_text.replace('"]', '') item_text = item_text.replace("', '", ",") item_text print(item_text) print(item_text, file=output) = x + 1 x print(x, 'total items processed.') output.close()print('Definition modifications are complete.')
Figuring out this routine took more time than I planned. Part of the reason is that the ‘definitions’ in KBpedia are the longest and most complicated strings, with many clauses and formatting and quoted sections. So I had quoting conflicts that caused some of the 58 K entries to skip or combine with other lines. I wanted to make sure the correspondence was kept accurate. Another issue was figuring out the exact
beautifulsoup syntax for identifying the specific OCyc links (with variable internal references) and extracting out the internal text for the link.
beautifulsoup is a powerful utility, and I am glad I spent some time learning how to get to first twitch with it.
Updates to Domain and Range
Since the earlier version (2.50) of KBpedia did not have proper loads of
range, once I re-established those specifications I foresaw that ingest of these fields might be a problem. The reasons for this supposition are the variety of data types that one might encounter, plus we were dealing with object and data properties, which have a bit more structure and stronger semantics, as well as annotations, which pose different issues in language checks and longer strings.
I was not surprised, then, when this step proved to be the most challenging of the update.
First, indeed, there were more domain and range options as this revised routine indicates (compare to the smaller version in CWPK #47:
### KEY CONFIG SETTINGS (see build_deck in config.py) ### # 'kb_src' : 'standard' # 'loop_list' : file_dict.values(), # see 'in_file' # 'loop' : 'property_loop', # 'in_file' : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv', # 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts.csv', def prop_annot_build(**build_deck): print('Beginning KBpedia property annotation build . . .') = kb.get_namespace('http://w3.org/2001/XMLSchema#') xsd = kb.get_namespace('http://www.opengis.net/def/crs/OGC/1.3/CRS84') wgs84 = build_deck.get('loop_list') loop_list = build_deck.get('loop') loop = build_deck.get('out_file') out_file = 1 x if loop is not 'property_loop': print("Needs to be a 'property_loop'; returning program.") return for loopval in loop_list: print(' . . . processing', loopval) = loopval in_file with open(in_file, 'r', encoding='utf8') as input: = True is_first_row = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subPropertyOf', 'domain', reader 'range', 'functional', 'altLabel', 'definition', 'editorialNote']) for row in reader: = row['id'] r_id = row['prefLabel'] r_pref = row['domain'] r_dom = row['range'] r_rng = row['altLabel'] r_alt = row['definition'] r_def = row['editorialNote'] r_note = r_id.replace('rc.', '') r_id id = getattr(rc, r_id) if id == None: continue if is_first_row: = False is_first_row continue id.prefLabel.append(r_pref) = r_dom.split('||') i_dom if i_dom != ['']: for item in i_dom: # We need to accommodate different namespaces if 'kko.' in item: = item.replace('kko.', '') item = getattr(kko, item) item id.domain.append(item) elif 'owl.' in item: = item.replace('owl.', '') item = getattr(owl, item) item id.domain.append(item) elif item == ['']: continue elif item != '': = getattr(rc, item) item if item == None: continue else: id.domain.append(item) else: print('No domain assignment:', 'Item no:', x, item) continue if 'owl.' in r_rng: # A tremendous number of range options = r_rng.replace('owl.', '') # xsd datatypes are only partially supported r_rng = getattr(owl, r_rng) r_rng id.range.append(r_rng) elif 'string' in r_rng: id.range = [str] elif 'decimal' in r_rng: id.range = [float] elif 'anyuri' in r_rng: id.range = [normstr] elif 'boolean' in r_rng: id.range = [bool] elif 'datetime' in r_rng: id.range = [datetime.datetime] elif 'date' in r_rng: id.range = [datetime.date] elif 'time' in r_rng: id.range = [datetime.time] elif 'wgs84.' in r_rng: = r_rng.replace('wgs84.', '') r_rng = getattr(wgs84, r_rng) r_rng id.range.append(r_rng) elif r_rng == ['']: print('r_rng = empty:', r_rng) else: print('r_rng = else:', r_rng, id) # id.range.append(r_rng) = r_alt.split('||') i_alt if i_alt != ['']: for item in i_alt: id.altLabel.append(item) id.definition.append(r_def) = r_note.split('||') i_note if i_note != ['']: for item in i_note: id.editorialNote.append(item) = x + 1 x format="rdfxml") kb.save(out_file, print('KBpedia property annotation build is complete.')
Second, a number of the range types —
wgs84 — are not supported internally by
owlready2, and there is no facility to add them directly to the system. I have made outreach to the responsive developer of
owlready2, Jean-Baptiste Lamy, to see whether we can fill this gap before we go live with KBpedia v. 300. (Update: Within two weeks, Jean-Baptiste responded with a fix and new definition capabilities.) Meanwhile, there are relatively few instances of this gap, so we are in pretty good shape to move forward as is. Only a handful of resources are affected by these gaps, out of a total of 58 K.
The changing of an identifier for a knowledge graph resource is not encouraged. Most semantic technology advice is simply to pick permanent or persistent URIs. There is thus little discussion or guidance as to what is best practice when an individual resource ID does need to change. Our change from hyphens to underscores (1) is one such example of when an ID needs to change.
The best point of intervention is at the Web server, since our premise for knowledge graphs is Web-accessible information obtained via (generally) HTTP. While we could provide internal knowledge graph representations to capture the mapping between old and new URIs, an external request in the old form still needs to get a completion response for the new form. The best way to achieve that is via content negotiation by the server.
Under circumstances where prior versions of your knowledge graph were in broad use, the recommended approach would be to follow the guidelines of the W3C (the standards-setting body for semantic technologies) for how to publish a semantic Web vocabulary. This guidance is further supplemented with recipes for how to publish linked data under the rubric of ‘cool URIs‘. Following these guidances is much easier than updating URIs in place.
However, because of decisions yet to be documented to not implement linked data (see CWPK #60 when it is published in about three weeks), the approach we will be taking is much simpler. We will generate a mapping (correspondence) file between the older, retired URIs (the ones with the hyphens) with the new URIs (the ones with the underscores). We will announce this correspondence file at time of v 300 release, which we have earmarked to occur at the conclusion of this CWPK series. The responsibility for URI updates, if needed, will be placed on existing KBpedia users. This decision violates the recommended best practice of never changing URIs, but we deem it manageable based on our current user base and their willingness to make those modifications directly. Posting this correspondence fill will be one of the last steps before KBpedia v 300 goes fully ‘live’.
So, we completed the full build, but kept a copy of the one-step-last-removed to return to if (when) we get a better processing of
The effort was greater than I anticipated. Actual processing time for a full re-build across all steps was about 90 min. There was perhaps another 8-12 hrs in working through developing the code and solving (or mostly so) the edge cases.
This is the first time I have done this re-build process with Python, but it is a process I have used and learned to improve for nearly a decade. I’m pretty pleased about the build process itself, but am absolutely thrilled with the learning that has taken place to give me tools at-hand. I’m feeling really positive about how this CWPK series is unfolding.
Part IV Conclusion
This brings to a close Part IV in our CWPK series. When I first laid out the plan for this series, I told myself that eventual public release of the series and its installments depended on being able to fully ’roundtrip’ KBpedia. I was somewhat confident setting out that this milestone could be achieved. Today, I know it to be so, and so now can begin the next steps of releasing the installments and their accompanying Jupyter Notebook pages. Successfully achieving the roundtrip milestone in this objective means we began to publish the start of our CWPK series on July 27, 2020. Woohoo!
In terms of the overall plan, we are about 2/3 of the way through the entire anticipated series. We next tackle the remaining steps in completing a full, public release of the knowledge graph. Then, we use the completed KBpedia v 300 to put the knowledge graph through its paces, doing some analysis, some graphing, and some machine learning. As of this moment in time, we have a target of 75 total installments in this Cooking with Python and KBpedia series, which we hope to wrap up by mid-November or so. Please keep with us for the journey!