Making the Transition to Methods and Modules
With this installment, we transition to our third major part in our Cooking with Python and KBpedia series. We have evaluated and decided upon our alternatives, then installed and configured them while gaining some exposure, and now are transitioning to applying those tools to developing our first methods. This transition will culminate with us packaging our first module in the KBpedia system, in the process beginning to undertake bulk modifications. These bulk capabilities are at the heart of adopting and then extending KBpedia for your own domain purposes. Of course, in still later installments, we will probe more advanced methods and capabilities, but this current part will help us move in that direction by setting the Python groundwork. Besides this intro article, this third major part is almost entirely devoted to Python code and code management.
When I begin posting that code, you will note that I change the standard blue message box at the conclusion of each installment. Yes, I’m a newbie, though with some exposure to programming best practices, but I am still most decidedly an amateur. One of the fun things in working with Python is the multiplicity of packages or modules or styles available to you as a programmer (amateur or not). There are great books and great online resources, which I often cite as we move forward, but I have found interactive coding to be an absolute blast with Jupyter Notebook. One can literally search and find Python coding options immediately on the Web and then test them directly in the notebook. I love the immediate testing, the tactile sense of interacting with the important code blocks. Knowing this, it is helpful to always bring forward the same environment and domain each time I work with the system. That means I am always working with information of relevance and testing routines of importance. I also like the ability to really do Knuth’s literate programming with the interspersing of comment and context.
So, as we kick off this new part, I wanted to start with a largely narrative introduction. I know where I want to go with this series, but since I am documenting as I go, I really don’t know for sure the path to get to objectives. I thought, therefore, that how I think about things and problems could be a logic trace for your own way to think about things. I think thinking in programmatic terms is more dynamic than report writing or project planning, my two main activities for decades. Coding is a faster, more consuming sport.
Why the Idea of ‘Roundtripping’?
Our experience over the past decade has brought three main lessons to the fore. First, knowledge — and, therefore, knowledge graphs to represent it — is dynamic and must to updated and maintained on a constant basis. A static knowledge graph is probably better than none at all, but is a far cry from the usefulness that can be gained from having knowledge currency an objective of a knowledge graph (and its supporting knowledge bases).
Second, while in expression they may be complex, knowledge systems are fundamentally simple and understandable. The complexity of a knowledge system arises from the emergence of simple rules, interacting in exponentially large ways. Implications are deductive, predictions are inductive, and new knowledge arises from abductive ways to interact with these systems. We should be able to break down our knowledge systems into fundamentally simple structures, modify those simple structures, and then build them back up again to their non-linear dynamics.
And, third, we have multiple ways we need to interact with knowledge graphs and bases, and multiple best-breed tools to do so. Sometimes we want to build and maintain a knowledge structure as something unto itself, with logic and integrity checks and tests and expansions of capabilities. Other actions might have us staging data to third-party applications for AI or machine learning. We also may need to make bulk modifications for specific application purposes or to tailor it specifically to our current domain. The different tools that might support these and other activities are best served when something akin to a common data interchange format is found. In our case, that is CSV in UTF-8 format, often expressed as N3 semantic triples.
Once the idea of a common exchange framework emerges, the sense of ‘roundtripping‘ becomes obvious. We use one tool for one purpose, export its information in a format with semantics sufficient for another tool to ingest it, make changes to it, and export it back again in a format with semantics readable by the first tool. Actually, in practice, good roundtripping more resembles a hub-and-spoke design, with the common representation framework at the hub and common to the spokes.
In our design of the KBpedia processing system moving forward, then, we will want to break down, or ‘decompose’ working, fully-specified and logically tested knowledge graphs into component parts that we can work with and modify offline, so to speak. We may work with and modify these component parts quite extensively in this ‘offline’ mode. We could, for example, swap out entire modules for specific domains with our own favored representations of that domain. We may also want to isolate all of our language strings to translate the knowledge graph to other languages. Or we may want to prune areas while we expand the specificity in others. We may even make changes in big chunks to the grounded upper structure of our knowledge graph because our design is inherently malleable. A huge source of ossification in knowledge graphs is this inability to be decomposed into re-processible building blocks.
A Mindset of Patterns
These big design considerations have a complimentary part at the implementation level of the code. The same drivers of hierarchy and generalizability that govern a modular architecture also govern code design, or so it seems to me. Maybe it is because of this pattern of break-down and build-up of the specification components of KBpedia that I also see repeatability in code steps. We start with a file and its context. We process that file to extract its mapped semantics and data. We manipulate that storehouse of assertions to support many applications. We are continually learning and adding to the storehouse. We make bulk moves and bulk changes to our underlying data. We are constantly opening and writing to files, and representing our information as two-dimensional arrays of records (rows) and characteristics (columns). We are needing to monitor changes and log errors and events to file while processing. We need to find our stored procedures and save stuff so we may readily find it again.
The idea of patterns and the power it brings to questions of scoping, abstraction, and design is substantial. I agree with the dictum that if you do something three times you should generalize and code it. My guess is that the search for the better algorithm and design is a key motivator to the professional programmer. For my purposes, however, this mindset is really just one of trying to think through generic activities that a given code block is intended to address, and then assess if more than three applications of this block (or parts of it) are likely across the intended code base. Once so stated, it is pretty obvious that ‘generalizability’ is very much a function of current use and context, so one dynamic aspect of programming is the continual refactoring of prior code to make it generalizable. When stated in words that way, it sounds perhaps a little crazy. But, in practice, generalizability of code leads to further simplicity, maintainability, and (hopefully) efficiency.
Python has many wonderful features to support patterns. One may, for example, adopt a ‘functional’ programming style in working with Python despite the language not being initially designed to be so. Extensions of functionality occur in any existing programming style with Python.
Any information passed to those routines should also be abstracted to logical names within input records. Automation only occurs through generalization. Like the simplicity argument made above, simple machines like automatons are easier to orchestrate and manage, even if their outcomes appear chaotic. So, what I think we would like to do in the totally abstract is have a limited number of functional method primitives to which we pass generic instructions and information using a relative small subset of named objects. Again, this is one of the key strengths of Python: the objectification of the language linked to nameable spaces.
High-level Build Overview
In its most general terms, we build KBpedia from three (actually, four, I cheat, and will explain in a bit) pieces. The first is the structural scaffolding of concepts and their ‘is-a’ hierarchical relationships. The second are the properties of the instances that the concepts represent, and how we understand, qualify, and quantify those things. The third piece is the way we label or describe or point to or indicate those things.
From these components we can build, and in the process logically test, the entire KBpedia from scratch. Since that is now working in our internal implementations with Clojure, that is a de minimus capability we want to capture in Python. While the build process begins with these input files and adds to the core starting point (the ‘bootstrap’ as best understood) we do not have that Python build code as we start out. Further, in a strange way, we never did have such a starting point for KBpedia in Clojure anyway. The code base for KBpedia we inherited from the previous generation UMBEL. And, UMBEL had some historical methods for building its knowledge graph directly from OpenCyc. The modular build routines had never been re-factored into the core routines of either UMBEL or KBpedia!
Fundamentally, this is not a big deal, since our modular approach and additions and modifications present no conceptual or implementation challenges. Still, the fact remains that our Clojure build routines do not begin at the root build premise. The easier way to bootstrap into a complete code base for roundtripping, then, is to first extract away the logical pieces from the coherent full KBpedia, until there is nothing left but the ‘core’ of the ontology. This core, of course, is the Kbpedia Knowledge Ontology, or KKO. For the bootstrapping process to work, we begin with a KKO specified core, and then extract or add pieces to it. We extract when we are capturing changes to the ontology graph that might have been made while in production or development using something like the Protégé IDE. We build when we are submitting our modifications to the ‘core’ and its existing components while testing for consistency and satisfiability.
Thus, while we may be tackling specific tasks a little backwards by dealing with extraction first, in the spirit of roundtripping these are merely questions of where one breaks into the system. For this CWPK series, that starts with extraction.
By the way, what was that reference to the fourth piece? Well, it is mapping KBpedia to external sources to facilitate retrieval and integration. We will cover that topic as well toward the end of our series. We are able to defer this topic since the mapping question is a bit of a secondary orbit from the central question of building and modifying KBpedia (or its derivatives).
A Caution and Some Basic Intuitions
My caution is just to reiterate that the Python code to come is one approach, among certainly many options, most of which I am sure would be easier to understand or better performing than what I am offering. Yet there is much to be said about getting ‘first twitch’ from these Jupyter Notebook installments and being able to test and extend these notions on your own.
And, what are these notions? Given the functional richness of the Python landscape it is only fair that I share some of my prejudices and intuitions about the specific methods put forth in the remaining code. Here are a few:
- I like the idea of ‘generators’. Much of what we deal with in these scripts and KBpedia itself can be expressed in the ‘generator’ style of efficiently looping over specific sets or iterating things
- A ‘set’ notation is at the heart of W3C standards (though sometimes masked as such) and the Python built-in set manipulation methods seem to be a powerful way of manipulating and comparing very large datasets. The set notation includes terminology such as intersction, union, difference, disjoint, subset, update, etc.
- And, our view of CSV files as a central standard likely means we need to investigate and compare and choose among multiple CSV options in Python.
Once we get these basic coding methods in place it is time to turn our efforts into a standard Python module. Our transition will be aided by working the Spyder IDE into our code-development workflow toward the end of this third part.