It’s Time for Ontologies to Put on Their Big Boy Pants
Many of us have been in the semantic Web game for more than a decade. My own first exposure in the early 2000s was spent trying to figure out what the difference was between XML and RDF. (Fortunately, that confusion has long since passed.) We also grappled with the then-new concept of ontologies, now more easily understood as knowledge graphs. In this process many lessons have been learned, but also much promise has yet to be realized.
One of the most important lessons is that the semantic Web is best seen not as an end unto itself. Rather, it, and the semantic technologies that underly it, is really just a means to get at important, longstanding challenges in data interoperability and artificial intelligence. Our work with knowledge graphs needs to be viewed through this lens of what we can do with these technologies to address real problems, not solely for technology’s sake.
It is with this spirit in mind that we are working on our next release of KBpedia, the knowledge structure that knits together six major public knowledge bases for the purpose of speeding machine learning and providing a scaffolding for data interoperability. This pending new release will expand KBpedia in important ways. It will provide a next step on the path to realizing the promise of knowledge graphs.
I will be sharing a series of articles to lay the groundwork for this release, as well as then, after release, to explain what some of it means. This first article begins by discussing the state-of-the-art in semantic knowledge graphs, what they currently do, and what they (often) currently don’t. I grade each of three major areas related to knowledge graphs, in declining order of achievement. My basic report is that we have gotten many things right — witness the growth and credibility of knowledge graphs across all current search services and intelligent agents — but it is time for knowledge graphs to grow out of knickers and don big boy pants.
The Current Knowledge Graph Reader: Concepts and Entities
We watch our children first learn the names of things as they begin mastering language. The learning focus is on nouns, and building a vocabulary about the things that populate the tangible world. By the time we begin putting together our first sentences, lampooned in such early books such as Dick and Jane and the dog Spot, our nouns are getting increasingly numerous and rich, though our verbs remain simple. Early language acquisition, as with the world itself, is much more populated by different kinds of objects than different kinds of actions. Our initial verbs tend to be fewer in number and much less varied than the differences of form and circumstance we can see from objects. Most knowledge graphs have an orientation to things and concepts, the nouns of the knowledge space, much like a Dick and Jane reader. Entities and concepts have occupied my own work on the UMBEL and KBpedia ontologies over the past decade. It is clear similar emphasis has occurred in public knowledge bases, as well. Nouns and categorizing things have been the major focus of efforts to date.
For example, major knowledge base constituents of KBpedia, such as Wikidata, Wikipedia or GeoNames, have millions of concepts or entities within them, but fewer than a few thousand predicates (approx. 2500 useful in Wikidata and 750 or so in DBpedia and schema.org). Further, reasoners that we apply over these graphs have not been expanded to deal with rich predicates. Reasoners mostly rely on inference over subsumption hierarchies, disjointedness, and property conditions like cardinality and range. Mapping predicates are mostly related to subsumption and equivalence, with the latter commonly misused .
Yet, even within the bounds of nouns, we unfortunately have not done well in identifying context. Disambiguation is made difficult without context. Though context may be partially described by nouns related to perception, situations, states and roles, we ultimately require an understanding of events, actions and relations. Until these latter factors are better captured and understood, our ability to establish context remains limited.
The semantic technology languages of RDF and OWL give us the tools to handle these constructs, at least within the limits of first-order logic, but we have mostly spent the past 15 years mastering kindergarten-level basics. To illustrate how basic this is, try to understand how different knowledge graphs treat entities (are they individuals, instances, particulars, or including events or concepts?) versus concepts (are they classes, types, generals, including or not abstractions?). There is certainly not uniformity of treatment of these basic noun grammars. Poor mappings and the inability to capture context further drag down this grade.
Only the Simplest of Relations
I’ve already noted the paucity of relations in (most) current knowledge graphs. But a limited vocabulary is not the only challenge.
There is no general nor coherent theory expressed in how to handle relations in use within the semantic Web. We have expressions that characterize individual things, what we, in our own work, term attributes. We have expressions that name or describe things, including annotations or metadata, what we term denotatives. We have expressions that point to or indicate things, what we term indexicals. And, we have expressions that characterize relations between external objects or actions an agent might take, what we term external relations. These are some of our terms for these relations — which we will describe in detail in the second and third parts of this series — but it is unlikely you will find most or all of these distinctions in any knowledge graph. This lack is a reflection of the inattention to relations.
Modeling relations and predicates needs to capture a worldview of how things are connected, preferably based on some coherent, underlying rationale. Similar to how we categorize the things and entities in our world, we also need to make ontological choices (in the classic sense of the Greek ontos, or the nature of being) as to what a predicate is and how predicates may be classified and organized. Not much is discussed about this topic in the knowledge graph literature, let alone put into practice.
The semantic Web has no well-known or accepted ontology of relations or properties. True, OWL offers the distinction of annotation, object and datatype properties, and also allows property characteristics such as transitivity, domain, range, cardinality, inversion, reflexivity, disjunction and the like to be expressed, but it is a rare ontology that uses any or many of these constructs. The
subProperty expression is used, but only in limited instances and rarely (none, to my knowledge) in a systematic schema. For example, it is readily obvious that some broader predicate such as
animalAction could be split into
voluntaryAction, and then into specific actions such as
walking, and so on, but schema with these kinds of logical property subsumptions are not evident. Structurally, OWL can be used to reason over actions and relations in a similar means as we reason over entities and types, but our common ontologies have yet to do so. Yet creating such schema are within grasp, since we have language structures such as VerbNet and other resources we could put to the task.
We want to establish such a schema so as to be able to reason and organize (categorize) actions and relations. We further want such a schema to segregate out intrinsic relations (attributes) from relations between things, or from descriptions about or indexes to things. This greater understanding is exactly what is needed to reason over relations. It is also what is called for in being able to relate parsed tokens to a semantic grammar. Relation and fact extraction from text further requires this form of schema. Without these broader understandings, we can not adequately capture situations and context, necessary for disambiguating the things in our world.
Though the splits and names may not be exactly as I would have preferred, we nonetheless have sufficient syntax and primitives in OWL by which we can develop such a schema of relations. However, since virtually nothing has been done in this regard over the 15 years of the semantic Web, I have to decrement its grade accordingly.
Oh, and Then There’s the Problem with Data
Besides machine learning, my personal motivations and strongly held beliefs in semantic technologies have been driven by the role they can play in data interoperability. By this term I mean the ability to bring information together from two or more sources so as to effectively analyze and make decisions over the combined information. The first challenge in data interoperability is to ensure that when we talk about things in two or more settings, we understand whether we are talking about the same or different things. To date, this has been a primary use of semantic technologies, though equivalence distinctions remain problematic . We can now relate information in unstructured, semi-structured and structured formats to a common basis. Ontologies are getting mature for capturing nouns. That portion of data interoperability, as noted above, gets a grade of B-.
But there are two additional factors in play with data interoperability. The first is to ensure we are understanding situations and contexts, what received a grade of C above. The remaining factor is actually relating the values associated with the entities or things at hand. In this regard, our track record to date has been abysmal.
As Kingsley Idehen is wont to explain, the linked data model of the semantic Web can be seen to conform to the EAV (entity-attribute-value) data model. We can do pretty well about entities (E), so long as we agree what constitutes an entity and we can accept some mis-assignments. No one really agrees as to what constitutes an attribute (A) (a true attribute, a property, or something other). And while we all intuitively know what constitutes a value, there is no agreement as to data types, units, or ways to relate values in different formats to one another. Though the semantic Web knows how to pump out data using the EAV model, there’s actually very little guidance on how we ingest and conform values across sources. Without this factor, there is no data interoperability. The semantic Web may know how to port relational data to a semantic model, but it still does not how to reconcile values. The ABox, in descriptive logic terms, is barely being tapped .
We fortunately have a rich reservoir of past logical, semantic and philosophical writings to draw upon in relation to all of these factors. We also have many formalized measuring systems and crosswalks between many of them. We are also seeing a renewed effort surrounding more uniform ways to characterize the essential characteristics of data, namely quantities, units, dimensions and datatypes (QUDT) . Better models for data interoperability and resolving these areas exist. Yet, however, insufficient time and effort has yet been expended to bring these resources together into a logical, computable schema. Until all of these factors are brought together with focus, actual data interoperability based on semantic technologies will remain limited.
Why Important to AI?
Relations identification and contextual understanding are at the heart of current challenges in artificial intelligence applications related to knowledge and text. Without these perspectives, it is harder to do sentiment analysis, “fact” (or assertion) extraction, reasoning over relations, reasoning over attributes, context analysis, or disambiguation. We need to learn how to speak the King’s English in these matters, and graduate beyond kindergarten readers.
Deep learning and inclusion of both supervised and unsupervised machine learning is best served when the feature (variable) pool is rich and logically coherent, and when the output targets are accurately defined. “Garbage in, garbage out” applies to artificial intelligence learning in the very same ways it applies to any kind of computational activity. We want coherence, clarity and accuracy in our training sets and corpuses no less than we want it in our analysis and characterizations of the world.
“Dirty” training bases with embedded error can be trained to do no better than their inputs. If we want to train our knowledge applications with Dick and Jane reader inputs, too often in error to begin with, we will not get beyond the most basic of knowledge levels. We can not make the transition to more sophisticated levels without a more sophisticated understanding of the symbolic means for communicating knowledge: that is, human language. Predicate understanding expressed through predicate representations are necessary for predicate logic.
To be sure, progress has been made in the first decade and one-half of the semantic Web. We have learned many best practices and have started to get pretty good in capturing nouns and their types. But what results is a stilted, halting conversation. To begin to become fluent, our knowledge bases must be able to capture and represent verbs, actions and events.
The Anticipated Series
Part II of this series will discuss what the ontological role is of events, and how that relates to a broader model of actions, activities and situations. This foundation will enable a discussion in Parts III and IV of the actual relations model in KBpedia, and how it is expressed in the KBpedia Knowledge Ontology (KKO). A summary of the KBpedia grammar will be provided in Part V. These next parts will set the context for our release of KBpedia v 150, incorporating these new representations, to coincide at the same time.
After this release of KBpedia, the series will continue to discuss such topics as what is real and reality, and some speculations as to practical applications arising from the new relations capabilities in KBpedia. Some of the topics to be discussed in concluding parts will be semantic parsers and natural language understanding, robotics as a driving force in expanded knowledge graphs, and best practices for constructing capable ontologies.
Throughout this series I will repeatedly harken to the teachings of Charles Sanders Perice, and how his insights in logic and sign-making help inform the ontological choices that we are making. We have been formulating our thoughts in this area for years, and Peirce provides important guidance for how to crack some very hard nuts. I hope we can help the evolving state of knowledge graphs grow up a bit, in the process establishing a more complete, coherent and logical basis for constructing knowledge structures useful for advances in artificial intelligence.