Posted:September 21, 2015

Steam engine in action, from WikipediaPractical and Reusable Designs to Make Knowledge Bases Computable

Wikipedia is a common denominator in question answering and commercial natural language applications that leverage artificial intelligence. Witness Siri, Watson, Cortana and Google Now, among others. DBpedia is a structured data representation of Wikipedia that makes much of this content machine readable. Wikidata is a multilingual repository of 18 million structured entities now feeding the Wikipedia ecosystem. The availability of these sources is remaking and accelerating the role of knowledge bases in powering the next generation of artificial intelligence applications. But much, much more is possible.

All of these noted knowledge bases lack a comprehensive and coherent knowledge structure. They are not computable, nor able to be reasoned over or inferenced. While they are valuable resources for structured data and content, the vast potential in these storehouses remains locked up. Yet the relevance of these sources to drive an artificial intelligence platform geared to data and content is profound.

And what makes this potential profound? Well, properly structured, knowledge bases can provide the features and generation of positive and negative training sets useful to machine learning. Coherent organization of the knowledge graph within the KB’s domain enables various forms of reasoning and inference, further useful to making fine-grained recognizers, extractors and classifiers applicable to external knowledge. As I have pointed out before with regard to knowledge-based artificial intelligence (or KBAI) [1], these capabilities can work to extract still more accurate structure and knowledge from the knowledge base, creating a virtuous circle of still further improvements.

In all fairness, the Wikipedia ecosystem was not designed to be a computable one. But the free and open access to content in the Wikipedia ecosystem has sparked an explosion of academic and commercial interest in using this knowledge, often in DBpedia machine-readable form. Yet, despite this interest and more than 500 research papers in areas leveraging Wikipedia for AI and NLP purposes [2], the efforts remain piecemeal and unconnected. Yes, there is valuable structure and content within the knowledge bases; yes, they are being exploited both for high-value bespoke applications and for small research projects; but, across the board, these sources are not being used or leveraged in anything approaching a systematic nature. Each distinct project requires anew its own preparations and staging.

And it is not only Wikipedia that is neglected as a general resource for AI and semantic technology applications. One is hard-pressed to identify any large-scale knowledge base, available in electronic form, that is being sufficiently and systematically exploited for AI or semantic technology purposes [3]. This gap is really rather perplexing. Why the huge disconnect between potential and reality? Could this gap somehow be related to also why the semantic technology community continues to bemoan the lack of “killer apps” in the space? Is there something possibly even more fundamental going on here?

I think there is.

We have spent eight years so far on the development and refinement of UMBEL [4]. It was intended initially to be a bridge between unruly Web content and reasoning capabilities in Cyc to enable information interoperability on the Web [5]; an objective it still retains. Naturally, Wikipedia was the first target for mapping to UMBEL [6]. Through our stumbling and bumbling and just serendipity, we have learned much about the specifics of Wikipedia [6], aspects of knowledge bases in general, and the interface of these resources to semantic technologies and artificial intelligence. The potential marriage between Cyc, UMBEL and Wikipedia has emerged as a first-class objective in its own right.

What we have learned is that it is not any single thing, but multiple things, that is preventing knowledge bases from living up to their potential as resources for artificial intelligence. As I trace some of the sources of our learning below, note that it is a combination of conceptual issues, architectural issues, and terminological issues that need to be overcome in order to see our way to a simpler and more responsive approach.

The Learning Process Began with UMBEL’s SuperTypes

Shortly after the initial release of UMBEL we undertook an effort in 2009 to split it into a number (initially 33) of mostly disjoint “super types” [7]. This logical segmentation was done for practical reasons of efficiency and modularity. It forced us to consider what is a “concept” and what is an “entity”, among other logical splits. It caused us to inspect the entire UMBEL knowledge space, and to organize and arrange and inspect the various parts and roles of the space.

We began to distinguish “attributes” and “relations” as separate from “concepts” and “entities”. Within the clustering of “entities” we could also see that some things were distinct individuals or entity instances, while other terms represented “types” or classes of like entities. At that time, “named entity” was a more important term of art than is practiced today. In looking at this idea we noted [7]:

The intuition surrounding “named entity” and nameable “things” was that they were discrete and disjoint. A rock is not a person and is not a chemical or an event. … some classes of things could also be treated as more-or-less distinct nameable “things”: beetles are not the same as frogs and are not the same as rocks. While some of these “things” might be a true individual with a discrete name, such as Kermit the Frog, or The Rock at Northwestern University, most instances of such things are unnamed. . . . The “nameability” (or logical categorization) of things is perhaps best kept separate from other epistemological issues of distinguishing sets, collections, or classes from individuals, members or instances.

Because we were mapping Cyc and Wikipedia using UMBEL as the intermediary, we noticed that some things were characterized as a class in one system, while being an instance in the other [8]. In essence, we were learning the vocabulary of knowledge bases, and beginning to see that terminology was by no means consistent across systems or viewpoints.

This reminds me of my experience as an undergraduate, learning plant taxonomy. We had to learn literally hundreds of strange terms such as glabrous or hirsute or pinnate, all terms of art for how to describe leaves, their shapes, their hairiness, fruits and flowers and such. What happens, though, when one learns the terminology of a domain is that one’s eyes are opened to see and distinguish more. What had previously been for me a field of view merely of various shades of green and shrubs and trees, emerged as distinct species of plants and individual variants that I could clearly discern and identify. As I learned nuanced distinctions I begin to be able to see with greater clarity. In a similar way, the naming and distinguishing of things in our UMBEL SuperTypes was opening up our eyes to finer and more nuanced distinctions in the knowledge base. All of this striving was in order to be able to map the millions and millions of things within Wikipedia to a different, coherent structure provided by Cyc and UMBEL.

ABox – TBox and Architectural Basics

One of the clearest distinctions that emerged was the split between the TBox and the ABox in the knowledge base, the difference between schema and instances. Through the years I have written many articles on this subject [9]. It is fundamental to understand the differences in representation and work between these two key portions of a knowledge base.

Instances are the specific individual things in the KB that are relevant to the domain. Instances can be many or few, as in the millions within Wikipedia, accounting for more than 90% of its total articles. Instances are characterized by various types of structured data, provided as key attribute-value pairs, and which may be explained through long or short text descriptions, may have multiple aliases and synonyms, may be related to other instances via type or kind or other relations, and may be described in multiple languages. This is the predominant form of content within most knowledge bases, perhaps best exemplified by Wikipedia.

The TBox, on the other hand, needs to be a coherent structural description of the domain, which expresses itself as a knowledge graph with meaningful and consistent connections across its concepts. Somewhat irrespective of the number of instances (the ABox) in the knowledge base, the TBox is relatively constant in size given a desired level of descriptive scope for the domain. (In other words, the logical model of the domain is mostly independent from the number of instances in the domain.)

For a reference structure such as UMBEL, then, the size of its ontology (TBox) can be much smaller and defined with focus, while still being able to refer to and incorporate millions of instances, as is the case for Wikipedia (or virtually any large knowledge base). Two critical aspects for the TBox thus emerge. First, it must be a coherent and reasonable “brain” for capturing the desired dynamics and relationships of the domain. And, second, it must provide a robust, flexible, and expandable means for incorporating instance records. This latter “bridging” purpose is the topic of the next sub-section.

The TBox-ABox segregation, and how it should work logically and pragmatically, requires much study and focus. It is easy to read the words and even sometimes to write them, but it has taken us many years of practice and much thought and experimentation to derive workable information architectures for realizing and exploiting this segregation.

I have previously spelled out seven benefits from the TBox-ABox split [10], but there is another one that arises from working through the practical aspects of this segregation. Namely, an effective ABox-TBox split compels an understanding of the roles and architecture of the TBox. It is the realization of this benefit that is the central basis for the insights provided in this article.

We’ll be spelling out more of these specifics in the sections below. These understandings help us define the composition and architecture of the TBox. In the case of the current development version of UMBEL [11], here are the broad portions of the TBox:

Distribution of Types in the UMBEL TBox

Distribution of Types in the UMBEL TBox

Structures (types) for organizing the entities in the domain constitute nearly 90% of the work of the TBox. This reflects the extreme importance of entity types to the “bridging” function of the TBox to the ABox.

Probing the Concept of ‘Entities’

Most of the instances in the ABox are entities, but what is an “entity”? Unfortunately, that is not universally agreed. In our parlance, an “entity” and related terms are:

  • The basic, real things in our domain of interest: entities
  • The way we characterize and describe those individual things: attributes
  • The way we describe connections between two or more of those things: relations, and
  • Aggregations or collections or classes of similar entities, which also share some essence: entity types.

We no longer use the term named entities, though nouns with proper names are almost always entities. By definition, entities can not be topics or types and entities are not data types. Some of the earlier typologies by others, such as Sekine [12], also mix the ideas of attributes and entities; we do not. Lastly, by definition, entity types have the same attribute “slots” as all type members, even if no data is assigned in many or most cases. The glossary presents a complete compilation of terms and acronyms used.

The role for the label “entity” can also refer to what is known as the root node in some systems such as SUMO [13]. In the OWL language and RDF data model we use, the root node is known as “thing”. Clearly, our use of the term “entity” is much different than SUMO and resides at a subsidiary place in the overall TBox hierarchy. In this case, and frankly for most semantic matches, equivalences should be judged with care, with context the crucial deciding factor.

Nonetheless, most practitioners do not use “entity” in a root manner. Some of the first uses were in the Message Understanding Conferences, especially MUC-6 and MUC-7 in 1995 and 1997, where competitions for finding “named entities” were begun, as well as the practice of in-line tagging [14]. However, even the original MUC conferences conflated proper names and quantities of interest under “named entities.” For example, MUC-6 defined person, organization, and location names, all of which are indeed entity types, but also included dates, times, percentages, and monetary amounts, which we define as attribute types.

It did not take long for various groups and researchers to want more entity types, more distinctions. BBN categories, proposed in 2002, were used for question answering and consisted of 29 types and 64 subtypes [15]. Sekine put forward and refined over many years his Extended Entity Types, which grew to about 200 types [12]. But some of these accepted ‘named entities‘ are also written in lower case, with examples such as rocks (‘gneiss’) or common animals or plants (‘daisy’) or chemicals (‘ozone’) or minerals (‘mica’) or drugs (‘aspirin’) or foods (‘sushi’) or whatever. Some deference was given to the idea of Kripke’s “rigid designators” as providing guidance for how to identify entities; rigid designators include proper names as well as certain natural kind of terms like biological species and substances. Because of these blurrings, the nomenclature of “named entities” began to fade away.

But it did not take but a few years where the demand was for “fine-grained entity” recognition, and scope and numbers of types continued to creep up. Here are some additional data points to what has already been mentioned:

  • DBpedia Ontology: 738 types [16]
  • 636 types [17]
  • YAGO: 505 types; see also HYENA [18]
  • Lee et al.: 147 types [19]
  • FIGER: 112 types [20]
  • Gillick: 86 types [21]
  • OpenCalais: 42 types [22]
  • GeoNames: 654 “feature codes” [23]
  • Nadeau: ~100 types [24].

Lastly, the new version of UMBEL has 25,000 entity types, in keeping with this growth trend and for the “bridging” reasons discussed below.

We can plot this out over time on log scale to see that the proposed entity types have been growing exponentially:

Growth in Recognition of Entity Types

Growth in Recognition of Entity Types

This growth in entity types comes from wanting to describe and organize things with more precision. No longer do we want to talk broadly about people, but we want to talk about astronauts or explorers. We don’t just simply want to talk about products, but categories of manufactured goods like cameras or sub-types like SLR cameras or further sub-types like digital SLR cameras or even specific models like the Canon EOS 7D Mark II (skipping over even more intermediate sub-types). With sufficient instances, it is possible to train recognizers for these different entity types.

What is appropriate for a given domain, enterprise or particular task may vary the depth and scope of what entity types should be considered, which we can also refer to as context. For example, the toucan has sometimes been used as a example of how to refer to or name a thing on the Web [25]. When we inspect what might be a definitive description of “toucan” on Wikipedia, we see that the term more broadly represents the family of Ramphastidae, which contains five genera and forty different species. The picture we display is but of one of those forty species, that of the keel-billed toucan (Ramphastos sulfuratus). Viewing the images of the full list of toucan species shows just how physically divergent these various “toucans” are from one another. Across all species, average sizes vary by more than a factor of three with great variation in bill sizes, coloration and range. Further, if I assert that the picture of the toucan is actually that of my pet keel-billed toucan, Pretty Bird, then we can also understand that this representation is for a specific individual bird, and not the physical keel-billed toucan species as a whole.

The point is not a lesson on toucans, but an affirmation that distinctions between what we think we may be describing occurs over multiple levels. Just as there is no self-evident criteria as to what constitutes an “entity type”, there is also not a self-evident and fully defining set of criteria as to what the physical “toucan” bird should represent. The meaning of what we call a “toucan” bird is not embodied in its label or even its name, but in the accompanying referential information that place the given referent into a context that can be communicated and understood. A URI points to (“refers to”) something that causes us to conjure up an understanding of that thing, be it a general description of a toucan, a picture of a toucan, an understanding of a species of toucan, or a specific toucan bird. Our understanding or interpretation results from the context and surrounding information accompanying the reference.

Both in terms of historical usage and trying to provide a framework for how to relate these various understandings of entities and types, we can thus see this kind of relationship:

Evolving Sophistication of Entity Types

Evolving Sophistication of Entity Types

What we see is an entities typology that provides the “bridging” interface between specific entity records and the UMBEL reasoning layer. This entities typology is built around UMBEL’s existing SuperTypes. The typology is the result of the classification of things according to their shared attributes and essences. The idea is that the world is divided into real, discontinuous and immutable ‘kinds’. Expressed another way, in statistics, typology is a composite measure that involves the classification of observations in terms of their attributes on multiple variables. In the context of a global KB such as Wikipedia, about 25,000 entity types are sufficient to provide a home for the millions of individual articles in the system.

Each SuperType related to entities has its own typology, and is generally expressed as a taxonomy of 3-4 levels, though there are some cases where the depth is much greater (9-10 levels) [26]. There are two flexible aspects to this design. First, because the types are “natural” and nested [27], coarse entity schema can readily find a correspondence. Second, if external records have need for more specificity and depth, that can be accommodated through a mapped bridging hierarchy as well. In other words, the typology can expand and contract like a squeezebox to map a range of specificities.

The internal process to create these typologies also has the beneficial effect of testing placements in the knowledge graph and identifying gaps in the structure as informed by fragments and orphans. The typologies should optimally be fully connected in order to completely fulfill their bridging function.

Extending the Mindset to Attributes and Relations

As with our defined terminology above [28], we can apply this same mindset to the characterizations (attributes) of entities and the relations between entities and TBox concepts or topics. Numerically, these two categories are much less prevalent than entity types. But, the construction and use of the typologies are roughly the same.

Since we are using RDF and OWL as our standard model and languages, one might ask why we are not relying on the distinction of object and datatype properties for these splits. Relations, it is true, by definition need to be object properties, since both subject and object need to be identified things. But attributes, in some cases such as rating systems or specific designators, may also refer to controlled vocabularies, which can (and, under best practice, should) be represented as object properties. So, while most attributes are built around datatype properties, not all are. Relations and attributes are a better cleaving, since we can use relations as patterns for fact extractions and the organization of attributes give us a cross-cutting means to understand the characteristics of things independent of entity type. These all become valuable potential features for machine learning, in addition to the standard text structure.

Though, today, UMBEL is considerably more sophisticated in its entities typologies, we already have a start on an attributes typology by virtue of the prior work on the Attributes Ontology [29], which will be revised to conform to this newer typology model. We also have a start on a relations typology based on existing OWL and RDF predicates used in UMBEL, plus many candidates from the Activities SuperType. As with the entities typology, relation types and attribute types may also have hierarchy, enabling reasoning and the squeezebox design. As with the entities typology, the objective is to have a fully connected hierarchy, of perhaps no more than 3-4 levels depth, with no fragments and no orphans.

A Different Role for Annotations

Annotations about how we label things and how we assign metadata to things resides at a different layer than what has been discussed to this point. Annotations can not be reasoned over, but they can and do play pivotal roles. Annotations are an important means for tagging, matching and slicing-and-dicing the information space. Metadata can perform those roles, but also may be used to analyze provenance and reasoning, if the annotations are mapped to object or datatype properties.

Labels are the means to broaden the correspondence of real-world reference to match the true referents or objects in the knowledge base. This enables the referents to remain abstract; that is, not tied to any given label or string. In best practice we recommend annotations reflect all of the various ways a given object may be identified (synonyms, acronyms, slang, jargon, all by language type). These considerations improve the means for tagging, matching, and slicing-and-dicing, even if the annotations are not able to be reasoned over.

As a mental note for the simple design that follows, imagine a transparent overlay, not shown, upon which best-practice annotations reside.

A Simple Design Brings it All Together

The insights provided here have taken much time to discover; they have arisen from a practical drive to make knowledge bases computable and useful to artificial intelligence applications. Here is how we now see the basics of a knowledge base, properly configured to be computable, and open to integration with external records:

Boiling KBs Down to Basics

Boiling KBs Down to Basics

At the broadest perspective, we can organize our knowledge-base platform into a “brain” or organizer/reasoner, and the instances or specific things or entities within our domain of interest. We can decompose a KB to become computable by providing various type groupings for our desired instance mappings, and an upper reasoning layer. An interface layer of “types”, organized into three broad groupings, provides the interface, or “bridging” layer, between the TBox and ABox. We thus have an architectural design segregating:

  • Topics and upper level — the general organization and “brains” of the domain
  • Entity types — categorizations of the actual things in the space
  • Relation types — the ways that different things are related to, or act upon, one another
  • Attribute types — a structured organization of the ways that individual entities can be described
  • Instances — the individual entities of the domain, and
  • Properties — the source grist for annotations, relation types and attribute types.

Becoming familiar with this terminology helps to show how the conventional understanding of these terms and structure have led to overlooking key insights and focusing (sometimes) on the wrong matters. That is in part why so much of the simple logic of this design has escaped the attention of most practitioners. For us, personally, at Structured Dynamics, it had eluded us for years, and we were actively seeking it.

Irrespective of terminology, the recognition of the role of types and their bridging function to actual instance data (records) is central to the computability of the structure. It also enables integration with any form of data record or record stores. The ability to understand relation types leads to improved relation extraction, a key means to mine entities and connections from general content and to extend assertions in the knowledge base. Entity types provide a flexible means for any entity to connect with the computable structure. And, the attribute types provide an orthogonal and inferential means to slice the information space by how objects get characterized.

Because of this architecture, the reference sources guiding its construction, its typologies, its ability to generate features and training sets, and its computability, we believe this overall design is suited to provide an array of AI and enterprise services:

Machine Intelligence Apps and Services
  • Entity recognizers
  • Relation extractors
  • Event extractors
  • Phrase identification
  • Classifiers
  • Q & A systems
  • Cognitive computing
  • Semantic publishing
  • Knowledge base mappings
  • Sub-graph extraction
  • Ontology development
  • Ontology mappers
  • Entity dictionaries
  • Entity linkers
  • Data conversion and mapping
  • Master data management
  • KB improvements
  • Attribute “slot filling”
  • Disambiguators
  • Duplicates removal
  • Inference and reasoning
  • Sentiment analysis
  • Semantic relatedness
  • Recommendation systems
  • Bespoke analysis
  • Bespoke platforms

By cutting through the clutter — conceptually and terminologically — it has been possible to derive a practical and repeatable design to make KBs computable. Being able to generate features and positive and negative training sets, almost at will, is proving to be an effective approach to machine learning at mass-produced prices.

[1] See M. K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” from AI3:::Adaptive Information blog, November 17, 2014. For additional writings in the series, see
[2] See Fabian M. Suchanek and Gerhard Weikum, 2014. “Knowledge Bases in the Age of Big Data Analytics,Proceedings of the VLDB Endowment 7, no. 13 (2014) and M.K. Bergman, “SWEETpedia,” listing of Wikipedia research articles, on AI3:::Adaptive Information blog, January 25, 2010; the listing as of its last update included 246 articles. Also, see Wikipedia’s own “Wikipedia in Academic Studies.”
[3] A possible exception to this observation is the biomedical community through its Open Biological and Biomedical Ontologies (OBO) initiative.
[4] See M.K. Bergman, 2007. “Announcing UMBEL: A Lightweight Subject Structure for the Web,” AI3:::Adaptive Information blog, July 12, 2007. Also see
[5] See M.K. Bergman, 2008. “Basing UMBEL’s Backbone on OpenCyc,” AI3:::Adaptive Information blog, April 2, 2008.
[6] See M.K. Bergman, 2015. “Shaping Wikipedia into a Computable Knowledge Base,” AI3:::Adaptive Information blog, March 31, 2015.
[7] M.K. Bergman, 2009. ‘SuperTypes’ and Logical Segmentation of Instances, AI3:::Adaptive Information blog, September 2, 2009.
[8] This possible use of an item as both a class and an instance through “punning” is a desirable feature of OWL 2, which is the language basis for UMBEL. You can learn more on this subject in M.K. Bergman, 2010. “Metamodeling in Domain Ontologies,” AI3:::Adaptive Information blog, September 20, 2010.
[9] For a listing of these, see the Google query One of the 40 articles with the most relevant commentary to this article is M.K. Bergman, 2014. “Big Structure and Data Interoperability,” AI3:::Adaptive Information blog, August 14, 2014.
[10] M.K. Bergman, 2009. ” Making Linked Data Reasonable using Description Logics, Part 1,” AI3:::Adaptive Information blog, February 11, 2009.
[11] The current development version of UMBEL is v 1.30. It is due for release before the end of 2015.
[12] See the Sekine Extended Entity Types; the listing also includes attributes info at bottom of source page.
[14] N. Chinchor, 1997. “Overview of MUC-7,” MUC-7 Proceedings, 1997.
[15] Ada Brunstein, 2002. “Annotation Guidelines for Answer Types”. LDC Catalog, Linguistic Data Consortium. Aug 3, 2002.
[16] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann, 2009. “DBpedia-A Crystallization Point for the Web of Data.” Web Semantics: science, services and agents on the world wide web 7, no. 3 (2009): 154-165; 170 classes in this paper. That has grown to more than 700; see and
[17] The listing is under some dynamic growth. This is the official count as of September 8, 2015, from Current updates are available from Github.
[18] Joanna Biega, Erdal Kuzey, and Fabian M. Suchanek, 2013. “Inside YAGO2: A Transparent Information Extraction Architecture,” in Proceedings of the 22nd international conference on World Wide Web, pp. 325-328. International World Wide Web Conferences Steering Committee, 2013. Also see Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol, Gerhard Weikum, 2012. “HYENA: Hierarchical Type Classification for Entity Names,” in Proceedings of the 24th International Conference on Computational Linguistics, Coling 2012, Mumbai, India, 2012.
[19] Changki Lee, Yi-Gyu Hwang, Hyo-Jung Oh, Soojong Lim, Jeong Heo, Chung-Hee Lee, Hyeon-Jin Kim, Ji-Hyun Wang, and Myung-Gil Jang, 2006. “Fine-grained Named Entity Recognition using Conditional Random Fields for Question Answering,” in Information Retrieval Technology, pp. 581-587. Springer Berlin Heidelberg, 2006.
[20] Xiao Ling and Daniel S. Weld, 2012. “Fine-Grained Entity Recognition,” in AAAI. 2012.
[21] Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, and David Huynh, 2104. “Context-Dependent Fine-Grained Entity Type Tagging.” arXiv preprint arXiv:1412.1820 (2014).
[24] David Nadeau, 2007. “Semi-supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision.” PhD Thesis, School of Information Technology and Engineering, University of Ottawa, 2007.
[25] M.K. Bergman, 2012. ” Give Me a Sign: What Do Things Mean on the Semantic Web?,” AI3:::Adaptive Information blog, January 24, 2012.
[26] A good example of description and use of typologies is in the archaelogy description on Wikipedia.
[27] M.K. Bergman, 2015. “‘Natural’ Classes in the Knowledge Web“, AI3:::Adaptive Information blog, July 13, 2015.
[28] Also see my Glossary for definitions of specific terminology used in this article.
[29] M.K. Bergman, 2015. “Conceptual and Practical Distinctions in the Attributes Ontology“, AI3:::Adaptive Information blog, March 3, 2015.
Posted:September 17, 2015

AI3 Pulse

In keeping with the expanding topics of knowledge-based artificial intelligence (KBAI), I have done a thorough update of my older Acronyms and Glossary page on this blog.

Besides correcting some errors and updating the listings, the major changes were to bring in an earlier post on a semantic technologies glossary and to greatly expand the glossary to include knowledge base and artificial intelligence topics.

Please let me know if you see any errors. I also welcome suggestions for new entries that should be added to the list.

Posted by AI3's author, Mike Bergman Posted on September 17, 2015 at 12:04 pm in Pulse, Site-related | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:July 24, 2015

AI3 Pulse

According to the Digital Service Cloud, Cisco forecasts that the number of connected devices worldwide (that is, the Internet of Things, or IoT) will double from 25 billion in 2015 to 50 billion in 2020. IDC claims the global IoT market will grow from $1.9 trillion in 2013 to $7.1 trillion by 2020.

Though not the only factor, a group of academic researchers first made the connection between semantic technologies and IoT a couple of years back [1]. IoT devices are quite diverse and measure different parameters and with different conventions and units of measure. Though competing proprietary protocols keep getting proposed, it is likely that open source standards will be one of the ways to get this data to interoperate.

According to the authors, “providing interoperability among the ‘Things’ on the IoT is one of the most fundamental requirements to support object addressing, tracking, and discovery as well as information representation, storage, and exchange.” They also said that “applying semantic technologies to IoT promotes interoperability among IoT resources, information models, data providers and consumers, and facilitates effective data access and integration, resource discovery, semantic reasoning, and knowledge extraction.”

Yet the authors caution that the use of semantic technologies alone will not be sufficient to guarantee interoperability. They note the importance for stakeholders to agree upon shared ontologies and descriptions of the ‘things’ (entities) and their access services in the IoT. Open standards matched with consensus ontologies and vocabularies are both required.

I could not agree more that the open standards of semantic technologies, which are already designed for interoperability, will be one of the keys to combining results across multiple devices. This belief is one of the reasons behind Structured Dynamics‘ recent work with the Attributes Ontology of UMBEL [2], which we see as another one of the key enablers for IoT.

[1] Payam Barnaghi, Wei Wang, Cory Henson, and Kerry Taylor, 2012. “Semantics for the Internet of Things: Early Progress and Back to the Future,International Journal on Semantic Web and Information Systems (IJSWIS) 8, no. 1 (2012): 1-21.
[2] See, for example, M.K. Bergman, 2015. “Conceptual and Practical Distinctions in the Attributes Ontology,” AI3:::Adaptive Information blog, March 3, 2015.
Posted:July 13, 2015

All Natural LabelGleaning Clues from Aristotle to Charles S. Peirce

We have recently talked much of the use of knowledge bases in areas such as artificial intelligence and knowledge supervision. The idea is to leverage the knowledge codified in these knowledge bases, Wikipedia being the most prominent exemplar, to guide feature selection or the creation of positive and negative training sets to be used by machine learners.

The pivotal piece of information that enables knowledge bases to perform in this way is a coherent knowledge graph of concepts and entity types. As I have discussed many times, the native category structure of Wikipedia (and all other commonly used KBs) leaves much to be desired. It is one of the major reasons we are re-organizing KB content using the UMBEL reference knowledge graph [1]. The ultimate requirement for the governing knowledge graph (ontology) is that it be logical, consistent and coherent. It is from this logical structure that we can provide the rich semsets [2] for semantic matches, make inferences, understand relatedness, and make disjointedness assertions. In the context of knowledge-based artificial intelligence (KBAI) applications [3], the disjointedness assertions are especially important to aiding the creation of negative training sets based on knowledge supervision.

Coherent and logical graphs first require natural groupings or classes of concepts and entity types by which to characterize the domain at hand, situated with respect to one another with testable relations. Entity types are further characterized by a similar graph of descriptive attributes. Concepts and entity types thus represent the nodes in the graph, with relations being the connecting infrastructure.

Going back at least to Aristotle, how to properly define and bound categories and concepts has been a topic of much philosophical discussion. If the nodes in our knowledge graph are not scoped and defined in a consistent way, then it is virtually impossible to construct a logical and coherent way to reason over this structure. This inconsistency is the root source of the problem that Wikipedia can not presently act as a computable knowledge graph, for example.

This article thus describes how Structured Dynamics informs its graph-construction efforts built around the notion of “natural classes.” Our use and notion of “natural classes” hews closely to how we understand the great American logician, Charles S. Peirce, came to define the concept [3]. Natural classes were a key underpinning to Peirce’s own efforts to provide a uniform classification system related to inquiry and the sciences.

Humanity’s Constant Effort to Define and Organize Our World

Aristotle set the foundational basis for understanding what we now call natural kinds and categories. The universal desire by all of us to be able to understand and describe our world has meant that philosophers have argued these splits and their bases ever since. In very broad terms we have realists, who believe things have independent order in the natural world and can be described as such; we have nominalists, who believe that humans provide the basis for how things are organized in part by how we name them; and we have idealists, or anti-realists, who believe “natural” classes are generalized ones that conform to human ideals of how the world is organized, but are not independently real [4]. These categories, too, shade into one another, such that these beliefs become strains in various degrees for how any one individual might be defined.The realist strain, also closely tied to the sciences and the scientific method, is what most guides the logic basis behind semantic technologies and SD’s view of how to organize the world.

Aristotle believed that the world could be characterized into categories, that categories were hierarchical in nature, and what defined a particular class or category was its essence, or the attributes that uniquely define what a given thing is. A mammal has the essences of being hairy, warm-blooded, and live births. These essences distinguish from other types of animals such as birds or reptiles or fishes or insects. Essential properties are different than accidental or artificial distinctions, such as whether a man has a beard or not or whether he is gray- or red-haired or of a certain age or country. A natural classification system is one that is based on these real differences and not artificial or single ones. Hierarchies arise from the shared generalizations of such essences amongst categories or classes. Under the Aristotelian approach, classification is the testing and logical clustering of such essences into more general or more specific categories of shared attributes. Because these essences are inherent to nature, natural clusterings are an expression of true relationships in the real world.

By the age of the Enlightenment, these long-held philosophies began to be questioned by some. Descartes famously grounded the perception of the world into innate ideas in the human mind. This philosophy built upon that of William of Ockham, who maintained the world is populated by individuals, and no such things as universals exist. In various guises, thinkers from Locke to Hume questioned a solely realistic organization of concepts in the world [5]. While there may be “natural kinds”, categorization is also an expression of the innate drive by humans to name and organize their world.

Relatedness of shared attributes can create ontological structures that enable inference and a host of graph analytics techniques for understanding meaning and connections. For such a structure to be coherent, the nodes (classes) of the structure should be as natural as possible, with uniformly applied relations defining the structure.

Thus, leaving behind metaphysical arguments, and relying solely on what is pragmatic, effectively built ontologies compel the use of a realistic viewpoint for how classes should be bounded and organized. Science and technology are producing knowledge at unprecedented amounts, and realism is the best approach for testing the trueness of new assertions. We think realism is the most efficacious approach to ontology design. One of the reasons that semantics are so important is that language used to capture the diversity of the real world must be able to be meaningfully related. Being explicit about the philosophy in how we construct ontologies helps decide sometimes sticky design questions.

Unnatural Classifications Instruct What is Natural

These points are not academic. The central failing, for example, of Wikipedia has been its category structure [7]. Categories have strayed from a natural classification scheme, and many categories are “artificial” in that they are compound or distinguished by a single attribute.“Compound” (or artificial) categories (such as Films directed by Pedro Almodóvar or Ambassadors of the United States to Mexico) are not “natural” categories, and including them in a logical evaluation only acts to confuse attributes from classification. To be sure, such existing categories should be decomposed into their attribute and concept components, but should not be included in constructing a schema of the domain.

“Artificial” categories may be identified in the Wikipedia category structure by both syntactical and heuristic signals. One syntactical rule is to look for the head of a title; one heuristic signal is to select out any category with prepositions. Across all rules, “compound” categories actually account for most of what is removed in order to produce “cleaned” categories.

We can combine these thoughts to show what a “cleaned” version of the Wikipedia category structure might look like. The 12/15/10 column in the table below reflects the approach used for determining the candidates for SuperTypes in the UMBEL ontology, last analyzed in 2010 [8]. The second column is from a current effort mapping Wikipedia to Cyc [9]:

12/15/10 3/1/15
Total Categories 100% 100%
Administrative Categories 14% 15%
Orphaned Categories 10% 20%
Working Categories 76% 66%
“Artificial” Categories 44% 34%
Single Head 23%
Plural Head 24%
“Clean” Categories 33% 46%

Two implications can be drawn from this table. First, without cleaning, there is considerable “noise” in the Wikipedia category structure, equivalent to about half to two-thirds of all categories. Without cleaning these categories, any analysis or classification that ensues is fighting unnecessary noise and has likely introduced substantial assignment errors.

Second, the power that comes from a coherent schema of categories and concepts — especially inference and graph analysis — can not be applied to a structure that is not constructed along realistic lines. We can expand on this observation by bringing in our best logician on information, semeiosis and categories, Charles S. Peirce.

Peirce’s Refined Arguments of a Natural Class

Peirce was the first, by my reading, who looked at the question of “natural classes” sufficient to provide design guidance, and which may be sometimes contraposed against what are called “artificial classes” (we tend to use the term “compound” classes instead). A “natural class” is a set with members that share the same set of attributes, though with different values (such as differences in age or hair color for humans, for example). Some of those attributes are also more essential to define the “type” of that class (such as humans being warm-blooded with live births and hair and use of symbolic languages). Artificial classes tend to only add one or a few shared attributes, and do not reflect the essence of the type [6].

The most comprehensive treatment of Peirce’s thinking on natural classes was provided by Menno Hulswit in 1997. He first explains the genesis of Peirce’s thinking [6]:

“The idea that things belong to natural kinds seems to involve a commitment to essentialism: what makes a thing a member of a particular natural kind is that it possesses a certain essential property (or a cluster of essential properties), a property both necessary and sufficient for a thing to belong to that kind.”

“According to Mill, every thing in the world belongs to some natural class or real kind. Mill made a distinction between natural classes and non-natural or artificial classes (Mill did not use the latter term). The main difference is that the things that compose a natural class have innumerous properties in common, whereas the things that belong to an artificial class resemble one another in but a few respects.”

“Accordingly, a natural or real class is defined as a class ‘of which all the members owe their existence to a common final cause’ (CP 1.204), or as a class the ‘existence of whose members is due to a common and peculiar final cause’ (CP1.211). The final cause is described in this context as ‘a common cause by virtue of which those things that have the essential characters of the class are enabled to exist’ (CP 1.204).”

“Peirce concluded from these observations that the objects that belong to the same natural class, need not have all the characters that seem to belong to the class. After thus having criticized Mill, Peirce gave the following definition of natural class (or real kind):

“Any class which, in addition to its defining character, has another that is of permanent interest and is common and peculiar to its members, is destined to be conserved in that ultimate conception of the universe at which we aim, and is accordingly to be called ‘real.’ (CP6.384; 1901)”

“. . . natural classification of artificial objects is a classification according to the purpose for which they were made.”

“The problem of natural kinds is important because it is inextricably linked to several philosophical notions, such as induction, universals, scientific realism, explanation, causation, and natural law.”

This background sets up Hulswit’s interpretation of then how Peirce’s views on natural classification evolved [6]:

“Peirce’s approach was broadly Aristotelian inasmuch as natural classification always concerns the form of things (which is that by virtue of which things are what they are) and not their matter. This entails that Peirce borrowed Aristotle’s idea that the form was identical to the intrinsic final cause. Therefore it was obvious that natural classification concerns the final causes of the things. From the natural sciences, Peirce had learned that the forms of chemical substances and biological species are the expression of a particular internal structure. He recognized that it was precisely this internal structure that was the final cause by virtue of which the members of the natural class exist.”

“Accordingly, Peirce’s view may be summarized as follows: Things belong to the same natural class on account of a metaphysical essence and a number of class characters. The metaphysical essence is a general principle by virtue of which the members of the class have a tendency to behave in a specific way; this is what Peirce meant by final cause. This finality may be expressed in some sort of microstructure. The class characters which by themselves are neither necessary nor sufficient conditions for membership of a class, are nevertheless concomitant. In the case of a chair, the metaphysical essence is the purpose for which chairs are made, while its having chair-legs is a class character. The fuzziness of boundary lines between natural classes is due to the fuzziness of the class characters. Natural classes, though very real, are not existing entities; their reality is of the nature of possibility, not of actuality. The primary instances of natural classes are the objects of scientific taxonomy, such as elementary particles in physics, gold in chemistry, and species in biology, but also artificial objects and social classes.”

“By denying that final causes are static, unchangeable entities, Peirce avoided the problems attached to classical essentialism. On the other hand, by eliminating arbitrariness, Peirce also avoided pluralistic anarchism. Though Peircean natural classes only come into being as a result of the abstractive and selective activities of the people who classify, they reflect objectively real general principles. Thus, there is not the slightest sense in which they are arbitrary: “there are artificial classifications in profusion, but [there is] only one natural classification” (C P 1.275; 1902).”

Importantly, note that “natural kinds” or “natural classes” are not limited to things only found in nature. Perice’s semiotics (theory of signs) also recognizes “natural” distinctions in arenas such as social classes, the sciences, and man-made products [6]. Again, the key discriminators are the essences of things that distinguish them from other things, and the degree of sharing of attributes contains the basis for understanding relationships and hierarchies.

Natural Classes Can be Tested, Reasoned Over and Are Mutable

Though all of this sounds somewhat abstract and philosophical, these distinctions are not merely metaphysical. The ability to organize our representations of the world into natural classes also carries with it the ability to organize that world, reason over it, draw inferences from it, and truth test it. Indeed, as we may discover through knowledge acquisition or the scientific method, this world representation is itself mutable. Our understanding of species relationships, for example, has changed markedly, especially most recently, as the basis for our classifications shifts from morphology to DNA. Einstein’s challenges to Newtonian physics similarly changed the “natural” way by which we need to organize our understanding of the world.

When we conjoin ideas such as Shannon’s theory of information [10] with Peirce’s sophisticated and nuanced theory of signs [11], other insights begin to emerge about how the natural classification of things (“information”) can produce leveraged benefits. In linking these concepts together, de Tienne has provided some explanations for how Peirce’s view of information relates to information theory and efficient information messaging and processing [12]:

“For a propositional term to be a predicate, it must have ‘informed breadth’, that is, it must be predicable of real things, ‘with logical truth on the whole in a supposed state of information.’ . . . . For a propositional term to be a subject, it must have ‘informed depth’, that is, it must have real characters that can be predicated of it also ‘with logical truth on the whole in a supposed state of information’.”

“Peirce indeed shows that induction, by enlarging the breadth of predicate terms, actually increases the depth of subject terms—by boldly generalizing the attribution of a character from selected objects to their collection—while hypothesis, by enlarging the depth of subject terms, actually increases the breadth of predicate terms—by boldly enlarging their attribution to new individuals. Both types of amplicative inferences thus generate information.”

“. . . information is not a mere sum of quantities, but a product, and this distinction harbors a profound insight. When Peirce began defining, in 1865, information as the multiplication of two logical quantities, breadth and depth (or connotation and denotation, or comprehension and extension), it was in recognition of the fact that information was itself a higher-order logical quantity not reducible to either multiplier or multiplicand. Unlike addition, multiplication changes dimensionality—at least when it is not reduced, as is often the case in schoolbooks, to a mere additive repetition. Information belongs to a different logical dimension, and this entails that, experientially, it manifests itself on a higher plane as well. Attributing a predicate to a subject within a judgment of experience is to acknowledge that the two multiplied ingredients, one the fruit of denotation, the other of connotation, in their very multiplication or copulative conjunction, engender a new kind of logical entity, one that is not merely a fruit or effect of their union, but one whose anticipation actually caused the union.”

The essence of knowledge is that it is ever-growing and expandable. New insights bring new relations and new truths. The structures we use to represent this knowledge must themselves adapt and reflect the best of our current, testable understandings. Keeping in mind the need for all of our classes to be “natural” — that is, consistent with testable, knowable truth — is a key building block in how we should organize our knowledge graphs. Similar inspection can be applied to the relations used in the knowledge graph [13], but I will leave that discussion to another day.

Though hardly simple, the re-classification of Wikipedia’s content into a structure based on “natural classes” will bring heretofore unseen capabilities in coherence and computability to the knowledge base. Similar benefits can be obtained from any knowledge base that is presently characterized by an unnatural structure.

We now have both tests and guidelines — granted, still being discerned from Peirce’s writings or its logic — for what constitutes a “natural class”. “Natural classes” are testable; we not only know it when we see it, we can systematize the use of them. In classifying a class as a “natural” one does entail aspects of judgment and world view. But, so long as the logics and perspectives behind these decisions are consistent, I believe we can create computable knowledge graphs that cohere following these tests and guidelines.

Some may question whether any given structure is more “natural” than another one. But, through such guideposts as coherence, inference, testability and truthfulness, these structural arrangements are testable propositions. As Peirce, I think, would admonish us, failure to meet these tests are grounds for re-jiggering our structures and classes. In the end, coherence and computability become the hurdles that our knowledge graphs must clear in order to be reliable structures.

[1] For the latest release of UMBEL and its knowledge graph and associated links, see M.K. Bergman, 2015. “UMBEL version 1.20 Released,” in AI3:::Adaptive Information blog, April 21, 2015.
[2] A semset is the use of a series of alternate labels and terms to describe a concept or entity. These alternatives include true synonyms, but may also be more expansive and include jargon, slang, acronyms or alternative terms that usage suggests refers to the same concept. See further
[3] I first discussed Charles S. Peirce at length in M.K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web,” in AI3:::Adaptive Information blog, January 24, 2012.
[4] See, for example, John Michael Steiner, 2011. “An Anti-Realist Theory of Natural Kinds.” PhD dissertation, University of Calgary, September 2011, 245 pp.
[5] Michael R. Ayers, 1981. “Locke versus Aristotle on Natural Kinds.” The Journal of Philosophy (1981): 247-272.
[6] Menno Hulswit, 1997. “Peirce’s Teleological Approach to Natural Classes,” in Transactions of the Charles S. Peirce Society (1997): 722-772.
[7] See M.K. Bergman, 2015. “Shaping Wikipedia into a Computable Knowledge Base,” in AI3:::Adaptive Information blog, March 31, 2015.
[9] Aleksander Smywinski-Pohl, Krzysztof Wróbel, Michael K. Bergman and Bartosz Ziółko, 2015. “ An Interoperable Framework for Web Knowledge Bases,” manuscript in preparation.
[10] Claude E. Shannon, 1948. “A Mathematical Theory of Communication”, Bell System Technical Journal, 27: 379–423, 623-656, July, October, 1948. See
[11] Charles Sanders Peirce, 1894. “What is in a Sign?”, see
[12] André de Tienne, 2006. “Peirce’s Logic of Information.” Seminario del Grupo de Estudios Peirceanos, Universidad de Navarra 28 (2006).
[13] See, as one example, this discussion for the need for consistent and foundational relationship types, Giancarlo Guizzardi and Gerd Wagner, 2008. “What’s in a Relationship: An Ontological Analysis” In Conceptual Modeling-ER 2008, pp. 83-97. Springer Berlin Heidelberg, 2008.
Posted:June 23, 2015

Openness; courtesy of Magelia WebStoreProviding the Method behind Knowledge-based Artificial Intelligence

One of the central engines behind artificial intelligence is machine learning. ML involves various ways that data is used to train or teach machines to classify, predict or perform complicated tasks, such as I captured in an earlier diagram. The methods used in machine learning may be statistical, based on rules, or recognizing or discovering patterns.

The name machine learning begs the question of to learn what? In the context of images, audio, video or sensory perception, machine learning is trained for the recognition of patterns, which can be layered into learning manifolds called deep learning. In my realm — that is, knowledge bases and semantics — machine learning can be applied to topic or entity clustering or classification; entity, attribute or relation identification and extraction; disambiguation; mapping and linking multiple sources; language translation; duplicates removal; reasoning; semantic relatedness; phrase identification; recommendation systems; and, question answering. Significant results can be obtained in these areas without the need for deep learning, though that can and is being usefully applied in areas like machine translation or artificial writing.

Machine learning can be either supervised or unsupervised. In supervised learning, positive and (often) negative training examples are presented to the learning algorithm in order to create a model to produce the desired results for the given context. No training examples are presented in unsupervised learning; rather, the model is derived from patterns discovered in the absence of training examples, sometimes described as finding hidden patterns in unlabeled data. Supervised methods are generally more accurate than unsupervised methods, and nearly universally so in the realm of content information and knowledge.

There is effort and expense associated with creating positive or negative training examples (sets). This effort can span from the maximum of ones completely constructed manually to ones that are semi-automatic (semi-supervised) or to ones informed by knowledge bases (weakly supervised or distant supervised [1], [2]). Creation of manual training sets may consume as much as 80% of overall efforts in some cases, and is always a time-consuming task whenever employed. The accuracy of the eventual models is only as good as the trueness of the input training sets, with traditionally the best results coming from manually determined training sets; the best of those are known as “gold standards.” The field of machine learning is thus broad and multiple methods span these spectra of effort and accuracy.

The Spectrum of Machine Learning

The Spectrum of Machine Learning

To date, the state-of-the-art in machine learning for natural language processing and semantics, my realm, has been in distant supervision using knowledge bases like Freebase or Wikipedia to extract training sets for supervised learning [1]. Relatively clean positive and negative training sets may be created with much reduced effort over manually created ones. This is the current “sweet spot” in the accuracy v. effort trade-off for machine learning in my realm.

However, as employed to date, distant supervision has mostly been a case-by-case, problem-by-problem approach, and most often applied to entity or relation extraction. Yes, knowledge bases may be inspected and manipulated to create the positive and negative training examples needed, but this effort has heretofore not been systematic in approach nor purposefully applied across a range of ML applications.  How to structure and use knowledge bases across a range of machine learning applications with maximum accuracy and minimum effort, what we call knowledge supervision, is the focus of this article. The methods of knowledge supervision are how we make operational the objectives of knowledge-based artificial intelligence. This article is thus one of the foundations to my recent series on KBAI [3].

Features and Training Sets

Features and training sets, in relation to the specific machine learning approaches that are applied, are the major determinants to how successful the learning actually is. We already touched upon the trade-offs in effort and accuracy associated with training sets, and will provide further detail on this question below. But features also pose trade-offs and require similar skill in selection and use.

In machine learning, a feature is a measurable property of the system being analyzed. A feature is equivalent to what is known as an explanatory variable in statistics. A feature, stated another way, is a measurable aspect of the system that provides predictive value for the state the system.

Features with high explanatory power independent of other features are favored, because each added feature adds a computational cost of some manner. Many features are correlated with one another; in these cases it is helpful to find the strongest signals and exclude the other correlates. Too many features also make tuning and refinement more difficult, what has sometimes been called the curse of dimensionality. Overfitting is also often a problem, which limits the ability of the model to generalize to other data.

Yet too few features and there is inadequate explanatory power to achieve the analysis objectives.

Though it is hard to find discussion of best practices in feature extraction, striking this balance is an art [4]. Multiple learners might also be needed in order to capture the smallest, independent (non-correlated) feature set with the highest explanatory power [5].

When knowledge bases are used in distant supervision, only a portion of their structure or content is used as features. Still other distant supervision efforts may be geared to other needs and use a different set of features. Indeed, broadly considered, knowledge bases (potentially) have a rich diversity of possible features arising from:

  • text, and its content, syntax, semantics and morphology
  • use vectors of co-occurring terms or concepts
  • categories
  • conventions
  • synonyms
  • linkages
  • mappings
  • relations
  • attributes
  • content placement within its knowledge graph, and
  • disjointednesses.

An understanding of the features potential for knowledge bases is the first mindset of moving toward more purposeful knowledge supervision. At Structured Dynamics we stage the structured information as RDF triples and OWL ontologies, which we can select and manipulate via APIs and SPARQL. We also stage the graph structure and text in Lucene, which gives us powerful faceted search and other advanced NLP manipulations and analyses. These same features may also be utilized to extend the features set available from the knowledge base through such actions as new entity, attribute, or relation extractions; fine-grained entity typing [6]; creation of word vectors or tensors; results of graph analytics; forward or backward chaining; efficient processing structures; etc.

Because all features are selectable via either structured SPARQL query or faceted search, it is also possible to more automatically extract positive and negative training sets. Attention to proper coverage and testing of disjointedness assertions is another purposeful step useful to knowledge supervision, since it aids identification of negative examples for the training.

Whatever the combination of ML method, feature set, or positive or negative training sets, the ultimate precision and accuracy of knowledge supervision requires the utmost degree of true results in both positive and negative training sets. Training to inaccurate information merely perpetuates inaccurate information. As anyone who has worked extensively with source knowledge bases such as Freebase, DBpedia or Wikipedia may attest, assignment errors and incomplete typing and characterizations are all too common. Further, none provide disjointedness assertions.

Thus, the system should be self-learning with results so characterized as to be fed automatically to further testing and refinement. As better quality and more features are added to the system, we produce what we have shown before [3], as the virtuous circle of KBAI:

Features and training sets may thus be based on the syntax, morphology, semantics (meaning of the data) or relationships (connections) of the source data in the knowledge base. Continuous testing and the application of the system’s own machine learners creates a virtuous feedback where the accuracy of the overall system is constantly and incrementally improved.

Manifest Applications for Knowledge Supervision

The artificial intelligence applications to which knowledge supervision may be applied are manifest. Here is a brief listing of some of those areas as evidenced by distant supervision applied to machine learning in academic research, or others not yet exploited:

  • entity identification (recognition) and extraction
  • attribute identification and extraction (“slot filling”)
  • relation identification and extraction
  • event identification and extraction
  • entity classifiers
  • phrase (n-gram) identification
  • entity linkers
  • mappers
  • topic clusterers
  • topic classifiers
  • disambiguators
  • duplicates removal
  • semantic relatedness
  • inference and reasoning
  • sub-graph extraction
  • ontology matchers
  • ontology mappers
  • sentiment analysis
  • question answering
  • recommendation systems
  • language translation
  • multi-language versions
  • artificial writing, and
  • ongoing knowledge base improvements and extensions.

These areas are listed in rough order from the simpler to the more complex analyses. Most distant supervision efforts to date have concentrated on information extraction, the first items shown on the list. But all of these are amenable to knowledge supervision with ML. Since 2009, many of the insights regarding these potentials have arisen from the Knowledge Base Population initiative of the Text Analysis Conference [7].

Mapping and linkage are essential areas on this list since they add greatly to the available feature set and provide the bases for greater information interoperability. This is the current emphasis of Structured Dynamics.

Knowledge Supervision is Purposeful and Systematic

Knowledge supervision is the purposeful structuring and use of knowledge bases to provide features and training sets for multiple kinds of machine learners, which in combination can be applied to multiple artificial intelligence outcomes. Knowledge supervision is the method by which knowledge-based artificial intelligence, or KBAI, is achieved.

None of this is free, of course. Much purposeful work is necessary to configure and stage the data structures and systems that support the broad application of knowledge supervision. And other questions and challenges related to KBAI also remain. Yet, as Pedro Domingos has stated [4]:

“And the organizations that make the most of machine learning are those that have in place an infrastructure that makes experimenting with many different learners, data sources and learning problems easy and efficient, and where there is a close collaboration between machine learning experts and application domain ones.”

Having the mindset and applying the methods of knowledge supervision produces an efficient, repeatable, improvable infrastructure for active learning about the enterprise’s information assets.

As noted, we are just at the beginnings of knowledge supervision, and best practices and guidelines are still in the formative stages. We also have open questions and challenges in how features can be effectively selected; how KB-trained classifiers can be adopted to the wild; how we can best select and combine existing machine learners to provide an ML infrastructure; where and how deep learning should most effectively be applied; and how other emerging insights in computational linguistics can be combined with knowledge supervision [8].

But we can already see that a purposeful mindset coupled with appropriate metadata and structured RDF data is a necessary grounding to the system. We can see broad patterns across the areas of information extraction involving concepts, entities, relations, attributes and events that can share infrastructure and methods. We realize that linkage and mapping are key enabling portions of the system. The need for continuous improvement and codification of self-learning are the means by which our systems will get more accurate.

So, with the what of knowledge-based artificial intelligence, we can now add some broad understandings of the how based on knowledge supervision. None of these ideas are unique or new unto themselves. But the central role of knowledge bases in KBAI and knowledge supervision represents an important advance of artificial intelligence to deal with real-world challenges in content and information.

[1] Distant supervision was earlier or alternatively called self-supervision, indirect supervision or weakly-supervised. It is a method to use knowledge bases to label entities automatically in text, which is then used to extract features and train a machine learning classifier. The knowledge bases provide coherent positive training examples and avoid the high cost and effort of manual labelling. The method is generally more effective than unsupervised learning, though with similar reduced upfront effort. Large knowledge bases such as Wikipedia or Freebase are often used as the KB basis.
The first acknowledged use of distant supervision was Craven and Kumlien in 1999 (Mark Craven and Johan Kumlien. 1999. “Constructing Biological Knowledge Bases by Extracting Information from Text Sources,” in ISMB, vol. 1999, pp. 77-86. 1999; source of weak supervision term.)); the first use of the formal term distant supervision was in Mintz et al. in 2009 (Mike Mintz, Steven Bills, Rion Snow, Dan Jurafsky, 2009. “Distant Supervision for Relation Extraction without Labeled Data,” in Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 1003–1011, Suntec, Singapore, 2-7 August 2009). Since then, the field has been a very active area of research; see next reference.
[2] See M. K. Bergman, 2015. “Forty Seminal Distant Supervision Articles,” from AI3:::Adaptive Information blog, November 17, 2014, as supplemented by [3].
[3] See M. K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” from AI3:::Adaptive Information blog, November 17, 2014.
[4] Pedro Domingos, 2012. “A Few Useful Things to Know About Machine Learning.” Communications of the ACM 55, no. 10 (2012): 78-87.
[5] There is a rich literature providing guidance on feature selection and feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the available features. It is also possible to apply methods, the best known and simplest being principal component analysis, among many, to reduce feature size (dimensionality) with acceptable loss in accuracy.
[6] As a good introduction and overview, see Xiao Ling and Daniel S. Weld, 2012. “Fine-Grained Entity Recognition,” in Proceedings of the 26th AAAI Conference on Artificial Intelligence, 2012. You can also search on the topic in Google Scholar.
[7] TAC is organized by the National Institute of Standards and Technology (NIST). Initiated in 2008, TAC grew out of NIST’s Document Understanding Conference (DUC) for text summarization, and the Question Answering Track of the Text Retrieval Conference (TREC). TAC is overseen by representatives from government, industry, and academia. The Knowledge Base Population tracks of TAC were started in 2009 and continue to today.
[8] See, for example,  Percy Liang and Christopher Potts, 2015. “Bringing Machine Learning and Compositional Semantics Together.” Annu. Rev. Linguist. 1, no. 1 (2015): 355-376.