Posted:March 14, 2016

AI3 PulseA New Era in Artificial Intelligence Will Open Pandora’s Box

Here’s a prediction: the new emphasis on artificial intelligence and robotics will occasion some new looks at knowledge representation. Prior to the past few years many knowledge representation (KR) projects have been more in the way of prototypes or games. But, now that we are seeing real robotics and knowledge-based AI activities take off, some of the prior warts and problems of leading KR approaches are starting to become evident.

For example, for years major upper-level ontologies have tended to emphasize dichotomous splits in how to “model” the world, including:

  • abstract-physical — a split between what is fictional or conceptual and what is tangibly real
  • occurrent-continuant — a split between a “snapshot” view of the world and its entities versus a “spanning” view that is explicit about changes in things over time
  • perduant-endurant — a split for how to regard the identity of individuals, either as a sequence of individuals distinguished by temporal parts (for example, childhood or adulthood) or as the individual enduring over time
  • dependent-independent — a split between accidents (which depend on some other entity) and substances (which are independent)
  • particulars-universals — a split between individuals in space and time that cannot be attributed to other entities versus abstract universals such as properties that may be assigned to anything
  • determinate-indeterminate.

Since the mid-1980s, description logics have also tended to govern most KR languages, and are the basis of the semantic Web data model and languages of RDF and OWL. (However, common logic and its dialects are also used as a more complete representation of first-order logic.) The trade-off in KR language design is one of expressiveness versus complexity.

Cyc was developed as one means to address a gap in standard KR approaches: how to capture and model common sense. Conceptual graphs, formally a part of common logic, were developed to handle n-ary relationships and the questions of sign processes (semiosis), fallibility and processes of pragmatic learning.

Zhou offers a new take on an old strategy to KR, which is to use set theory as the underlying formalism [1]. This first paper deals with the representation itself; a later paper is planned on reasoning.

We do not live in a dichotomous world. And, I personally find Charles Peirce’s semeiosis to be a more compelling grounding for what a KR design should look like. But as Zhou points out, and is evident in current AI advances, robotics and the need for efficient, effective reasoning are testing today’s standards in knowledge representation as never before. I suspect we are in for a period of ferment and innovation as we work to get our KR languages up to task.


[1] Yi Zhou, 2016. “A Set Theoretic Approach for Knowledge Representation: the Representation Part,” arXiv:1603.03511, 14 Mar 2016.
Posted:September 21, 2015

Steam engine in action, from WikipediaPractical and Reusable Designs to Make Knowledge Bases Computable

Wikipedia is a common denominator in question answering and commercial natural language applications that leverage artificial intelligence. Witness Siri, Watson, Cortana and Google Now, among others. DBpedia is a structured data representation of Wikipedia that makes much of this content machine readable. Wikidata is a multilingual repository of 18 million structured entities now feeding the Wikipedia ecosystem. The availability of these sources is remaking and accelerating the role of knowledge bases in powering the next generation of artificial intelligence applications. But much, much more is possible.

All of these noted knowledge bases lack a comprehensive and coherent knowledge structure. They are not computable, nor able to be reasoned over or inferenced. While they are valuable resources for structured data and content, the vast potential in these storehouses remains locked up. Yet the relevance of these sources to drive an artificial intelligence platform geared to data and content is profound.

And what makes this potential profound? Well, properly structured, knowledge bases can provide the features and generation of positive and negative training sets useful to machine learning. Coherent organization of the knowledge graph within the KB’s domain enables various forms of reasoning and inference, further useful to making fine-grained recognizers, extractors and classifiers applicable to external knowledge. As I have pointed out before with regard to knowledge-based artificial intelligence (or KBAI) [1], these capabilities can work to extract still more accurate structure and knowledge from the knowledge base, creating a virtuous circle of still further improvements.

In all fairness, the Wikipedia ecosystem was not designed to be a computable one. But the free and open access to content in the Wikipedia ecosystem has sparked an explosion of academic and commercial interest in using this knowledge, often in DBpedia machine-readable form. Yet, despite this interest and more than 500 research papers in areas leveraging Wikipedia for AI and NLP purposes [2], the efforts remain piecemeal and unconnected. Yes, there is valuable structure and content within the knowledge bases; yes, they are being exploited both for high-value bespoke applications and for small research projects; but, across the board, these sources are not being used or leveraged in anything approaching a systematic nature. Each distinct project requires anew its own preparations and staging.

And it is not only Wikipedia that is neglected as a general resource for AI and semantic technology applications. One is hard-pressed to identify any large-scale knowledge base, available in electronic form, that is being sufficiently and systematically exploited for AI or semantic technology purposes [3]. This gap is really rather perplexing. Why the huge disconnect between potential and reality? Could this gap somehow be related to also why the semantic technology community continues to bemoan the lack of “killer apps” in the space? Is there something possibly even more fundamental going on here?

I think there is.

We have spent eight years so far on the development and refinement of UMBEL [4]. It was intended initially to be a bridge between unruly Web content and reasoning capabilities in Cyc to enable information interoperability on the Web [5]; an objective it still retains. Naturally, Wikipedia was the first target for mapping to UMBEL [6]. Through our stumbling and bumbling and just serendipity, we have learned much about the specifics of Wikipedia [6], aspects of knowledge bases in general, and the interface of these resources to semantic technologies and artificial intelligence. The potential marriage between Cyc, UMBEL and Wikipedia has emerged as a first-class objective in its own right.

What we have learned is that it is not any single thing, but multiple things, that is preventing knowledge bases from living up to their potential as resources for artificial intelligence. As I trace some of the sources of our learning below, note that it is a combination of conceptual issues, architectural issues, and terminological issues that need to be overcome in order to see our way to a simpler and more responsive approach.

The Learning Process Began with UMBEL’s SuperTypes

Shortly after the initial release of UMBEL we undertook an effort in 2009 to split it into a number (initially 33) of mostly disjoint “super types” [7]. This logical segmentation was done for practical reasons of efficiency and modularity. It forced us to consider what is a “concept” and what is an “entity”, among other logical splits. It caused us to inspect the entire UMBEL knowledge space, and to organize and arrange and inspect the various parts and roles of the space.

We began to distinguish “attributes” and “relations” as separate from “concepts” and “entities”. Within the clustering of “entities” we could also see that some things were distinct individuals or entity instances, while other terms represented “types” or classes of like entities. At that time, “named entity” was a more important term of art than is practiced today. In looking at this idea we noted [7]:

The intuition surrounding “named entity” and nameable “things” was that they were discrete and disjoint. A rock is not a person and is not a chemical or an event. … some classes of things could also be treated as more-or-less distinct nameable “things”: beetles are not the same as frogs and are not the same as rocks. While some of these “things” might be a true individual with a discrete name, such as Kermit the Frog, or The Rock at Northwestern University, most instances of such things are unnamed. . . . The “nameability” (or logical categorization) of things is perhaps best kept separate from other epistemological issues of distinguishing sets, collections, or classes from individuals, members or instances.

Because we were mapping Cyc and Wikipedia using UMBEL as the intermediary, we noticed that some things were characterized as a class in one system, while being an instance in the other [8]. In essence, we were learning the vocabulary of knowledge bases, and beginning to see that terminology was by no means consistent across systems or viewpoints.

This reminds me of my experience as an undergraduate, learning plant taxonomy. We had to learn literally hundreds of strange terms such as glabrous or hirsute or pinnate, all terms of art for how to describe leaves, their shapes, their hairiness, fruits and flowers and such. What happens, though, when one learns the terminology of a domain is that one’s eyes are opened to see and distinguish more. What had previously been for me a field of view merely of various shades of green and shrubs and trees, emerged as distinct species of plants and individual variants that I could clearly discern and identify. As I learned nuanced distinctions I begin to be able to see with greater clarity. In a similar way, the naming and distinguishing of things in our UMBEL SuperTypes was opening up our eyes to finer and more nuanced distinctions in the knowledge base. All of this striving was in order to be able to map the millions and millions of things within Wikipedia to a different, coherent structure provided by Cyc and UMBEL.

ABox – TBox and Architectural Basics

One of the clearest distinctions that emerged was the split between the TBox and the ABox in the knowledge base, the difference between schema and instances. Through the years I have written many articles on this subject [9]. It is fundamental to understand the differences in representation and work between these two key portions of a knowledge base.

Instances are the specific individual things in the KB that are relevant to the domain. Instances can be many or few, as in the millions within Wikipedia, accounting for more than 90% of its total articles. Instances are characterized by various types of structured data, provided as key attribute-value pairs, and which may be explained through long or short text descriptions, may have multiple aliases and synonyms, may be related to other instances via type or kind or other relations, and may be described in multiple languages. This is the predominant form of content within most knowledge bases, perhaps best exemplified by Wikipedia.

The TBox, on the other hand, needs to be a coherent structural description of the domain, which expresses itself as a knowledge graph with meaningful and consistent connections across its concepts. Somewhat irrespective of the number of instances (the ABox) in the knowledge base, the TBox is relatively constant in size given a desired level of descriptive scope for the domain. (In other words, the logical model of the domain is mostly independent from the number of instances in the domain.)

For a reference structure such as UMBEL, then, the size of its ontology (TBox) can be much smaller and defined with focus, while still being able to refer to and incorporate millions of instances, as is the case for Wikipedia (or virtually any large knowledge base). Two critical aspects for the TBox thus emerge. First, it must be a coherent and reasonable “brain” for capturing the desired dynamics and relationships of the domain. And, second, it must provide a robust, flexible, and expandable means for incorporating instance records. This latter “bridging” purpose is the topic of the next sub-section.

The TBox-ABox segregation, and how it should work logically and pragmatically, requires much study and focus. It is easy to read the words and even sometimes to write them, but it has taken us many years of practice and much thought and experimentation to derive workable information architectures for realizing and exploiting this segregation.

I have previously spelled out seven benefits from the TBox-ABox split [10], but there is another one that arises from working through the practical aspects of this segregation. Namely, an effective ABox-TBox split compels an understanding of the roles and architecture of the TBox. It is the realization of this benefit that is the central basis for the insights provided in this article.

We’ll be spelling out more of these specifics in the sections below. These understandings help us define the composition and architecture of the TBox. In the case of the current development version of UMBEL [11], here are the broad portions of the TBox:



Distribution of Types in the UMBEL TBox

Distribution of Types in the UMBEL TBox

Structures (types) for organizing the entities in the domain constitute nearly 90% of the work of the TBox. This reflects the extreme importance of entity types to the “bridging” function of the TBox to the ABox.

Probing the Concept of ‘Entities’

Most of the instances in the ABox are entities, but what is an “entity”? Unfortunately, that is not universally agreed. In our parlance, an “entity” and related terms are:

  • The basic, real things in our domain of interest: entities
  • The way we characterize and describe those individual things: attributes
  • The way we describe connections between two or more of those things: relations, and
  • Aggregations or collections or classes of similar entities, which also share some essence: entity types.

We no longer use the term named entities, though nouns with proper names are almost always entities. By definition, entities can not be topics or types and entities are not data types. Some of the earlier typologies by others, such as Sekine [12], also mix the ideas of attributes and entities; we do not. Lastly, by definition, entity types have the same attribute “slots” as all type members, even if no data is assigned in many or most cases. The glossary presents a complete compilation of terms and acronyms used.

The role for the label “entity” can also refer to what is known as the root node in some systems such as SUMO [13]. In the OWL language and RDF data model we use, the root node is known as “thing”. Clearly, our use of the term “entity” is much different than SUMO and resides at a subsidiary place in the overall TBox hierarchy. In this case, and frankly for most semantic matches, equivalences should be judged with care, with context the crucial deciding factor.

Nonetheless, most practitioners do not use “entity” in a root manner. Some of the first uses were in the Message Understanding Conferences, especially MUC-6 and MUC-7 in 1995 and 1997, where competitions for finding “named entities” were begun, as well as the practice of in-line tagging [14]. However, even the original MUC conferences conflated proper names and quantities of interest under “named entities.” For example, MUC-6 defined person, organization, and location names, all of which are indeed entity types, but also included dates, times, percentages, and monetary amounts, which we define as attribute types.

It did not take long for various groups and researchers to want more entity types, more distinctions. BBN categories, proposed in 2002, were used for question answering and consisted of 29 types and 64 subtypes [15]. Sekine put forward and refined over many years his Extended Entity Types, which grew to about 200 types [12]. But some of these accepted ‘named entities‘ are also written in lower case, with examples such as rocks (‘gneiss’) or common animals or plants (‘daisy’) or chemicals (‘ozone’) or minerals (‘mica’) or drugs (‘aspirin’) or foods (‘sushi’) or whatever. Some deference was given to the idea of Kripke’s “rigid designators” as providing guidance for how to identify entities; rigid designators include proper names as well as certain natural kind of terms like biological species and substances. Because of these blurrings, the nomenclature of “named entities” began to fade away.

But it did not take but a few years where the demand was for “fine-grained entity” recognition, and scope and numbers of types continued to creep up. Here are some additional data points to what has already been mentioned:

  • DBpedia Ontology: 738 types [16]
  • schema.org: 636 types [17]
  • YAGO: 505 types; see also HYENA [18]
  • Lee et al.: 147 types [19]
  • FIGER: 112 types [20]
  • Gillick: 86 types [21]
  • OpenCalais: 42 types [22]
  • GeoNames: 654 “feature codes” [23]
  • Nadeau: ~100 types [24].

Lastly, the new version of UMBEL has 25,000 entity types, in keeping with this growth trend and for the “bridging” reasons discussed below.

We can plot this out over time on log scale to see that the proposed entity types have been growing exponentially:



Growth in Recognition of Entity Types

Growth in Recognition of Entity Types

This growth in entity types comes from wanting to describe and organize things with more precision. No longer do we want to talk broadly about people, but we want to talk about astronauts or explorers. We don’t just simply want to talk about products, but categories of manufactured goods like cameras or sub-types like SLR cameras or further sub-types like digital SLR cameras or even specific models like the Canon EOS 7D Mark II (skipping over even more intermediate sub-types). With sufficient instances, it is possible to train recognizers for these different entity types.

What is appropriate for a given domain, enterprise or particular task may vary the depth and scope of what entity types should be considered, which we can also refer to as context. For example, the toucan has sometimes been used as a example of how to refer to or name a thing on the Web [25]. When we inspect what might be a definitive description of “toucan” on Wikipedia, we see that the term more broadly represents the family of Ramphastidae, which contains five genera and forty different species. The picture we display is but of one of those forty species, that of the keel-billed toucan (Ramphastos sulfuratus). Viewing the images of the full list of toucan species shows just how physically divergent these various “toucans” are from one another. Across all species, average sizes vary by more than a factor of three with great variation in bill sizes, coloration and range. Further, if I assert that the picture of the toucan is actually that of my pet keel-billed toucan, Pretty Bird, then we can also understand that this representation is for a specific individual bird, and not the physical keel-billed toucan species as a whole.

The point is not a lesson on toucans, but an affirmation that distinctions between what we think we may be describing occurs over multiple levels. Just as there is no self-evident criteria as to what constitutes an “entity type”, there is also not a self-evident and fully defining set of criteria as to what the physical “toucan” bird should represent. The meaning of what we call a “toucan” bird is not embodied in its label or even its name, but in the accompanying referential information that place the given referent into a context that can be communicated and understood. A URI points to (“refers to”) something that causes us to conjure up an understanding of that thing, be it a general description of a toucan, a picture of a toucan, an understanding of a species of toucan, or a specific toucan bird. Our understanding or interpretation results from the context and surrounding information accompanying the reference.

Both in terms of historical usage and trying to provide a framework for how to relate these various understandings of entities and types, we can thus see this kind of relationship:



Evolving Sophistication of Entity Types

Evolving Sophistication of Entity Types

What we see is an entities typology that provides the “bridging” interface between specific entity records and the UMBEL reasoning layer. This entities typology is built around UMBEL’s existing SuperTypes. The typology is the result of the classification of things according to their shared attributes and essences. The idea is that the world is divided into real, discontinuous and immutable ‘kinds’. Expressed another way, in statistics, typology is a composite measure that involves the classification of observations in terms of their attributes on multiple variables. In the context of a global KB such as Wikipedia, about 25,000 entity types are sufficient to provide a home for the millions of individual articles in the system.

Each SuperType related to entities has its own typology, and is generally expressed as a taxonomy of 3-4 levels, though there are some cases where the depth is much greater (9-10 levels) [26]. There are two flexible aspects to this design. First, because the types are “natural” and nested [27], coarse entity schema can readily find a correspondence. Second, if external records have need for more specificity and depth, that can be accommodated through a mapped bridging hierarchy as well. In other words, the typology can expand and contract like a squeezebox to map a range of specificities.

The internal process to create these typologies also has the beneficial effect of testing placements in the knowledge graph and identifying gaps in the structure as informed by fragments and orphans. The typologies should optimally be fully connected in order to completely fulfill their bridging function.

Extending the Mindset to Attributes and Relations

As with our defined terminology above [28], we can apply this same mindset to the characterizations (attributes) of entities and the relations between entities and TBox concepts or topics. Numerically, these two categories are much less prevalent than entity types. But, the construction and use of the typologies are roughly the same.

Since we are using RDF and OWL as our standard model and languages, one might ask why we are not relying on the distinction of object and datatype properties for these splits. Relations, it is true, by definition need to be object properties, since both subject and object need to be identified things. But attributes, in some cases such as rating systems or specific designators, may also refer to controlled vocabularies, which can (and, under best practice, should) be represented as object properties. So, while most attributes are built around datatype properties, not all are. Relations and attributes are a better cleaving, since we can use relations as patterns for fact extractions and the organization of attributes give us a cross-cutting means to understand the characteristics of things independent of entity type. These all become valuable potential features for machine learning, in addition to the standard text structure.

Though, today, UMBEL is considerably more sophisticated in its entities typologies, we already have a start on an attributes typology by virtue of the prior work on the Attributes Ontology [29], which will be revised to conform to this newer typology model. We also have a start on a relations typology based on existing OWL and RDF predicates used in UMBEL, plus many candidates from the Activities SuperType. As with the entities typology, relation types and attribute types may also have hierarchy, enabling reasoning and the squeezebox design. As with the entities typology, the objective is to have a fully connected hierarchy, of perhaps no more than 3-4 levels depth, with no fragments and no orphans.

A Different Role for Annotations

Annotations about how we label things and how we assign metadata to things resides at a different layer than what has been discussed to this point. Annotations can not be reasoned over, but they can and do play pivotal roles. Annotations are an important means for tagging, matching and slicing-and-dicing the information space. Metadata can perform those roles, but also may be used to analyze provenance and reasoning, if the annotations are mapped to object or datatype properties.

Labels are the means to broaden the correspondence of real-world reference to match the true referents or objects in the knowledge base. This enables the referents to remain abstract; that is, not tied to any given label or string. In best practice we recommend annotations reflect all of the various ways a given object may be identified (synonyms, acronyms, slang, jargon, all by language type). These considerations improve the means for tagging, matching, and slicing-and-dicing, even if the annotations are not able to be reasoned over.

As a mental note for the simple design that follows, imagine a transparent overlay, not shown, upon which best-practice annotations reside.

A Simple Design Brings it All Together

The insights provided here have taken much time to discover; they have arisen from a practical drive to make knowledge bases computable and useful to artificial intelligence applications. Here is how we now see the basics of a knowledge base, properly configured to be computable, and open to integration with external records:

Boiling KBs Down to Basics

Boiling KBs Down to Basics

At the broadest perspective, we can organize our knowledge-base platform into a “brain” or organizer/reasoner, and the instances or specific things or entities within our domain of interest. We can decompose a KB to become computable by providing various type groupings for our desired instance mappings, and an upper reasoning layer. An interface layer of “types”, organized into three broad groupings, provides the interface, or “bridging” layer, between the TBox and ABox. We thus have an architectural design segregating:

  • Topics and upper level — the general organization and “brains” of the domain
  • Entity types — categorizations of the actual things in the space
  • Relation types — the ways that different things are related to, or act upon, one another
  • Attribute types — a structured organization of the ways that individual entities can be described
  • Instances — the individual entities of the domain, and
  • Properties — the source grist for annotations, relation types and attribute types.

Becoming familiar with this terminology helps to show how the conventional understanding of these terms and structure have led to overlooking key insights and focusing (sometimes) on the wrong matters. That is in part why so much of the simple logic of this design has escaped the attention of most practitioners. For us, personally, at Structured Dynamics, it had eluded us for years, and we were actively seeking it.

Irrespective of terminology, the recognition of the role of types and their bridging function to actual instance data (records) is central to the computability of the structure. It also enables integration with any form of data record or record stores. The ability to understand relation types leads to improved relation extraction, a key means to mine entities and connections from general content and to extend assertions in the knowledge base. Entity types provide a flexible means for any entity to connect with the computable structure. And, the attribute types provide an orthogonal and inferential means to slice the information space by how objects get characterized.

Because of this architecture, the reference sources guiding its construction, its typologies, its ability to generate features and training sets, and its computability, we believe this overall design is suited to provide an array of AI and enterprise services:

Machine Intelligence Apps and Services
  • Entity recognizers
  • Relation extractors
  • Event extractors
  • Phrase identification
  • Classifiers
  • Q & A systems
  • Cognitive computing
  • Semantic publishing
  • Knowledge base mappings
  • Sub-graph extraction
  • Ontology development
  • Ontology mappers
  • Entity dictionaries
  • Entity linkers
  • Data conversion and mapping
  • Master data management
  • KB improvements
  • Attribute “slot filling”
  • Disambiguators
  • Duplicates removal
  • Inference and reasoning
  • Sentiment analysis
  • Semantic relatedness
  • Recommendation systems
  • Bespoke analysis
  • Bespoke platforms

By cutting through the clutter — conceptually and terminologically — it has been possible to derive a practical and repeatable design to make KBs computable. Being able to generate features and positive and negative training sets, almost at will, is proving to be an effective approach to machine learning at mass-produced prices.


[1] See M. K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” from AI3:::Adaptive Information blog, November 17, 2014. For additional writings in the series, see https://www.mkbergman.com/category/kbai/.
[2] See Fabian M. Suchanek and Gerhard Weikum, 2014. “Knowledge Bases in the Age of Big Data Analytics,Proceedings of the VLDB Endowment 7, no. 13 (2014) and M.K. Bergman, “SWEETpedia,” listing of Wikipedia research articles, on AI3:::Adaptive Information blog, January 25, 2010; the listing as of its last update included 246 articles. Also, see Wikipedia’s own “Wikipedia in Academic Studies.”
[3] A possible exception to this observation is the biomedical community through its Open Biological and Biomedical Ontologies (OBO) initiative.
[4] See M.K. Bergman, 2007. “Announcing UMBEL: A Lightweight Subject Structure for the Web,” AI3:::Adaptive Information blog, July 12, 2007. Also see http://umbel.org.
[5] See M.K. Bergman, 2008. “Basing UMBEL’s Backbone on OpenCyc,” AI3:::Adaptive Information blog, April 2, 2008.
[6] See M.K. Bergman, 2015. “Shaping Wikipedia into a Computable Knowledge Base,” AI3:::Adaptive Information blog, March 31, 2015.
[7] M.K. Bergman, 2009. ‘SuperTypes’ and Logical Segmentation of Instances, AI3:::Adaptive Information blog, September 2, 2009.
[8] This possible use of an item as both a class and an instance through “punning” is a desirable feature of OWL 2, which is the language basis for UMBEL. You can learn more on this subject in M.K. Bergman, 2010. “Metamodeling in Domain Ontologies,” AI3:::Adaptive Information blog, September 20, 2010.
[9] For a listing of these, see the Google query https://www.google.com/search?q=tbox+abox+site%3Amkbergman.com. One of the 40 articles with the most relevant commentary to this article is M.K. Bergman, 2014. “Big Structure and Data Interoperability,” AI3:::Adaptive Information blog, August 14, 2014.
[10] M.K. Bergman, 2009. ” Making Linked Data Reasonable using Description Logics, Part 1,” AI3:::Adaptive Information blog, February 11, 2009.
[11] The current development version of UMBEL is v 1.30. It is due for release before the end of 2015.
[12] See the Sekine Extended Entity Types; the listing also includes attributes info at bottom of source page.
[14] N. Chinchor, 1997. “Overview of MUC-7,” MUC-7 Proceedings, 1997.
[15] Ada Brunstein, 2002. “Annotation Guidelines for Answer Types”. LDC Catalog, Linguistic Data Consortium. Aug 3, 2002.
[16] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann, 2009. “DBpedia-A Crystallization Point for the Web of Data.” Web Semantics: science, services and agents on the world wide web 7, no. 3 (2009): 154-165; 170 classes in this paper. That has grown to more than 700; see http://mappings.dbpedia.org/server/ontology/classes/ and http://wiki.dbpedia.org/services-resources/datasets/dataset-2015-04/dataset-2015-04-statistics.
[17] The listing is under some dynamic growth. This is the official count as of September 8, 2015, from http://schema.org/docs/full.html. Current updates are available from Github.
[18] Joanna Biega, Erdal Kuzey, and Fabian M. Suchanek, 2013. “Inside YAGO2: A Transparent Information Extraction Architecture,” in Proceedings of the 22nd international conference on World Wide Web, pp. 325-328. International World Wide Web Conferences Steering Committee, 2013. Also see Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol, Gerhard Weikum, 2012. “HYENA: Hierarchical Type Classification for Entity Names,” in Proceedings of the 24th International Conference on Computational Linguistics, Coling 2012, Mumbai, India, 2012.
[19] Changki Lee, Yi-Gyu Hwang, Hyo-Jung Oh, Soojong Lim, Jeong Heo, Chung-Hee Lee, Hyeon-Jin Kim, Ji-Hyun Wang, and Myung-Gil Jang, 2006. “Fine-grained Named Entity Recognition using Conditional Random Fields for Question Answering,” in Information Retrieval Technology, pp. 581-587. Springer Berlin Heidelberg, 2006.
[20] Xiao Ling and Daniel S. Weld, 2012. “Fine-Grained Entity Recognition,” in AAAI. 2012.
[21] Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, and David Huynh, 2104. “Context-Dependent Fine-Grained Entity Type Tagging.” arXiv preprint arXiv:1412.1820 (2014).
[24] David Nadeau, 2007. “Semi-supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision.” PhD Thesis, School of Information Technology and Engineering, University of Ottawa, 2007.
[25] M.K. Bergman, 2012. ” Give Me a Sign: What Do Things Mean on the Semantic Web?,” AI3:::Adaptive Information blog, January 24, 2012.
[26] A good example of description and use of typologies is in the archaelogy description on Wikipedia.
[27] M.K. Bergman, 2015. “‘Natural’ Classes in the Knowledge Web“, AI3:::Adaptive Information blog, July 13, 2015.
[28] Also see my Glossary for definitions of specific terminology used in this article.
[29] M.K. Bergman, 2015. “Conceptual and Practical Distinctions in the Attributes Ontology“, AI3:::Adaptive Information blog, March 3, 2015.
Posted:June 8, 2015

InteroperabilityWhat Began as Data Integration Implies So Much More

Oh, it was probably two or three years ago that one of our clients asked us to look into single-source authoring, or more broadly what has come to be known as COPE (create once, publish everywhere), as made prominent by Daniel Jacobson of NPR, now Netflix. We also looked closely at the question of formats and workflows that might increase efficiencies or lower costs in the quest to grab and publish content.

Then, of course, about the same time, it was becoming apparent that standard desktop and laptop screens were being augmented with smartphones and tablets. Smaller screen aspects require a different interface layout and interaction; but, writing for specific devices was a losing proposition. Responsive Web design and grid layout templates that could bridge different device aspects have now come to the fore.

Though it has been true for some time that different publishing venues — from the Web to paper documents or PDFs — have posed a challenge, these other requirements point to a broader imperative. I have intuitively felt there is a consistent thread at the core of these emerging device, use and publishing demands, but the common element has heretofore eluded me.

For years — decades, actually — I have been focused on the idea of data interoperability. My first quest was to find a model that could integrate text stories and documents with structured data from conventional databases and spreadsheets. My next quest was to find a framework that could relate context and meaning across multiple perspectives and world views. Though it took awhile, and which only began to really take shape about a decade ago, I began to focus on RDF and general semantic Web principles for providing this model.

Data integration though open, semantic Web standards has been a real beacon for how I have pursued this quest. The ideal of being able to relate disparate information from multiple sources and viewpoints to each other has been a driving motivation in my professional interests. In analyzing the benefits of a more connected world of information I could see efficiencies, reduced costs, more global understandings, and insights from previously hidden connections.

Yet here is the funny thing. I began to realize that other drivers for how to improve knowledge worker efficiencies or to deploy results to different devices and venues share the same justifications as data integration. Might there not be some common bases and logic underlying the interoperability imperative? Is not data interoperability but a part of a broader mindset? Are there some universal principles to derive from an inspection of interoperability, broadly construed?

In this article I try to follow these questions to some logical ends. This investigation raises questions and tests from the global — that is, information interoperability — to the local and practical in terms of notions such as create once, use everywhere, and have it staged for relating and interoperability. I think we see that the same motivators and arguments for relating information apply to the efficient ways to organize and publish that information. I think we also see that the idea of interoperability is systemic. Fortunately, meaningful interoperability can be achieved across-the-board today with application of the right mindsets and approaches. Below, I also try to set the predicates for how these benefits might be realized by exploring some first principles of interoperability.

What is Interoperability?

So, what is interoperability and why is it important?

So-called enterprise information integration and interoperability seem to sprout from the same basic reality. Information gets created and codifed across multiple organizations, formats, storage systems and locations. Each source of this information gets created with its own scope, perspective, language, characteristics and world view. Even in the same organization, information gets generated and characterized according to its local circumstances.

In the wild, and even within single organizations, information gets captured, represented, and characterized according to multiple formats and viewpoints. Without bridges between sources that make explicit the differences in format and interpretation, we end up with what — in fact — is today’s reality of information stovepipes. The reality of our digital information being in isolated silos and moats results in duplicate efforts, inefficiencies, and lost understandings. Despite all of the years and resources thrown at information generation, use and consumption, our digital assets are unexploited to a shocking extent. The overarching cause for this dereliction of fiscal stewardship is the lack of interoperability.

By the idea of interoperability we are getting at the concept of working together. Together means things are connected in some manner. Working means we can mesh the information across sources to do more things, or do them better or more cheaply. Interoperability does not necessarily imply integration, since our sources can reside in distributed locations and formats. What is important is not the physical location — or, indeed, even format and representations — but that we have bridges across sources that enable the source information to work together.

In working backwards from this observation, then, we need certain capabilities to fulfill these interoperability objectives. We need to be able to ingest multiple encodings, serializations and formats. Because we need to work with this information, and tools for doing so are diverse, we also need the ability to export information in multiple encodings, serializations and formats. Human circumstance means we need to ingest and encode this information in multiple human languages. Some of our information is more structured, and describes relationships between things or the attributes or characterizations of particular types of things. Since all of this source information has context and provenance, we need to capture these aspects as well in order to ascertain the meaning and trustworthiness of the information.

This set of requirements is a lot of work, which can most efficiently be done against one or a few canonical representations of the input information. From a data integration perspective, then, the core system to support, store and manage this information should be based on only a few central data representations and models, with many connectors for ingesting native information in the wild and tools to support the core representations:



Data Flow Perspective on Interoperability

A Data Flow Perspective on Interoperability

In our approach at Structured Dynamics we have chosen the Resource Description Framework (RDF) as the structured data model at the core of the system [1], supported by the Lucene text engine for full-text search and efficient facet searching. Because all of the information is given unique Web identifiers (URIs), and the whole system resides on the Web accessible via the HTTP protocol, our information may reside anywhere the Internet connects.

This gives us a data model and a uniform way to represent the input data across structured, semi-structured and unstructured sources. Further, we have a structure that can capture the relations or attributes (including metadata and provenance) of the input information. However, one more step is required to achieve data interoperability: an understanding of the context and meaning of the source information.

To achieve the next layer in the data interoperability pyramid [2] it is thus necessary to employ semantic technologies. The structure of the RDF data model has an inherent expressiveness to capture meaning and context. To this foundation we must add a coherent view of the concepts and entity types in our domain of interest, which also enables us to capture the entities within this system and their characteristics and relationships to other entities and concepts. These properties applied to the classes and instances in our domain of interest can be expressed as a knowledge graph, which provides the logical schema and inferential framework for our domain. This stack of semantic building blocks gets formally expressed as ontologies (the technical term for a working graph) that should putatively provide a coherent representation of the domain at hand.

We can visualize this semantic stack as follows:



Semantics Perspective of Interoperability

A Semantics Perspective of Interoperability

We have been using the spoke-and-hub diagram above for data flows for some years and have used the semantic stack representation before, too. I believe in my bones the importance of data interoperability to competitive advantage for enterprises, and therefore its business worth as a focus of my company’s technology. But, once so considered, some more fundamental questions emerge. What makes data interoperability a worthwhile objective? Can an understanding of those objectives bring us more fundamental understandings of fundamental benefits? Does a grounding in more fundamental benefits suggest any change in our development priorities?

Drivers of Interoperability

I think we can boil the drivers of interoperability down to four. These are:

  • Efficiency — literally trillions are spent globally each year in the research, creation, re-use, publishing, storing and browsing of information [3]. Yet relevant information is hard to find, and sometimes obscure information is overlooked. The lack of reuse of prior good content because it is not discoverable is unconscionable given today’s technologies. The base productivity of information use is low;
  • Cost — missed information or lack of awareness of relevant information leads to increased time, increased direct costs (labor and material), and increased indirect costs. Awareness, understanding and re-use of existing information would save millions or more for brand-name firms [3] annually if these interoperability gaps were overcome;
  • Insight — drawing connections between previously unconnected things and enabling discovery are essential inputs to innovation, itself the overall driver of productivity (and, therefore, wealth) gains. The reinforcing leverage of interoperability resides in its ability to bring new understandings and insights; and
  • Capture — simply being able to include the 80% of extant information contained in text is a huge first step to interoperability, but grounding the system in the inherent connectedness of the Web means that all kinds of fields + streams, APIs, mappings, DBs, datasets, Web content, on-the-fly discoveries, and device sensors through the Internet of things (IoT) can be captured to contribute to our insights.

To be sure, data interoperability is focused on insight. But data interoperability also brings efficiency and cost reductions. As we add other aspects of interoperability — say, responsive design for mobile — we may see comparatively fewer benefits in insight, but more in efficiency, cost, and, even, capture. Anything done to increase benefits from any of these drivers contributes to the net benefits and rationale for interoperability.

Principles of Interoperability

The general goodness arising from interoperability suggests it is important to understand the first principles underlying the concept. By understanding these principles, we can also tease out the fundamental areas deserving attention and improvement in our interoperability developments and efforts. These principles help us cut through the crap in order to see what is important and deserves attention.

I think the first of the first principles for interoperability is reusability. Once we have put the effort into the creation of new valuable data or content, we want to be able to use and apply that knowledge in all applicable venues. Some of this reuse might be in chunking or splitting the source information into parts that can be used and deployed for many purposes. Some of this reuse might be in repurposing the source data and content for different presentations, expressions or devices. These considerations imply the importance of storing, characterizing, structuring and retrieving information in one or a few canonical ways.

Interoperable content and forms should also aspire to an ideal of “onceness“. The ideal is that the efforts to gather, create or analyze information be done as few times as possible. This ideal clearly ties into the principle of reusabilty because that must be in place to minimize duplication and overlooking what exists. The reason to focus on onceness is that it forces an explication of the workflows and bottlenecks inherent to our current work practices. These are critical areas to attack since, unattended, such inefficiencies provide the “death by a thousand cuts” to interoperability. Onceness is at the center of such compelling ideas as COPE and the role of APIs in a flexible architecture (see below) to promote interoperability.

A respect for workflows is also a first principle, expressed in two different ways. The first way is that existing workflows can not be unduly disrupted when introducing interoperability improvements. While workflows can be improved or streamlined over time — and should — initial introduction and acceptance of new tools and practices must fit with existing ways of doing tasks in order to see adoption. Jarring changes to existing work practices are mostly resisted. The second way that workflows are a first principle is in the importance of being aware of, explicitly modeling, and then codifying how we do tasks. This becomes the “language” of our work, and helps define the tooling points or points of interaction as we merge activities from multiple disciplines in our domain. These workflow understandings also help us identify useful points for APIs in our overall interoperability architecture.

These considerations provide the rationale for assigning metadata [4] that characterize our information objects and structure, based on controlled vocabularies and relationships as established by domain and administrative ontologies [5]. In the broadest interoperability perspective, these vocabularies and the tagging of information objects with them are a first principle for ensuring how we can find and transition states of information. These vocabularies need not be complex or elaborate, but they need to be constant and consistent across the entire content lifecycle. There are backbone aspects to these vocabularies that capture the overall information workflow, as well as very specific steps for individual tasks. As a complement to such administrative ontologies, domain ontologies provide the context and meaning (semantics) for what our information is about.

The common grounding of data model and semantics means we can connect our sources of information.  The properties that define the relationships between things determine the structure of our knowledge graph. Seeking commonalities for how our information sources relate to one another helps provide a coherent graph for drawing inferences. How we describe our entities with attributes provides a second type of property. Attribute profiles are also a good signal for testing entity relatedness. Properties — either relations or attributes — provide another filter to draw insight from available information.

If the above sounds like a dynamic and fluid environment, you would be right. Ultimately, interoperability is a knowledge challenge in a technology environment that is rapidly changing. New facts, perspectives, devices and circumstances are constantly arising. For these very reasons an interoperability framework must embrace the open world assumption [6], wherein the underlying logic structure and its vocabulary and data can be grown and extended at will. We are seeing some breakaway from conventional closed-world thinking of relational databases with NoSQL and graph databases, but a coherent logic based on description logics, such as is found with open standard semantic technologies like RDF and OWL and SPARQL, is even more responsive.

Though perhaps not quite at the level of a first principle, I also think interoperability improvements should be easy to use, easy to share, and easy to learn. Tooling is clearly implied in this, but also it is important we be able to develop a language and framing for what constitutes interoperability. We need to be able to talk about and inspect the question of interoperability in order to discover insights and gain efficiencies.

Aspects of Interoperability

The thing about interoperability is that it extends over all aspects of the information lifecycle, from capturing and creating information, to characterizing and vetting it, to analyzing it, or publishing or distributing it. Eventually, information and content already developed becomes input to new plans or requirements. These aspects extend across multiple individuals and departments and even organizations, with portions of the lifecycle governed (or not) by their own set of tools and practices. We can envision this overall interoperability workflow something like the following [7]:



Generalized Workflow Perspective of Interoperability

A Generalized Workflow Perspective of Interoperability

Overall, only pieces of this cycle are represented in most daily workflows. Actually, in daily work, parts of this workflow are much more detailed and involved than what this simplistic overview implies. Editorial review and approvals, or database administration and management, or citation gathering or reference checking, or data cleaning, or ontology creation and management, or ETL activities, or hundreds of other specific tasks, sit astride this general backbone.

Besides showing that interoperability is a systemic activity for any organization (or should be), we can also derive a couple of other insights from this figure. First, we can see that some form of canonical representation and management is central to interoperability. As noted, this need not be a central storage system, but can be distributed using Web identifiers (URIs) and protocols (HTTP). Second, we characterize and tag our information objects using ontologies, both from structural and administrative viewpoints, but also by domain and meaning. Characterizing our information by a common semantics of meaning enables us to combine and analyze our information.

A third insight is that a global schema specific to workflows and information interoperability is the key for linking and combining activities at any point within the cycle.  A common vocabulary for stages and interoperability tasks, included as a best practice for our standard tagging efforts, provides the conventions for how batons can get passed between activities at any stage in this cycle. The challenge of making this insight operational is one more of practice and governance than of technology. Inspecting and characterizing our information workflows with a common vocabulary and understanding needs to be a purposeful activity in its own right, backed with appropriate management attention and incentives.

A final insight is that such a perspective on interoperability is a bit of a fractal. As we get more specific in our workflows and activities, we can apply these same insights in order to help those new, more specific workflows become interoperable. We can learn where to plug into this structure. And, we can learn how our specific activities through the application of explicit metadata and tags with canonical representations can work to interact well with other aspects of the content lifecycle.

Interoperability can be achieved today with the right mindsets and approaches. Fortunately, because of the open world first principle, this challenge can be tackled in an incremental, piecemeal manner. While the overall framework provides guidance for where comprehensive efforts across the organization may go, we can also cleave off only parts of this cycle for immediate attention, following a “pay as you benefit” approach [8]. A global schema and a consistent approach to workflows and information characterizations can help ensure the baton is properly passed as we extend our interoperability guidance to other reaches of the enterprise.

General Architecture and a Sample Path

We can provide a similar high-level view for what an enterprise information architecture supporting interoperability might look like. We can broadly layer this architecture into content acquisition, representation and repository, and content consumption:



An Architecture for Interoperability

An Architectural Perspective of Interoperability

Content of all forms — structured, semi-structured and unstructured — is brought into the system and tagged or mapped into the governing domain or administrative schema. Text content is marked up with reduced versions of HTML (such as RASH [9] or Markdown [10]) in order to retain the author’s voice and intent in areas such as emphasis, titles or section headers; the structure of the content is also characterized by patterned areas such as abstracts, body and references. All structured data is characterized according to the RDF data model, with vocabularies as provided by OWL in some cases.

We already have an exemplar repository in the Open Semantic Framework [11] that shows the way (along with other possible riffs on this theme) for how just a few common representations and conventions can work to distribute both schema and information (data) across a potentially distributed network. Further, by not stopping at the water’s edge of data interoperability, we can also embrace further, structural characterization of our content. Adding this wrinkle enables us to efficiently support a variety of venues for content consumption simultaneously.

This architecture is quite consistent with what is known as WOA (for Web-oriented architecture) [12]. Like the Internet itself, WOA has the advantage of being scalable and distributed, all (mostly) based on open standards. The interfaces between architectural components are also provided though mostly RESTful application programming interfaces (APIs), which extends interoperability to outside systems and provides flexibility for swapping in new features or functionality as new components or developments arise. Under this design, all components and engines become in effect “black boxes”, with information exchange via standard vocabularies and formats using APIs as the interface for interoperability.

A Global Context for Interoperability

Though data interoperability is a large and central piece, I hope I have demonstrated that interoperability is a much broader and far-reaching concept. We can see that “global interoperability” extends into all aspects of the information lifecycle. By expanding our viewpoint of what constitutes interoperability, we have discovered some more general principles and mindsets that can promise efficiencies, lower costs and greater insights across the enterprise.

An explicit attention to workflows and common vocabularies for those flows and the information objects they govern is a key to a more general understanding of interoperability and the realization of its benefits. Putting this kind of infrastructure in place is also a prerequisite to greater tooling and automation in processing information.

We can already put in place chains of tooling and workflows governed by these common vocabularies and canonical representations to achieve a degree of this interoperability. We do not need to tackle the whole enchilada at once or mount some form of “big bang” initiative. We can start piecemeal, and expand as we benefit. The biggest gaps remain codification of workflows in relation to the overall information lifecycle, and the application of taggers to provide the workflow and structure metadata at each stage in the cycle. Again, these are not matters so much of technology or tooling, but policy and information governance.

What I have outlined here provides the basic scaffolding for how such an infrastructure to promote interoperability may evolve. We know how we do our current tasks; we need to understand and codify those workflows. Then, we need to express our processing of information at any point along the content lifecycle. A number of years back I discussed climbing the data interoperability pyramid [2]. We have made much progress over the past five years and stand ready to take our emphasis on interoperability to the next level.

To be sure there is much additional tooling still needed, mostly in the form of mappers and taggers. But the basic principles, core concepts and backbone tools for supporting greater interoperability are known and relatively easy to put in place. Embracing the mindset and inculcating this process into our general information management routines is the next challenge. Working to obtain the ideal is doable today.


[1] See M. K. Bergman, 2009. “Advantages and Myths of RDF,” from AI3:::Adaptive Information blog, April 8, 2009.
[2] See M. K. Bergman, 2006. “Climbing the Data Federation Pyramid,” from AI3:::Adaptive Information blog, April 8, 2009.
[3] See M. K. Bergman, 2005. “Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents,” from AI3:::Adaptive Information blog, July 20, 2005.
[4] See M. K. Bergman, 2010. “I Have Yet to Metadata I Did’t Like,” from AI3:::Adaptive Information blog, August 16, 2010.
[5] See M. K. Bergman, 2011. “An Ontologies Architecture for Ontology-driven Apps,” from AI3:::Adaptive Information blog, December 5, 2011. Ontologies
[6] See M. K. Bergman, 2009. “The Open World Assumption: Elephant in the Room,” from AI3:::Adaptive Information blog, December 21, 2009.
[7] Some sources that helped form my thoughts on the information lifecycle include Backbone Media and Piktochart.
[8] See M. K. Bergman, 2010. “‘Pay as You Benefit’: A New Enterprise IT Strategy,” from AI3:::Adaptive Information blog, July 12, 2010.
[9] See Silvio Peroni, 2015. “RASH: Research Articles in Simplified HTML,” March 15, 2015.
[10] Many Markdown options exist for a reduced subset of HTML; one in this vein is Scholarly Markdown.
[11] The Open Semantic Framework has its own Web site (http://opensemanticframework.org/), supported by a wiki of more than 500 supporting technical articles (http://wiki.opensemanticframework.org/index.php/Main_Page).
[12] See M. K. Bergman, 2009. “A Generalized Web-oriented Architecture (WOA) for Structured Data,” from AI3:::Adaptive Information blog, May 3, 2009.
Posted:August 18, 2014

Steam engine in action, from WikipediaA Critical Fit with the Semantic Web and AI

In the first parts of this series we introduced the idea of Big Structure, and the fact that it resides at the nexus of the semantic Web, artificial intelligence, natural language processing, knowledge bases, and Big Data. In this article, we look specifically at the work that Big Structure promotes in data interoperability as a way to clarify what the roles these various aspects play.

By its nature, data integration (the first step in data interoperability) means that data is being combined across two or more datasets. Such integration surfaces all of the myriad aspects of semantic heterogeneities, exactly the kinds of issues that the semantic Web and semantic technologies were designed to address. But resolving semantic differences can not be fulfilled by semantic technologies alone. While semantics can address the basis of differences in meaning and context, resolution of those differences or deciding between differing interpretations (that is, ambiguity) also requires many of the tools of artificial intelligence or natural language processing (NLP).

By decomposing this space into its various sources of semantic heterogeneities — as well as the work required in order to provide for such functions as search, disambiguation, mapping and transformations — we can begin to understand how all of these components can work together in order to help achieve data interoperability. This understanding, in turn, is essential to understand the stack and software architecture — and its accompanying information architecture — in order to best achieve these interoperability objectives.

So, this current article lays out this conceptual framework of components and roles. Later articles in this series will address the specific questions of software and information architectural design.

Data Interoperability in Relation to Semantics

Semantic technologies give us the basis for understanding differences in meaning across sources, specifically geared to address differences in real world usage and context. These semantic tools are essential for providing common bases for relating structured data across various sources and contexts. These same semantic tools are also the basis by which we can determine what unstructured content “means”, thus providing the structured data tags that also enable us to relate documents to conventional data sources (from databases, spreadsheets, tables and the like). These semantic technologies are thus the key enablers for making information — unstructured, semi-structured and structured — understandable to both humans and machines across sources. Such understandings are then a key basis for powering the artificial intelligence applications that are now emerging to make our lives more productive and less routine.

For nearly a decade I have used an initial schema by Pluempitiwiriyawej and Hammer to elucidate the sources of possible semantic differences between content. Over the years I have added language and encoding differences to this schema. Most recently, I have updated this schema to specifically call out semantic heterogeneities due to either conceptual differences between sources (largely arising from schema differences) and value and attribute differences amongst actual data. I have further added examples for what each of these categories of semantic heterogenities means [1].

This table of more than 40 sources of semantic heterogeneities clearly shows the possible impediments to get data to interoperate across sources:

Class Category Subcategory Examples Type [2] [4]
LANGUAGE Encoding Ingest Encoding Mismatch For example, ANSI v UTF-8 [3] Concept
Ingest Encoding Lacking Mis-recognition of tokens because not being parsed with the proper encoding [3] Concept
Query Encoding Mismatch For example, ANSI v UTF-8 in search [3] Concept
Query Encoding Lacking Mis-recognition of search tokens because not being parsed with the proper encoding [3] Concept
Languages Script Mismatch Variations in how parsers handle, say, stemming, white spaces or hyphens Concept
Parsing / Morphological Analysis Errors (many) Arabic languages (right-to-left) v Romance languages (left-to-right) Concept
Syntactical Errors (many) Ambiguous sentence references, such as I’m glad I’m a man, and so is Lola (Lola by Ray Davies and the Kinks) Concept
Semantics Errors (many) River bank v money bank v billiards bank shot Concept
CONCEPTUAL Naming Case Sensitivity Uppercase v lower case v Camel case Concept
Synonyms United States v USA v America v Uncle Sam v Great Satan Concept
Acronyms United States v USA v US Concept
Homonyms Such as when the same name refers to more than one concept, such as Name referring to a person v Name referring to a book Concept
Misspellings As stated Concept
Generalization / Specialization When single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to “phone” but the other schema has multiple elements such as “home phone,” “work phone” and “cell phone” Concept
Aggregation Intra-aggregation When the same population is divided differently (such as, Census v Federal regions for states, England v Great Britain v United Kingdom, or full person names v first-middle-last) Concept
Inter-aggregation May occur when sums or counts are included as set members Concept
Internal Path Discrepancy Can arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove) Concept
Missing Item Content Discrepancy Differences in set enumerations or including items or not (say, US territories) in a listing of US states Concept
Missing Content Differences in scope coverage between two or more datasets for the same concept Concept
Attribute List Discrepancy Differences in attribute completeness between two or more datasets Attribute
Missing Attribute Differences in scope coverage between two or more datasets for the same attribute Attribute
Item Equivalence When two types (classes or sets) are asserted as being the same when the scope and reference are not (for example, Berlin the city v Berlin the official city-state) Concept
When two individuals are asserted as being the same when they are actually distinct (for example, John Kennedy the president v John Kennedy the aircraft carrier) Attribute
Type Mismatch When the same item is characterized by different types, such as a person being typed as an animal v human being v person Attribute
Constraint Mismatch When attributes referring to the same thing have different cardinalities or disjointedness assertions Attribute
DOMAIN Schematic Discrepancy Element-value to Element-label Mapping One of four errors that may occur when attribute names (say, Hair v Fur) may refer to the same attribute, or when same attribute names (say, Hair v Hair) may refer to different attribute scopes (say, Hair v Fur) or where values for these attributes may be the same but refer to different actual attributes or where values may differ but be for the same attribute and putative value.Many of the other semantic heterogeneities herein also contribute to schema discrepancies Attribute
Attribute-value to Element-label Mapping Attribute
Element-value to Attribute-label Mapping Attribute
Attribute-value to Attribute-label Mapping Attribute
Scale or Units Measurement Type Differences, say, in the metric v English measurement systems, or currencies Attribute
Units Differences, say, in meters v centimeters v millimeters Attribute
Precision For example, a value of 4.1 inches in one dataset v 4.106 in another dataset Attribute
Data Representation Primitive Data Type Confusion often arises in the use of literals v URIs v object types Attribute
Data Format Delimiting decimals by period v commas; various date formats; using exponents or aggregate units (such as thousands or millions) Attribute
DATA Naming Case Sensitivity Uppercase v lower case v Camel case Attribute
Synonyms For example, centimeters v cm Attribute
Acronyms For example, currency symbols v currency names Attribute
Homonyms Such as when the same name refers to more than one attribute, such as Name referring to a person v Name referring to a book Attribute
Misspellings As stated Attribute
ID Mismatch or Missing ID URIs can be a particular problem here, due to actual mismatches but also use of name spaces or not and truncated URIs Attribute
Missing Data A common problem, more acute with closed world approaches than with open world ones Attribute
Element Ordering Set members can be ordered or unordered, and if ordered, the sequences of individual members or values can differ Attribute
Sources of Semantic Heterogeneities

Ultimately, since we express all of our content and information with human language, we need to start there to understand the first sources in semantic differences. Like the differences in human language, we also have differences in world views and experience. These differences are often conceptual in nature and get at what we might call differences in real world perspectives and experiences. From there, we encounter differences in our specific realms of expertise or concern, or the applicable domain(s) for our information and knowledge. Then, lastly, we give our observations and characterizations data and values in order to specify and quantify our observations. But the attributes of data are subject to the same semantic vagaries as concepts, in addition to their own specific challenges in units and measures and how they are expressed.

From the conceptual to actual data, then, we see differences in perspective, vocabularies, measures and conventions. Only by systematically understanding these sources of heterogeneity — and then explicitly addressing them — can we begin to try to put disparate information on a common footing. Only by reconciling these differences can we begin to get data to interoperate.

Some of these differences and heterogeneities are intrinsic to the nature of the data at hand. Even for the same putative topics, data from French researchers will be expressed in a different language and with different measurements (metric) than will data from English researchers. Some of these heterogeneities also arise from the basis and connections asserted between datasets, as misuse of the sameAs predicate shows in many linked data applications [5].

Fortunately, in many areas we are transitioning by social convention to overcome many of these sources of semantic heterogeneity. A mere twenty years ago, our information technology systems expressed and stored data in a multitude of formats and systems. The Internet and Web protocols have done much to overcome these sources of differences, what I’ve termed elsewhere as climbing the data federation pyramid [6]. Semantic Web approaches where data items are assigned unique URIs are another source of making integration easier. And, whether all agree from a cultural aspect if it is good, we are also seeing English become the lingua franca of research and data.

The point of the table above is not to throw up our hands and say there is just too much complexity in data integration. Rather, by systematically decomposing the sources of semantic heterogeneity, we can anticipate and accommodate those sources not yet being addressed by cultural or technological conventions. While there is a large number of categories of semantic heterogeneity, these categories are also patterned and can be anticipated and corrected. These patterned sources inform us about what kind of work must be done to overcome semantic differences where they still reside.

Work Components in Data Interoperability

The description logics that underly the semantic Web already do a fair job of architecting this concept-attribute split in semantics. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts [7].

The semantic Web is a standards-based effort by the W3C (World Wide Web Consortium); many of its accomplishments have arisen around ontology and TBox-related efforts. Data integration has putatively been tackled from the perspective of linked data, but that methodology so far is short on attributes and property-mapping linkages between datasets and schema. There are as yet no reference vocabularies or schema for attributes [8]. Many of the existing linked data linkages are based on erroneous owl:sameAs assertions. It is fair to say that attribute and ABox-level semantics and interoperability have received scarce attention, even though the logic underpinnings exist for progress to be made.

This lack on the attributes or ABox-side of things is a major gap in the work requirements for data interoperability, as we see from the table below. The TBox development and understanding is quite good; and, a number of reference ontologies are available upon which to ground conceptual mappings [9]. But the ABox third is largely missing grounding references. And, the specialty work tasks, representing about the last third, are needful of better effectiveness and tooling.

For both the TBox and the ABox we are able to describe and model concepts (classes), instances (individuals), and are pretty good at being able to model relationships (predicates) between concepts and individuals. We also are able to ground concepts and their relationships through a number of reference concept ontologies [9]. But our understanding of attributes (the descriptive properties of instances) remains poor and ungrounded. Best practices — let alone general practices — still remain to be discovered.

TBox (concepts) Specialty Work Tasks ABox (data)
  • Definitions of the concepts and properties (relationships) of the controlled vocabulary
  • Declarations of concept axioms or roles
  • Inferencing of relationships, be they transitive, symmetric, functional or inverse to another property
  • Equivalence testing as to whether two classes or properties are equivalent to one another
  • Subsumption, which is checking whether one concept is more general than another
  • Satisfiability, which is the problem of checking whether a concept has been defined (is not an empty concept)
  • Classification, which places a new concept in the proper place in a taxonomic hierarchy of concepts
  • Logical implication, which is whether a generic relationship is a logical consequence of the declarations in the TBox
  • Infer property assertions implicit through the transitive property
  • Mappings are the core of interoperability in that concepts and attributes get matched across schema and datasets
  • Transformations are the means to bring disparate data into common grounds, the second leg of interoperability
  • Entailments, which are whether other propositions are implied by the stated condition
  • Instance checking, which verifies whether a given individual is an instance of (belongs to) a specified concept
  • Knowledge base consistency, which is to verify whether all concepts admit at least one individual
  • Realization, which is to find the most specific concept for an individual object
  • Retrieval, which is to find the individuals that are instances of a given concept
  • Identity relations, which is to determine the equivalence or relatedness of instances in different datasets]
  • Disambiguation, which is resolving references to the proper instance
  • Membership assertions, either as concepts or as roles
  • Attributes assertions
  • Linkages assertions that capture the above but also assert the external sources for these assignments
  • Consistency checking of instances
  • Satisfiability checks, which are that the conditions of instance membership are met
Work Tasks for a Data Interoperability Framework

Across the knowledge base (that is, the combination of the TBox and the ABox), the semantic Web has improved its search capabilities by formally integrating with conventional text search engines, such as Solr. Instance and consistency checking are pretty straightforward to do, but are often neglected steps in most non-commercial semantic installations. Critical areas such as mappings, transformations and identity evaluation remain weak work areas. This figure helps show these major areas and their work splits:

Work Splits in Data Interoperability

Work Splits Between the Semantic Web and AI

As we discussed earlier on the recent and rapid advances of artificial intelligence [10], the combination of knowledge bases and the semantic Web with AI machine learning (ML) and NLP techniques will show rapid improvements in data interoperability. The two stumbling blocks of not having a framework and architecture for interoperability, plus the lack of attributes groundings, have been controlling. Now that these factors are known and they are being purposefully addressed, we should see rapid improvements, similar to other areas in AI.

This re-embedding of the semantic Web in artificial intelligence, coupled with the conscious attention to provide reference groundings for data interoperability, should do much to address what are current, labor-intensive stumbling blocks in the knowledge management workflow.

Putting Some Grown-up Pants on the Semantic Web

The semantic Web clearly needs to play a central role in data integration and interoperability. Fortunately, like we have seen in other areas [11], semantic technologies lend themselves to generic functional software that can be designed for re-use in most any knowledge domain, chiefly by changing the data and ontologies guiding them. This means that reference libraries of groundings, mappings and transformations can be built over time and reused across enterprises and projects. Use of functional programming languages will also align well with the data and schema in knowledge management functions and ontologies and DSLs. These prospects parallel the emergence of knowledge-based AI (KBAI), which marries electronic Web knowledge bases with improvements in machine-learning algorithms.

The time for these initiatives is now. The complete lack of distributed data interoperability is no longer tolerable. High costs due to unacceptable manual efforts and too many failed projects plague the data interoperability efforts of the past. Data interoperability is no longer a luxury, but a necessity for enterprises needing to compete in a data-intensive environment. At scale, point-to-point integration efforts become ineffective; a form of reusable and transferable master data management (MDM) needs to emerge for the realiites of Big Data, and one that is based on the open and standard protocols of the Web.

Much tooling and better workflows and user interfaces will need to emerge. But the critical aspects are the ones we are addressing now: information and software architectures; reference groundings and attributes; and education about these very real prospects near at hand. The challenge of data interoperability in cooperation with its artificial intelligence cousin is where the semantic Web will finally put on its Big Boy pants.


[1] See Charnyote Pluempitiwiriyawej and Joachim Hammer, 2000. A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources, Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See https://cise.ufl.edu/tr/DOC/REP-2000-396.pdf. I first cited this report and extended it to cover languages (see [3]) in M.K. Bergman 2006. Sources and Classification of Semantic Heterogeneities, AI3:::Adaptive Information blog, June 6, 2006. See https://www.mkbergman.com/232/sources-and-classification-of-semantic-heterogeneities/). This most recent version added the examples and expanding the listing a bit further, to where it is no longer faithful to the original 2000 paper.
[2] Concept is the shorthand used for the schema or classes or TBox. Attribute is the shorthand used for instance data or entities and their ABox. I segregate class-relation properties (predicates) from instance-describing properties (attributes). This distinction is not use in standard TBox-ABox splits; its rationale will be described in a further article.
[3] See M.K. Bergman, 2006. Tutorial: Internet Languages, Character Sets and Encodings, BrightPlanet Corporation Technical Documentation, March 2006, 13 pp. See https://www.mkbergman.com/wp-content/themes/ai3v2/files/2006Posts/InternationalizationTutorial060323.pdf.
[4] See [7]. Also the TBox portion, or classes (concepts), is the basis of the ontologies. The ontologies establish the structure used for governing the conceptual relationships for that domain and in reference to external (Web) ontologies. The ABox portion, or instances (named entities), represents the specific, individual things that are the members of those classes. Named entities are the notable objects, persons, places, events, organizations and things of the world. Each named entity is related to one or more classes (concepts) to which it is a member. Named entities do not set the structure of the domain, but populate that structure. The ABox and TBox play different roles in the use and organization of the information and structure.
[5] M.K. Bergman 2009. When Linked Data Rules Fail, AI3:::Adaptive Information blog, November 16, 2009. See https://www.mkbergman.com/846/when-linked-data-rules-fail/.
[6] M.K. Bergman 2006. Climbing the Data Federation Pyramid, AI3:::Adaptive Information blog, May 25, 2006. See https://www.mkbergman.com/229/climbing-the-data-federation-pyramid/.
[7] M.K. Bergman 2008. Thinking ‘Inside the Box’ with Description Logics, AI3:::Adaptive Information blog, November 10, 2008. See https://www.mkbergman.com/466/thinking-inside-the-box-with-description-logics/.
[8] See the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.
[9] Examples of upper-level ontologies include UMBEL, the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak. See further the Wikipedia entry on upper ontologies.
[10] M.K. Bergman 2014. Spring Dawns on Artificial Intelligence, AI3:::Adaptive Information blog, June 2, 2014. See https://www.mkbergman.com/1731/spring-dawns-on-artificial-intelligence/.
[11] M.K. Bergman 2011. Ontology-driven Apps Using Generic Applications, AI3:::Adaptive Information blog, March 7, 2011. See https://www.mkbergman.com/948/ontology-driven-apps-using-generic-applications/.
Posted:July 23, 2014

Light and Dark Structure of Universe, @NYT, see http://vimeo.com/100907866Envisioning A New Adaptive Infrastructure for Data Interoperability

In Part I of this two-part series, Fred Giasson and I looked back over a decade of working within the semantic Web and found it partially successful but really the wrong question moving forward. The inadequacies of the semantic Web to date reside in its lack of attention to practical data interoperability across organizational or community boundaries. An emphasis on linked data has created an illusion that questions of data integration are being effectively addressed. They are not.

Linked data is hard to publish and not the only useful form for consuming data; linked data quality is often unreliable; the linking predicates for relating disparate data sources to one another may be inadequate or wrong; and, there are no reference groundings for relating data values across datasets. Neither the semantic Web nor linked data has developed the practices, tooling or experience to actually interoperate data across the Web. These criticisms are not meant to condemn linked data — it is, after all, the early years. Where it is compliant and from authoritative information sources, linked data can be a gold standard in data publishing. But, linked data is neither necessary nor essential, and may even be a diversion if it sucks the air from the room for what is more broadly useful.

This table summarizes the state-of-art in the semantic Web for frameworks and guidance in how to interoperate data:

Category Related Terms Status in the Semantic Web Notes
Classes sets, concepts, topics, types, kinds Mature, but broader scope coverage desirable; equivalent linkages between datasets often mis-applied; more realistic proximate linkages in flux, with no bases to reason over them [1]
Instances individuals, entities, members, records, things Current basis for linked data; many linkage properties mis-applied [2]
Relation Properties relations, predicates Equivalent linkages between datasets often mis-applied; more realistic proximate linkages in flux, with no bases to reason over them. [3]
Descriptive Properties attributes, descriptors Save for a couple of minor exceptions, no basis for mapping attributes across datasets [4]
Values data Basic QUDT ontologies could contribute here [5]

We can relate the standard subjectpredicateobject triple statement in RDF to this table, using the Category column. Classes and Instances relate to the subjects, Relation and Descriptive Properties relate to the predicate, and Values relate to the object [6] in an RDF triple. The concepts and class schema of different information sources (their “aboutness”) can reasonably be made to interoperate. In terms of the description logics that underly the logic bases of W3C ontologies, the focus and early accomplishments of the semantic Web have been on this “terminological box” or T-Box [7]. Tooling to make the mappings more productive and means to test the coherence and completeness of the results still remain as priority efforts, but the conceptual basis and best practices have progressed pretty well.

In contrast, nearly lacking in focus and tooling has been the flip side of that description logics coin: the A-Box [7], or assertional and instance (data) level of the equation. Both the T-Box and A-Box are necessary to provide a knowledge base. Today, there are virtually no vocabularies, no tooling, no history, no best practices and no “grounding” for actual A-Box data integration within the semantic Web. Without such guidance, the semantic Web is silent on the questions of data interoperability. As David Karger explained in his keynote address at ISWC in 2013 [8], “we’ve got our heads in the clouds while people are stuck in the dirt.”

Yet these are not fatal flaws of the semantic Web, nor are they permanent. Careful inspection of current circumstances, combined with purposeful action, suggests:

  1. Data integration can be solved
  2. Leveraging background knowledge is a key enabler
  3. Interoperability requires reference structures, what we are calling Big Structure.

The Prism of Data Interoperability

Why do we keep pointing to the question of data interoperability? Consider these facts:

  • 80% of all available information is in text or documents (unstructured)
  • 40% of standard IT project expenses are devoted to data integration in one form or another, due to the manual effort needed for data migration and mapping
  • Information volumes are now doubling in fewer than two years
  • Other trends including smartphones and sensors are further accelerating information growth
  • Effective business intelligence requires the use of quality, integrated data.

The abiding, costly, frustrating and energy-sucking demands of data integration have been a constant within enterprises for more than three decades. The same challenges reside for the Web. The Internet of Things will further demand better interoperability frameworks and guidelines. Current data integration tooling relies little upon semantics and no leading alternative is based principally around semantic approaches [9].

The data integration market is considered to include enterprise data integration and extract, transform and load (ETL) vendors. Gartner estimates tool sales for this market to be about $2 billion annually, with a growth rate faster than most IT areas [10]. But data integration also touches upon broader areas such as enterprise application integration (EAI), federated search and query, and master data management (MDM), among others. Given that data integration is also 40% of standard IT project costs, new approaches are needed to finally unblock the costly logjam of enterprise information integration. Most analysts see firms that are actively pursuing data integration innovations as forward-thinking and more competitive.

Data integration is combining information from multiple sources and providing users a uniform view of it. Data interoperability is being able to exchange and work upon (inter-operate) information across system and organizational boundaries. The ability to integrate data precedes the ability to interoperate it. For example, I may have three datasets of mammals that I want to consolidate and describe in similar terms with common units of measurement. That is an example of data integration. I may then want to relate this mammal knowledge base with a more general perspective of the animal kingdom. That is an example of data interoperability. Data integration usually occurs within a single organization or enterprise or institutional offering (as would be, say, Wikipedia). Data interoperability additionally needs to define meanings and communicate them in common ways across organizational, domain or community boundaries.

These are natural applications for the semantic Web. Why, then, has there not been more practical use of the semantic Web for these purposes?

That is an interesting question that we only partially addressed in Part I of this series. All aspects of data have semantics: what the data is about, what is its context, how it relates to other data, and what its values are and what they mean. The semantic Web is closely allied with natural language processing, an essential for bringing the 80% of unstructured data into the equation. Semantic Web ontologies are useful structures for how to relate real-world data into common, reference forms. The open world logic of the semantic Web is the right perspective for knowledge functions under the real-world conditions of constantly expanding information and understandings.

While these requirements suggest an integral role for the semantic Web, it is also clear that the semantic Web has not yet made these contributions. One explanation may be that semantic Web advocates, let alone the linked data tribe, have not seen data integration — as traditionally defined — as their central remit. Another possibility is that trying to solve data interoperability through the primary lens of the semantic Web is the wrong focus. In any case, meeting the challenge of data interoperability clearly requires a much broader context.

Embedding Data Interoperability Into a Broader Context

The semantic Web, in our view, is properly understood as a sub-domain of artificial intelligence. Semantic technologies mesh smoothly with natural language tasks and objectives.  But, as we noted in a recent review article, artificial intelligence is itself undergoing a renaissance [11]. These advances are coming about because of the use of knowledge-based AI (KBAI), which combines knowledge bases with machine learning and other AI approaches. Natural language and spoken interfaces combined with background knowledge and a few machine-language utilities are what underlie Apple’s Siri, for example.

The realization that the semantic Web is useful but insufficient and that AI is benefitting from the leveraging of background knowledge and knowledge bases caused us to “decompose” the data-interoperability information space. Because artificial intelligence is a key player here, we also wanted to capture all of the main sub-domains of AI and their relationships to one another:

Artificial Intelligence Domains

Artificial Intelligence Domains

Two core observations emerge from standing back and looking at these questions. First, many of AI’s main sub-domains have a role to play with respect to data integration and interoperability:

AI and Data Interoperability

AI Domains Related to Data Interoperability

This places semantic Web technologies as a co-participant with natural language processing, knowledge mining, pattern recognizers, KR languages, reasoners, and machine learning as domains related to data interoperability.

And, second, generalizing the understanding of knowledge bases and other guiding structures in this space, such as ontologies, highlights the potential importance of Big Structure. Virtually every one of the domains displayed above would be aided by leveraging background knowledge.

Grounding Data Interoperability in Big Structure

As our previous AI review showed [11], reference knowledge bases — Wikipedia in the forefront — have been a tremendous boon to moving forward on many AI challenges. Our own experience with UMBEL has also shown how reference ontologies can help align and provide common grounding for mapping different information domains into one another [12]. Vetted, gold-standard reference structures provide a fixity of coherent touchpoints for orienting different concepts and domains (and, we believe, data) to one another.

In the data integration context, master data models (and management, or MDM) attempt to provide common reference terms and objects to aid the integration effort. Like other areas in conventional data integration, very few examples of MDM tools based on semantic technologies exist.

This use of reference structures and the importance of knowledge bases to help solve hard computational tasks suggests there may be a general principle at work. If ontologies can help orient domain concepts, why can’t they also be used to orient instance data and their attributes? In fact, must these structures always be ontologies? Are not other common reference structures such as taxonomies, vocabularies, reference entity sets, or other typologies potentially useful to data integration?

By standing back in this manner and asking these broader questions we can see a host of structures like reference concepts, reference attributes, reference places, reference identifiers, and the like, playing the roles of providing common groundings for integration and interoperation. Through the AI experience, we can also see that subsequent use of these reference structures — be they full knowledge bases or more limited structures like taxonomies or typologies — can further improve information extraction and organization. The virtuous circle of knowledge structures improving AI algorithms, which can then further improve the knowledge structures, has been a real Aha! moment for the artificial intelligence community. We should see rapid iterations of this virtuous circle in the months to come.

These perspectives can help lead to purposeful designs and approaches for attacking such next-generation problems as data interoperability. The semantic Web can not solve this alone because additional AI capabilities need to be brought to bear. Conventional data integration approaches that lack semantic Big Structure groundings — let alone the use of AI techniques — have years of history of high cost and disappointing results. No conventional enterprise knowledge management problem appears sheltered from this whirlwind of knowledge-backed AI.

At Structured Dynamics, Fred Giasson and I have been discussing “Big Structure” for some time. However, it was only in researching this article that I came across the first public use of this phrase in the context of AI and big data. In May, Dr. Jiawei Han, a leading researcher in data mining, gave a lecture at Yahoo! Labs entitled, Big Data Needs Big Structure. In it, he defines “Big Structure as a type information network.” The correlation with ontologies and knowledge structures is obvious.

An Emerging Development Agenda

The intellectual foundations already exist to move aggressively on a focused development agenda to improve the infrastructure of data interoperability. This emerging agenda needs to look to new refererence structures, better tooling, the use of functional languages and practices, and user interfaces and workflows that improve the mappings that are the heart of interoperability.

Big Structure, such as UMBEL for referencing what data is about, is the present exemplar for going forward. Excellent reference and domain ontologies for common domains already exist. Mapping predicates have been developed for these purposes. Though creation of the maps is still laborious, tooling improvements (see below) should speed up that process as well.

What is next needed are reference structures to help guide attributes mappings, data value mappings, and transformations into usable common attribute quantities and types. I will discuss in a later post our more detailed thoughts of what a reference gold-standard attribute ontology should look like. This new Big Structure should also be helpful in guiding conversion, transformation and “lifting” utilities that may be used to bring attribute values from heterogeneous sources into a common basis. As mappings are completed, these too can become standard references as the bootstrapping continues.

Mappings for data integration across the scales, scope and growth of data volumes on the Web and within enterprises can no longer be handled manually. Semi-automated tooling must be developed and refined that operates over large volumes with acceptable performance. Constant efforts to reduce the data volumes requiring manual curation are essential; AI approaches should be incorporated into the virtuous iterations to reduce these efforts. Meanwhile, attentiveness to productive user interfaces and efficient workflows are also essential to improve throughput.

Further, by working off of standards-based Big Structures, this tooling can be made more-or-less generic, with ready application to different domains and different data. Because this tooling will often work in enterprises behind firewalls, standard enterprise capabilities (security, access, preservation, availability) should also be added to this infrastructure.

These Big Structures and tools should themselves be created and maintained via functional programming languages and DSLs specifically geared to the circumstances at hand. We want languages suited to RDF and AI purposes with superior performance across large mapped datasets and unstructured text. But we also want languages that are easier to use and maintain by knowledge workers themselves. Partitioning strategies may also need to be employed to ensure acceptable real-time feedback to users responsible for data integration mappings.

A New Adaptive Infrastructure for Data Interoperability

Structured Dynamics’ review exercise, now documented in this two-part series, affirms the semantic Web needs to become re-embedded in artificial intelligence, backed by knowledge bases, which are themselves creatures of the semantic Web. Coupling artificial intelligence with knowledge bases will do much to improve the most labor-intensive stumbling blocks in the data integration workflow: mappings and transformations. Through a purposeful approach of developing reference structures for attributes and data values, we will begin to see marked improvements in the efficiency and lower costs of data integration. In turn, what is learned by using these approaches for mastering MDM will teach the semantic Web much.

An approach using semantic technologies and artificial intelligence tools will begin to solve the data integration puzzle. By leveraging background knowledge, we will begin to extend into data interoperability. Purposeful attention to tooling and workflows geared to improve the mapping speed and efficiency by users will enable us to increase the stable of reference structures — that is, Big Structure — available for the next integration challenges. As this roster of Big Structures increases, they can be shared, allowing more generic issues of data integration to be overcome, freeing domains and enterprises to target what is unique.

Achieving this vision will not occur overnight. But, based on a decade of semantic Web experience and the insights being gained from today’s knowledge-based AI advances, the way forward looks pretty clear. We are entering a fundamental new era of knowledge-based computation. We welcome challenging case examples that will help us move this vision forward.

NOTE: This Part II concludes the series with Part I, A Decade in the Trenches of the Semantic Web

[1] Using semantic ontologies can and has worked well for many domains and applications, such as the biomedical OBO ontologies, IBM’s Watson, Google’s Knowledge Graph, and hundreds in more specific domains. Combined with concept reference structures like UMBEL, both building blocks and exemplars exist for how to interoperate across what different domains are about.
[2] For examples of issues, see M. K. Bergman, 2009. When Linked Data Rules Fail, AI3:::Adaptive Information blog, November 16, 2009.
[3] Some of these options are overviewed by M. K. Bergman, 2010. The Nature of Connectedness on the Web, AI3:::Adaptive Information blog, November 22, 2010.
[4] See the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.
[6] The object may also refer to another class or instance, in which case the relation property takes the form of an ObjectProperty and the “value” is the URI referring to that object.
[7] See, for example, M. K. Bergman, 2009. Making Linked Data Reasonable Using Description Logics, Part 2, AI3:::Adaptive Information blog, February 15, 2009.
[9] Info-Tech Research Group, 2011. Vendor Landscape Plus: Data Integration Tools, 72 pp.
[10] Gartner estimates that the data integration tool market was slightly over $2 billion at the end of 2012, an increase of 7.4% from 2011. This market is seeing an above-average growth rate of the overall enterprise software market, as data integration continues to be considered a strategic priority by organizations. See Eric Thoo, Ted Friedman, Mark A. Beyer, 2013. Magic Quadrant for Data Integration Tools, research Report G00248961 from Gartner, Inc., 17 July 2013; see: http://www.gartner.com/technology/reprints.do?id=1-1HBEFSF&ct=130717&st=sb
[11] See M. K. Bergman, 2014. Spring Dawns on Artificial Intelligence, AI3:::Adaptive Information blog, June 2, 2014.
[12] See M. K. Bergman, 2011. In Search of ‘Gold Standards’ for the Semantic Web, AI3:::Adaptive Information blog, February 28, 2011.