Posted:August 18, 2014

Big Structure and Data Interoperability

A Critical Fit with the Semantic Web and AI

In the first parts of this series we introduced the idea of Big Structure, and the fact that it resides at the nexus of the semantic Web, artificial intelligence, natural language processing, knowledge bases, and Big Data. In this article, we look specifically at the work that Big Structure promotes in data interoperability as a way to clarify what the roles these various aspects play.

By its nature, data integration (the first step in data interoperability) means that data is being combined across two or more datasets. Such integration surfaces all of the myriad aspects of semantic heterogeneities, exactly the kinds of issues that the semantic Web and semantic technologies were designed to address. But resolving semantic differences can not be fulfilled by semantic technologies alone. While semantics can address the basis of differences in meaning and context, resolution of those differences or deciding between differing interpretations (that is, ambiguity) also requires many of the tools of artificial intelligence or natural language processing (NLP).

By decomposing this space into its various sources of semantic heterogeneities — as well as the work required in order to provide for such functions as search, disambiguation, mapping and transformations — we can begin to understand how all of these components can work together in order to help achieve data interoperability. This understanding, in turn, is essential to understand the stack and software architecture — and its accompanying information architecture — in order to best achieve these interoperability objectives.

So, this current article lays out this conceptual framework of components and roles. Later articles in this series will address the specific questions of software and information architectural design.

Data Interoperability in Relation to Semantics

Semantic technologies give us the basis for understanding differences in meaning across sources, specifically geared to address differences in real world usage and context. These semantic tools are essential for providing common bases for relating structured data across various sources and contexts. These same semantic tools are also the basis by which we can determine what unstructured content “means”, thus providing the structured data tags that also enable us to relate documents to conventional data sources (from databases, spreadsheets, tables and the like). These semantic technologies are thus the key enablers for making information — unstructured, semi-structured and structured — understandable to both humans and machines across sources. Such understandings are then a key basis for powering the artificial intelligence applications that are now emerging to make our lives more productive and less routine.

For nearly a decade I have used an initial schema by Pluempitiwiriyawej and Hammer to elucidate the sources of possible semantic differences between content. Over the years I have added language and encoding differences to this schema. Most recently, I have updated this schema to specifically call out semantic heterogeneities due to either conceptual differences between sources (largely arising from schema differences) and value and attribute differences amongst actual data. I have further added examples for what each of these categories of semantic heterogenities means [1].

This table of more than 40 sources of semantic heterogeneities clearly shows the possible impediments to get data to interoperate across sources:

Class	Category	Subcategory	Examples	Type [2] [4]
LANGUAGE	Encoding	Ingest Encoding Mismatch	For example, ANSI v UTF-8 [3]	Concept
		Ingest Encoding Lacking	Mis-recognition of tokens because not being parsed with the proper encoding [3]	Concept
		Query Encoding Mismatch	For example, ANSI v UTF-8 in search [3]	Concept
		Query Encoding Lacking	Mis-recognition of search tokens because not being parsed with the proper encoding [3]	Concept
	Languages	Script Mismatch	Variations in how parsers handle, say, stemming, white spaces or hyphens	Concept
		Parsing / Morphological Analysis Errors (many)	Arabic languages (right-to-left) v Romance languages (left-to-right)	Concept
		Syntactical Errors (many)	Ambiguous sentence references, such as I’m glad I’m a man, and so is Lola (Lola by Ray Davies and the Kinks)	Concept
		Semantics Errors (many)	River bank v money bank v billiards bank shot	Concept
CONCEPTUAL	Naming	Case Sensitivity	Uppercase v lower case v Camel case	Concept
		Synonyms	United States v USA v America v Uncle Sam v Great Satan	Concept
		Acronyms	United States v USA v US	Concept
		Homonyms	Such as when the same name refers to more than one concept, such as Name referring to a person v Name referring to a book	Concept
		Misspellings	As stated	Concept
	Generalization / Specialization		When single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to “phone” but the other schema has multiple elements such as “home phone,” “work phone” and “cell phone”	Concept
	Aggregation	Intra-aggregation	When the same population is divided differently (such as, Census v Federal regions for states, England v Great Britain v United Kingdom, or full person names v first-middle-last)	Concept
	Aggregation	Inter-aggregation	May occur when sums or counts are included as set members	Concept
	Internal Path Discrepancy		Can arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove)	Concept
	Missing Item	Content Discrepancy	Differences in set enumerations or including items or not (say, US territories) in a listing of US states	Concept
		Missing Content	Differences in scope coverage between two or more datasets for the same concept	Concept
		Attribute List Discrepancy	Differences in attribute completeness between two or more datasets	Attribute
		Missing Attribute	Differences in scope coverage between two or more datasets for the same attribute	Attribute
	Item Equivalence		When two types (classes or sets) are asserted as being the same when the scope and reference are not (for example, Berlin the city v Berlin the official city-state)	Concept
	Item Equivalence		When two individuals are asserted as being the same when they are actually distinct (for example, John Kennedy the president v John Kennedy the aircraft carrier)	Attribute
	Type Mismatch		When the same item is characterized by different types, such as a person being typed as an animal v human being v person	Attribute
	Constraint Mismatch		When attributes referring to the same thing have different cardinalities or disjointedness assertions	Attribute
DOMAIN	Schematic Discrepancy	Element-value to Element-label Mapping	One of four errors that may occur when attribute names (say, Hair v Fur) may refer to the same attribute, or when same attribute names (say, Hair v Hair) may refer to different attribute scopes (say, Hair v Fur) or where values for these attributes may be the same but refer to different actual attributes or where values may differ but be for the same attribute and putative value.Many of the other semantic heterogeneities herein also contribute to schema discrepancies	Attribute
		Attribute-value to Element-label Mapping		Attribute
		Element-value to Attribute-label Mapping		Attribute
		Attribute-value to Attribute-label Mapping		Attribute
	Scale or Units	Measurement Type	Differences, say, in the metric v English measurement systems, or currencies	Attribute
	Scale or Units	Units	Differences, say, in meters v centimeters v millimeters	Attribute
	Precision		For example, a value of 4.1 inches in one dataset v 4.106 in another dataset	Attribute
	Data Representation	Primitive Data Type	Confusion often arises in the use of literals v URIs v object types	Attribute
	Data Representation	Data Format	Delimiting decimals by period v commas; various date formats; using exponents or aggregate units (such as thousands or millions)	Attribute
DATA	Naming	Case Sensitivity	Uppercase v lower case v Camel case	Attribute
		Synonyms	For example, centimeters v cm	Attribute
		Acronyms	For example, currency symbols v currency names	Attribute
		Homonyms	Such as when the same name refers to more than one attribute, such as Name referring to a person v Name referring to a book	Attribute
		Misspellings	As stated	Attribute
	ID Mismatch or Missing ID		URIs can be a particular problem here, due to actual mismatches but also use of name spaces or not and truncated URIs	Attribute
	Missing Data		A common problem, more acute with closed world approaches than with open world ones	Attribute
	Element Ordering		Set members can be ordered or unordered, and if ordered, the sequences of individual members or values can differ	Attribute

Sources of Semantic Heterogeneities

Ultimately, since we express all of our content and information with human language, we need to start there to understand the first sources in semantic differences. Like the differences in human language, we also have differences in world views and experience. These differences are often conceptual in nature and get at what we might call differences in real world perspectives and experiences. From there, we encounter differences in our specific realms of expertise or concern, or the applicable domain(s) for our information and knowledge. Then, lastly, we give our observations and characterizations data and values in order to specify and quantify our observations. But the attributes of data are subject to the same semantic vagaries as concepts, in addition to their own specific challenges in units and measures and how they are expressed.

From the conceptual to actual data, then, we see differences in perspective, vocabularies, measures and conventions. Only by systematically understanding these sources of heterogeneity — and then explicitly addressing them — can we begin to try to put disparate information on a common footing. Only by reconciling these differences can we begin to get data to interoperate.

Some of these differences and heterogeneities are intrinsic to the nature of the data at hand. Even for the same putative topics, data from French researchers will be expressed in a different language and with different measurements (metric) than will data from English researchers. Some of these heterogeneities also arise from the basis and connections asserted between datasets, as misuse of the sameAs predicate shows in many linked data applications [5].

Fortunately, in many areas we are transitioning by social convention to overcome many of these sources of semantic heterogeneity. A mere twenty years ago, our information technology systems expressed and stored data in a multitude of formats and systems. The Internet and Web protocols have done much to overcome these sources of differences, what I’ve termed elsewhere as climbing the data federation pyramid [6]. Semantic Web approaches where data items are assigned unique URIs are another source of making integration easier. And, whether all agree from a cultural aspect if it is good, we are also seeing English become the lingua franca of research and data.

The point of the table above is not to throw up our hands and say there is just too much complexity in data integration. Rather, by systematically decomposing the sources of semantic heterogeneity, we can anticipate and accommodate those sources not yet being addressed by cultural or technological conventions. While there is a large number of categories of semantic heterogeneity, these categories are also patterned and can be anticipated and corrected. These patterned sources inform us about what kind of work must be done to overcome semantic differences where they still reside.

Work Components in Data Interoperability

The description logics that underly the semantic Web already do a fair job of architecting this concept-attribute split in semantics. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts [7].

The semantic Web is a standards-based effort by the W3C (World Wide Web Consortium); many of its accomplishments have arisen around ontology and TBox-related efforts. Data integration has putatively been tackled from the perspective of linked data, but that methodology so far is short on attributes and property-mapping linkages between datasets and schema. There are as yet no reference vocabularies or schema for attributes [8]. Many of the existing linked data linkages are based on erroneous owl:sameAs assertions. It is fair to say that attribute and ABox-level semantics and interoperability have received scarce attention, even though the logic underpinnings exist for progress to be made.

This lack on the attributes or ABox-side of things is a major gap in the work requirements for data interoperability, as we see from the table below. The TBox development and understanding is quite good; and, a number of reference ontologies are available upon which to ground conceptual mappings [9]. But the ABox third is largely missing grounding references. And, the specialty work tasks, representing about the last third, are needful of better effectiveness and tooling.

For both the TBox and the ABox we are able to describe and model concepts (classes), instances (individuals), and are pretty good at being able to model relationships (predicates) between concepts and individuals. We also are able to ground concepts and their relationships through a number of reference concept ontologies [9]. But our understanding of attributes (the descriptive properties of instances) remains poor and ungrounded. Best practices — let alone general practices — still remain to be discovered.

TBox (concepts)	Specialty Work Tasks	ABox (data)
Definitions of the concepts and properties (relationships) of the controlled vocabulary Declarations of concept axioms or roles Inferencing of relationships, be they transitive, symmetric, functional or inverse to another property Equivalence testing as to whether two classes or properties are equivalent to one another Subsumption, which is checking whether one concept is more general than another Satisfiability, which is the problem of checking whether a concept has been defined (is not an empty concept) Classification, which places a new concept in the proper place in a taxonomic hierarchy of concepts Logical implication, which is whether a generic relationship is a logical consequence of the declarations in the TBox Infer property assertions implicit through the transitive property	Mappings are the core of interoperability in that concepts and attributes get matched across schema and datasets Transformations are the means to bring disparate data into common grounds, the second leg of interoperability Entailments, which are whether other propositions are implied by the stated condition Instance checking, which verifies whether a given individual is an instance of (belongs to) a specified concept Knowledge base consistency, which is to verify whether all concepts admit at least one individual Realization, which is to find the most specific concept for an individual object Retrieval, which is to find the individuals that are instances of a given concept Identity relations, which is to determine the equivalence or relatedness of instances in different datasets] Disambiguation, which is resolving references to the proper instance	Membership assertions, either as concepts or as roles Attributes assertions Linkages assertions that capture the above but also assert the external sources for these assignments Consistency checking of instances Satisfiability checks, which are that the conditions of instance membership are met

Work Tasks for a Data Interoperability Framework

Across the knowledge base (that is, the combination of the TBox and the ABox), the semantic Web has improved its search capabilities by formally integrating with conventional text search engines, such as Solr. Instance and consistency checking are pretty straightforward to do, but are often neglected steps in most non-commercial semantic installations. Critical areas such as mappings, transformations and identity evaluation remain weak work areas. This figure helps show these major areas and their work splits:

Work Splits Between the Semantic Web and AI

As we discussed earlier on the recent and rapid advances of artificial intelligence [10], the combination of knowledge bases and the semantic Web with AI machine learning (ML) and NLP techniques will show rapid improvements in data interoperability. The two stumbling blocks of not having a framework and architecture for interoperability, plus the lack of attributes groundings, have been controlling. Now that these factors are known and they are being purposefully addressed, we should see rapid improvements, similar to other areas in AI.

This re-embedding of the semantic Web in artificial intelligence, coupled with the conscious attention to provide reference groundings for data interoperability, should do much to address what are current, labor-intensive stumbling blocks in the knowledge management workflow.

Putting Some Grown-up Pants on the Semantic Web

The semantic Web clearly needs to play a central role in data integration and interoperability. Fortunately, like we have seen in other areas [11], semantic technologies lend themselves to generic functional software that can be designed for re-use in most any knowledge domain, chiefly by changing the data and ontologies guiding them. This means that reference libraries of groundings, mappings and transformations can be built over time and reused across enterprises and projects. Use of functional programming languages will also align well with the data and schema in knowledge management functions and ontologies and DSLs. These prospects parallel the emergence of knowledge-based AI (KBAI), which marries electronic Web knowledge bases with improvements in machine-learning algorithms.

The time for these initiatives is now. The complete lack of distributed data interoperability is no longer tolerable. High costs due to unacceptable manual efforts and too many failed projects plague the data interoperability efforts of the past. Data interoperability is no longer a luxury, but a necessity for enterprises needing to compete in a data-intensive environment. At scale, point-to-point integration efforts become ineffective; a form of reusable and transferable master data management (MDM) needs to emerge for the realiites of Big Data, and one that is based on the open and standard protocols of the Web.

Much tooling and better workflows and user interfaces will need to emerge. But the critical aspects are the ones we are addressing now: information and software architectures; reference groundings and attributes; and education about these very real prospects near at hand. The challenge of data interoperability in cooperation with its artificial intelligence cousin is where the semantic Web will finally put on its Big Boy pants.

[1] See Charnyote Pluempitiwiriyawej and Joachim Hammer, 2000. A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources, Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See https://cise.ufl.edu/tr/DOC/REP-2000-396.pdf. I first cited this report and extended it to cover languages (see [3]) in M.K. Bergman 2006. Sources and Classification of Semantic Heterogeneities, AI3:::Adaptive Information blog, June 6, 2006. See https://www.mkbergman.com/232/sources-and-classification-of-semantic-heterogeneities/). This most recent version added the examples and expanding the listing a bit further, to where it is no longer faithful to the original 2000 paper.

[2] Concept is the shorthand used for the schema or classes or TBox. Attribute is the shorthand used for instance data or entities and their ABox. I segregate class-relation properties (predicates) from instance-describing properties (attributes). This distinction is not use in standard TBox-ABox splits; its rationale will be described in a further article.

[3] See M.K. Bergman, 2006. Tutorial: Internet Languages, Character Sets and Encodings, BrightPlanet Corporation Technical Documentation, March 2006, 13 pp. See https://www.mkbergman.com/wp-content/themes/ai3v2/files/2006Posts/InternationalizationTutorial060323.pdf.

[4] See [7]. Also the TBox portion, or classes (concepts), is the basis of the ontologies. The ontologies establish the structure used for governing the conceptual relationships for that domain and in reference to external (Web) ontologies. The ABox portion, or instances (named entities), represents the specific, individual things that are the members of those classes. Named entities are the notable objects, persons, places, events, organizations and things of the world. Each named entity is related to one or more classes (concepts) to which it is a member. Named entities do not set the structure of the domain, but populate that structure. The ABox and TBox play different roles in the use and organization of the information and structure.

[5] M.K. Bergman 2009. When Linked Data Rules Fail, AI3:::Adaptive Information blog, November 16, 2009. See https://www.mkbergman.com/846/when-linked-data-rules-fail/.

[6] M.K. Bergman 2006. Climbing the Data Federation Pyramid, AI3:::Adaptive Information blog, May 25, 2006. See https://www.mkbergman.com/229/climbing-the-data-federation-pyramid/.

[7] M.K. Bergman 2008. Thinking ‘Inside the Box’ with Description Logics, AI3:::Adaptive Information blog, November 10, 2008. See https://www.mkbergman.com/466/thinking-inside-the-box-with-description-logics/.

[8] See the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.

[9] Examples of upper-level ontologies include UMBEL, the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak. See further the Wikipedia entry on upper ontologies.

[10] M.K. Bergman 2014. Spring Dawns on Artificial Intelligence, AI3:::Adaptive Information blog, June 2, 2014. See https://www.mkbergman.com/1731/spring-dawns-on-artificial-intelligence/.

[11] M.K. Bergman 2011. Ontology-driven Apps Using Generic Applications, AI3:::Adaptive Information blog, March 7, 2011. See https://www.mkbergman.com/948/ontology-driven-apps-using-generic-applications/.

Posted:August 12, 2014

What is Big Structure?

Defining the Guideposts for Big Data

Download PDF

In our recent two-part series we described a decade of experience working in the semantic Web (Part I) and our view that Big Structure, which resides at the nexus of the semantic Web, knowledge bases and artificial intelligence, was a key component of making sense of Big Data going forward (Part II). We are at a time when multiple advances are conjoining to create new opportunities and excitement.

Data without context and relationships is meaningless. The idea of Big Data is powerful, but it is often presented as either a “good thing” in and of itself, or a mantra for something that is rather undefined. There is no doubt that with the Internet and the Web we are now able to generate and access data at unprecedented scale. There is also no question that tracking mechanisms and cheap storage — and simpler, large-scale databases and Web services — mean that we can also capture data and structure of natures previously unseen. Everyone knows the remarkable growth in exabytes and more.

The prospect of data everywhere — some useful with important context and some not — has clearly captured the current discussion. Heck, if we claim Big Data, we even make more in wage or consulting charge-out fees. Who can argue with that?

Well, actually, anyone interested in meaningful data or cross-dataset interoperability can argue with that. Big Data is great, except it means little if we can not combine that data across multiple sources for potentially multiple purposes. (Remember, one of the “V’s” of Big Data is variability.) Once the question of what data means gets brought to the fore, it is now time for context and relationships. Structure in an information context means that which situates or describes data in an interpretable way. Big Data needs a Big Structure complement to make sense of it all.

What is a Big Structure?

Big Structure is data relationships and context that can be combined into a coherent framework to enable dataset interoperability and understanding. By necessity, Big Structure implies that the meaning of data can be understood and its values can be brought to common bases such that analysis, testing and validation can be applied across values. Big Structure is not a monolithic thing, but the combination of multiple things that give data meaning and context. As such, Big Structure is often a re-purposing of existing structural assets, plus other special sauce, organized for the aim of data interoperability.

Big Structure is data relationships and context that can be combined into a coherent framework to enable dataset interoperability and understanding.

The components of Big Structure can be identified and characterized. These components can be assessed for usefulness and authoritativeness, and then incorporated into broader structures that ultimately bring the topics of what the data is about and the values of that data into alignment. Thus, Big Structure is also a mindset and approach to selecting and combining structures such that broad dataset interoperability can be achieved.

Big Structure is actually a continuum or family of concept and data relationships, any one of which is also a contributor to helping to map and interoperate data. Ultimately, the components of Big Structure get combined into reference graph structures that place the concepts and actual data values of the Big Data into context. There are certain ways to use and organize existing structures to achieve these Big Structure objectives; some of these ways are described in this article.

Once the components of Big Structure are combined into these reference graphs we then can also use network or graph analysis to understand the relationships amongst the constituent data items. This recursive nature of graph reference structures to organize the constituent data and then to use those graphs to analyze the data is one of the hallmark characteristics of Big Structure.

Big Structure thus involves the need to identify and then organize constituent forms of structure into coherent reference frameworks. Concepts in contributing datasets are then mapped to these structures, and the attributes and values of the underlying data are also transformed into canonical representations. It is these mappings and transformations that provide the interoperability of Big Structure. Big Structure therefore continues to evolve by adding more and more reference structures, all coherently organized.

Contributors to Big Structure

Big Structure is a family of canonical reference structures that help guide mapping and interoperability. The table below lists some of the possible contributors to Big Structure [1], roughly in descending order as to the degree of structure and its contribution to interoperability. The table provides both definitions and use descriptions for each component, plus optionally some notes regarding coverage and use:

Structure Type	Definition	Use	Note
Reference ontologies	Major grounding structures for orienting and interoperating concepts or data	The reference concepts for orienting all data and domain information	[2]
Reference attributes	Major grounding structures for interoperating data and data characterizations	The reference relationships amongst data descriptions and characteristics, which also provides the means for transformations between heterogeneous representations	[3]
Data model (RDF)	A self-consistent means for describing the structure of data and their relationships	The “canonical” data model at the heart of the system; provides a single interoperability point; RDF is the canonical model used by Structured Dynamics for its Big Structures	[4]
Domain attributes	The data descriptions and characteristics for the constituent datasets in the applicable domain(s)	The reference attributes specific to the domain(s) at hand (which are generally more specific than general reference attributes)
Domain ontologies	The formal conceptualization of a domain, using a shared vocabulary to denote the types, properties and interrelationships of those concepts	The reference concepts and their relationships specific to the domain(s) at hand; generally are mapped to the reference ontologies	[5]
Concept maps	A diagram that depicts suggested relationships between concepts	Structurally similar to a domain ontology; a few related terms shown in Note	[6]
Schema	The structure of a database that defines the objects and relationships in that database	Organizing framework for relational databases (and their tables)	[7]
Mappings	The process of creating data element correspondences between two distinct data models or schema	Mapping predicates are used to relate concepts or attributes from two different datasets or knowledge bases to one another. Mappings are often a precursor to various transformations to bring data into a common representation	[8]
Taxonomies	A particular classification of related concepts, often of a hierarchical nature	Hierarchical relationships are expressed in narrower or broader terms (or subClassOf); may also be see also relationships	[9]
Facets	Clearly defined, mutually exclusive, and collectively exhaustive aspects, properties or characteristics of a class or specific subject	Facets can provide alternative ways for classifying objects beyond a single taxonomy
Categories	Grouping objects based on similar properties	A category may be viewed as equivalent to a concept	[10]
Tables	A collection of related data held in a structured format, generally a two-dimensional layout of rows (records) and columns (fields)	Simplest and most common data presentation format
Synsets	A group of data elements or terms that are considered semantically equivalent for the purposes of information retrieval	Also known as a “semset” in the parlance of UMBEL
Metadata	Data providing information about one or more aspects of the source data, thus “data about data”	It is the description of what data is about rather than the values and attributes of the actual data
Thesauri	A form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects	A thesaurus is composed a list of words (or terms), a vocabulary for relating these words (or terms) to one another, often hierarchically, and a set of rules on how to use these aspects
Gazetteers	A listing of similar entity types with associated structural data (such as countries and population or standard codes)	Often used in relation to people or place entity types, though any class of entities may have a gazetteer
Controlled vocabularies	The use of predefined, authorized terms as preselected by the sponsor to enforce consistency in terminology	Applied to specific domains or sub-domains, with single controlled vocabularies per official language used
Reference lists	Authoritative listings of similar objects, each uniquely identified by name or code	May be as simple as a comprehensive list of countries with associated ISO codes	[11]
Dictionaries	A repository of information about data such as meaning, relationships to other data, origin, usage, or format	In our context, can range from the meaning associated with standard word dictionaries to the more formal data dictionary
Glossaries	An alphabetical list of terms in a particular domain with the definitions for those terms	Definition is the only structured information provided
Nested lists	Related concepts or entities organized by some form of hierarchical relationship (narrower, broader, subClassOf, etc.)	Akin to a simple taxonomy
Ordered lists	A finite, ordered collection of values for a given type	May also be additional information linked to the listing
Clusters	A set of objects grouped according to some basis of similarity (type, attributes, or characteristics)	Basis for how the objects got clustered is not always obvious
Unordered lists	A container of similar items or entities, with no implied order or sequence	Also known as a “bag” or “collection”	[12]
Values	The actual data; a normal form or a type member	Basic QUDT ontologies could contribute here

An alternate way to look at these contributor structures is to characterize them with respect to degree of structure and degree of contributing to interoperability:

Structure v Interoperability

In general, as might be expected, the greater the degree of structure, the greater its potential contribution to interoperability. The components in the upper right quadrant represent the most structured and interoperable ones. These also conform most to the use of W3C standards for the RDF data model and the OWL ontology languages. Expressions of structure are codified and standardized. Use of best practices also ensures completeness and suitability as reference groundings for interoperability.

The lower left portions of the quadrant represent the least structure and interoperability. However, as standard reference means for characterizing and describing data, even structures in this quadrant can contribute to meeting Big Structure requirements. Tagging of documents (unstructured data) occurs in this less-sophisticated lower left quadrant, but it gives equal footing to 80% of the content that generally resides in text form. (The interoperability system is further enhanced when the basis of the tags is derived from the “semsets” of the reference and domain ontologies, another example of a best practice.)

All of the listed components can thus contribute to Big Structure. However, the completeness of that structure and its usefulness for interoperability increases as one progresses along the blue arrow of the Big Structure continuum. Data interoperability arises from the continued efforts to drive Big Structure to the upper right of this quadrant. As noted, Big Structure is a mindset and process rather than some finite state. As more concepts and attributes get grounded in standard references, the degree of Big Structure (and, thus, data interoperability) continues to increase.

The Foundation of Reference Groundings

In both semantics and artificial intelligence — and certainly in the realm of data interoperability — there is always the problem of symbol grounding. In the conceptual realm, symbol grounding means that when we use a term or phrase we are referring to the same thing; that is, the referent is the same. In the data value realm, symbol grounding means that when we refer to an object or a number — say, the number 4.1 — we are also referring to the same metric. 4.1 inches is not the same as 4.1 centimeters or 4.1 on the Richter scale, and object names for set member types also have the same challenges of ambiguous semantics as do all other things referred to by language.

The variability V in Big Data or the 40-some dimensions of potential semantic heterogeneity [13] are explicit recognitions of the symbol grounding challenge. Assuming we can determine context (itself an important consideration not further discussed here), fixity of reference is essential to these groundings. Context and groundings are the ways by which we remove ambiguity in what we measure and record.

Like dictionaries for human languages, or stars and constellations for navigators, or agreed standards in measurement, or the Greenwich meridian for timekeepers, fixed references are needed to orient and “ground” each new dataset over which we attempt to integrate. Without such fixities of reference, everything floats in reference to other things, the cursed “rubber ruler” phenomenon.

Thus, we can express our Big Structure components from a foundational perspective as well. In Structured Dynamics‘ view of the world, the foundation for data interoperability is grounded in reference structures or ontologies that provide the fixity of reference for concepts and data and their attributes. Upon these foundations are then constructed the domain views of concepts and attributes, which become the target for mapping other references and Big Structures:

Foundations to Big Structure

The mappings, transformations and domain and reference ontologies are themselves written in the OWL languages of the W3C and the standards of the RDF data model. At this most expressive end of Big Structure, the representations are in the form of graphs. Network and graph analytics will expand still further business intelligence prospects. The use of these standards with common and testable logic is another means to ensure coherency and interoperability of the Big Structure that results.

Note a key aspect of the grounding foundation is missing: one or more reference ontologies for attributes. Though many examples exist on the concept side, little has been done to explicitly address the questions of data value interoperability. This major gap is a current emphasis of Structured Dynamics, with much that will be said over the coming weeks. Also expect an open source reference ontology for attributes in the near future.

The thing is that we are learning how to make the various parts of this interoperability stack work. We are leveraging existing structural assets of all kinds to establish the semantics and infrastructure for domain interoperability. We know how to match and map these existing structural assets to the reference frameworks that are the foundation to interoperability.

A Vision of Interoperability

The real world is one of heterogeneous datasets, multiple schema and differing viewpoints. Even within single enterprises — and those which formerly expressed little need or interest to interoperate with the broader world — data integration and interoperability has been a real challenge. Big Data itself is not solving these problems. Quite the opposite. Big Data trends are turning data interoperability molehills into mountain-high competitive threats.

Like any well-built structure, data interoperability requires a solid foundation. That foundation must reside in exemplar reference ontologies upon which to ground the semantics and exchange standards for data. Using the canonical RDF data model makes this task practical. Existing information structures of various types across the enterprise and the Web all can and should play a role in establishing reference structures. The accretion of reference structures will lead to still further interoperability and the ability to incorporate more datasets. Currently expensive practices in, say, master data management (MDM) can begin to transition to a new paradigm. It is easy to envision working from a library of existing reference standards for use across enterprises. This kind of incremental expansion of interoperability leads to still more interoperable data in a virtuous cycle of innovation and lower budgets.

As our computing continues to get more virtual and cloud-like, physical and hardware and software architectures must give way to information architectures (in the true sense of interoperability). We have no choice but to treat the architecting of information as a first-order challenge. The totally cool thing about the data integration challenge is that the architecture can be readily varied and tested to achieve a working foundation. Much empirical information exists about how to do it and what to do next. The chief challenge has been to recognize that data interoperability — and its dependence on Big Structure — is a first-order concern (and opportunity). The intersection of Big Structure with Big Data, and with graph and AI algorithms, should create new approaches to chew across the data integration environment. I expect progress to be rapid.

[1] There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach. See M.K. Bergman, 2007. An Intrepid Guide to Ontologies, AI3:::Adaptive Information blog, May 16, 2007.

[2] UMBEL and other upper level ontologies are examples here. In the case of UMBEL, that Big Structure is used as a scaffolding of reference concepts used to link external (unrelated) structures to help inter-operating data between two unrelated systems. Such a Big Structure can also be used for other tasks such as helping machine learning techniques to categorize and disambiguate pieces of data by leveraring such a structure of types.

[3] Unfortunately, no reference structures for attributes yet exist. For a discussion of this status, see the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.

[4] Data models encompass a rather broad span. The RDF discussion represents a more formal end of the data model spectrum, wherein there is complete logic, syntax and serialization discussions, more involved than most data models.

[5] Domain ontologies represent the most closely-aligned view of the domain and its relationships of all of the component structures listed.

[6] Concept maps are very closely related to ontologies, and may include topic maps, mind maps and other graph-like structures of concepts.

[7] Schema may apply to many realms, but in the IT and software context schema mostly refers to database schema related to relational databases. These are often expresssed in UML diagrams or XML schema.

[8] Mappings and transformatons are a huge area of diverse structure and different serializations and specifications. Fortunately, the task of mapping external structure to RDF removes the many-to-many issues with most transformation approaches.

[9] Taxonomies mask an entire sub-categories of directories, folksonomies, subject trees, and more. The key aspect is that relevant concepts are expressed in a graph relationship manner to other concepts, often in a hierarchical fashion.

[10] Categories also includes the general classification process.

[11] I would consider a canonical references listing of country names and codes to be a part of Big Structure, since they act as a controlled vocabulary.

[12] This is a key area for including unstructured documents, since tags are a primary means of adding metadata to a document. When the pool of tags is based on the governing reference and domain ontologies, then interoperability is further promoted.

[13] M.K. Bergman, 2006. Sources and Classification of Semantic Heterogeneities, AI3:::Adaptive Information blog, June 6, 2006.

Posted:July 23, 2014

Big Structure: At The Nexus of Knowledge Bases, the Semantic Web and Artificial Intelligence

Envisioning A New Adaptive Infrastructure for Data Interoperability

Download PDF

In Part I of this two-part series, Fred Giasson and I looked back over a decade of working within the semantic Web and found it partially successful but really the wrong question moving forward. The inadequacies of the semantic Web to date reside in its lack of attention to practical data interoperability across organizational or community boundaries. An emphasis on linked data has created an illusion that questions of data integration are being effectively addressed. They are not.

Linked data is hard to publish and not the only useful form for consuming data; linked data quality is often unreliable; the linking predicates for relating disparate data sources to one another may be inadequate or wrong; and, there are no reference groundings for relating data values across datasets. Neither the semantic Web nor linked data has developed the practices, tooling or experience to actually interoperate data across the Web. These criticisms are not meant to condemn linked data — it is, after all, the early years. Where it is compliant and from authoritative information sources, linked data can be a gold standard in data publishing. But, linked data is neither necessary nor essential, and may even be a diversion if it sucks the air from the room for what is more broadly useful.

This table summarizes the state-of-art in the semantic Web for frameworks and guidance in how to interoperate data:

Category	Related Terms	Status in the Semantic Web	Notes
Classes	sets, concepts, topics, types, kinds	Mature, but broader scope coverage desirable; equivalent linkages between datasets often mis-applied; more realistic proximate linkages in flux, with no bases to reason over them	[1]
Instances	individuals, entities, members, records, things	Current basis for linked data; many linkage properties mis-applied	[2]
Relation Properties	relations, predicates	Equivalent linkages between datasets often mis-applied; more realistic proximate linkages in flux, with no bases to reason over them.	[3]
Descriptive Properties	attributes, descriptors	Save for a couple of minor exceptions, no basis for mapping attributes across datasets	[4]
Values	data	Basic QUDT ontologies could contribute here	[5]

We can relate the standard subject – predicate – object triple statement in RDF to this table, using the Category column. Classes and Instances relate to the subjects, Relation and Descriptive Properties relate to the predicate, and Values relate to the object [6] in an RDF triple. The concepts and class schema of different information sources (their “aboutness”) can reasonably be made to interoperate. In terms of the description logics that underly the logic bases of W3C ontologies, the focus and early accomplishments of the semantic Web have been on this “terminological box” or T-Box [7]. Tooling to make the mappings more productive and means to test the coherence and completeness of the results still remain as priority efforts, but the conceptual basis and best practices have progressed pretty well.

In contrast, nearly lacking in focus and tooling has been the flip side of that description logics coin: the A-Box [7], or assertional and instance (data) level of the equation. Both the T-Box and A-Box are necessary to provide a knowledge base. Today, there are virtually no vocabularies, no tooling, no history, no best practices and no “grounding” for actual A-Box data integration within the semantic Web. Without such guidance, the semantic Web is silent on the questions of data interoperability. As David Karger explained in his keynote address at ISWC in 2013 [8], “we’ve got our heads in the clouds while people are stuck in the dirt.”

Yet these are not fatal flaws of the semantic Web, nor are they permanent. Careful inspection of current circumstances, combined with purposeful action, suggests:

Data integration can be solved
Leveraging background knowledge is a key enabler
Interoperability requires reference structures, what we are calling Big Structure.

The Prism of Data Interoperability

Why do we keep pointing to the question of data interoperability? Consider these facts:

80% of all available information is in text or documents (unstructured)
40% of standard IT project expenses are devoted to data integration in one form or another, due to the manual effort needed for data migration and mapping
Information volumes are now doubling in fewer than two years
Other trends including smartphones and sensors are further accelerating information growth
Effective business intelligence requires the use of quality, integrated data.

The abiding, costly, frustrating and energy-sucking demands of data integration have been a constant within enterprises for more than three decades. The same challenges reside for the Web. The Internet of Things will further demand better interoperability frameworks and guidelines. Current data integration tooling relies little upon semantics and no leading alternative is based principally around semantic approaches [9].

The data integration market is considered to include enterprise data integration and extract, transform and load (ETL) vendors. Gartner estimates tool sales for this market to be about $2 billion annually, with a growth rate faster than most IT areas [10]. But data integration also touches upon broader areas such as enterprise application integration (EAI), federated search and query, and master data management (MDM), among others. Given that data integration is also 40% of standard IT project costs, new approaches are needed to finally unblock the costly logjam of enterprise information integration. Most analysts see firms that are actively pursuing data integration innovations as forward-thinking and more competitive.

Data integration is combining information from multiple sources and providing users a uniform view of it. Data interoperability is being able to exchange and work upon (inter-operate) information across system and organizational boundaries. The ability to integrate data precedes the ability to interoperate it. For example, I may have three datasets of mammals that I want to consolidate and describe in similar terms with common units of measurement. That is an example of data integration. I may then want to relate this mammal knowledge base with a more general perspective of the animal kingdom. That is an example of data interoperability. Data integration usually occurs within a single organization or enterprise or institutional offering (as would be, say, Wikipedia). Data interoperability additionally needs to define meanings and communicate them in common ways across organizational, domain or community boundaries.

These are natural applications for the semantic Web. Why, then, has there not been more practical use of the semantic Web for these purposes?

That is an interesting question that we only partially addressed in Part I of this series. All aspects of data have semantics: what the data is about, what is its context, how it relates to other data, and what its values are and what they mean. The semantic Web is closely allied with natural language processing, an essential for bringing the 80% of unstructured data into the equation. Semantic Web ontologies are useful structures for how to relate real-world data into common, reference forms. The open world logic of the semantic Web is the right perspective for knowledge functions under the real-world conditions of constantly expanding information and understandings.

While these requirements suggest an integral role for the semantic Web, it is also clear that the semantic Web has not yet made these contributions. One explanation may be that semantic Web advocates, let alone the linked data tribe, have not seen data integration — as traditionally defined — as their central remit. Another possibility is that trying to solve data interoperability through the primary lens of the semantic Web is the wrong focus. In any case, meeting the challenge of data interoperability clearly requires a much broader context.

Embedding Data Interoperability Into a Broader Context

The semantic Web, in our view, is properly understood as a sub-domain of artificial intelligence. Semantic technologies mesh smoothly with natural language tasks and objectives. But, as we noted in a recent review article, artificial intelligence is itself undergoing a renaissance [11]. These advances are coming about because of the use of knowledge-based AI (KBAI), which combines knowledge bases with machine learning and other AI approaches. Natural language and spoken interfaces combined with background knowledge and a few machine-language utilities are what underlie Apple’s Siri, for example.

The realization that the semantic Web is useful but insufficient and that AI is benefitting from the leveraging of background knowledge and knowledge bases caused us to “decompose” the data-interoperability information space. Because artificial intelligence is a key player here, we also wanted to capture all of the main sub-domains of AI and their relationships to one another:

Artificial Intelligence Domains

Two core observations emerge from standing back and looking at these questions. First, many of AI’s main sub-domains have a role to play with respect to data integration and interoperability:

AI and Data Interoperability

AI Domains Related to Data Interoperability

This places semantic Web technologies as a co-participant with natural language processing, knowledge mining, pattern recognizers, KR languages, reasoners, and machine learning as domains related to data interoperability.

And, second, generalizing the understanding of knowledge bases and other guiding structures in this space, such as ontologies, highlights the potential importance of Big Structure. Virtually every one of the domains displayed above would be aided by leveraging background knowledge.

Grounding Data Interoperability in Big Structure

As our previous AI review showed [11], reference knowledge bases — Wikipedia in the forefront — have been a tremendous boon to moving forward on many AI challenges. Our own experience with UMBEL has also shown how reference ontologies can help align and provide common grounding for mapping different information domains into one another [12]. Vetted, gold-standard reference structures provide a fixity of coherent touchpoints for orienting different concepts and domains (and, we believe, data) to one another.

In the data integration context, master data models (and management, or MDM) attempt to provide common reference terms and objects to aid the integration effort. Like other areas in conventional data integration, very few examples of MDM tools based on semantic technologies exist.

This use of reference structures and the importance of knowledge bases to help solve hard computational tasks suggests there may be a general principle at work. If ontologies can help orient domain concepts, why can’t they also be used to orient instance data and their attributes? In fact, must these structures always be ontologies? Are not other common reference structures such as taxonomies, vocabularies, reference entity sets, or other typologies potentially useful to data integration?

By standing back in this manner and asking these broader questions we can see a host of structures like reference concepts, reference attributes, reference places, reference identifiers, and the like, playing the roles of providing common groundings for integration and interoperation. Through the AI experience, we can also see that subsequent use of these reference structures — be they full knowledge bases or more limited structures like taxonomies or typologies — can further improve information extraction and organization. The virtuous circle of knowledge structures improving AI algorithms, which can then further improve the knowledge structures, has been a real Aha! moment for the artificial intelligence community. We should see rapid iterations of this virtuous circle in the months to come.

These perspectives can help lead to purposeful designs and approaches for attacking such next-generation problems as data interoperability. The semantic Web can not solve this alone because additional AI capabilities need to be brought to bear. Conventional data integration approaches that lack semantic Big Structure groundings — let alone the use of AI techniques — have years of history of high cost and disappointing results. No conventional enterprise knowledge management problem appears sheltered from this whirlwind of knowledge-backed AI.

At Structured Dynamics, Fred Giasson and I have been discussing “Big Structure” for some time. However, it was only in researching this article that I came across the first public use of this phrase in the context of AI and big data. In May, Dr. Jiawei Han, a leading researcher in data mining, gave a lecture at Yahoo! Labs entitled, Big Data Needs Big Structure. In it, he defines “Big Structure as a type information network.” The correlation with ontologies and knowledge structures is obvious.

An Emerging Development Agenda

The intellectual foundations already exist to move aggressively on a focused development agenda to improve the infrastructure of data interoperability. This emerging agenda needs to look to new refererence structures, better tooling, the use of functional languages and practices, and user interfaces and workflows that improve the mappings that are the heart of interoperability.

Big Structure, such as UMBEL for referencing what data is about, is the present exemplar for going forward. Excellent reference and domain ontologies for common domains already exist. Mapping predicates have been developed for these purposes. Though creation of the maps is still laborious, tooling improvements (see below) should speed up that process as well.

What is next needed are reference structures to help guide attributes mappings, data value mappings, and transformations into usable common attribute quantities and types. I will discuss in a later post our more detailed thoughts of what a reference gold-standard attribute ontology should look like. This new Big Structure should also be helpful in guiding conversion, transformation and “lifting” utilities that may be used to bring attribute values from heterogeneous sources into a common basis. As mappings are completed, these too can become standard references as the bootstrapping continues.

Mappings for data integration across the scales, scope and growth of data volumes on the Web and within enterprises can no longer be handled manually. Semi-automated tooling must be developed and refined that operates over large volumes with acceptable performance. Constant efforts to reduce the data volumes requiring manual curation are essential; AI approaches should be incorporated into the virtuous iterations to reduce these efforts. Meanwhile, attentiveness to productive user interfaces and efficient workflows are also essential to improve throughput.

Further, by working off of standards-based Big Structures, this tooling can be made more-or-less generic, with ready application to different domains and different data. Because this tooling will often work in enterprises behind firewalls, standard enterprise capabilities (security, access, preservation, availability) should also be added to this infrastructure.

These Big Structures and tools should themselves be created and maintained via functional programming languages and DSLs specifically geared to the circumstances at hand. We want languages suited to RDF and AI purposes with superior performance across large mapped datasets and unstructured text. But we also want languages that are easier to use and maintain by knowledge workers themselves. Partitioning strategies may also need to be employed to ensure acceptable real-time feedback to users responsible for data integration mappings.

A New Adaptive Infrastructure for Data Interoperability

Structured Dynamics’ review exercise, now documented in this two-part series, affirms the semantic Web needs to become re-embedded in artificial intelligence, backed by knowledge bases, which are themselves creatures of the semantic Web. Coupling artificial intelligence with knowledge bases will do much to improve the most labor-intensive stumbling blocks in the data integration workflow: mappings and transformations. Through a purposeful approach of developing reference structures for attributes and data values, we will begin to see marked improvements in the efficiency and lower costs of data integration. In turn, what is learned by using these approaches for mastering MDM will teach the semantic Web much.

An approach using semantic technologies and artificial intelligence tools will begin to solve the data integration puzzle. By leveraging background knowledge, we will begin to extend into data interoperability. Purposeful attention to tooling and workflows geared to improve the mapping speed and efficiency by users will enable us to increase the stable of reference structures — that is, Big Structure — available for the next integration challenges. As this roster of Big Structures increases, they can be shared, allowing more generic issues of data integration to be overcome, freeing domains and enterprises to target what is unique.

Achieving this vision will not occur overnight. But, based on a decade of semantic Web experience and the insights being gained from today’s knowledge-based AI advances, the way forward looks pretty clear. We are entering a fundamental new era of knowledge-based computation. We welcome challenging case examples that will help us move this vision forward.

NOTE: This Part II concludes the series with Part I, A Decade in the Trenches of the Semantic Web

[1] Using semantic ontologies can and has worked well for many domains and applications, such as the biomedical OBO ontologies, IBM’s Watson, Google’s Knowledge Graph, and hundreds in more specific domains. Combined with concept reference structures like UMBEL, both building blocks and exemplars exist for how to interoperate across what different domains are about.

[2] For examples of issues, see M. K. Bergman, 2009. When Linked Data Rules Fail, AI3:::Adaptive Information blog, November 16, 2009.

[3] Some of these options are overviewed by M. K. Bergman, 2010. The Nature of Connectedness on the Web, AI3:::Adaptive Information blog, November 22, 2010.

[4] See the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.

[5] See QUDT – Quantities, Units, Dimensions and Data Types Ontologies, Retrieved July 22, 2014.

[6] The object may also refer to another class or instance, in which case the relation property takes the form of an ObjectProperty and the “value” is the URI referring to that object.

[7] See, for example, M. K. Bergman, 2009. Making Linked Data Reasonable Using Description Logics, Part 2, AI3:::Adaptive Information blog, February 15, 2009.

[8] See David Karger, 2013. Keynote at the European Semantic Web Conference Part 1: The State of End User Information Management, June 5, 2013.

[9] Info-Tech Research Group, 2011. Vendor Landscape Plus: Data Integration Tools, 72 pp.

[10] Gartner estimates that the data integration tool market was slightly over $2 billion at the end of 2012, an increase of 7.4% from 2011. This market is seeing an above-average growth rate of the overall enterprise software market, as data integration continues to be considered a strategic priority by organizations. See Eric Thoo, Ted Friedman, Mark A. Beyer, 2013. Magic Quadrant for Data Integration Tools, research Report G00248961 from Gartner, Inc., 17 July 2013; see: http://www.gartner.com/technology/reprints.do?id=1-1HBEFSF&ct=130717&st=sb

[11] See M. K. Bergman, 2014. Spring Dawns on Artificial Intelligence, AI3:::Adaptive Information blog, June 2, 2014.

[12] See M. K. Bergman, 2011. In Search of ‘Gold Standards’ for the Semantic Web, AI3:::Adaptive Information blog, February 28, 2011.

Posted:July 16, 2014

A Decade in the Trenches of the Semantic Web

Are We Losing the War? Was it Even the Right One?

Download PDF

Cinemaphiles will readily recognize Akira Kurosawa‘s Rashomon film of 1951. And, in the 1960s, one of the most popular book series was Lawrence Durrell‘s The Alexandria Quartet. Both, each in its own way, tried to get at the question of what is truth by telling the same story from the perspective of different protagonists. Whether you saw this movie or read these books you know the punchline: the truth was very different depending on the point of view and experience — including self-interest and delusion — of each protagonist. All of us recognize this phenomenon of the blind men’s view of the elephant.

I have been making my living and working full time on the semantic Web and semantic technologies now for a full decade. So has my partner at Structured Dynamics, Fred Giasson. Others have certainly worked longer in this field. The original semantic Web article appeared in Scientific American in 2000 [1], and the foundational Resource Description Framework data model dates from 1999. Fred and I have our own views of what has gone on in the trenches of the semantic Web over this period. We thought a decade was a good point to look back, share what we’ve experienced, and discover where to point our next offensive thrusts.

What Has Gone Well?

The vision of the semantic Web in the Scientific American article painted a picture of globally interconnected data leveraged by agents or bots designed to make our lives easier and more automated. However, by the time that I got directly involved, nearly five years after standards first started to be published, Tim Berners-Lee and many leading proponents of RDF were beginning to shift focus to linked data. The agents, and automation, and ontologies of the initial vision were being downplayed in favor of effective means to publish and consume data based on RDF. In many ways, linked data resembled a re-branding.

This break had been coming for a while, memorably captured by a 2008 ISWC session led by Peter F. Patel-Schneider [2]. This internal division of viewpoint likely caused effort to be split that would have been better spent in proselytizing and improving tools. It also diverted somewhat into internal squabbles. While many others have pointed to a tactical mistake of using an XML serialization for early versions of RDF as a key factor is slowing initial adoption, a factor I agree was at play, my own suspicion is that the philosophical split taking place in the community was the heavier burden.

Whatever the cause, many of the hopes of the heady days of the initial vision have not been obtained over the past fifteen years, though there have been notable successes.

The biomedical community has been the shining exemplar for data interoperability across an entire discipline, with earth sciences, ecology and other science-based domains also showing interoperability success [3]. Families of ontologies accompanied by tooling and best practices have characterized many of these efforts. Sadly, though, most other domains have not followed suit, and commercial interoperability is nearly non-existent.

Most all of the remaining success has resided in single-institution data integration and knowledge representation initiatives. IBM’s Watson and Apple’s Siri are two amazing capabilities run and managed by single institutions, as is Google’s Knowledge Graph. Also, some individual commercial and government enterprises, willing to pay support to semantic technology experts, have shown success in data integration, using RDF, SKOS and OWL.

We have seen the close kinship between natural language, text, and Q & A with the semantic Web, also demonstrated by Siri and more recent offshoots. We have seen a trend toward pairing great-performing open source text engines, notably Solr, with RDF and triple stores. Recommendation systems have shown some success. Linked data publishing has also had some notable examples, including the first of the lot, DBpedia, with certain institutional publishers (such as the Library of Congress, Eurostat, The Getty, Europeana, OpenGLAM [galleries, archives, libraries, and museums]) showing leadership and the commitment of significant vocabularies to linked data form.

On the standards front, early experience led to new and better versions of the SPARQL query language (SPARQL 1.1 was greatly improved in the last decade and appears to be one capability that sells triple stores), RDF 1.1 and OWL 2. Certain open source tools have become prominent, including Protégé, Virtuoso (open source) and Jena (among unnamed others, of course). At least in the early part of this history, tool development was rapid and flourishing, though the innovation pace has dropped substantially according to my tracking database Sweet Tools.

What Has Disappointed?

My biggest disappointments have been, first, the complete lack of distributed data interoperability, and, second, the lack or inability of commercial enterprises to embrace and adopt semantic technologies on their own. The near absence of discussion about instance records and their attributes helps frame the current maturity of the semantic Web. Namely, it has yet to crack the real nuts of data integration and interoperability across organizations. Again, with the exception of the biomedical community, neither in the linked data realm nor in the broader semantic Web, can we point to information based on semantic Web principles being widely shared between systems and organizations.

Some in the linked data community have explicitly acknowledged this. The abstract for the upcoming COLD 2014 workshop, for example, states [4]:

. . . applications that consume Linked Data are not yet widespread. Reasons may include a lack of suitable methods for a number of open problems, including the seamless integration of Linked Data from multiple sources, dynamic discovery of available data and data sources, provenance and information quality assessment, application development environments, and appropriate end user interfaces.

We have written about many issues with linked data, ranging from the use of improper mapping predicates; to the difficulty in publishing; and to dereferencing URIs on the Web since they are sparse and not always properly implemented [5]. But ultimately, most linked data is just instance data that can be represented in simpler attribute-value form. By shunning a knowledge representation language (namely, OWL) at the processing end, we have put too much burden on what are really just instance records. Linked data does not get the balance of labor right. It ignores the reality that data consumers want actionable information over being able to click from data item to data item, with overall quality reduced to the lowest common denominator. If a publisher has the interest and capability to publish quality linked data, great! It should become part of the data ingest pool and the data becomes easy to consume. But to insist on linked data across the board creates unnecessary barriers. Linked data growth has not nearly kept pace with broader structured data growth on the Web [6].

At the enterprise level, the semantic technology stack is hard to grasp and understand for newcomers. RDF and OWL awareness and understanding are nearly nil in companies without prior semantic Web experience, or 99.9% of all companies. This is not a failure of the enterprises; it is the failure of us, the advocates and suppliers. While we (Structured Dynamics) have developed and continue to refine the turnkey Open Semantic Framework stack, and have spent more efforts than most in documenting and explicating its use, the systems are still too complicated. We combine complicated content management systems as user front-ends to a complicated semantic technology stack that needs to be driven by a complicated (to develop) ontology. And we think we are doing some of the best technology transfer around!

Moreover, while these systems are good at integrating concepts and schema, they are virtually silent on the question of actual data integration. It is shocking to say, but the semantic Web has no vocabularies or tools sufficient to enable data items for the same entity from two different datasets to be combined or reconciled [7]. These issues can be solved within the individual enterprise, but again the system breaks when distributed interoperability is the desire. General Web-based inconsistencies, such as in HTML coding or mime types, impose hurdles on distributed interoperability. These are some of the reasons why we see the successes in the context (generally) of single institutions, as opposed to anything that is truly yet Web-wide.

These points, as is often the case with software-oriented technologies, come down to a disappointing state of tooling. Markets drive developer interest, and market share has been disappointing; thus, fewer tools. Tool interest comes from commercial engagements, and not generally grants, the major source of semantic Web funding, particularly in the European Union. Pragmatic tools that solve real problems in user adoption are rarely a sufficient basis for getting a Ph.D.

The weaknesses in tooling extend from basic installation, to configuration, unit and integrated tests, data conversion and lifting, and, especially, all things ontology. Weaknesses in ontology tooling include (critically) mapping, consistency and coherency checking, authoring, managing, version control, re-factoring, optimization, and workflows. All of these issues are solvable; they are standard software challenges. But it is hard to conquer markets largely with the wrong army pursuing the wrong objectives in response to the wrong incentives.

Yet, despite the weaknesses in tooling, we believe we have been fairly effective in transferring technology to our clients. It takes more documentation and more training and, often, accompanying tool development or improvement in the workflow areas critical to the project. But clients need to be told this as well. In these still early stages, successful clients are going to have to expend more staff effort. With reasonable commitment, it is demonstrable that an enterprise can take over and manage a large-scale semantic engagement on its own. Still, for semantic technologies to have greater market penetration, it will be necessary to lower those commitments.

How Has the Environment Changed?

Of course, over the period of this history, the environment as a whole has changed markedly. The Web today is almost unrecognizable from the Web of 15 years ago. If one assumes that Web technologies tend to have a five year or so period of turnover, we have gone through at least two to three generations of change on the Web since the initial vision for the semantic Web.

The most systemic changes in this period have been cloud computing and the adoption of the smartphone. These, plus the network of workstations approach to data centers, have radically changed what is desirable in a large-scale, distributed architecture. APIs have become RESTful and database infrastructures have become flatter and more distributed. These architectures and their supporting infrastructure — such as virtual servers, MapReduce variants, and many applications — have in turn opened the door to performant management of large volumes of flat (key-value or graph) data, or big data.

On the Web side, JavaScript, just a few years older than the semantic Web, is now dominant in Web pages and taking on server-side roles (such as through Node.js). In turn, JSON has now grown in popularity as a form of data representation and transfer and is being adopted to the semantic Web (along with codifying CSV). Mobile, too, affects the Web side because of the need for multiple-platform deployments, touchscreen use, and different user interface paradigms and layout designs. The app ecosystem around smartphones has become a huge source for change and innovation.

Extremely germane to the semantic Web — indeed, overall, for artificial intelligence — has been the occurrence of knowledge-based AI (KBAI). The marrying of electronic Web knowledge bases — such as Wikipedia or internal ones like the Google search index or its Knowledge Graph — with improvements in machine-learning algorithms is systematically mowing down what used to be called the Grand Challenges of computing. Sensors are also now entering the picture, from our phones to our homes and our cars, that exposes the higher-order requirement for data integration combined with semantics. NLP kits have improved in terms of accuracy and execution speed; many semantic tasks such as tagging or categorizing or questioning already perform at acceptable levels for most projects.

On the tooling side, nearly all building blocks for what needs to be done next are available in open source, with some platform areas quite functional (including OSF, of course). We have also been successful in finding clients that agree to open source the development work we do for them, since they are benefiting from the open source development that went on before them.

What Did We Set Out to Achieve?

When Structured Dynamics entered the picture, there were already many tools available and core languages had been released. Our view of the world at that time led us to adopt two priorities for what we thought might be a five year or so plan. We have achieved the objectives we set for ourselves then, though it has taken us a couple of years longer to realize.

One priority was to develop a reference structure for concepts to serve as a “grounding” basis for relating datasets, vocabularies, schema, taxonomies, or ontologies. We achieved this with our first commercial release (v 1.00) of UMBEL in February 2011. Subsequent to that we have progressed to v 1.05. In the coming months we will see two further major updates that have been under active effort for about eight months.

The other priority was to create a turnkey foundation for a semantic enterprise. This, too, has been achieved, with many more releases. The Open Semantic Framework (OSF) is now in version 3.00, backed by a 500-article training documentation and technical wiki. Support tooling now includes automated installation, testing, and data transfer and synchronization.

Because our corporate objectives were largely achieved it was time to look at lessons learned and set new directions. This article, in part, is a result of that process.

How Did Our Priorities Evolve Over the Decade?

I thought it would be helpful to use the content of this AI3 blog to track how concerns and priorities changed for me and Structured Dynamics over this history. Since I started my blog quite soon after my entry into the semantic Web, the record of my perspectives was conterminous and rather complete.

The fifty articles below trace my evolution in knowledge and skills, as well as a progression from structured data to the semantic Web. These 50 articles represent about 11% of all articles in my chronological archive; they were selected as being the most germane to the question of evolution of the semantic Web.

After early ramp up, most of the formative discussion below occurred in the early years. Posts have declined most recently as implementation has taken over. Note most of the links below have PDFs available from their main pages.

2014

Innovation, Information, Growth and Wealth – information fuels innovation that creates wealth
Spring Dawns on Artificial Intelligence – massive trends are waking artificial intelligence (AI) from its dark winter

2013

Seven Arguments for Semantic Technologies – a re-cap and summary of prior writings

2012

The Age of the Graph – the ubiquitous and fundamental roles of graph structures
The Rationale for Semantic Technologies – most refined arguments to date
What is Structure? – structure is information, and information is structure
The Trouble with Memes – the role of Shannon‘s information theory to adaptive information
Give Me a Sign: What Do Things Mean on the Semantic Web? – nature of meaning and representation; influence of Charles S Peirce

2011

Making the Argument for Semantic Technologies – five unique advantages for the enterprise
In the Midst of an Evolutionary Explosion – artificial intelligence (AI) finally begins to flex its muscles
Leveraging Intangible Assets Using Semantic Technologies – in a knowledge economy, the value of intangible assets exceeds tangible ones
Democratizing Information with Semantics – putting the IT function into the hands of users, the knowledge workers
Ontology-Driven Apps Using Generic Applications – the technology is here to stand software engineering on its head
Seeking a Semantic Web Sweet Spot – making the argument for reference structures
Making Connections Real – role of knowledge bases in guiding connections
Declining IT Innovation in the Enterprise – innovation is shifting to the consumer sector

2010

What is a Reference Concept? – information interoperability requires some fixed reference points
The Nature of Connectedness on the Web – the reality is most connections are proximate
A Reference Guide to Ontology Best Practices – as stated
I Have Yet to Metadata I Didn’t Like – the real world issue is not how to publish data, but how to consume and curate it
An Executive Intro to Ontologies – first recommended resource for learning about ontologies
‘Pay as You Benefit’: A New Enterprise IT Strategy – using incremental, low-risk and open approaches to adopting semantics
Changing IT for Good – how to transition the enterprise to semantic technologies
Seven Pillars of the Open Semantic Enterprise – fundamental overview of the semantic enterprise; one of my most cited articles

2009

The Open World Assumption: Elephant in the Room – the fundamental importance of open world reasoning to knowledge applications
Ontology-driven Applications Using Adaptive Ontologies – using ontologies as the central governing structures for semantic technologies
When Linked Data Rules Fail – questioning the limits of linked data as often practiced
The Law of Linked Data – quantifies the benefits from interconnecting data
Fresh Perspectives on the Semantic Enterprise – first description of how semantic technologies can fit within the enterprise
Structure the World – an integrative view of how native data forms can integrate into the semantic Web
structWSF: A Framework for Collaboration Networks – how architectural design can promote collaboration and distributed semantic Web
Advantages and Myths of RDF – my definitive piece on RDF (Resource Description Framework)
Making Linked Data Reasonable using Description Logics, Part 3 – generalizing how native data forms can interact with the semantic Web
Making Linked Data Reasonable using Description Logics, Part 2 – how to split work between the TBox, ABox and specialty services
Back to the Future with Description Logics – first description of the importance of keeping schema (TBox) separate from instance records (ABox)

2008

Thinking Inside the Box with Description Logics – beginning to explicate the logic underneath W3C semantic technologies
WOA: A New Enterprise Partner for Linked Data – the importance of architecture for emerging solutions
When is Content Coherent? – the essential metric of ‘coherence’ to semantic vocabularies
The Semantic Web and Industry Standards – early understandings of the Semantic Web

2007

A Data Model of Web Data Models: Part I – good structural overview, now 7 years old !
What is the Structured Web? – explication of the structured Web as an intermediate way point to the semantic Web
Announcing UMBEL: A Lightweight Subject Structure for the Web – first announcement for the UMBEL reference concept structure
Where are the Road Signs for the Structured Web? – first identification that the semantic Web is missing a system of reference concepts
Structure Paves the Way to the Semantic Web – invited guest editorial in IEEE Intelligent Systems
There’s Not Yet Enough Backbone – a suitable subject structure for organizing knowledge is needed
Structurizing the Web with RDF – first rather comprehensive piece on the benefits of RDF data model
Did You Blink? The Structured Web Just Arrived – April 2007; first external piece on DBpedia

2006

Sources and Classification of Semantic Heterogeneities – still one of the best primers on heterogeneous data
Climbing the Data Federation Pyramid – puts Internet and semantic Web into context

2005

Open Source and the ‘Business Ecosystem’ – the importance of keystone influencers and partnerships to build technology ecosystems

The early years of this history were concentrated on gathering background information and getting educated. The release of DBpedia in 2007 showed how knowledge bases would become essential to the semantic Web. We also identified that a lack of shared reference concepts was making it difficult to “ground” different semantic Web datasets or schema to one another. Another key theme was the diversity of native data structures on the Web, but also how all of them could be readily represented in RDF.

By 2008 we began to study the logical underpinnings to the semantic Web as we were coming to understand how it should be practiced. We also began studying Web-oriented architectures as key design guidance going forward. These themes continued into 2009, though now informed by clients and applications, which was expanding our understanding of requirements (and, sometimes, shortcomings) in the enterprise marketplace. The importance of an open world approach to the basic open nature of knowledge management was cementing a clarity of the role and fit of semantic solutions in the overall informaton space. The general community shift to linked data was beginning to surface worries.

2010 marked a shift for us to become more of a popularizer of semantic technologies in the enterprise, useful to attract and inform prospects. The central role of ontologies as the guiding structures (either as codified knowledge structures or as instruction sets for the platform) for OSF opened realizations that generic functional software could be designed that can be re-used in most any knowledge domain by simply changing the data and ontologies guiding them. This increased our efforts in ontology tooling and training, now geared more to the knowledge worker. The importance of groundings for aligning schema and data caused us to work hard on UMBEL in 2011 to get it to a commercial release state.

All of these efforts were converging on design thoughts about the nature of information and how it is signified and communicated. The bases of an overall philosophy regarding our work emerged around the teachings of Charles S Peirce and Claude Shannon. Semantics and groundings were clearly essential to convey accurate messages. Simple forms, so long as they are correct, are always preferred over complex ones because message transmittal is more efficient and less subject to losses (inaccuracies). How these structures could be represented in graphs affirmed the structural correctness of the design approach. The now obvious re-awakening of artificial intelligence helps to put the semantic Web in context: a key subpart, but still a subset, of artificial intelligence. The percentage of formative articles directly related over these last couple of years to the semantic Web drops much, as the emphasis continues to shift to tech transfer.

What Else Did We Learn?

Not all lessons learned warranted an article on their own. So, we have also reflected on what other lessons we learned over this decade. The overall theme is: Simpler is better.

Distributed data interoperability across the Web is a fundamental weakness. There are no magic tricks to integrate data. Data mapping and integration will always require massaging. Each data integration activity needs its own solution. However, it can greatly be helped with ontologies and with better tooling.

In keeping with the lesson of grounding, a reference ontology for attributes is missing. It is needed as a bridge across disparate datasets describing similar entities or with different attributes for the same entities. It is also a means to reduce the pairwise combinatorial issue of integrating multiple datasets. And, whatever is done in the data integration area, an open world approach will be essential given the nature of knowledge information.

There is good design and best practice for distributed architectures. The larger these installations become, the more important it is to use a lightweight, loosely-coupled design. RESTful Web services and their interfaces are key. Simpler services with fewer functions can be designed to complement one another and increase throughput effectiveness.

Functional programming languages align well with the data and schema in knowledge management functions. Ontologies, as structures, also fit well with functional languages. The ability to create DSLs should continue to improve bringing the knowledge management function directly into the hands of its users, the knowledge workers.

In a broader sense, alluded to above, the semantic Web is but a set of concepts. There are multiple ways to use it. It can be leveraged without requiring “core” semantic Web tools such a triple stores. Solr can act as a semantic store because semantics, NLP and search are naturally married. But, the semantic Web, in turn, needs to become re-embedded in artificial intelligence, now backed by knowledge bases, which are themselves creatures of the semantic Web.

Design needs to move away from linked data or the semantic Web as the goals. The building blocks are there, though perhaps not yet combined or expressed well. The real improvements now to the overall knowledge function will result from knowledge bases, artificial intelligence, and the semantic Web working together. That is the next frontier.

Overall, we perhaps have been in the wrong war for the wrong reasons. Linked data is certainly not an end and mostly appears to represent work, rather than innovation. The semantic Web is no longer the right war, either, because improvements there will not come so much from arguing semantic languages and paradigms. Learning how to master distributed data integration will teach the semantic Web much, and coupling artificial intelligence with knowledge bases will do much to improve the most labor-intensive stumbling blocks in the knowledge management workflow: mappings and transformations. Further, these same bases will extend the reach into analytical and statistical realms.

The semantic Web has always been an infrastructure play to us. On that basis, it will be hard to ever judge market penetration or dominance. So, maybe in terms of a vision from 15 years ago the growth of the semantic Web has been disappointing. But, for Fred and me, we are finally seeing the landscape clearly and in perspective, even if from a viewpoint that may be different from others’. From our vantage point, we are at the exciting cusp of a new, broader synthesis.

NOTE: This is Part I of a two-part series. Part II will appear shortly.

[1] Tim Berners-Lee, James Hendler, and Ora Lassila, “The Semantic Web,” in Scientific American 284(5): pp 34-43, 2001. See http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2.

[2] For those with a spare 90 minutes or so, you may also want to view this panel session and debate that took place on “An OWL 2 Far?” at ISWC ’08 in Karlsruhe, Germany, on October 28, 2008. The panel was chaired by Peter F. Patel-Schneider (Bell Labs, Alcathor) with the panel members of Stefan Decker (DERI Galway), Michel Dumontier (Carleton University), Tim Finin (University of Maryland) and Ian Horrocks (University of Oxford), with much audience participation. See http://videolectures.net/iswc08_panel_schneider_owl/

[3] Open Biomedical Ontologies (OBO) is an effort to create controlled vocabularies for shared use across different biological and medical domains. As of 2006, OBO formed part of the resources of the U.S. National Center for Biomedical Ontology (NCBO). As of the date of this article, there were 376 ontologies listed on the NCBO’s BioOntology site. Both OBO and BioOntology provide tools and best practices.

[4] Fifth International Workshop on Consuming Linked Data (COLD 2014), co-located with the 13th International Semantic Web Conference (ISWC) in Riva del Garda, Italy, October 19-20.

[5] See https://www.mkbergman.com/category/linked-data/.

[6] See https://www.mkbergman.com/1713/fyn-v-ft/.

[7] See the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.

Posted:July 10, 2014

50 Ontology Mapping and Alignment Tools

More Than 20 Are Currently Active and Often in Open Source

I have been periodically tracking ontology tools for some time now (also as contained on the Open Semantic Framework wiki). Recent work caused me to update the listing in the ontology matching/mapping/alignment area. Ontology alignment is important once one attempts to integrate across multiple knowledge bases. Steady progress in better performance (precision and recall) has been occurring, though efforts may have plateaued somewhat. Shvaiko and Euzenat have a good report on the state of the art in ontology alignment.

There has been a formalized activity on ontology alignment going back to 2003. This OAEI (Ontology Alignment Evaluation Initiative) has evolved to include formal tests and datasets, and annual evaluations and bake-offs. Over the years, various tools have come and gone, and some have evolved through multiple versions. Some are provided in source or with online demos; others are research efforts with no testable code.

As far as I know, no one has kept a current and comprehensive listing of these tools and their active status (though the Ontology Matching site does have an outdated list). Please accept the listing below as one attempt to redress this gap.

I welcome submissions of new (unlisted) tools, particularly those that are still active and available for download. There are surely gaps in what is listed below. Also, expect some new tools and updated results to be forthcoming from OAEI 2014 as reported at the Ontology Mapping workshop at ISWC effort in October.

Besides the tapering improvement in performance, other notable trends in ontology matching include ways to optimize multiple scoring methods and using background knowledge to help guide alignments.

Active, Often with Code

The Alignment API is an API and implementation for expressing and sharing ontology alignments. The correspondences between entities (e.g., classes, objects, properties) in ontologies is called an alignment. The API provides a format for expressing alignments in a uniform way. The goal of this format is to be able to share on the web the available alignments. The format is expressed in RDF, so it is freely extensible. The Alignment API itself is a Java description of tools for accessing the common format. It defines four main interfaces (Alignment, Cell, Relation and Evaluator)
AgreementMakerLight is an automated and efficient ontology matching system derived from AgreementMaker
Blooms is a tool for ontology matching. It utilizes information from Wikipedia category hierarchy and from the web to identify subclass relationship between entities. See also its Wiki page
CODI (Combinatorial Optimization for Data Integration) leverages terminological structure for ontology matching. The current implementation produces mappings between concepts, properties, and individuals. CODI is based on the syntax and semantics of Markov logic and transforms the alignment problem to a maximum-a-posteriori optimization problem
COMA++ is a schema and ontology matching tool with a comprehensive infrastructure. Its graphical interface supports a variety of interaction
Falcon-AO (Finding, aligning and learning ontologies) is an automatic ontology matching tool that includes the three elementary matchers of String, V-Doc and GMO. In addition, it integrates a partitioner PBM to cope with large-scale ontologies* hMAFRA (Harmonize Mapping Framework) is a set of tools supporting semantic mapping definition and data reconciliation between ontologies. The targeted formats are XSD, RDFS and KAON
GOMMA is a generic infrastructure for managing and analyzing life science ontologies and their evolution. The component-based infrastructure utilizes a generic repository to uniformly and efficiently manage many versions of ontologies and different kinds of mappings. Different functional components focus on matching life science ontologies, detecting and analyzing evolutionary changes and patterns in these ontologies
HerTUDA is a simple, fast ontology matching tool, based on syntactic string comparison and filtering of irrelevant mappings. Despite its simplicity, it outperforms many state-of-the-art ontology matching tools
Karma is an information integration tool to integrate data from databases, spreadsheets, delimited text files, XML, JSON, KML and Web APIs. Users integrate information according to an ontology of their choice using a graphical user interface that automates much of the process. Karma learns to recognize the mapping of data to ontology classes and then uses the ontology to propose a model that ties together these classes
KitAMO is a tool for evaluating ontology alignment strategies and their combinations. It supports the study, evaluation and comparison of alignment strategies and their combinations based on their performance and the quality of their alignments on test cases. Based on the SAMBO project
The linked open data enhancer (LODE) framework is a set of integrated tools that allow digital humanists, librarians, and information scientists to connect their data collections to the linked open data cloud. It can be applied to any domain with RDF datasets
LogMap is highly scalable ontology matching system with ‘built-in’ reasoning and diagnosis capabilities. LogMap can deal with semantically rich ontologies containing tens (and even hundreds) of thousands of classes
MapOnto is a research project aiming at discovering semantic mappings between different data models, e.g, database schemas, conceptual schemas, and ontologies. So far, it has developed tools for discovering semantic mappings between database schemas and ontologies as well as between different database schemas. The Protege plug-in is still available, but appears to be for older versions
MatchIT automates and facilitates schema matching and semantic mapping between different Web vocabularies. MatchIT runs as a stand-alone or plug-in Eclipse application and can be integrated with popular third party applications. MatchIT’s uses Adaptive Lexicon™ as an ontology-driven dictionary and thesaurus of English language terminology to quantify and ank the semantic similarity of concepts. It apparently is not available in open source
OntoM is one component of the OntoBuilder, which is a comprehensive ontology building and managing framework. OntoM provides a choice of mapping and scoring methods for matching schema
The Ontology Mapping Tool (OMT) is an Eclipse plug-in part of the Web Service Modeling Toolkit (WSMT), designed to offer support for the semi-automatic creation of ontology mappings. OMT offers a set of features such as multiple ontology perspectives, mapping contexts, suggestions, bottom-up and top-down mapping strategies
Optima is a state of the art general purpose tool for performing ontology alignment. It automatically identifies and matches relevant concepts between ontologies. The tool is supported by an intuitive user interface that facilitates the visualization and analysis of ontologies in N3, RDF and OWL and the alignment results. This is an open source ontology alignment frame work. Optima is also available as a plugin to Protégé ontology editor
PARIS is a system for the automatic alignment of RDF ontologies. PARIS aligns not only instances, but also relations and classes. Alignments at the instance level cross-fertilize with alignments at the schema level
S-Match takes any two tree like structures (such as database schemas, classifications, lightweight ontologies) and returns a set of correspondences between those tree nodes which semantically correspond to one another
ServOMap is an ontology matching tool based on Information Retrieval technique relying on the ServO system. To run it, please follow the directions described at http://oaei.ontologymatching.org/2012/seals-eval.html
The Silk framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. While designed for mapping instance data, it can also be used for schema
Yam++ (not) Yet Another Matcher is a flexible and self-configuring ontology matching system for discovering semantic correspondences between entities (i.e., classes, object properties and data properties) of ontologies. This new version YAM++ 2013 has a significant improvement from the previous versions. See also the 2013 results. Code not apparently available.

Not Apparently in Active Use

ASMOV (Automated Semantic Mapping of Ontologies with Validation) is an automatic ontology matching tool which has been designed in order to facilitate the integration of heterogeneous systems, using their data source ontologies
The AMW (ATLAS Model Weaver) is a tool for establishing relationships (i.e., links) between models. The links are stored in a model, called weaving model
Chimaera is a software system that supports users in creating and maintaining distributed ontologies on the web. Two major functions it supports are merging multiple ontologies together and diagnosing individual or multiple ontologies
ConcepTool is a system to model, analyse, verify, validate, share, combine, and reuse domain knowledge bases and ontologies, reasoning about their implication
CMS (CROSI Mapping System) is a structure matching system that capitalizes on the rich semantics of the OWL constructs found in source ontologies and on its modular architecture that allows the system to consult external linguistic resources
ConRef is a service discovery system which uses ontology mapping techniques to support different user vocabularies
DRAGO reasons across multiple distributed ontologies interrelated by pairwise semantic mappings, with a vision of peer-to-peer mapping of many distributed ontologies on the Web. It is implemented as an extension to an open source Pellet OWL Reasoner
DSSim is an agent-based ontology matching framework; neither application nor source code appears to be available
FOAM is the Framework for ontology alignment and mapping. It is based on heuristics (similarity) of the individual entities (concepts, relations, and instances)
HMatch is a tool for dynamically matching distributed ontologies at different levels of depth. In particular, four different matching models are defined to span from surface to intensive matching, with the goal of providing a wide spectrum of metrics suited for dealing with many different matching scenarios that can be encountered in comparing concept descriptions of real ontologies
IF-Map is an Information Flow based ontology mapping method. It is based on the theoretical grounds of logic of distributed systems and provides an automated streamlined process for generating mappings between ontologies of the same domain
LILY is a system matching heterogeneous ontologies. LILY extracts a semantic subgraph for each entity, then it uses both linguistic and structural information in semantic subgraphs to generate initial alignments. The system is presently in a demo version only
MAFRA Toolkit – the Ontology MApping FRAmework Toolkit allows users to create semantic relations between two (source and target) ontologies, and apply such relations in translating source ontology instances into target ontology instances
Malasco is an ontology matching system for matching large-scale OWL ontologies. It can use different partitioning algorithms and existing matching tools
myOntology is used to produce the theoretical foundations, and deployable technology for the Wiki-based, collaborative and community-driven development and maintenance of ontologies instance data and mappings
OLA stands for OWL Lite Alignment. This is the name of a method for computing alignments between two OWL (non necessary Lite) ontologies
OntoEngine is a step toward allowing agents to communicate even though they use different formal languages (i.e., different ontologies). It translates data from a “source” ontology to a “target”
OntoMerge serves as a semi-automated nexus for agents and humans to find ways of coping with notational differences between ontologies with overlapping subject areas
The OWL-CTXMATCH application is a Java 5-compliant implementation of the OWL-CTXMATCH algorithm. Beside the Java platform it requires additional libraries and external data source that is WordNet 2.0
OLA/OLA2 (OWL-Lite Alignment) matches ontologies written in OWL. It relies on a similarity combining all the knowledge used in entity descriptions. It also deal with one-to-many relationships and circularity in entity descriptions through a fixpoint algorithm
OWLS-MX is a hybrid semantic Web service matchmaker. OWLS-MX 1.0 utilizes both description logic reasoning, and token based IR similarity measures. It applies different filters to retrieve OWL-S services that are most relevant to a given query
Potluck is a Web-based user interface that lets casual users—those without programming skills and data modeling expertise—mash up data themselves. Potluck is novel in its use of drag and drop for merging fields, its integration and extension of the faceted browsing paradigm for focusing on subsets of data to align, and its application of simultaneous editing for cleaning up data syntactically. Potluck also lets the user construct rich visualizations of data in-place as the user aligns and cleans up the data
PRIOR+ is a generic and automatic ontology mapping tool, based on propagation theory, information retrieval technique and artificial intelligence model. The approach utilizes both linguistic and structural information of ontologies, and measures the profile similarity and structure similarity of different elements of ontologies in a vector space model (VSM)
RiMOM (Risk Minimization based Ontology Mapping) integrates different alignment strategies: edit-distance based strategy, vector-similarity based strategy, path-similarity based strategy, background-knowledge based strategy, and three similarity-propagation based strategies
SAMBO is a system that assists a user in aligning and merging two ontologies in OWL format. The user performs an alignment process with the help of alignment suggestions proposed by the system. The system carries out the actual merging and derives the logical consequences of the merge operations
semMF is a flexible framework for calculating semantic similarity between objects that are represented as arbitrary RDF graphs. The framework allows taxonomic and non-taxonomic concept matching techniques to be applied to selected object properties
Snoggle is a graphical, SWRL-based ontology mapper. Snoggle attempts to solve the ontology mapping problem by providing a graphical user interface (similar to which of the Microsoft Visio) to guide the process of ontology vocabulary alignment. In Snoggle, user-defined mappings can be serialized into rules, which is expressed using SWRL
Terminator is a tool for creating term to ontology resource mappings (documentation in Finnish)
Vine is a tool that allows users to perform fast mappings of terms across ontologies. It performs smart searches, can search using regular expressions, requires a minimum number of clicks to perform mappings, can be plugged into arbitrary mapping framework, is non-intrusive with mappings stored in an external file, has export to text files, and adds metadata to any mapping. See also http://sourceforge.net/projects/vine/.

Main Links

Search

Author: Mike Bergman

Posted:August 18, 2014

Big Structure and Data Interoperability

A Critical Fit with the Semantic Web and AI

Data Interoperability in Relation to Semantics

Work Components in Data Interoperability

Putting Some Grown-up Pants on the Semantic Web

Posted:August 12, 2014

What is Big Structure?

Defining the Guideposts for Big Data

What is a Big Structure?

Contributors to Big Structure

The Foundation of Reference Groundings

A Vision of Interoperability

Posted:July 23, 2014

Big Structure: At The Nexus of Knowledge Bases, the Semantic Web and Artificial Intelligence

Envisioning A New Adaptive Infrastructure for Data Interoperability

The Prism of Data Interoperability

Embedding Data Interoperability Into a Broader Context

Grounding Data Interoperability in Big Structure

An Emerging Development Agenda

A New Adaptive Infrastructure for Data Interoperability

Posted:July 16, 2014

A Decade in the Trenches of the Semantic Web

Are We Losing the War? Was it Even the Right One?

What Has Gone Well?

What Has Disappointed?

How Has the Environment Changed?

What Did We Set Out to Achieve?

How Did Our Priorities Evolve Over the Decade?

What Else Did We Learn?

Posted:July 10, 2014

50 Ontology Mapping and Alignment Tools

More Than 20 Are Currently Active and Often in Open Source

Active, Often with Code

Not Apparently in Active Use