Posted:August 9, 2016

Peg ProjectContinued Visibility for the Award-winning Web Portal

Laszlo Pinter, the individual who hired us as the technical contractor for the Peg community portal (www.mypeg.ca), recently gave a talk on the project at a TEDx conference in Winnipeg. Peg is the well-being indicator system for the community of Winnipeg. Laszlo’s talk is a 15-minute, high-level overview of the project and its rationale and role:

Peg helps identify and track indicators that relate to the economic, environmental, cultural and social well-being of the people of Winnipeg. There are scores of connected datasets underneath Peg that relate all information from stories to videos and indicator data to one another using semantic technologies. I first wrote about Peg when it was released at the end of 2013.

In 2014, Peg won the international Community Indicators Consortium Impact Award. The Peg Web site is a joint project of the United Way of Winnipeg (UWW)  and the International Institute for Sustainable Development (IISD). Our company, Structured Dynamics, was the lead developer for the project, which is also based on SD’s Open Semantic Framework (OSF) platform.

Congratulations to the Peg team for the well-deserved visibility!

Posted:June 8, 2015

InteroperabilityWhat Began as Data Integration Implies So Much More

Oh, it was probably two or three years ago that one of our clients asked us to look into single-source authoring, or more broadly what has come to be known as COPE (create once, publish everywhere), as made prominent by Daniel Jacobson of NPR, now Netflix. We also looked closely at the question of formats and workflows that might increase efficiencies or lower costs in the quest to grab and publish content.

Then, of course, about the same time, it was becoming apparent that standard desktop and laptop screens were being augmented with smartphones and tablets. Smaller screen aspects require a different interface layout and interaction; but, writing for specific devices was a losing proposition. Responsive Web design and grid layout templates that could bridge different device aspects have now come to the fore.

Though it has been true for some time that different publishing venues — from the Web to paper documents or PDFs — have posed a challenge, these other requirements point to a broader imperative. I have intuitively felt there is a consistent thread at the core of these emerging device, use and publishing demands, but the common element has heretofore eluded me.

For years — decades, actually — I have been focused on the idea of data interoperability. My first quest was to find a model that could integrate text stories and documents with structured data from conventional databases and spreadsheets. My next quest was to find a framework that could relate context and meaning across multiple perspectives and world views. Though it took awhile, and which only began to really take shape about a decade ago, I began to focus on RDF and general semantic Web principles for providing this model.

Data integration though open, semantic Web standards has been a real beacon for how I have pursued this quest. The ideal of being able to relate disparate information from multiple sources and viewpoints to each other has been a driving motivation in my professional interests. In analyzing the benefits of a more connected world of information I could see efficiencies, reduced costs, more global understandings, and insights from previously hidden connections.

Yet here is the funny thing. I began to realize that other drivers for how to improve knowledge worker efficiencies or to deploy results to different devices and venues share the same justifications as data integration. Might there not be some common bases and logic underlying the interoperability imperative? Is not data interoperability but a part of a broader mindset? Are there some universal principles to derive from an inspection of interoperability, broadly construed?

In this article I try to follow these questions to some logical ends. This investigation raises questions and tests from the global — that is, information interoperability — to the local and practical in terms of notions such as create once, use everywhere, and have it staged for relating and interoperability. I think we see that the same motivators and arguments for relating information apply to the efficient ways to organize and publish that information. I think we also see that the idea of interoperability is systemic. Fortunately, meaningful interoperability can be achieved across-the-board today with application of the right mindsets and approaches. Below, I also try to set the predicates for how these benefits might be realized by exploring some first principles of interoperability.

What is Interoperability?

So, what is interoperability and why is it important?

So-called enterprise information integration and interoperability seem to sprout from the same basic reality. Information gets created and codifed across multiple organizations, formats, storage systems and locations. Each source of this information gets created with its own scope, perspective, language, characteristics and world view. Even in the same organization, information gets generated and characterized according to its local circumstances.

In the wild, and even within single organizations, information gets captured, represented, and characterized according to multiple formats and viewpoints. Without bridges between sources that make explicit the differences in format and interpretation, we end up with what — in fact — is today’s reality of information stovepipes. The reality of our digital information being in isolated silos and moats results in duplicate efforts, inefficiencies, and lost understandings. Despite all of the years and resources thrown at information generation, use and consumption, our digital assets are unexploited to a shocking extent. The overarching cause for this dereliction of fiscal stewardship is the lack of interoperability.

By the idea of interoperability we are getting at the concept of working together. Together means things are connected in some manner. Working means we can mesh the information across sources to do more things, or do them better or more cheaply. Interoperability does not necessarily imply integration, since our sources can reside in distributed locations and formats. What is important is not the physical location — or, indeed, even format and representations — but that we have bridges across sources that enable the source information to work together.

In working backwards from this observation, then, we need certain capabilities to fulfill these interoperability objectives. We need to be able to ingest multiple encodings, serializations and formats. Because we need to work with this information, and tools for doing so are diverse, we also need the ability to export information in multiple encodings, serializations and formats. Human circumstance means we need to ingest and encode this information in multiple human languages. Some of our information is more structured, and describes relationships between things or the attributes or characterizations of particular types of things. Since all of this source information has context and provenance, we need to capture these aspects as well in order to ascertain the meaning and trustworthiness of the information.

This set of requirements is a lot of work, which can most efficiently be done against one or a few canonical representations of the input information. From a data integration perspective, then, the core system to support, store and manage this information should be based on only a few central data representations and models, with many connectors for ingesting native information in the wild and tools to support the core representations:



Data Flow Perspective on Interoperability

A Data Flow Perspective on Interoperability

In our approach at Structured Dynamics we have chosen the Resource Description Framework (RDF) as the structured data model at the core of the system [1], supported by the Lucene text engine for full-text search and efficient facet searching. Because all of the information is given unique Web identifiers (URIs), and the whole system resides on the Web accessible via the HTTP protocol, our information may reside anywhere the Internet connects.

This gives us a data model and a uniform way to represent the input data across structured, semi-structured and unstructured sources. Further, we have a structure that can capture the relations or attributes (including metadata and provenance) of the input information. However, one more step is required to achieve data interoperability: an understanding of the context and meaning of the source information.

To achieve the next layer in the data interoperability pyramid [2] it is thus necessary to employ semantic technologies. The structure of the RDF data model has an inherent expressiveness to capture meaning and context. To this foundation we must add a coherent view of the concepts and entity types in our domain of interest, which also enables us to capture the entities within this system and their characteristics and relationships to other entities and concepts. These properties applied to the classes and instances in our domain of interest can be expressed as a knowledge graph, which provides the logical schema and inferential framework for our domain. This stack of semantic building blocks gets formally expressed as ontologies (the technical term for a working graph) that should putatively provide a coherent representation of the domain at hand.

We can visualize this semantic stack as follows:



Semantics Perspective of Interoperability

A Semantics Perspective of Interoperability

We have been using the spoke-and-hub diagram above for data flows for some years and have used the semantic stack representation before, too. I believe in my bones the importance of data interoperability to competitive advantage for enterprises, and therefore its business worth as a focus of my company’s technology. But, once so considered, some more fundamental questions emerge. What makes data interoperability a worthwhile objective? Can an understanding of those objectives bring us more fundamental understandings of fundamental benefits? Does a grounding in more fundamental benefits suggest any change in our development priorities?

Drivers of Interoperability

I think we can boil the drivers of interoperability down to four. These are:

  • Efficiency — literally trillions are spent globally each year in the research, creation, re-use, publishing, storing and browsing of information [3]. Yet relevant information is hard to find, and sometimes obscure information is overlooked. The lack of reuse of prior good content because it is not discoverable is unconscionable given today’s technologies. The base productivity of information use is low;
  • Cost — missed information or lack of awareness of relevant information leads to increased time, increased direct costs (labor and material), and increased indirect costs. Awareness, understanding and re-use of existing information would save millions or more for brand-name firms [3] annually if these interoperability gaps were overcome;
  • Insight — drawing connections between previously unconnected things and enabling discovery are essential inputs to innovation, itself the overall driver of productivity (and, therefore, wealth) gains. The reinforcing leverage of interoperability resides in its ability to bring new understandings and insights; and
  • Capture — simply being able to include the 80% of extant information contained in text is a huge first step to interoperability, but grounding the system in the inherent connectedness of the Web means that all kinds of fields + streams, APIs, mappings, DBs, datasets, Web content, on-the-fly discoveries, and device sensors through the Internet of things (IoT) can be captured to contribute to our insights.

To be sure, data interoperability is focused on insight. But data interoperability also brings efficiency and cost reductions. As we add other aspects of interoperability — say, responsive design for mobile — we may see comparatively fewer benefits in insight, but more in efficiency, cost, and, even, capture. Anything done to increase benefits from any of these drivers contributes to the net benefits and rationale for interoperability.

Principles of Interoperability

The general goodness arising from interoperability suggests it is important to understand the first principles underlying the concept. By understanding these principles, we can also tease out the fundamental areas deserving attention and improvement in our interoperability developments and efforts. These principles help us cut through the crap in order to see what is important and deserves attention.

I think the first of the first principles for interoperability is reusability. Once we have put the effort into the creation of new valuable data or content, we want to be able to use and apply that knowledge in all applicable venues. Some of this reuse might be in chunking or splitting the source information into parts that can be used and deployed for many purposes. Some of this reuse might be in repurposing the source data and content for different presentations, expressions or devices. These considerations imply the importance of storing, characterizing, structuring and retrieving information in one or a few canonical ways.

Interoperable content and forms should also aspire to an ideal of “onceness“. The ideal is that the efforts to gather, create or analyze information be done as few times as possible. This ideal clearly ties into the principle of reusabilty because that must be in place to minimize duplication and overlooking what exists. The reason to focus on onceness is that it forces an explication of the workflows and bottlenecks inherent to our current work practices. These are critical areas to attack since, unattended, such inefficiencies provide the “death by a thousand cuts” to interoperability. Onceness is at the center of such compelling ideas as COPE and the role of APIs in a flexible architecture (see below) to promote interoperability.

A respect for workflows is also a first principle, expressed in two different ways. The first way is that existing workflows can not be unduly disrupted when introducing interoperability improvements. While workflows can be improved or streamlined over time — and should — initial introduction and acceptance of new tools and practices must fit with existing ways of doing tasks in order to see adoption. Jarring changes to existing work practices are mostly resisted. The second way that workflows are a first principle is in the importance of being aware of, explicitly modeling, and then codifying how we do tasks. This becomes the “language” of our work, and helps define the tooling points or points of interaction as we merge activities from multiple disciplines in our domain. These workflow understandings also help us identify useful points for APIs in our overall interoperability architecture.

These considerations provide the rationale for assigning metadata [4] that characterize our information objects and structure, based on controlled vocabularies and relationships as established by domain and administrative ontologies [5]. In the broadest interoperability perspective, these vocabularies and the tagging of information objects with them are a first principle for ensuring how we can find and transition states of information. These vocabularies need not be complex or elaborate, but they need to be constant and consistent across the entire content lifecycle. There are backbone aspects to these vocabularies that capture the overall information workflow, as well as very specific steps for individual tasks. As a complement to such administrative ontologies, domain ontologies provide the context and meaning (semantics) for what our information is about.

The common grounding of data model and semantics means we can connect our sources of information.  The properties that define the relationships between things determine the structure of our knowledge graph. Seeking commonalities for how our information sources relate to one another helps provide a coherent graph for drawing inferences. How we describe our entities with attributes provides a second type of property. Attribute profiles are also a good signal for testing entity relatedness. Properties — either relations or attributes — provide another filter to draw insight from available information.

If the above sounds like a dynamic and fluid environment, you would be right. Ultimately, interoperability is a knowledge challenge in a technology environment that is rapidly changing. New facts, perspectives, devices and circumstances are constantly arising. For these very reasons an interoperability framework must embrace the open world assumption [6], wherein the underlying logic structure and its vocabulary and data can be grown and extended at will. We are seeing some breakaway from conventional closed-world thinking of relational databases with NoSQL and graph databases, but a coherent logic based on description logics, such as is found with open standard semantic technologies like RDF and OWL and SPARQL, is even more responsive.

Though perhaps not quite at the level of a first principle, I also think interoperability improvements should be easy to use, easy to share, and easy to learn. Tooling is clearly implied in this, but also it is important we be able to develop a language and framing for what constitutes interoperability. We need to be able to talk about and inspect the question of interoperability in order to discover insights and gain efficiencies.

Aspects of Interoperability

The thing about interoperability is that it extends over all aspects of the information lifecycle, from capturing and creating information, to characterizing and vetting it, to analyzing it, or publishing or distributing it. Eventually, information and content already developed becomes input to new plans or requirements. These aspects extend across multiple individuals and departments and even organizations, with portions of the lifecycle governed (or not) by their own set of tools and practices. We can envision this overall interoperability workflow something like the following [7]:



Generalized Workflow Perspective of Interoperability

A Generalized Workflow Perspective of Interoperability

Overall, only pieces of this cycle are represented in most daily workflows. Actually, in daily work, parts of this workflow are much more detailed and involved than what this simplistic overview implies. Editorial review and approvals, or database administration and management, or citation gathering or reference checking, or data cleaning, or ontology creation and management, or ETL activities, or hundreds of other specific tasks, sit astride this general backbone.

Besides showing that interoperability is a systemic activity for any organization (or should be), we can also derive a couple of other insights from this figure. First, we can see that some form of canonical representation and management is central to interoperability. As noted, this need not be a central storage system, but can be distributed using Web identifiers (URIs) and protocols (HTTP). Second, we characterize and tag our information objects using ontologies, both from structural and administrative viewpoints, but also by domain and meaning. Characterizing our information by a common semantics of meaning enables us to combine and analyze our information.

A third insight is that a global schema specific to workflows and information interoperability is the key for linking and combining activities at any point within the cycle.  A common vocabulary for stages and interoperability tasks, included as a best practice for our standard tagging efforts, provides the conventions for how batons can get passed between activities at any stage in this cycle. The challenge of making this insight operational is one more of practice and governance than of technology. Inspecting and characterizing our information workflows with a common vocabulary and understanding needs to be a purposeful activity in its own right, backed with appropriate management attention and incentives.

A final insight is that such a perspective on interoperability is a bit of a fractal. As we get more specific in our workflows and activities, we can apply these same insights in order to help those new, more specific workflows become interoperable. We can learn where to plug into this structure. And, we can learn how our specific activities through the application of explicit metadata and tags with canonical representations can work to interact well with other aspects of the content lifecycle.

Interoperability can be achieved today with the right mindsets and approaches. Fortunately, because of the open world first principle, this challenge can be tackled in an incremental, piecemeal manner. While the overall framework provides guidance for where comprehensive efforts across the organization may go, we can also cleave off only parts of this cycle for immediate attention, following a “pay as you benefit” approach [8]. A global schema and a consistent approach to workflows and information characterizations can help ensure the baton is properly passed as we extend our interoperability guidance to other reaches of the enterprise.

General Architecture and a Sample Path

We can provide a similar high-level view for what an enterprise information architecture supporting interoperability might look like. We can broadly layer this architecture into content acquisition, representation and repository, and content consumption:



An Architecture for Interoperability

An Architectural Perspective of Interoperability

Content of all forms — structured, semi-structured and unstructured — is brought into the system and tagged or mapped into the governing domain or administrative schema. Text content is marked up with reduced versions of HTML (such as RASH [9] or Markdown [10]) in order to retain the author’s voice and intent in areas such as emphasis, titles or section headers; the structure of the content is also characterized by patterned areas such as abstracts, body and references. All structured data is characterized according to the RDF data model, with vocabularies as provided by OWL in some cases.

We already have an exemplar repository in the Open Semantic Framework [11] that shows the way (along with other possible riffs on this theme) for how just a few common representations and conventions can work to distribute both schema and information (data) across a potentially distributed network. Further, by not stopping at the water’s edge of data interoperability, we can also embrace further, structural characterization of our content. Adding this wrinkle enables us to efficiently support a variety of venues for content consumption simultaneously.

This architecture is quite consistent with what is known as WOA (for Web-oriented architecture) [12]. Like the Internet itself, WOA has the advantage of being scalable and distributed, all (mostly) based on open standards. The interfaces between architectural components are also provided though mostly RESTful application programming interfaces (APIs), which extends interoperability to outside systems and provides flexibility for swapping in new features or functionality as new components or developments arise. Under this design, all components and engines become in effect “black boxes”, with information exchange via standard vocabularies and formats using APIs as the interface for interoperability.

A Global Context for Interoperability

Though data interoperability is a large and central piece, I hope I have demonstrated that interoperability is a much broader and far-reaching concept. We can see that “global interoperability” extends into all aspects of the information lifecycle. By expanding our viewpoint of what constitutes interoperability, we have discovered some more general principles and mindsets that can promise efficiencies, lower costs and greater insights across the enterprise.

An explicit attention to workflows and common vocabularies for those flows and the information objects they govern is a key to a more general understanding of interoperability and the realization of its benefits. Putting this kind of infrastructure in place is also a prerequisite to greater tooling and automation in processing information.

We can already put in place chains of tooling and workflows governed by these common vocabularies and canonical representations to achieve a degree of this interoperability. We do not need to tackle the whole enchilada at once or mount some form of “big bang” initiative. We can start piecemeal, and expand as we benefit. The biggest gaps remain codification of workflows in relation to the overall information lifecycle, and the application of taggers to provide the workflow and structure metadata at each stage in the cycle. Again, these are not matters so much of technology or tooling, but policy and information governance.

What I have outlined here provides the basic scaffolding for how such an infrastructure to promote interoperability may evolve. We know how we do our current tasks; we need to understand and codify those workflows. Then, we need to express our processing of information at any point along the content lifecycle. A number of years back I discussed climbing the data interoperability pyramid [2]. We have made much progress over the past five years and stand ready to take our emphasis on interoperability to the next level.

To be sure there is much additional tooling still needed, mostly in the form of mappers and taggers. But the basic principles, core concepts and backbone tools for supporting greater interoperability are known and relatively easy to put in place. Embracing the mindset and inculcating this process into our general information management routines is the next challenge. Working to obtain the ideal is doable today.


[1] See M. K. Bergman, 2009. “Advantages and Myths of RDF,” from AI3:::Adaptive Information blog, April 8, 2009.
[2] See M. K. Bergman, 2006. “Climbing the Data Federation Pyramid,” from AI3:::Adaptive Information blog, April 8, 2009.
[3] See M. K. Bergman, 2005. “Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents,” from AI3:::Adaptive Information blog, July 20, 2005.
[4] See M. K. Bergman, 2010. “I Have Yet to Metadata I Did’t Like,” from AI3:::Adaptive Information blog, August 16, 2010.
[5] See M. K. Bergman, 2011. “An Ontologies Architecture for Ontology-driven Apps,” from AI3:::Adaptive Information blog, December 5, 2011. Ontologies
[6] See M. K. Bergman, 2009. “The Open World Assumption: Elephant in the Room,” from AI3:::Adaptive Information blog, December 21, 2009.
[7] Some sources that helped form my thoughts on the information lifecycle include Backbone Media and Piktochart.
[8] See M. K. Bergman, 2010. “‘Pay as You Benefit’: A New Enterprise IT Strategy,” from AI3:::Adaptive Information blog, July 12, 2010.
[9] See Silvio Peroni, 2015. “RASH: Research Articles in Simplified HTML,” March 15, 2015.
[10] Many Markdown options exist for a reduced subset of HTML; one in this vein is Scholarly Markdown.
[11] The Open Semantic Framework has its own Web site (http://opensemanticframework.org/), supported by a wiki of more than 500 supporting technical articles (http://wiki.opensemanticframework.org/index.php/Main_Page).
[12] See M. K. Bergman, 2009. “A Generalized Web-oriented Architecture (WOA) for Structured Data,” from AI3:::Adaptive Information blog, May 3, 2009.
Posted:September 15, 2014

Big Structure has a Foundation in Reference Structures, But Any Structure Aids Interoperability

Big Structure is built on a foundation of reference structures, with domain structures capturing the domain at hand. These represent the target foundations for mapping schema and transforming data in the wild into an operable, canonical form. Any structure, even the most lightweight of lists and metadata, can contribute to and be mapped into this model, as this wall of structure shows:

Foundations to Big Structure

Described below are some of these structures, in rough descending order of completeness and usefulness, for making data interoperable. Please note that any of these structures might be available as linked data.

Reference Structures

In both semantics and artificial intelligence — and certainly in the realm of data interoperability — there is always the problem of symbol grounding. In the conceptual realm, symbol grounding means that when we use a term or phrase we are referring to the same thing. In the data value realm, symbol grounding means that when we refer to an object or a number, we are referring to the same measure.

UMBEL is the standard reference ontology used by Structured Dynamics. It contains 28,000 concepts (classes and relationships) derived from the Cyc knowledge base. The reference concepts of UMBEL are mapped to Wikipedia, schema.org (used in Google’s knowledge graph), DBpedia ontology classes, GeoNames and PROTON. Similar reference structures are used to ground the actual data values and attributes.

Other reference structures may be used, so long as they are rather complete in scope and coherent in their relationships. Logical consistency is a key requirement for grounding.

Knowledge Bases

Knowledge bases combine schema with data in a logical manner; well-constructed ones support computations, inference and reasoning. To date, the two primary knowledge bases that we use are Wikipedia and Cyc. However, many specific domain knowledge bases also exist.

Knowledge bases are important sources for symbol grounding. It addition, because of their computability, they may be used with artificial intelligence methods to both extend the knowledge base and to refine the feature estimates used in the AI algorithms.

Domain Ontologies

Domain ontologies, constructed as graphs, are the principal working structures in data interoperability. Though best practices recommend they be grounded in the reference structures, the domain structures are the ones that specifically capture the concepts and data attributes of the target information domain. More effort should be focused at this level in the wall of structure than any other.

Domain structures provide unique benefits in discovery, flexible access, and information integration due to their inherent connectedness. Further, these domain structures can be layered on top of existing information assets, which means they are an enhancement and not a displacement for prior investments. And, these domain structures may be matured incrementally, which means their development is cost-effective.

Mappings

Data and schema in the wild need to be mapped and transformed into these canonical structures. What is known as data wrangling is an aspect of these mappings and transformations. Mappings thus become the glue that ties native data to interoperable forms.

Mapping is the critical bridging function in data interoperability. It requires tools and background intelligence to suggest possible correspondences; how well this is done is a key to making the semi-automatic mapping process as efficient as possible. Mapping structures are the result of the final correspondences. Mapping effort is a function of the scope of Big Structure, not the volume of Big Data.

Existing Structures

A broad variety of structures occur in the wild — from database schema and taxonomies to dictionaries and lists — that need to be represented in a common form and then mapped in order to support interoperability. The common representation used by Structured Dynamics is the RDF data model.

Structure Scripts

Scripting and tooling are essential to help create Big Structure efficiently.

Editor’s Note: We are pleased to share with you in advance some of the text from Structured Dynamics’ new Web site.
Posted:August 12, 2014

From http://www.slowfamilyonline.com/tag/tinker-toy/Defining the Guideposts for Big Data

In our recent two-part series we described a decade of experience working in the semantic Web (Part I) and our view that Big Structure, which resides at the nexus of the semantic Web, knowledge bases and artificial intelligence, was a key component of making sense of Big Data going forward (Part II). We are at a time when multiple advances are conjoining to create new opportunities and excitement.

Data without context and relationships is meaningless. The idea of Big Data is powerful, but it is often presented as either a “good thing” in and of itself, or a mantra for something that is rather undefined. There is no doubt that with the Internet and the Web we are now able to generate and access data at unprecedented scale. There is also no question that tracking mechanisms and cheap storage — and simpler, large-scale databases and Web services — mean that we can also capture data and structure of natures previously unseen. Everyone knows the remarkable growth in exabytes and more.

The prospect of data everywhere — some useful with important context and some not — has clearly captured the current discussion. Heck, if we claim Big Data, we even make more in wage or consulting charge-out fees. Who can argue with that?

Well, actually, anyone interested in meaningful data or cross-dataset interoperability can argue with that. Big Data is great, except it means little if we can not combine that data across multiple sources for potentially multiple purposes. (Remember, one of the “V’s” of Big Data is variability.) Once the question of what data means gets brought to the fore, it is now time for context and relationships. Structure in an information context means that which situates or describes data in an interpretable way. Big Data needs a Big Structure complement to make sense of it all.

What is a Big Structure?

Big Structure is data relationships and context that can be combined into a coherent framework to enable dataset interoperability and understanding. By necessity, Big Structure implies that the meaning of data can be understood and its values can be brought to common bases such that analysis, testing and validation can be applied across values. Big Structure is not a monolithic thing, but the combination of multiple things that give data meaning and context. As such, Big Structure is often a re-purposing of existing structural assets, plus other special sauce, organized for the aim of data interoperability.

Big Structure is data relationships and context that can be combined into a coherent framework to enable dataset interoperability and understanding.

The components of Big Structure can be identified and characterized. These components can be assessed for usefulness and authoritativeness, and then incorporated into broader structures that ultimately bring the topics of what the data is about and the values of that data into alignment. Thus, Big Structure is also a mindset and approach to selecting and combining structures such that broad dataset interoperability can be achieved.

Big Structure is actually a continuum or family of concept and data relationships, any one of which is also a contributor to helping to map and interoperate data. Ultimately, the components of Big Structure get combined into reference graph structures that place the concepts and actual data values of the Big Data into context. There are certain ways to use and organize existing structures to achieve these Big Structure objectives; some of these ways are described in this article.

Once the components of Big Structure are combined into these reference graphs we then can also use network or graph analysis to understand the relationships amongst the constituent data items. This recursive nature of graph reference structures to organize the constituent data and then to use those graphs to analyze the data is one of the hallmark characteristics of Big Structure.

Big Structure thus involves the need to identify and then organize constituent forms of structure into coherent reference frameworks. Concepts in contributing datasets are then mapped to these structures, and the attributes and values of the underlying data are also transformed into canonical representations. It is these mappings and transformations that provide the interoperability of Big Structure. Big Structure therefore continues to evolve by adding more and more reference structures, all coherently organized.

Contributors to Big Structure

Big Structure is a family of canonical reference structures that help guide mapping and interoperability. The table below lists some of the possible contributors to Big Structure [1], roughly in descending order as to the degree of structure and its contribution to interoperability. The table provides both definitions and use descriptions for each component, plus optionally some notes regarding coverage and use:

Structure Type Definition Use Note
Reference ontologies Major grounding structures for orienting and interoperating concepts or data The reference concepts for orienting all data and domain information [2]
Reference attributes Major grounding structures for interoperating data and data characterizations The reference relationships amongst data descriptions and characteristics, which also provides the means for transformations between heterogeneous representations [3]
Data model (RDF) A self-consistent means for describing the structure of data and their relationships The “canonical” data model at the heart of the system; provides a single interoperability point; RDF is the canonical model used by Structured Dynamics for its Big Structures [4]
Domain attributes The data descriptions and characteristics for the constituent datasets in the applicable domain(s) The reference attributes specific to the domain(s) at hand (which are generally more specific than general reference attributes)
Domain ontologies The formal conceptualization of a domain, using a shared vocabulary to denote the types, properties and interrelationships of those concepts The reference concepts and their relationships specific to the domain(s) at hand; generally are mapped to the reference ontologies [5]
Concept maps A diagram that depicts suggested relationships between concepts Structurally similar to a domain ontology; a few related terms shown in Note [6]
Schema The structure of a database that defines the objects and relationships in that database Organizing framework for relational databases (and their tables) [7]
Mappings The process of creating data element correspondences between two distinct data models or schema Mapping predicates are used to relate concepts or attributes from two different datasets or knowledge bases to one another. Mappings are often a precursor to various transformations to bring data into a common representation [8]
Taxonomies A particular classification of related concepts, often of a hierarchical nature Hierarchical relationships are expressed in narrower or broader terms (or subClassOf); may also be see also relationships [9]
Facets Clearly defined, mutually exclusive, and collectively exhaustive aspects, properties or characteristics of a class or specific subject Facets can provide alternative ways for classifying objects beyond a single taxonomy
Categories Grouping objects based on similar properties A category may be viewed as equivalent to a concept [10]
Tables A collection of related data held in a structured format, generally a two-dimensional layout of rows (records) and columns (fields) Simplest and most common data presentation format
Synsets A group of data elements or terms that are considered semantically equivalent for the purposes of information retrieval Also known as a “semset” in the parlance of UMBEL
Metadata Data providing information about one or more aspects of the source data, thus “data about data” It is the description of what data is about rather than the values and attributes of the actual data
Thesauri A form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects A thesaurus is composed a list of words (or terms), a vocabulary for relating these words (or terms) to one another, often hierarchically, and a set of rules on how to use these aspects
Gazetteers A listing of similar entity types with associated structural data (such as countries and population or standard codes) Often used in relation to people or place entity types, though any class of entities may have a gazetteer
Controlled vocabularies The use of predefined, authorized terms as preselected by the sponsor to enforce consistency in terminology Applied to specific domains or sub-domains, with single controlled vocabularies per official language used
Reference lists Authoritative listings of similar objects, each uniquely identified by name or code May be as simple as a comprehensive list of countries with associated ISO codes [11]
Dictionaries A repository of information about data such as meaning, relationships to other data, origin, usage, or format In our context, can range from the meaning associated with standard word dictionaries to the more formal data dictionary
Glossaries An alphabetical list of terms in a particular domain with the definitions for those terms Definition is the only structured information provided
Nested lists Related concepts or entities organized by some form of hierarchical relationship (narrower, broader, subClassOf, etc.) Akin to a simple taxonomy
Ordered lists A finite, ordered collection of values for a given type May also be additional information linked to the listing
Clusters A set of objects grouped according to some basis of similarity (type, attributes, or characteristics) Basis for how the objects got clustered is not always obvious
Unordered lists A container of similar items or entities, with no implied order or sequence Also known as a “bag” or “collection” [12]
Values The actual data; a normal form or a type member Basic QUDT ontologies could contribute here

An alternate way to look at these contributor structures is to characterize them with respect to degree of structure and degree of contributing to interoperability:

Structure v Interoperability

Structure v Interoperability

In general, as might be expected, the greater the degree of structure, the greater its potential contribution to interoperability. The components in the upper right quadrant represent the most structured and interoperable ones. These also conform most to the use of W3C standards for the RDF data model and the OWL ontology languages. Expressions of structure are codified and standardized. Use of best practices also ensures completeness and suitability as reference groundings for interoperability.

The lower left portions of the quadrant represent the least structure and interoperability. However, as standard reference means for characterizing and describing data, even structures in this quadrant can contribute to meeting Big Structure requirements. Tagging of documents (unstructured data) occurs in this less-sophisticated lower left quadrant, but it gives equal footing to 80% of the content that generally resides in text form. (The interoperability system is further enhanced when the basis of the tags is derived from the “semsets” of the reference and domain ontologies, another example of a best practice.)

All of the listed components can thus contribute to Big Structure. However, the completeness of that structure and its usefulness for interoperability increases as one progresses along the blue arrow of the Big Structure continuum. Data interoperability arises from the continued efforts to drive Big Structure to the upper right of this quadrant. As noted, Big Structure is a mindset and process rather than some finite state. As more concepts and attributes get grounded in standard references, the degree of Big Structure (and, thus, data interoperability) continues to increase.

The Foundation of Reference Groundings

In both semantics and artificial intelligence — and certainly in the realm of data interoperability — there is always the problem of symbol grounding. In the conceptual realm, symbol grounding means that when we use a term or phrase we are referring to the same thing; that is, the referent is the same. In the data value realm, symbol grounding means that when we refer to an object or a number — say, the number 4.1 — we are also referring to the same metric. 4.1 inches is not the same as 4.1 centimeters or 4.1 on the Richter scale, and object names for set member types also have the same challenges of ambiguous semantics as do all other things referred to by language.

The variability V in Big Data or the 40-some dimensions of potential semantic heterogeneity [13] are explicit recognitions of the symbol grounding challenge. Assuming we can determine context (itself an important consideration not further discussed here), fixity of reference is essential to these groundings. Context and groundings are the ways by which we remove ambiguity in what we measure and record.

Like dictionaries for human languages, or stars and constellations for navigators, or agreed standards in measurement, or the Greenwich meridian for timekeepers, fixed references are needed to orient and “ground” each new dataset over which we attempt to integrate. Without such fixities of reference, everything floats in reference to other things, the cursed “rubber ruler” phenomenon.

Thus, we can express our Big Structure components from a foundational perspective as well. In Structured Dynamics‘ view of the world, the foundation for data interoperability is grounded in reference structures or ontologies that provide the fixity of reference for concepts and data and their attributes. Upon these foundations are then constructed the domain views of concepts and attributes, which become the target for mapping other references and Big Structures:

Foundations to Big Structure

Foundations to Big Structure

The mappings, transformations and domain and reference ontologies are themselves written in the OWL languages of the W3C and the standards of the RDF data model. At this most expressive end of Big Structure, the representations are in the form of graphs. Network and graph analytics will expand still further business intelligence prospects. The use of these standards with common and testable logic is another means to ensure coherency and interoperability of the Big Structure that results.

Note a key aspect of the grounding foundation is missing: one or more reference ontologies for attributes. Though many examples exist on the concept side, little has been done to explicitly address the questions of data value interoperability. This major gap is a current emphasis of Structured Dynamics, with much that will be said over the coming weeks. Also expect an open source reference ontology for attributes in the near future.

The thing is that we are learning how to make the various parts of this interoperability stack work. We are leveraging existing structural assets of all kinds to establish the semantics and infrastructure for domain interoperability. We know how to match and map these existing structural assets to the reference frameworks that are the foundation to interoperability.

A Vision of Interoperability

The real world is one of heterogeneous datasets, multiple schema and differing viewpoints. Even within single enterprises — and those which formerly expressed little need or interest to interoperate with the broader world — data integration and interoperability has been a real challenge. Big Data itself is not solving these problems. Quite the opposite. Big Data trends are turning data interoperability molehills into mountain-high competitive threats.

Like any well-built structure, data interoperability requires a solid foundation. That foundation must reside in exemplar reference ontologies upon which to ground the semantics and exchange standards for data. Using the canonical RDF data model makes this task practical. Existing information structures of various types across the enterprise and the Web all can and should play a role in establishing reference structures. The accretion of reference structures will lead to still further interoperability and the ability to incorporate more datasets. Currently expensive practices in, say, master data management (MDM) can begin to transition to a new paradigm. It is easy to envision working from a library of existing reference standards for use across enterprises. This kind of incremental expansion of interoperability leads to still more interoperable data in a virtuous cycle of innovation and lower budgets.

As our computing continues to get more virtual and cloud-like, physical and hardware and software architectures must give way to information architectures (in the true sense of interoperability). We have no choice but to treat the architecting of information as a first-order challenge. The totally cool thing about the data integration challenge is that the architecture can be readily varied and tested to achieve a working foundation. Much empirical information exists about how to do it and what to do next. The chief challenge has been to recognize that data interoperability — and its dependence on Big Structure — is a first-order concern (and opportunity). The intersection of Big Structure with Big Data, and with graph and AI algorithms, should create new approaches to chew across the data integration environment. I expect progress to be rapid.


[1] There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach. See M.K. Bergman, 2007. An Intrepid Guide to Ontologies, AI3:::Adaptive Information blog, May 16, 2007.
[2] UMBEL and other upper level ontologies are examples here. In the case of UMBEL, that Big Structure is used as a scaffolding of reference concepts used to link external (unrelated) structures to help inter-operating data between two unrelated systems. Such a Big Structure can also be used for other tasks such as helping machine learning techniques to categorize and disambiguate pieces of data by leveraring such a structure of types.
[3] Unfortunately, no reference structures for attributes yet exist. For a discussion of this status, see the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.
[4] Data models encompass a rather broad span. The RDF discussion represents a more formal end of the data model spectrum, wherein there is complete logic, syntax and serialization discussions, more involved than most data models.
[5] Domain ontologies represent the most closely-aligned view of the domain and its relationships of all of the component structures listed.
[6] Concept maps are very closely related to ontologies, and may include topic maps, mind maps and other graph-like structures of concepts.
[7] Schema may apply to many realms, but in the IT and software context schema mostly refers to database schema related to relational databases. These are often expresssed in UML diagrams or XML schema.
[8] Mappings and transformatons are a huge area of diverse structure and different serializations and specifications. Fortunately, the task of mapping external structure to RDF removes the many-to-many issues with most transformation approaches.
[9] Taxonomies mask an entire sub-categories of directories, folksonomies, subject trees, and more. The key aspect is that relevant concepts are expressed in a graph relationship manner to other concepts, often in a hierarchical fashion.
[10] Categories also includes the general classification process.
[11] I would consider a canonical references listing of country names and codes to be a part of Big Structure, since they act as a controlled vocabulary.
[12] This is a key area for including unstructured documents, since tags are a primary means of adding metadata to a document. When the pool of tags is based on the governing reference and domain ontologies, then interoperability is further promoted.
[13] M.K. Bergman, 2006. Sources and Classification of Semantic Heterogeneities, AI3:::Adaptive Information blog, June 6, 2006.
Posted:July 16, 2014

Battle of Niemen, WWI, photo from WikimediaAre We Losing the War? Was it Even the Right One?

Cinemaphiles will readily recognize Akira Kurosawa‘s Rashomon film of 1951. And, in the 1960s, one of the most popular book series was Lawrence Durrell‘s The Alexandria Quartet. Both, each in its own way, tried to get at the question of what is truth by telling the same story from the perspective of different protagonists. Whether you saw this movie or read these books you know the punchline: the truth was very different depending on the point of view and experience — including self-interest and delusion — of each protagonist. All of us recognize this phenomenon of the blind men’s view of the elephant.

I have been making my living and working full time on the semantic Web and semantic technologies now for a full decade. So has my partner at Structured Dynamics, Fred Giasson. Others have certainly worked longer in this field. The original semantic Web article appeared in Scientific American in 2000 [1], and the foundational Resource Description Framework data model dates from 1999. Fred and I have our own views of what has gone on in the trenches of the semantic Web over this period. We thought a decade was a good point to look back, share what we’ve experienced, and discover where to point our next offensive thrusts.

What Has Gone Well?

The vision of the semantic Web in the Scientific American article painted a picture of globally interconnected data leveraged by agents or bots designed to make our lives easier and more automated. However, by the time that I got directly involved, nearly five years after standards first started to be published, Tim Berners-Lee and many leading proponents of RDF were beginning to shift focus to linked data. The agents, and automation, and ontologies of the initial vision were being downplayed in favor of effective means to publish and consume data based on RDF. In many ways, linked data resembled a re-branding.

This break had been coming for a while, memorably captured by a 2008 ISWC session led by Peter F. Patel-Schneider [2]. This internal division of viewpoint likely caused effort to be split that would have been better spent in proselytizing and improving tools. It also diverted somewhat into internal squabbles. While many others have pointed to a tactical mistake of using an XML serialization for early versions of RDF as a key factor is slowing initial adoption, a factor I agree was at play, my own suspicion is that the philosophical split taking place in the community was the heavier burden.

Whatever the cause, many of the hopes of the heady days of the initial vision have not been obtained over the past fifteen years, though there have been notable successes.

The biomedical community has been the shining exemplar for data interoperability across an entire discipline, with earth sciences, ecology and other science-based domains also showing interoperability success [3]. Families of ontologies accompanied by tooling and best practices have characterized many of these efforts. Sadly, though, most other domains have not followed suit, and commercial interoperability is nearly non-existent.

Most all of the remaining success has resided in single-institution data integration and knowledge representation initiatives. IBM’s Watson and Apple’s Siri are two amazing capabilities run and managed by single institutions, as is Google’s Knowledge Graph. Also, some individual commercial and government enterprises, willing to pay support to semantic technology experts, have shown success in data integration, using RDF, SKOS and OWL.

We have seen the close kinship between natural language, text, and Q & A with the semantic Web, also demonstrated by Siri and more recent offshoots. We have seen a trend toward pairing great-performing open source text engines, notably Solr, with RDF and triple stores. Recommendation systems have shown some success. Linked data publishing has also had some notable examples, including the first of the lot, DBpedia, with certain institutional publishers (such as the Library of Congress, Eurostat, The Getty, Europeana, OpenGLAM [galleries, archives, libraries, and museums]) showing leadership and the commitment of significant vocabularies to linked data form.

On the standards front, early experience led to new and better versions of the SPARQL query language (SPARQL 1.1 was greatly improved in the last decade and appears to be one capability that sells triple stores), RDF 1.1 and OWL 2. Certain open source tools have become prominent, including Protégé, Virtuoso (open source) and Jena (among unnamed others, of course). At least in the early part of this history, tool development was rapid and flourishing, though the innovation pace has dropped substantially according to my tracking database Sweet Tools.

What Has Disappointed?

My biggest disappointments have been, first, the complete lack of distributed data interoperability, and, second, the lack or inability of commercial enterprises to embrace and adopt semantic technologies on their own. The near absence of discussion about instance records and their attributes helps frame the current maturity of the semantic Web. Namely, it has yet to crack the real nuts of data integration and interoperability across organizations. Again, with the exception of the biomedical community, neither in the linked data realm nor in the broader semantic Web, can we point to information based on semantic Web principles being widely shared between systems and organizations.

Some in the linked data community have explicitly acknowledged this. The abstract for the upcoming COLD 2014 workshop, for example, states [4]:

. . . applications that consume Linked Data are not yet widespread. Reasons may include a lack of suitable methods for a number of open problems, including the seamless integration of Linked Data from multiple sources, dynamic discovery of available data and data sources, provenance and information quality assessment, application development environments, and appropriate end user interfaces.

We have written about many issues with linked data, ranging from the use of improper mapping predicates; to the difficulty in publishing; and to dereferencing URIs on the Web since they are sparse and not always properly implemented [5]. But ultimately, most linked data is just instance data that can be represented in simpler attribute-value form. By shunning a knowledge representation language (namely, OWL) at the processing end, we have put too much burden on what are really just instance records. Linked data does not get the balance of labor right. It ignores the reality that data consumers want actionable information over being able to click from data item to data item, with overall quality reduced to the lowest common denominator. If a publisher has the interest and capability to publish quality linked data, great! It should become part of the data ingest pool and the data becomes easy to consume. But to insist on linked data across the board creates unnecessary barriers. Linked data growth has not nearly kept pace with broader structured data growth on the Web [6].

At the enterprise level, the semantic technology stack is hard to grasp and understand for newcomers. RDF and OWL awareness and understanding are nearly nil in companies without prior semantic Web experience, or 99.9% of all companies. This is not a failure of the enterprises; it is the failure of us, the advocates and suppliers. While we (Structured Dynamics) have developed and continue to refine the turnkey Open Semantic Framework stack, and have spent more efforts than most in documenting and explicating its use, the systems are still too complicated. We combine complicated content management systems as user front-ends to a complicated semantic technology stack that needs to be driven by a complicated (to develop) ontology. And we think we are doing some of the best technology transfer around!

Moreover, while these systems are good at integrating concepts and schema, they are virtually silent on the question of actual data integration. It is shocking to say, but the semantic Web has no vocabularies or tools sufficient to enable data items for the same entity from two different datasets to be combined or reconciled [7]. These issues can be solved within the individual enterprise, but again the system breaks when distributed interoperability is the desire. General Web-based inconsistencies, such as in HTML coding or mime types, impose hurdles on distributed interoperability. These are some of the reasons why we see the successes in the context (generally) of single institutions, as opposed to anything that is truly yet Web-wide.

These points, as is often the case with software-oriented technologies, come down to a disappointing state of tooling. Markets drive developer interest, and market share has been disappointing; thus, fewer tools. Tool interest comes from commercial engagements, and not generally grants, the major source of semantic Web funding, particularly in the European Union. Pragmatic tools that solve real problems in user adoption are rarely a sufficient basis for getting a Ph.D.

The weaknesses in tooling extend from basic installation, to configuration, unit and integrated tests, data conversion and lifting, and, especially, all things ontology. Weaknesses in ontology tooling include (critically) mapping, consistency and coherency checking, authoring, managing, version control, re-factoring, optimization, and workflows. All of these issues are solvable; they are standard software challenges. But it is hard to conquer markets largely with the wrong army pursuing the wrong objectives in response to the wrong incentives.

Yet, despite the weaknesses in tooling, we believe we have been fairly effective in transferring technology to our clients. It takes more documentation and more training and, often, accompanying tool development or improvement in the workflow areas critical to the project. But clients need to be told this as well. In these still early stages, successful clients are going to have to expend more staff effort. With reasonable commitment, it is demonstrable that an enterprise can take over and manage a large-scale semantic engagement on its own. Still, for semantic technologies to have greater market penetration, it will be necessary to lower those commitments.

How Has the Environment Changed?

Of course, over the period of this history, the environment as a whole has changed markedly. The Web today is almost unrecognizable from the Web of 15 years ago. If one assumes that Web technologies tend to have a five year or so period of turnover, we have gone through at least two to three generations of change on the Web since the initial vision for the semantic Web.

The most systemic changes in this period have been cloud computing and the adoption of the smartphone. These, plus the network of workstations approach to data centers, have radically changed what is desirable in a large-scale, distributed architecture. APIs have become RESTful and database infrastructures have become flatter and more distributed. These architectures and their supporting infrastructure — such as virtual servers, MapReduce variants, and many applications — have in turn opened the door to performant management of large volumes of flat (key-value or graph) data, or big data.

On the Web side, JavaScript, just a few years older than the semantic Web, is now dominant in Web pages and taking on server-side roles (such as through Node.js). In turn, JSON has now grown in popularity as a form of data representation and transfer and is being adopted to the semantic Web (along with codifying CSV). Mobile, too, affects the Web side because of the need for multiple-platform deployments, touchscreen use, and different user interface paradigms and layout designs. The app ecosystem around smartphones has become a huge source for change and innovation.

Extremely germane to the semantic Web — indeed, overall, for artificial intelligence — has been the occurrence of knowledge-based AI (KBAI). The marrying of electronic Web knowledge bases — such as Wikipedia or internal ones like the Google search index or its Knowledge Graph — with improvements in machine-learning algorithms is systematically mowing down what used to be called the Grand Challenges of computing. Sensors are also now entering the picture, from our phones to our homes and our cars, that exposes the higher-order requirement for data integration combined with semantics. NLP kits have improved in terms of accuracy and execution speed; many semantic tasks such as tagging or categorizing or questioning already perform at acceptable levels for most projects.

On the tooling side, nearly all building blocks for what needs to be done next are available in open source, with some platform areas quite functional (including OSF, of course). We have also been successful in finding clients that agree to open source the development work we do for them, since they are benefiting from the open source development that went on before them.

What Did We Set Out to Achieve?

When Structured Dynamics entered the picture, there were already many tools available and core languages had been released. Our view of the world at that time led us to adopt two priorities for what we thought might be a five year or so plan. We have achieved the objectives we set for ourselves then, though it has taken us a couple of years longer to realize.

One priority was to develop a reference structure for concepts to serve as a “grounding” basis for relating datasets, vocabularies, schema, taxonomies, or ontologies. We achieved this with our first commercial release (v 1.00) of UMBEL in February 2011. Subsequent to that we have progressed to v 1.05. In the coming months we will see two further major updates that have been under active effort for about eight months.

The other priority was to create a turnkey foundation for a semantic enterprise. This, too, has been achieved, with many more releases. The Open Semantic Framework (OSF) is now in version 3.00, backed by a 500-article training documentation and technical wiki. Support tooling now includes automated installation, testing, and data transfer and synchronization.

Because our corporate objectives were largely achieved it was time to look at lessons learned and set new directions. This article, in part, is a result of that process.

How Did Our Priorities Evolve Over the Decade?

I thought it would be helpful to use the content of this AI3 blog to track how concerns and priorities changed for me and Structured Dynamics over this history. Since I started my blog quite soon after my entry into the semantic Web, the record of my perspectives was conterminous and rather complete.

The fifty articles below trace my evolution in knowledge and skills, as well as a progression from structured data to the semantic Web. These 50 articles represent about 11% of all articles in my chronological archive; they were selected as being the most germane to the question of evolution of the semantic Web.

After early ramp up, most of the formative discussion below occurred in the early years. Posts have declined most recently as implementation has taken over. Note most of the links below have  PDFs available from their main pages.

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

The early years of this history were concentrated on gathering background information and getting educated. The release of DBpedia in 2007 showed how knowledge bases would become essential to the semantic Web. We also identified that a lack of shared reference concepts was making it difficult to “ground” different semantic Web datasets or schema to one another. Another key theme was the diversity of native data structures on the Web, but also how all of them could be readily represented in RDF.

By 2008 we began to study the logical underpinnings to the semantic Web as we were coming to understand how it should be practiced. We also began studying Web-oriented architectures as key design guidance going forward. These themes continued into 2009, though now informed by clients and applications, which was expanding our understanding of requirements (and, sometimes, shortcomings) in the enterprise marketplace. The importance of an open world approach to the basic open nature of knowledge management was cementing a clarity of the role and fit of semantic solutions in the overall informaton space. The general community shift to linked data was beginning to surface worries.

2010 marked a shift for us to become more of a popularizer of semantic technologies in the enterprise, useful to attract and inform prospects. The central role of ontologies as the guiding structures (either as codified knowledge structures or as instruction sets for the platform) for OSF opened realizations that generic functional software could be designed that can be re-used in most any knowledge domain by simply changing the data and ontologies guiding them. This increased our efforts in ontology tooling and training, now geared more to the knowledge worker.  The importance of groundings for aligning schema and data caused us to work hard on UMBEL in 2011 to get it to a commercial release state.

All of these efforts were converging on design thoughts about the nature of information and how it is signified and communicated. The bases of an overall philosophy regarding our work emerged around the teachings of Charles S Peirce and Claude Shannon. Semantics and groundings were clearly essential to convey accurate messages. Simple forms, so long as they are correct, are always preferred over complex ones because message transmittal is more efficient and less subject to losses (inaccuracies). How these structures could be represented in graphs affirmed the structural correctness of the design approach. The now obvious re-awakening of artificial intelligence helps to put the semantic Web in context: a key subpart, but still a subset, of artificial intelligence. The percentage of formative articles directly related over these last couple of years to the semantic Web drops much, as the emphasis continues to shift to tech transfer.

What Else Did We Learn?

Not all lessons learned warranted an article on their own. So, we have also reflected on what other lessons we learned over this decade. The overall theme is: Simpler is better.

Distributed data interoperability across the Web is a fundamental weakness. There are no magic tricks to integrate data. Data mapping and integration will always require massaging. Each data integration activity needs its own solution. However, it can greatly be helped with ontologies and with better tooling.

In keeping with the lesson of grounding, a reference ontology for attributes is missing. It is needed as a bridge across disparate datasets describing similar entities or with different attributes for the same entities. It is also a means to reduce the pairwise combinatorial issue of integrating multiple datasets. And, whatever is done in the data integration area, an open world approach will be essential given the nature of knowledge information.

There is good design and best practice for distributed architectures. The larger these installations become, the more important it is to use a lightweight, loosely-coupled design. RESTful Web services and their interfaces are key. Simpler services with fewer functions can be designed to complement one another and increase throughput effectiveness.

Functional programming languages align well with the data and schema in knowledge management functions. Ontologies, as structures, also fit well with functional languages. The ability to create DSLs should continue to improve bringing the knowledge management function directly into the hands of its users, the knowledge workers.

In a broader sense, alluded to above, the semantic Web is but a set of concepts. There are multiple ways to use it. It can be leveraged without requiring “core” semantic Web tools such a triple stores. Solr can act as a semantic store because semantics, NLP and search are naturally married. But, the semantic Web, in turn, needs to become re-embedded in artificial intelligence, now backed by knowledge bases, which are themselves creatures of the semantic Web.

Design needs to move away from linked data or the semantic Web as the goals. The building blocks are there, though perhaps not yet combined or expressed well. The real improvements now to the overall knowledge function will result from knowledge bases, artificial intelligence, and the semantic Web working together. That is the next frontier.

Overall, we perhaps have been in the wrong war for the wrong reasons. Linked data is certainly not an end and mostly appears to represent work, rather than innovation. The semantic Web is no longer the right war, either, because improvements there will not come so much from arguing semantic languages and paradigms. Learning how to master distributed data integration will teach the semantic Web much, and coupling artificial intelligence with knowledge bases will do much to improve the most labor-intensive stumbling blocks in the knowledge management workflow: mappings and transformations. Further, these same bases will extend the reach into analytical and statistical realms.

The semantic Web has always been an infrastructure play to us. On that basis, it will be hard to ever judge market penetration or dominance. So, maybe in terms of a vision from 15 years ago the growth of the semantic Web has been disappointing. But, for Fred and me, we are finally seeing the landscape clearly and in perspective, even if from a viewpoint that may be different from others’. From our vantage point, we are at the exciting cusp of a new, broader synthesis.

NOTE: This is Part I of a two-part series. Part II will appear shortly.

[1] Tim Berners-Lee, James Hendler, and Ora Lassila, “The Semantic Web,” in Scientific American 284(5): pp 34-43, 2001. See http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2.
[2] For those with a spare 90 minutes or so, you may also want to view this panel session and debate that took place on “An OWL 2 Far?” at ISWC ’08 in Karlsruhe, Germany, on October 28, 2008. The panel was chaired by Peter F. Patel-Schneider (Bell Labs, Alcathor) with the panel members of Stefan Decker (DERI Galway), Michel Dumontier (Carleton University), Tim Finin (University of Maryland) and Ian Horrocks (University of Oxford), with much audience participation. See http://videolectures.net/iswc08_panel_schneider_owl/
[3] Open Biomedical Ontologies (OBO) is an effort to create controlled vocabularies for shared use across different biological and medical domains. As of 2006, OBO formed part of the resources of the U.S. National Center for Biomedical Ontology (NCBO). As of the date of this article, there were 376 ontologies listed on the NCBO’s BioOntology site. Both OBO and BioOntology provide tools and best practices.
[4] Fifth International Workshop on Consuming Linked Data (COLD 2014), co-located with the 13th International Semantic Web Conference (ISWC) in Riva del Garda, Italy, October 19-20.
[7] See the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.