Posted:July 23, 2014

Light and Dark Structure of Universe, @NYT, see http://vimeo.com/100907866Envisioning A New Adaptive Infrastructure for Data Interoperability

In Part I of this two-part series, Fred Giasson and I looked back over a decade of working within the semantic Web and found it partially successful but really the wrong question moving forward. The inadequacies of the semantic Web to date reside in its lack of attention to practical data interoperability across organizational or community boundaries. An emphasis on linked data has created an illusion that questions of data integration are being effectively addressed. They are not.

Linked data is hard to publish and not the only useful form for consuming data; linked data quality is often unreliable; the linking predicates for relating disparate data sources to one another may be inadequate or wrong; and, there are no reference groundings for relating data values across datasets. Neither the semantic Web nor linked data has developed the practices, tooling or experience to actually interoperate data across the Web. These criticisms are not meant to condemn linked data — it is, after all, the early years. Where it is compliant and from authoritative information sources, linked data can be a gold standard in data publishing. But, linked data is neither necessary nor essential, and may even be a diversion if it sucks the air from the room for what is more broadly useful.

This table summarizes the state-of-art in the semantic Web for frameworks and guidance in how to interoperate data:

Category Related Terms Status in the Semantic Web Notes
Classes sets, concepts, topics, types, kinds Mature, but broader scope coverage desirable; equivalent linkages between datasets often mis-applied; more realistic proximate linkages in flux, with no bases to reason over them [1]
Instances individuals, entities, members, records, things Current basis for linked data; many linkage properties mis-applied [2]
Relation Properties relations, predicates Equivalent linkages between datasets often mis-applied; more realistic proximate linkages in flux, with no bases to reason over them. [3]
Descriptive Properties attributes, descriptors Save for a couple of minor exceptions, no basis for mapping attributes across datasets [4]
Values data Basic QUDT ontologies could contribute here [5]

We can relate the standard subjectpredicateobject triple statement in RDF to this table, using the Category column. Classes and Instances relate to the subjects, Relation and Descriptive Properties relate to the predicate, and Values relate to the object [6] in an RDF triple. The concepts and class schema of different information sources (their “aboutness”) can reasonably be made to interoperate. In terms of the description logics that underly the logic bases of W3C ontologies, the focus and early accomplishments of the semantic Web have been on this “terminological box” or T-Box [7]. Tooling to make the mappings more productive and means to test the coherence and completeness of the results still remain as priority efforts, but the conceptual basis and best practices have progressed pretty well.

In contrast, nearly lacking in focus and tooling has been the flip side of that description logics coin: the A-Box [7], or assertional and instance (data) level of the equation. Both the T-Box and A-Box are necessary to provide a knowledge base. Today, there are virtually no vocabularies, no tooling, no history, no best practices and no “grounding” for actual A-Box data integration within the semantic Web. Without such guidance, the semantic Web is silent on the questions of data interoperability. As David Karger explained in his keynote address at ISWC in 2013 [8], “we’ve got our heads in the clouds while people are stuck in the dirt.”

Yet these are not fatal flaws of the semantic Web, nor are they permanent. Careful inspection of current circumstances, combined with purposeful action, suggests:

  1. Data integration can be solved
  2. Leveraging background knowledge is a key enabler
  3. Interoperability requires reference structures, what we are calling Big Structure.

The Prism of Data Interoperability

Why do we keep pointing to the question of data interoperability? Consider these facts:

  • 80% of all available information is in text or documents (unstructured)
  • 40% of standard IT project expenses are devoted to data integration in one form or another, due to the manual effort needed for data migration and mapping
  • Information volumes are now doubling in fewer than two years
  • Other trends including smartphones and sensors are further accelerating information growth
  • Effective business intelligence requires the use of quality, integrated data.

The abiding, costly, frustrating and energy-sucking demands of data integration have been a constant within enterprises for more than three decades. The same challenges reside for the Web. The Internet of Things will further demand better interoperability frameworks and guidelines. Current data integration tooling relies little upon semantics and no leading alternative is based principally around semantic approaches [9].

The data integration market is considered to include enterprise data integration and extract, transform and load (ETL) vendors. Gartner estimates tool sales for this market to be about $2 billion annually, with a growth rate faster than most IT areas [10]. But data integration also touches upon broader areas such as enterprise application integration (EAI), federated search and query, and master data management (MDM), among others. Given that data integration is also 40% of standard IT project costs, new approaches are needed to finally unblock the costly logjam of enterprise information integration. Most analysts see firms that are actively pursuing data integration innovations as forward-thinking and more competitive.

Data integration is combining information from multiple sources and providing users a uniform view of it. Data interoperability is being able to exchange and work upon (inter-operate) information across system and organizational boundaries. The ability to integrate data precedes the ability to interoperate it. For example, I may have three datasets of mammals that I want to consolidate and describe in similar terms with common units of measurement. That is an example of data integration. I may then want to relate this mammal knowledge base with a more general perspective of the animal kingdom. That is an example of data interoperability. Data integration usually occurs within a single organization or enterprise or institutional offering (as would be, say, Wikipedia). Data interoperability additionally needs to define meanings and communicate them in common ways across organizational, domain or community boundaries.

These are natural applications for the semantic Web. Why, then, has there not been more practical use of the semantic Web for these purposes?

That is an interesting question that we only partially addressed in Part I of this series. All aspects of data have semantics: what the data is about, what is its context, how it relates to other data, and what its values are and what they mean. The semantic Web is closely allied with natural language processing, an essential for bringing the 80% of unstructured data into the equation. Semantic Web ontologies are useful structures for how to relate real-world data into common, reference forms. The open world logic of the semantic Web is the right perspective for knowledge functions under the real-world conditions of constantly expanding information and understandings.

While these requirements suggest an integral role for the semantic Web, it is also clear that the semantic Web has not yet made these contributions. One explanation may be that semantic Web advocates, let alone the linked data tribe, have not seen data integration — as traditionally defined — as their central remit. Another possibility is that trying to solve data interoperability through the primary lens of the semantic Web is the wrong focus. In any case, meeting the challenge of data interoperability clearly requires a much broader context.

Embedding Data Interoperability Into a Broader Context

The semantic Web, in our view, is properly understood as a sub-domain of artificial intelligence. Semantic technologies mesh smoothly with natural language tasks and objectives.  But, as we noted in a recent review article, artificial intelligence is itself undergoing a renaissance [11]. These advances are coming about because of the use of knowledge-based AI (KBAI), which combines knowledge bases with machine learning and other AI approaches. Natural language and spoken interfaces combined with background knowledge and a few machine-language utilities are what underlie Apple’s Siri, for example.

The realization that the semantic Web is useful but insufficient and that AI is benefitting from the leveraging of background knowledge and knowledge bases caused us to “decompose” the data-interoperability information space. Because artificial intelligence is a key player here, we also wanted to capture all of the main sub-domains of AI and their relationships to one another:

Artificial Intelligence Domains

Artificial Intelligence Domains

Two core observations emerge from standing back and looking at these questions. First, many of AI’s main sub-domains have a role to play with respect to data integration and interoperability:

AI and Data Interoperability

AI Domains Related to Data Interoperability

This places semantic Web technologies as a co-participant with natural language processing, knowledge mining, pattern recognizers, KR languages, reasoners, and machine learning as domains related to data interoperability.

And, second, generalizing the understanding of knowledge bases and other guiding structures in this space, such as ontologies, highlights the potential importance of Big Structure. Virtually every one of the domains displayed above would be aided by leveraging background knowledge.

Grounding Data Interoperability in Big Structure

As our previous AI review showed [11], reference knowledge bases — Wikipedia in the forefront — have been a tremendous boon to moving forward on many AI challenges. Our own experience with UMBEL has also shown how reference ontologies can help align and provide common grounding for mapping different information domains into one another [12]. Vetted, gold-standard reference structures provide a fixity of coherent touchpoints for orienting different concepts and domains (and, we believe, data) to one another.

In the data integration context, master data models (and management, or MDM) attempt to provide common reference terms and objects to aid the integration effort. Like other areas in conventional data integration, very few examples of MDM tools based on semantic technologies exist.

This use of reference structures and the importance of knowledge bases to help solve hard computational tasks suggests there may be a general principle at work. If ontologies can help orient domain concepts, why can’t they also be used to orient instance data and their attributes? In fact, must these structures always be ontologies? Are not other common reference structures such as taxonomies, vocabularies, reference entity sets, or other typologies potentially useful to data integration?

By standing back in this manner and asking these broader questions we can see a host of structures like reference concepts, reference attributes, reference places, reference identifiers, and the like, playing the roles of providing common groundings for integration and interoperation. Through the AI experience, we can also see that subsequent use of these reference structures — be they full knowledge bases or more limited structures like taxonomies or typologies — can further improve information extraction and organization. The virtuous circle of knowledge structures improving AI algorithms, which can then further improve the knowledge structures, has been a real Aha! moment for the artificial intelligence community. We should see rapid iterations of this virtuous circle in the months to come.

These perspectives can help lead to purposeful designs and approaches for attacking such next-generation problems as data interoperability. The semantic Web can not solve this alone because additional AI capabilities need to be brought to bear. Conventional data integration approaches that lack semantic Big Structure groundings — let alone the use of AI techniques — have years of history of high cost and disappointing results. No conventional enterprise knowledge management problem appears sheltered from this whirlwind of knowledge-backed AI.

At Structured Dynamics, Fred Giasson and I have been discussing “Big Structure” for some time. However, it was only in researching this article that I came across the first public use of this phrase in the context of AI and big data. In May, Dr. Jiawei Han, a leading researcher in data mining, gave a lecture at Yahoo! Labs entitled, Big Data Needs Big Structure. In it, he defines “Big Structure as a type information network.” The correlation with ontologies and knowledge structures is obvious.

An Emerging Development Agenda

The intellectual foundations already exist to move aggressively on a focused development agenda to improve the infrastructure of data interoperability. This emerging agenda needs to look to new refererence structures, better tooling, the use of functional languages and practices, and user interfaces and workflows that improve the mappings that are the heart of interoperability.

Big Structure, such as UMBEL for referencing what data is about, is the present exemplar for going forward. Excellent reference and domain ontologies for common domains already exist. Mapping predicates have been developed for these purposes. Though creation of the maps is still laborious, tooling improvements (see below) should speed up that process as well.

What is next needed are reference structures to help guide attributes mappings, data value mappings, and transformations into usable common attribute quantities and types. I will discuss in a later post our more detailed thoughts of what a reference gold-standard attribute ontology should look like. This new Big Structure should also be helpful in guiding conversion, transformation and “lifting” utilities that may be used to bring attribute values from heterogeneous sources into a common basis. As mappings are completed, these too can become standard references as the bootstrapping continues.

Mappings for data integration across the scales, scope and growth of data volumes on the Web and within enterprises can no longer be handled manually. Semi-automated tooling must be developed and refined that operates over large volumes with acceptable performance. Constant efforts to reduce the data volumes requiring manual curation are essential; AI approaches should be incorporated into the virtuous iterations to reduce these efforts. Meanwhile, attentiveness to productive user interfaces and efficient workflows are also essential to improve throughput.

Further, by working off of standards-based Big Structures, this tooling can be made more-or-less generic, with ready application to different domains and different data. Because this tooling will often work in enterprises behind firewalls, standard enterprise capabilities (security, access, preservation, availability) should also be added to this infrastructure.

These Big Structures and tools should themselves be created and maintained via functional programming languages and DSLs specifically geared to the circumstances at hand. We want languages suited to RDF and AI purposes with superior performance across large mapped datasets and unstructured text. But we also want languages that are easier to use and maintain by knowledge workers themselves. Partitioning strategies may also need to be employed to ensure acceptable real-time feedback to users responsible for data integration mappings.

A New Adaptive Infrastructure for Data Interoperability

Structured Dynamics’ review exercise, now documented in this two-part series, affirms the semantic Web needs to become re-embedded in artificial intelligence, backed by knowledge bases, which are themselves creatures of the semantic Web. Coupling artificial intelligence with knowledge bases will do much to improve the most labor-intensive stumbling blocks in the data integration workflow: mappings and transformations. Through a purposeful approach of developing reference structures for attributes and data values, we will begin to see marked improvements in the efficiency and lower costs of data integration. In turn, what is learned by using these approaches for mastering MDM will teach the semantic Web much.

An approach using semantic technologies and artificial intelligence tools will begin to solve the data integration puzzle. By leveraging background knowledge, we will begin to extend into data interoperability. Purposeful attention to tooling and workflows geared to improve the mapping speed and efficiency by users will enable us to increase the stable of reference structures — that is, Big Structure — available for the next integration challenges. As this roster of Big Structures increases, they can be shared, allowing more generic issues of data integration to be overcome, freeing domains and enterprises to target what is unique.

Achieving this vision will not occur overnight. But, based on a decade of semantic Web experience and the insights being gained from today’s knowledge-based AI advances, the way forward looks pretty clear. We are entering a fundamental new era of knowledge-based computation. We welcome challenging case examples that will help us move this vision forward.

NOTE: This Part II concludes the series with Part I, A Decade in the Trenches of the Semantic Web

[1] Using semantic ontologies can and has worked well for many domains and applications, such as the biomedical OBO ontologies, IBM’s Watson, Google’s Knowledge Graph, and hundreds in more specific domains. Combined with concept reference structures like UMBEL, both building blocks and exemplars exist for how to interoperate across what different domains are about.
[2] For examples of issues, see M. K. Bergman, 2009. When Linked Data Rules Fail, AI3:::Adaptive Information blog, November 16, 2009.
[3] Some of these options are overviewed by M. K. Bergman, 2010. The Nature of Connectedness on the Web, AI3:::Adaptive Information blog, November 22, 2010.
[4] See the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.
[6] The object may also refer to another class or instance, in which case the relation property takes the form of an ObjectProperty and the “value” is the URI referring to that object.
[7] See, for example, M. K. Bergman, 2009. Making Linked Data Reasonable Using Description Logics, Part 2, AI3:::Adaptive Information blog, February 15, 2009.
[9] Info-Tech Research Group, 2011. Vendor Landscape Plus: Data Integration Tools, 72 pp.
[10] Gartner estimates that the data integration tool market was slightly over $2 billion at the end of 2012, an increase of 7.4% from 2011. This market is seeing an above-average growth rate of the overall enterprise software market, as data integration continues to be considered a strategic priority by organizations. See Eric Thoo, Ted Friedman, Mark A. Beyer, 2013. Magic Quadrant for Data Integration Tools, research Report G00248961 from Gartner, Inc., 17 July 2013; see: http://www.gartner.com/technology/reprints.do?id=1-1HBEFSF&ct=130717&st=sb
[11] See M. K. Bergman, 2014. Spring Dawns on Artificial Intelligence, AI3:::Adaptive Information blog, June 2, 2014.
[12] See M. K. Bergman, 2011. In Search of ‘Gold Standards’ for the Semantic Web, AI3:::Adaptive Information blog, February 28, 2011.
Posted:July 16, 2014

Battle of Niemen, WWI, photo from WikimediaAre We Losing the War? Was it Even the Right One?

Cinemaphiles will readily recognize Akira Kurosawa‘s Rashomon film of 1951. and in the 1960s one of the most popular book series was Lawrence Durrell‘s The Alexandria Quartet. Both, each in its own way, tried to get at the question of what is truth by telling the same story from the perspective of different protagonists. Whether you saw this movie or read these books you know the punchline: the truth was very different depending on the point of view and experience — including self-interest and delusion — of each protagonist. All of us recognize this phenomenon of the blind men’s view of the elephant.

I have been making my living and working full time on the semantic Web and semantic technologies now for a full decade. So has my partner at Structured Dynamics, Fred Giasson. Others have certainly worked longer in this field. The original semantic Web article appeared in Scientific American in 2000 [1], and the foundational Resource Description Framework data model dates from 1999. Fred and I have our own views of what has gone on in the trenches of the semantic Web over this period. We thought a decade was a good point to look back, share what we’ve experienced, and discover where to point our next offensive thrusts.

What Has Gone Well?

The vision of the semantic Web in the Scientific American article painted a picture of globally interconnected data leveraged by agents or bots designed to make our lives easier and more automated. However, by the time that I got directly involved, nearly five years after standards first started to be published, Tim Berners-Lee and many leading proponents of RDF were beginning to shift focus to linked data. The agents, and automation, and ontologies of the initial vision were being downplayed in favor of effective means to publish and consume data based on RDF. In many ways, linked data resembled a re-branding.

This break had been coming for a while, memorably captured by a 2008 ISWC session led by Peter F. Patel-Schneider [2]. This internal division of viewpoint likely caused effort to be split that would have been better spent in proselytizing and improving tools. It also diverted somewhat into internal squabbles. While many others have pointed to a tactical mistake of using an XML serialization for early versions of RDF as a key factor is slowing initial adoption, a factor I agree was at play, my own suspicion is that the philosophical split taking place in the community was the heavier burden.

Whatever the cause, many of the hopes of the heady days of the initial vision have not been obtained over the past fifteen years, though there have been notable successes.

The biomedical community has been the shining exemplar for data interoperability across an entire discipline, with earth sciences, ecology and other science-based domains also showing interoperability success [3]. Families of ontologies accompanied by tooling and best practices have characterized many of these efforts. Sadly, though, most other domains have not followed suit, and commercial interoperability is nearly non-existent.

Most all of the remaining success has resided in single-institution data integration and knowledge representation initiatives. IBM’s Watson and Apple’s Siri are two amazing capabilities run and managed by single institutions, as is Google’s Knowledge Graph. Also, some individual commercial and government enterprises, willing to pay support to semantic technology experts, have shown success in data integration, using RDF, SKOS and OWL.

We have seen the close kinship between natural language, text, and Q & A with the semantic Web, also demonstrated by Siri and more recent offshoots. We have seen a trend toward pairing great-performing open source text engines, notably Solr, with RDF and triple stores. Recommendation systems have shown some success. Linked data publishing has also had some notable examples, including the first of the lot, DBpedia, with certain institutional publishers (such as the Library of Congress, Eurostat, The Getty, Europeana, OpenGLAM [galleries, archives, libraries, and museums]) showing leadership and the commitment of significant vocabularies to linked data form.

On the standards front, early experience led to new and better versions of the SPARQL query language (SPARQL 1.1 was greatly improved in the last decade and appears to be one capability that sells triple stores), RDF 1.1 and OWL 2. Certain open source tools have become prominent, including Protégé, Virtuoso (open source) and Jena (among unnamed others, of course). At least in the early part of this history, tool development was rapid and flourishing, though the innovation pace has dropped substantially according to my tracking database Sweet Tools.

What Has Disappointed?

My biggest disappointments have been, first, the complete lack of distributed data interoperability, and, second, the lack or inability of commercial enterprises to embrace and adopt semantic technologies on their own. The near absence of discussion about instance records and their attributes helps frame the current maturity of the semantic Web. Namely, it has yet to crack the real nuts of data integration and interoperability across organizations. Again, with the exception of the biomedical community, neither in the linked data realm nor in the broader semantic Web, can we point to information based on semantic Web principles being widely shared between systems and organizations.

Some in the linked data community have explicitly acknowledged this. The abstract for the upcoming COLD 2014 workshop, for example, states [4]:

. . . applications that consume Linked Data are not yet widespread. Reasons may include a lack of suitable methods for a number of open problems, including the seamless integration of Linked Data from multiple sources, dynamic discovery of available data and data sources, provenance and information quality assessment, application development environments, and appropriate end user interfaces.

We have written about many issues with linked data, ranging from the use of improper mapping predicates; to the difficulty in publishing; and to dereferencing URIs on the Web since they are sparse and not always properly implemented [5]. But ultimately, most linked data is just instance data that can be represented in simpler attribute-value form. By shunning a knowledge representation language (namely, OWL) at the processing end, we have put too much burden on what are really just instance records. Linked data does not get the balance of labor right. It ignores the reality that data consumers want actionable information over being able to click from data item to data item, with overall quality reduced to the lowest common denominator. If a publisher has the interest and capability to publish quality linked data, great! It should become part of the data ingest pool and the data becomes easy to consume. But to insist on linked data across the board creates unnecessary barriers. Linked data growth has not nearly kept pace with broader structured data growth on the Web [6].

At the enterprise level, the semantic technology stack is hard to grasp and understand for newcomers. RDF and OWL awareness and understanding are nearly nil in companies without prior semantic Web experience, or 99.9% of all companies. This is not a failure of the enterprises; it is the failure of us, the advocates and suppliers. While we (Structured Dynamics) have developed and continue to refine the turnkey Open Semantic Framework stack, and have spent more efforts than most in documenting and explicating its use, the systems are still too complicated. We combine complicated content management systems as user front-ends to a complicated semantic technology stack that needs to be driven by a complicated (to develop) ontology. And we think we are doing some of the best technology transfer around!

Moreover, while these systems are good at integrating concepts and schema, they are virtually silent on the question of actual data integration. It is shocking to say, but the semantic Web has no vocabularies or tools sufficient to enable data items for the same entity from two different datasets to be combined or reconciled [7]. These issues can be solved within the individual enterprise, but again the system breaks when distributed interoperability is the desire. General Web-based inconsistencies, such as in HTML coding or mime types, impose hurdles on distributed interoperability. These are some of the reasons why we see the successes in the context (generally) of single institutions, as opposed to anything that is truly yet Web-wide.

These points, as is often the case with software-oriented technologies, come down to a disappointing state of tooling. Markets drive developer interest, and market share has been disappointing; thus, fewer tools. Tool interest comes from commercial engagements, and not generally grants, the major source of semantic Web funding, particularly in the European Union. Pragmatic tools that solve real problems in user adoption are rarely a sufficient basis for getting a Ph.D.

The weaknesses in tooling extend from basic installation, to configuration, unit and integrated tests, data conversion and lifting, and, especially, all things ontology. Weaknesses in ontology tooling include (critically) mapping, consistency and coherency checking, authoring, managing, version control, re-factoring, optimization, and workflows. All of these issues are solvable; they are standard software challenges. But it is hard to conquer markets largely with the wrong army pursuing the wrong objectives in response to the wrong incentives.

Yet, despite the weaknesses in tooling, we believe we have been fairly effective in transferring technology to our clients. It takes more documentation and more training and, often, accompanying tool development or improvement in the workflow areas critical to the project. But clients need to be told this as well. In these still early stages, successful clients are going to have to expend more staff effort. With reasonable commitment, it is demonstrable that an enterprise can take over and manage a large-scale semantic engagement on its own. Still, for semantic technologies to have greater market penetration, it will be necessary to lower those commitments.

How Has the Environment Changed?

Of course, over the period of this history, the environment as a whole has changed markedly. The Web today is almost unrecognizable from the Web of 15 years ago. If one assumes that Web technologies tend to have a five year or so period of turnover, we have gone through at least two to three generations of change on the Web since the initial vision for the semantic Web.

The most systemic changes in this period have been cloud computing and the adoption of the smartphone. These, plus the network of workstations approach to data centers, have radically changed what is desirable in a large-scale, distributed architecture. APIs have become RESTful and database infrastructures have become flatter and more distributed. These architectures and their supporting infrastructure — such as virtual servers, MapReduce variants, and many applications — have in turn opened the door to performant management of large volumes of flat (key-value or graph) data, or big data.

On the Web side, JavaScript, just a few years older than the semantic Web, is now dominant in Web pages and taking on server-side roles (such as through Node.js). In turn, JSON has now grown in popularity as a form of data representation and transfer and is being adopted to the semantic Web (along with codifying CSV). Mobile, too, affects the Web side because of the need for multiple-platform deployments, touchscreen use, and different user interface paradigms and layout designs. The app ecosystem around smartphones has become a huge source for change and innovation.

Extremely germane to the semantic Web — indeed, overall, for artificial intelligence — has been the occurrence of knowledge-based AI (KBAI). The marrying of electronic Web knowledge bases — such as Wikipedia or internal ones like the Google search index or its Knowledge Graph — with improvements in machine-learning algorithms is systematically mowing down what used to be called the Grand Challenges of computing. Sensors are also now entering the picture, from our phones to our homes and our cars, that exposes the higher-order requirement for data integration combined with semantics. NLP kits have improved in terms of accuracy and execution speed; many semantic tasks such as tagging or categorizing or questioning already perform at acceptable levels for most projects.

On the tooling side, nearly all building blocks for what needs to be done next are available in open source, with some platform areas quite functional (including OSF, of course). We have also been successful in finding clients that agree to open source the development work we do for them, since they are benefiting from the open source development that went on before them.

What Did We Set Out to Achieve?

When Structured Dynamics entered the picture, there were already many tools available and core languages had been released. Our view of the world at that time led us to adopt two priorities for what we thought might be a five year or so plan. We have achieved the objectives we set for ourselves then, though it has taken us a couple of years longer to realize.

One priority was to develop a reference structure for concepts to serve as a “grounding” basis for relating datasets, vocabularies, schema, taxonomies, or ontologies. We achieved this with our first commercial release (v 1.00) of UMBEL in February 2011. Subsequent to that we have progressed to v 1.05. In the coming months we will see two further major updates that have been under active effort for about eight months.

The other priority was to create a turnkey foundation for a semantic enterprise. This, too, has been achieved, with many more releases. The Open Semantic Framework (OSF) is now in version 3.00, backed by a 500-article training documentation and technical wiki. Support tooling now includes automated installation, testing, and data transfer and synchronization.

Because our corporate objectives were largely achieved it was time to look at lessons learned and set new directions. This article, in part, is a result of that process.

How Did Our Priorities Evolve Over the Decade?

I thought it would be helpful to use the content of this AI3 blog to track how concerns and priorities changed for me and Structured Dynamics over this history. Since I started my blog quite soon after my entry into the semantic Web, the record of my perspectives was conterminous and rather complete.

The fifty articles below trace my evolution in knowledge and skills, as well as a progression from structured data to the semantic Web. These 50 articles represent about 11% of all articles in my chronological archive; they were selected as being the most germane to the question of evolution of the semantic Web.

After early ramp up, most of the formative discussion below occurred in the early years. Posts have declined most recently as implementation has taken over. Note most of the links below have  PDFs available from their main pages.

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

The early years of this history were concentrated on gathering background information and getting educated. The release of DBpedia in 2007 showed how knowledge bases would become essential to the semantic Web. We also identified that a lack of shared reference concepts was making it difficult to “ground” different semantic Web datasets or schema to one another. Another key theme was the diversity of native data structures on the Web, but also how all of them could be readily represented in RDF.

By 2008 we began to study the logical underpinnings to the semantic Web as we were coming to understand how it should be practiced. We also began studying Web-oriented architectures as key design guidance going forward. These themes continued into 2009, though now informed by clients and applications, which was expanding our understanding of requirements (and, sometimes, shortcomings) in the enterprise marketplace. The importance of an open world approach to the basic open nature of knowledge management was cementing a clarity of the role and fit of semantic solutions in the overall informaton space. The general community shift to linked data was beginning to surface worries.

2010 marked a shift for us to become more of a popularizer of semantic technologies in the enterprise, useful to attract and inform prospects. The central role of ontologies as the guiding structures (either as codified knowledge structures or as instruction sets for the platform) for OSF opened realizations that generic functional software could be designed that can be re-used in most any knowledge domain by simply changing the data and ontologies guiding them. This increased our efforts in ontology tooling and training, now geared more to the knowledge worker.  The importance of groundings for aligning schema and data caused us to work hard on UMBEL in 2011 to get it to a commercial release state.

All of these efforts were converging on design thoughts about the nature of information and how it is signified and communicated. The bases of an overall philosophy regarding our work emerged around the teachings of Charles S Peirce and Claude Shannon. Semantics and groundings were clearly essential to convey accurate messages. Simple forms, so long as they are correct, are always preferred over complex ones because message transmittal is more efficient and less subject to losses (inaccuracies). How these structures could be represented in graphs affirmed the structural correctness of the design approach. The now obvious re-awakening of artificial intelligence helps to put the semantic Web in context: a key subpart, but still a subset, of artificial intelligence. The percentage of formative articles directly related over these last couple of years to the semantic Web drops much, as the emphasis continues to shift to tech transfer.

What Else Did We Learn?

Not all lessons learned warranted an article on their own. So, we have also reflected on what other lessons we learned over this decade. The overall theme is: Simpler is better.

Distributed data interoperability across the Web is a fundamental weakness. There are no magic tricks to integrate data. Data mapping and integration will always require massaging. Each data integration activity needs its own solution. However, it can greatly be helped with ontologies and with better tooling.

In keeping with the lesson of grounding, a reference ontology for attributes is missing. It is needed as a bridge across disparate datasets describing similar entities or with different attributes for the same entities. It is also a means to reduce the pairwise combinatorial issue of integrating multiple datasets. And, whatever is done in the data integration area, an open world approach will be essential given the nature of knowledge information.

There is good design and best practice for distributed architectures. The larger these installations become, the more important it is to use a lightweight, loosely-coupled design. RESTful Web services and their interfaces are key. Simpler services with fewer functions can be designed to complement one another and increase throughput effectiveness.

Functional programming languages align well with the data and schema in knowledge management functions. Ontologies, as structures, also fit well with functional languages. The ability to create DSLs should continue to improve bringing the knowledge management function directly into the hands of its users, the knowledge workers.

In a broader sense, alluded to above, the semantic Web is but a set of concepts. There are multiple ways to use it. It can be leveraged without requiring “core” semantic Web tools such a triple stores. Solr can act as a semantic store because semantics, NLP and search are naturally married. But, the semantic Web, in turn, needs to become re-embedded in artificial intelligence, now backed by knowledge bases, which are themselves creatures of the semantic Web.

Design needs to move away from linked data or the semantic Web as the goals. The building blocks are there, though perhaps not yet combined or expressed well. The real improvements now to the overall knowledge function will result from knowledge bases, artificial intelligence, and the semantic Web working together. That is the next frontier.

Overall, we perhaps have been in the wrong war for the wrong reasons. Linked data is certainly not an end and mostly appears to represent work, rather than innovation. The semantic Web is no longer the right war, either, because improvements there will not come so much from arguing semantic languages and paradigms. Learning how to master distributed data integration will teach the semantic Web much, and coupling artificial intelligence with knowledge bases will do much to improve the most labor-intensive stumbling blocks in the knowledge management workflow: mappings and transformations. Further, these same bases will extend the reach into analytical and statistical realms.

The semantic Web has always been an infrastructure play to us. On that basis, it will be hard to ever judge market penetration or dominance. So, maybe in terms of a vision from 15 years ago the growth of the semantic Web has been disappointing. But, for Fred and me, we are finally seeing the landscape clearly and in perspective, even if from a viewpoint that may be different from others’. From our vantage point, we are at the exciting cusp of a new, broader synthesis.

NOTE: This is Part I of a two-part series. Part II will appear shortly.

[1] Tim Berners-Lee, James Hendler, and Ora Lassila, “The Semantic Web,” in Scientific American 284(5): pp 34-43, 2001. See http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2.
[2] For those with a spare 90 minutes or so, you may also want to view this panel session and debate that took place on “An OWL 2 Far?” at ISWC ’08 in Karlsruhe, Germany, on October 28, 2008. The panel was chaired by Peter F. Patel-Schneider (Bell Labs, Alcathor) with the panel members of Stefan Decker (DERI Galway), Michel Dumontier (Carleton University), Tim Finin (University of Maryland) and Ian Horrocks (University of Oxford), with much audience participation. See http://videolectures.net/iswc08_panel_schneider_owl/
[3] Open Biomedical Ontologies (OBO) is an effort to create controlled vocabularies for shared use across different biological and medical domains. As of 2006, OBO formed part of the resources of the U.S. National Center for Biomedical Ontology (NCBO). As of the date of this article, there were 376 ontologies listed on the NCBO’s BioOntology site. Both OBO and BioOntology provide tools and best practices.
[4] Fifth International Workshop on Consuming Linked Data (COLD 2014), co-located with the 13th International Semantic Web Conference (ISWC) in Riva del Garda, Italy, October 19-20.
[7] See the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.
Posted:February 24, 2014

Smell the MoneyTo Combat a Decline in Mindshare, Follow What is Pragmatic

A secret of the semantic Web community is that energy, innovation and participation have slipped over, say, the past three or four years. This has been obvious for some time. I began collecting statistics on such things as prevalence in Google searches, attendance at SemTech or xSWC meetings, postings to user groups, blog postings, heck, even stupid and lengthy controversies on the mailing lists, or the sale and then sale and then sale of SemTech itself.

Fortunately, I realized that my observation of a decline did not depend on having documentary backup: the trend was obvious. So, I could stop collecting time-sucking statistics. I’m sure many of the participants in the formation of the semWeb know exactly of this decline in energy and focus of which I speak.

Other endeavors have kept me from worrying too much about such matters, but recent griping in public forums about the state of the semantic Web got me again thinking about premises and the state of semantic technologies. Such re-thinks are useful because they help put current circumstances into context, and because they help guide how to spot emerging opportunities.

While I am not feeling overwhelmingly passionate about such matters, there does appear to be a villain in this story, what I might term the FYN crowd [1]. But, like all good villains and stories, villainy is mostly a matter of context, with the winners being the ones writing the history. So, accept my thoughts as arising as much from my own worldview as from anything else . . . .

Galileo’s BallsGalileo's Balls

Once one embraces an intellectual domain with the premise of semantics, then meaning and context a priori become first citizens. Depending on viewpoint, what the semantic Web means to one individual can differ substantially from another individual. Moreover, the space becomes a sort of cipher for expressing any worldview, legitimately. For example, one tension at the heart of the semantic Web enterprise has been bottom up v top down; another has been anything goes v more structure and formalism. Hot buttons arise when worldviews differ, as they always surely do. The semantic Web is no exception.

Yet the stated bases for these semantic Web hot buttons, I would claim, are simplistic. What really occurs in the semantic technology space is something more akin to the Galileo thermometer, multiple viewpoints finding multiple resting points. Only in the semantic Web case, the natural resting points don’t just simply occur along a single dimension of, say, formalism, but other viewpoints as well. So, what we end up with is something more akin to a 3D- or multi-dimensional column. There are an infinite number of resting points in reasoned discourse.

Why should this be strange or threatening? Of course, upon inspection, it is not. The understanding that needs to arise is that semantics is truly about differences at all levels of human experience, perceptions and language. A pragmatic semantics must reflect this reality.

I don’t think that these sentiments will ever translate into precision or algorithms. But they can be modeled approximately with algorithms and refined with judgment. Much of their essence can also be captured by ontologies. These are viewpoints that can be captured in silico and used to help humans make better decisions. Semantics are essential to these prospects. At the heart of any pragmatic semantics must be an accommodation of viewpoints and terminology.

The real point in all of this — actually, also the major reason for semantic technologies in the first place — is that for any topic of normal human discourse there is a variety of viewpoints. Only a system expressly designed to respect these differences can be an effective digital means of interoperability.

Tribal Diversities

There are many tribes within the semantic technology space. Academic researchers are the most visible tribe. Because of funding nuances and general interest and tradition (though there are real differences between the US, Queen’s countries, EU or Asia), academics have — and sometimes continue to — set the tone for the semantic Web community. This has been useful to establish a coherent and (generally) logical basis to the underpinnings of the semantic Web. But most in the community would also acknowledge this basis is not sufficient to achieve commercial breakthrough.

In the US, there is a strange mix, with many semantic researchers flying below the radar, because they work for the three-letter intelligence agencies. Also, there is a very strong biomedical community, often funded from the National Library of Medicine. The biomedical community has been an exemplar innovator. Because of this community’s efforts, we now can see how an entire domain — biomedical — can develop and leverage ontologies, establish common vocabularies or standards, or cooperate on tools development. There is no public community more advanced in semantic techology developments than the biomedical one.

Another tribe in this space is the successful hunter, able to use semantic technology capabilities to attract and secure paying customers. Most of the activities of these tribe members is hidden from view, because their paying efforts are by nature infrastructural and concentrated on enterprise and commercial customers. But, also, many individuals within this tribe actively contribute to public efforts and conferences. Many of the more visible semantic technology companies, including my own, occupy this space.

But the most enriched tribe of the semantic Web has been the background semantic orchestrator, generally through infrastructure-based initiatives like broadscale knowledge representation, statistical analysis of massive text corpora, well-considered ontologies, or knowledge structures. The semantic efforts of the search engine vendors, including Bing and Google’s knowledge graph, are members of this tribe, as is Siri, now part of Apple.

These differences in market focus and visibility have tended to play out in expected ways. Academic researchers, Web enthusiasts and those committed to open data have been most vocal about “linked data”. They tend to be the more visible participants in semantic Web mailing lists and forums. Casual followers of the semantic technology space, or those new to it, mostly hear these same voices. By default, the apparent health and status of the semantic Web is more-or-less defined by these voices.

When I said in the intro that the semantic Web has slipped over the past few years, that perception is mostly the result of the lowered volume and fewer messages coming from the vocal tribe. But there are two problems with the accuracy resulting from that. The first, as argued above, is that the vocal and visible linked data advocates are not the only representatives of the community. And, the second, which I’ll get to in a moment, is that the vocal community’s prescriptions for the semantic Web, in my opinion, are no longer the most meaningful ones.

Branding, Terminology and Marketing Messages

Pig SnoutsMany early proponents of the semantic Web, I think it fair to observe, would say that two positioning mistakes (from their perspective) have kept the paradigm from grabbing greater hold. The first reason often cited is the use of XML as the initial syntax of RDF. At first blush, I agree with this observation, given that when I was first entering into the dark chambers of the semantic Web it was at times difficult to separate XML from RDF. Today, though, most semWeb practitioners prefer the use of alternative serializations. I personally don’t think that any difficulties that semantic Web understanding and adoption may pose today are any longer influenced by a decade-old XML confusion. In Web years, these are eons.

The second reason seems to have been the flat-out retreat from “semantic Web” terminology. The conscious decision to switch to the “linked data” branding began in earnest about 2008. I find this shift interesting. I think it relates to looking to the wrong measures of success. What seemed like a clever re-branding at that time has both set the focus in the wrong direction and consequently set the wrong targets for measuring success.

In the areas of standards and movements, moral authority, suasion and prominence often become the bases for who is viewed as “owning” a new concept. There has been much of this posturing around the “semantic Web” and “linked data”, with parry thrusts from “Web 3.0″ and “big data” and “open this or that”. So, I’m not surprised that branding many of the concepts of the semantic Web with a new term — “linked data” — was pushed and took hold. But why original semantic Web advocates adopted this term and its shift in focus from an ecosystem to data representation and exchange does surprise me.

The strange thing, in my opinion, is the monadic emphasis on “linked data” that acts to partially kill the semantic Web minding. Whether by design or fallout, “linked data” inexorably shifts the focus to how data is represented and transmitted. It is a royal pain in the ass for publishers to publish “linked data” and then, when done, there is surprisingly little consumption of it. The MusicBrainz announcement it was dropping RDFa last week is telliing [2]. We are seeing the representation of structured Web data being driven on other bases, as evidenced by the success of JSON, something that linked data enthusiasts have only lately come to embrace, and the schema.org initiative of the major search engines.

Once linked data was raised as the lead banner, other branding messages followed. The first add-on message was “follow-your-nose”. FYN represents clcking from link to link following data references of interest on the Web [1]. In order for that be facilitated, but also as a means to clear up some confusions about linked data, the quality standard of “5-star linked data” was also put forth. To achieve all five stars, linked data should conform to open standards such as RDF and link to other data for context [3].

Today, on virtually all “official” semantic Web forums you will see mention of the brands of linked data, FYN, 5-star linked data, and open data. Publishing of data according to best practices that enables global links from datum to datum across the Giant Global Graph has become the sort of gold standard associated with this new branding.

What is the Measure of Success?

Success is always measured against our premises and values. In the case of the vocal tribe, the premises and values relate to linked and open data. By these measures, the semantic Web is a mixed bag. On the positive front, many laudable sources of quality data — most recently the Getty Museum [4], but also the Library of Congress and arts and humanities publishers across Europe, but also including many science realms beyond biology, and of course hundreds of others made famous by the LOD cloud [5]  — are published as linked data. or in the process being so. Open data sets are coming from government at all levels [6].

On the negative front, the growth of pubished linked data has fallen behind the pace of publishing structured data in general, and notable evidence for where the consumption of linked data has made a difference is pretty hard to find. Linked data advocates only rarely discuss integration with “closed”, proprietary data or enterprise use, integration and realities. Shitty sameAs assertions abound everywhere. Markets find it hard to get excited when the arguments and reference frameworks don’t relate well to their actual problems and pain points. DBpedia can only go so far, and a mountain of links to it without relevance, context or quality is just so much more noise [7].

The point here is not to mount a screed against linked data, but to caution: Be careful how you brand yourself. By the measures of growth and penetration and uptake of linked data, moreover linked open data, the semantic Web space is generally not attracting developer interest, media attention or venture dollars. I hope the release of meaningful linked data continues, but setting that goal as the measure of the semantic Web’s success is selling the wrong product.

Rather than setting a FYN objective as to whether our semantic technology efforts to date have been a success, I suggest we adopt a “follow the money” (FT$) premise. Who is investing or making money off of this stuff, and how and why? Herein lies a different measure of success.Money River

If we look to the approaches taken by those making money in this market, we find that the:

  • Challenges of meaningful connections
  • Interoperability
  • Integration across document and structured data
  • Discovering new patterns and relationships
  • Facilitating semantic understanding across disparate communities and legacy data sources, and
  • Providing quality characteristics for new entities,

are where the bucks are being made. These activities are all at the heart of the knowledge worker’s job responsibilities. Even the earliest advocates of the semantic Web must have had aspirations that the semantic Web had the promise to address these meaningful challenges.

Another secret to systems like Freebase, Google knowledge graph, Bing, Watson, Siri, or similar innovations is their use and reliance on Wikipedia, at least in their formative stages. Though often DBpedia was the structural form of ingest, the core basis of these systems’ capabilities comes from content — Wikipedia — the access to which was only made easier via DBpedia.

The sentiment to follow the money is not a sell out or a political statement. It is a recognition that work worth doing is work others appreciate and are willing to pay for. It is the best signal amidst the noise of what is valuable to work on.

It’s Time for the Side B Hit

I’ve been a fairly active participant in the semantic Web for nearly 10 years. I sometimes have the image of an aspiring music artist from the ’50s or ’60s arguing with the record execs which song should be the favored Side A cut on the 45. The visible voices of the semantic Web want to push FYN and linked data as Side A, but it really isn’t selling, according to the advocates’ own success measures.

The Side B of interoperability, RDF and OWL is not just “filler” to the main promotion, but where I clearly think the hit resides. Some have heard that track, buy it, and are enthused about it. It would be nice if the record execs could see what is right before their face and begin promoting it as well.

FYN and its vocal proponents risk the perception of failure of the semantic Web enterprise from the simple fact of putting linked data front and center. Sure, it is a good approach with potentially rich information so long as you can trust the source both for the content itself and the quality of its RDF expression. No one is arguing with that.

But SGML and ASN.1, one could argue, in similar veins, amongst actually dozens of others, were great and useful notations, yet are now mostly historical footnotes. If a trusted source is going to serve me up 5-star linked data, I will take it. Yet the truth is I would take structured data in any form from a trusted source, but take no linked data from an unknown source or one with poor linkages. We spend much time looking at these issues for our clients, and it is the rare linked data set that becomes part of our solution. Even then, we carefully scrutinize all assumed connections.

The Side B semantic Web of vetted and interlinked, interoperable data organized by competent graphs is the winning side. It is the only location where true economic transactions are taking place around the semantic Web. To understand where the semantic Web makes sense, follow the Side B money to your answers.

The insight gained from a FT$ approach clearly points to the failure of FYN. I say, do linked data if you can, it is the best ingest format around. But don’t get too hung up on that. Spend your time figuring out how to bridge meaningful gaps in semantics or data across any enterprise, global or local. Information is not truffles, and following your nose is not the primary argument for the semantic Web.

[1] FYN. or Follow Your Nose, reflects is the general practice of performing web retrieval on URIs in a knowledge base to obtain more knowledge. Two W3C articles provide additional commentary. In the linked data context, FYN represents clcking from link to link following data references of interest. FYN is a specific pattern of linked data. Ed Summers provided one of the better overviews of the use of FYN in the context of linked data and the Web of Data.
[2] See the MusicBrainz blog from February 18, 2014.
[3] Tim Berners-Lee describes 5-star linked open data in this article.
[4] The Getty Museum recently made a portion of its Arts and Architecture Thesaurus (AAT) open source using linked data; see http://blogs.getty.edu/iris/art-architecture-thesaurus-now-available-as-linked-open-data/.
[5] The linked open data (LOD) cloud diagram and supporting information is maintained at http://lod-cloud.net/.
[6] I have often written on the problems with linked and open data as presently practiced. See Practical P-P-P-Problems with Linked Data (October 4, 2010) and The Nature of Connectedness on the Web (November 22, 2010) as two examples. Specific commentary on open data in government is provided in When Linked Data Rules Fail (November 16, 2009).
[7] For another assessment of the state of the semantic Web, see Brian Sletten’s recent Keep On Keeping On article on semanticweb.com (January 13, 2014).
Posted:December 16, 2013

Schema.orgComplementary Efforts of the W3C Mirror the Trend

Two and one-half years ago the triumvirate of Google, Bing and Yahoo! — soon to be joined by Yandex, the major Russian search engine — released schema.org. The purpose of schema.org is to bring a simple means for Web site owners and authors to tag their sites with a vocabulary, designed to be understandable by search engines, to describe common things and activities on the Web. Though informed and led by innovators with impeccable backgrounds in the early semantic Web and knowledge representation [1], the founders of schema.org also understood that the Web is a messy place with often wrong syntax and usage. Their stated commitment to simplicity and practicality caused me to state the day of release that schema.org was “perhaps the most important event for the structured Web since RDF was released a dozen years ago.”

Just a week ago schema.org version 1.0e was released. That event, plus much else in recent months, is suggesting a real maturity and take up of schema.org. It looks like the promise of schema.org is being fulfilled.

Growth and Impact of the schema

When first released, schema.org provided nearly 300 structured record types that may be used to tag information in Web pages. Via various collaborative processes since, and with an active discussion group, the schema.org vocabulary has about doubled in size. Some key areas of expansion have been in describing various actions, adding basic medical terms, product and transaction expansion via linkages to GoodRelations, civic services, and most recently, accessibility. Many other additions are in progress.

In his keynote address at ISWC 2013 in Sydney on October 23, Ramathan Guha [1] reported that 15 percent of crawled pages and 5 million sites have some schema.org markup. We can also see that some of the most widely used content management systems on the Web, notably including WordPress, Joomla and Drupal, have or plan to have native schema.org support. These tooling trends are important because, though designed for simple manual markup, it does require a bit of attention and skill to get schema.org markup right. Having markup added to pages automatically in the background is the next threshold for even broader adoption.

The ability of the schema.org vocabulary to capture essential domain facts as structured data is reflected in the growing list of prominent sites tagging with schema.org. According to Guha, these are some of the prominent sites now using schema.org:

Category Prominent Sites
News Nytimes, guardian.com, bbc.co.uk
Movies imdb, rottentomatoes, movies.com
Jobs / careers careerjet.com, monster.com, indeed.com
People linkedin.com
Products ebay.com, alibaba.com, sears.com, cafepress.com, sulit.com, fotolia.com
Videos youtube, dailymotion, frequency.com, vinebox.com
Medical cvs.com, drugs.com
Local yelp.com, allmenus.com, urbanspoon.com
Events wherevent.com, meetup.com, zillow.com, eventful
Music last.fm, myspace.com, soundcloud.com
Key Applications pinterest.com, opentable.com

Examples like Pinterest show how schema.org can also provide a central organizing point for new ventures and applications. There are also key relationships between schema.org and new search initiatives such as Google’s Now or its knowledge graph.

From day one schema.org was released with a mechanism for other parties to extend its vocabulary. However, more recently, there has been a significant increase of attention on questions of interoperability and relation to other existing vocabularies. To wit:

  • Prominent knowledge representation experts, such as Peter Patel-Schneider, have become active to suggest better interoperability and design considerations
  • The root of schema.org is now recognized as owl:Thing
  • Much discussion has occurred on integration or interoperability or not with SKOS, the simple knowledge organizational vocabulary
  • Provisions have been added to capture concepts such as domain and range
  • Calls have been made to increase the number of examples and documentation, including enforcing consistency across the vocabulary.

To be clear, it was never the intent for schema.org to become a single, governing vocabulary for the Web. Nonetheless, these broader means to enable others to tie in effectively with it are an indicator that schema.org’s sponsors are serious about finding effective common grounds.

Aside from certain areas such as recipes or claiming site or blog ownership, it has been unclear how the search engines are actually using schema.org markup or not. The sponsors have oft stated a go-slow attitude to see if the marketplace indeed embraces the vocabulary or not. I’m also sure that the sponsors, as familiar as they are with spam and erroneous markup, have also wanted to put in place effective ingest procedures that do not reduce the quality of their search indexes.

Getting Dan Brickley, one of the better-known individuals in RDF and the semantic Web, to act as schema.org’s liaison to the broader community, and beginning to open up about actual usage and uptake of schema.org are great signs of the sponsors’ commitment to the vocabulary. We should expect to see a much quickened pace and more visibility for schema.org within the search services themselves within the coming months.

W3C’s Complementary Efforts

Meanwhile, back at the ranch, a number of other interesting efforts are occurring within the World Wide Web Consortium (W3C) that are complementary to these trends. As readers of this blog well know, I have argued for some time that RDF makes for a fantastic data model for interoperating disparate content, which our company Structured Dynamics centrally relies upon, but that RDF is not an essential for metadata specification or exchange. Understood serializations based on understood vocabularies — in other words, exactly the design of schema.org — should be sufficient to describe the various types of things and their attributes as may be found on the Web. This idea of structured data in a variety of forms puts control into the hands of content authors. Various markets will determine what makes best sense for them as to how they actually express that structured data.

Last week the W3C announced its retirement of the Semantic Web group, subsuming it instead into the activities of the new W3C Data Activity. The W3C also announced a new group in CSV (comma-separated values) data exchange to go along with recent efforts in JSON-LD (linked data).

These are great trends that reflect a prejudice to adoption. Along with the advances taking place with schema.org, the Web now appears to be entering into a golden age of structured data.


[1] For example, a Google Fellow instrumental in founding schema.org is Ramanathan V. Guha, with a background extending back to Cyc and through Apple and Netscape through what came to be RDF. Guha was also the lead executive behind Google’s Knowledge Graph, which has some key relations with schema.org.

Posted by AI3's author, Mike Bergman Posted on December 16, 2013 at 11:39 am in Linked Data, Structured Web | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/1696/the-maturation-of-schema-org/
The URI to trackback this post is: http://www.mkbergman.com/1696/the-maturation-of-schema-org/trackback/
Posted:May 15, 2013

So Many QuestionsThinking About the Interstices of the Journey

It actually is a dimmer memory than I would like: the decision to begin a blog eight years ago, nearly to the day ([1]). Since then, for every month and more often many more times per month, I have posted the results of my investigations or ramblings, mostly like clockwork. But, in a creeping realization, I see that I have not posted any new thoughts on this venue for more than two months! Until that hiatus, I had been biologically consistent.

Maybe, as some say, and I don’t disagree, the high-water mark of the blog has passed. Certainly blog activity has dropped dramatically. The emergence of ‘snippet communications’ now appears dominant based on messages and bandwidth. I don’t loathe it, nor fear it, but I find a world dominated by 140 characters and instant babbling mostly a bore.

From a data mining perspective — similar to peristalsis or the wave in a sports stadium — there is worth in the “crowd” coherence/incoherence and spontaneity. We can see the waves, but most are transient. I actually think that broad scrutiny helps promote separating the wheat from chaff. We can expose free radicals to the sanitizing effect of sunlight. Yet these are waves, only very rarely trends, and most generally not truths. That truth stuff is some really slippery stuff.

But, that is really not what is happening for me. (Though I really live to chaw on truth.) Mostly, I just had nothing interesting to say, so there was no reason to blog about it. And, now, as I look at why I broke my disciplined approach to blogging and why it has gone on hiatus, I still am a bit left scratching my head as to why my pontifications stalled.

Two obvious reasons are that our business is doing gangbusters, and it is hard to sneak away from company good-fortune. Another is that my family and children have been joyously demanding.

Yet all of that deflects from the more relevant explanation. The real reason, I think, that I have not been writing more recently actually relates to the circumstance of semantic techologies. Yes, progress is being made, some instances are notable, but the general “semantic web” or “linked data” enterprise is stalled. The narrative for these things — let alone their expression and relevance — needs to change substantially.

I feel we are in the midst of this intense interstice, but the framing perspective for the next discussions have yet to emerge.

The strange thing about that statement is not the basis in semantic technologies, which are now understood and powerful, but the incorporation of these advantages into enterprise practices and environments. In this sense, semantic technologies are now growing up. Their logic and role is clear and explainable, but how they fit into corporate practice with acceptable maintenance costs is still being discovered.