Posted:April 28, 2010

The Bipolar Disorder of Linked Data

An Acceptance of Its Natural Role is the Prozac Substitute

There has been a bit of a manic-depressive character on the Web waves of late with respect to linked data. On the one hand, we have seen huzzahs and celebrations from the likes of ReadWriteWeb and Semantic Web.com and, just concluded, the Linked Data on the Web (LDOW) workshop at WWW2010. This treatment has tended to tout the coming of the linked data era and to seek ideas about possible, cool linked data apps [1]. This rise in visibility has been accomplished by much manic and excited discussion on various mailing lists.

On the other hand, we have seen much wringing of hands and gnashing of teeth for why linked data is not being used more and why the broader issue of the semantic Web is not seeing more uptake. This depressive “call to arms” has sometimes felt like ravings with blame being given to the poor state of apps and user interfaces to badly linked data to the difficulty of publishing same. Actually using linked data for anything productive (other than single sources like DBpedia) still appears to be an issue.

Meanwhile, among others, Kingsley Idehen, ubiquitous voice on the Twitter #linkeddata channel, has been promoting the separation of identity of linked data from the notion of the semantic Web. He is also trying to change the narrative away from the association of linked data with RDF, instead advocating “Data 3.0” and the entity-attribute-value (EAV) model understanding of structured data.

As someone less engaged in these topics since my own statements about linked data over the past couple of years [2], I have my own distanced-yet-still-biased view of what all of this crisis of confidence is about. I think I have a diagnosis for what may be causing this bipolar disorder of linked data [3].

The Semantic Web Boogie Man

A fairly universal response from enterprise prospects when raising the topic of the semantic Web is, “That was a big deal of about a decade ago, wasn’t it? It didn’t seem to go anywhere.” And, actually, I think both proponents and keen observers agree with this general sentiment. We have seen the original advocate, Tim Berners-Lee, float the Giant Global Graph balloon, and now Linked Data. Others have touted Web 3.0 or Web of Data or, frankly, dozens of alternatives. Linked data, which began as a set of techniques for publishing RDF, has emerged as a potential marketing hook and saviour for the tainted original semantic Web term.

And therein, I think, lies the rub and the answer to the bipolar disorder.

If one looks at the original principles for putting linked data on the Web or subsequent interpretations, it is clear that linked data (lower case) is merely a set of techniques. Useful techniques, for sure; but really a simple approach to exposing data using the Web with URLs as the naming convention for objects and their relationships. These techniques provide (1) methods to access data on the Web and (2) specifying the relationships to link the data (resources). The first part is mechanistic and not really of further concern here. And, while any predicate can be used to specify a data (resource) relationship, that relationship should also be discoverable with a URL (dereferencable) to qualify as linked data. Then, to actually be semantically useful, that relationship (predicate) should also have a precise definition and be part of a coherent schema. (Note, this last sentence is actually not part of the “standard” principles for linked data, which itself is a problem.)

When used right, these techniques can be powerful and useful. But, poor choices or execution in how relationships are specified often leads to saying little or nothing about semantics. Most linked data uses a woefully small vocabulary of data relationships, with even a smaller set ever used for setting linkages across existing linked data sets [4]. Linked data techniques are a part of the foundation to overall best practices, but not the total foundation. As I have argued for some time, linked data alone does not speak to issues of context nor coherence.

To speak semantically, linked data is not a synonym for the semantic Web nor is it the sameAs the semantic Web. But, many proponents have tried to characterize it as such. The general tenor is to blow the horns hard anytime some large data set is “exposed” as linked data. (No matter whether the data is incoherent, lacks a schema, or is even poorly described and defined.) Heralding such events, followed by no apparent usefulness to the data, causes confusion to reign supreme and disappointment to naturally occur.

The semantic Web (or semantic enterprise or semantic government or similar expressions) is a vision and an ideal. It is also a fairly complete one that potentially embraces machines and agents working in the background to serve us and make us more productive. There is an entire stack of languages and techniques and methods that enable schema to be described and non-conforming data to be interoperated. Now, of course this ideal is still a work in progress. Does that make it a failure?

Well, maybe so, if one sees the semantic Web as marketing or branding. But, who said we had to present it or understand it as such?

The issue is not one of marketing and branding, but the lack of benefits. Now, maybe I have it all wrong, but it seems to me that the argument needs to start with what “linked data” and the “semantic Web” can do for me. What I actually call it is secondary. Rejecting the branding of the semantic Web for linked data or Web 3.0 or any other somesuch is still dressing the emperor in new clothes.

A Nicely Progressing Continuum, Thank You!

For a couple of years now I have tried in various posts to present linked data in a broader framework of structured and semantic Web data. I first tried to capture this continuum in a diagram from July 2007:


Document Web	Structured Web		Semantic Web
		Linked Data
Document-centric Document resources Unstructured data and semi-structured data HTML URL-centric circa 1993	Data-centric Structured data Semi-structured data and structured data XML, JSON, RDF, etc URI-centric circa 2003	Data-centric Linked data Semi-structured data and structured data RDF, RDF-S URI-centric circa 2006	Data-centric Linked data Semi-structured data and structured data RDF, RDF-S, OWL URI-centric circa ???

Now, three years later, I think the transitional phase of linked data is reaching an end. OK, we have figured out one useful way to publish large datasets staged for possible interoperability. Sure, we have billions of triples and assertions floating out there. But what are we to do with them? And, is any of it any good?

The Reality of a Heterogeneous World

I think Kingsley is right in one sense to point to EAV and structured data. We, too, have not met a structured data format we did not like. There are hundreds of attribute-value pair models of even more generic nature that also belong to the conversation.

One of my most popular posts on this blog has been, ‘Structs’: Naïve Data Formats and the ABox, from January 2009. Today, we have a multitude of popular structured data formats from XML to JSON and even spreadsheets (CSV). Each form has its advocates, place and reasons for existence and popularity (or not). This inherent diversity is a fact and fixture of any discussion of data. It is a major reason why we developed the irON (instance record and object notation) non-RDF vocabulary to provide a bridge from such forms to RDF, which is accessible on the Web via URIs. irON clearly shows that entities can be usefully described and consumed in either RDF or non-RDF serialized forms.

Though RDF and linked data is a great form for expressing this structured information, other forms can convey the same meaning as well. Of the billions of linked data triples exposed to date, surely more than 99% are of this instance-level, “ABox” type of data [5]. And, more telling, of all of the structured data that is publicly obtainable on the Web, my wild guess is that less than 0.0000000001% of that is even linked RDF data [6].

Neither linked data nor RDF alone will — today or in the near future — play a pivotal or essential role for instance data. The real contribution from RDF and the semantic Web will come from connecting things together, from interoperation and federation and conjoining. This is the provenance of the TBox and is a role barely touched by linked data. Publishing data as linked data helps tremendously in simplifying ingest and guiding the eventual connections, but the making of those connections, testing for their quality and reliability, are steps beyond the linked data ken or purpose.

Promoting Linked Data to its Level of Incompetence

It seems, then, that we see two different forces and perspectives at work, each contributing in its own way to today’s bipolar nature of linked data.

On the manic side, we see the celebration for the release of each large, linked data set. This perspective seems to care most about volumes and numbers, with less interest in how and whether the data is of quality or useful. This perspective seems to believe “post the data, and the public will come.” This same perspective is also quite parochial with respect to the unsuitability of non-linked data, be it microdata, microformats or any of the older junk.

On the depressed side, linked data has been seen as a more palatable packaging for the disappointments and perceived failures or slow adoption of the earlier semantic Web phrasing. When this perspective sees the lack of structure, defensible connections and other quality problems with linked data as it presently exists, despair and frustration ensue.

But both of these perspectives very much miss the mark. Linked data will never become the universal technique for publishing structured data, and should not be expected to be such. Numbers are never a substitute for quality. And linked data lacks the standards, scope and investment made in the semantic Web to date. Be patient; don’t despair; structured data and the growth of semantics and useful metadata is proceeding just fine.

Unrealistic expectations or wrong roles and metrics simply confuse the public. We are fortunate that most potential buyers do not frequent the community’s various mailing lists. Reduced expectations and an understanding of linked data’s natural role is perhaps the best way to bring back balance.

Linked Data’s Natural Role

We have consciously moved our communications focus from speaking internally to the community to reaching out to the broader enterprise public. There is much of education, clarification and dialog that is now needed with the buying public. The time has moved past software demos and toys to workable, pragmatic platforms, and the methodologies and documentation necessary to support them. This particular missive speaking to the founding community is (perhaps many will Hurray!) likely to become even more rare as we continue to focus outward.

As Structured Dynamics has stated many times, we are committed to linked data, presenting our information as such, and providing better tools for producing and consuming it. We have made it one of the seven foundations to our technology stack and methodology.

But, linked data on its own is inadequate as an interoperability standard. Many practitioners don’t publish it right, characterize it right, or link to it right. That does not negate its benefits, but it does make it a poor candidate to install on the semantic Web throne.

Linked data based on RDF is perhaps the first citizen amongst all structured data citizens. It is an expressive and readily consumed means for publishing and relating structured instance data and one that can be easily interoperated. It is a natural citizen of the Web.

If we can accept and communicate linked data for these strengths, for what it naturally is — a useful set of techniques and best practices for enabling data that can be easily consumed — we can rest easy at night and not go crazy. Otherwise, bring on the Prozac.

[1] Actually, in my opinion, the suggested listing of apps from these discussions is distinctly unimpressive and not compelling. As argued in the main body of the post, I think this is because linked data is really just a technique or best practice, and not a basis alone for enabling compelling apps. As initial developers of such apps as the UMBEL concept explorer or Dataviewer, Structured Dynamics understands the use of linked data and has a defensible basis to comment on applications. Our own applications intimately integrate linked data, but only as one of seven foundations.

[2] Here are some of my relevant posts over the past year discussing the role of linked data: Moving Beyond Linked Data (Sept. 20, 2009); Fresh Perspectives on the Semantic Enterprise (Sept. 28, 2009); The Law of Linked Data (Oct. 11, 2009); When Linked Data Rules Fail (Nov. 16, 2009).

[3] The current bipolar discussion reminds me of the “Six Phases of a Project,” a copy of which has been a permanent fixture on my office wall:

Enthusiasm
Disillusionment
Panic
Search for the guilty
Punishment of the innocent
Honors & praise for the non-participants.

[4] See, for example: Harry Halpin, 2009. “A Query-Driven Characterization of Linked Data,” paper presented at the Linked Data on the Web (LDOW) 2009 Workshop, April 20, 2009, Madrid, Spain, see http://events.linkeddata.org/ldow2009/papers/ldow2009_paper16.pdf; Prateek Jain, Pascal Hitzler, Peter Z. Yehy, Kunal Vermay and Amit P. Shet, 2010. “Linked Data is Merely More Data,” in Dan Brickley, Vinay K. Chaudhri, Harry Halpin, and Deborah McGuinness, Linked Data Meets Artificial Intelligence, Technical Report SS-10-07, AAAI Press, Menlo Park, California, 2010, pp. 82-86., see http://knoesis.wright.edu/library/publications/linkedai2010_submission_13.pdf; among others.

[5] Structured Dynamics’ best practices approach makes explicit splits between the “ABox” (for instance data) and “TBox” (for ontology schema) in accordance with our working definition for description logics, a fundamental underpinning for how we use RDF:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”

[6] This topic is deserving of some analysis in its own right, and my guess is really just that. For example, RSS feeds to mobile devices alone perhaps account for 2,000 petabytes today; see http://www.tgdaily.com/hardware-features/49167-8000-petabytes-of-mobile-data-traffic-expected-by-2014.

Posted:April 23, 2010

Brown Bag Lunch: Historical Origins of the Knowledge Economy

A Reprise AI3 Post from Four Years Ago

Friday Brown Bag Lunch
In 2002 Joel Mokyr, an economic historian from Northwestern University, wrote a book that should be read by anyone interested in knowledge and its role in economic growth. The Gifts of Athena : Historical Origins of the Knowledge Economy is a sweeping and comprehensive account of the period from 1760 (in what Mokyr calls the “Industrial Enlightenment”) through the Industrial Revolution beginning roughly in 1820 and then continuing through the end of the 19th century.

The book (and related expansions by Mokyr available as separate PDFs on the Internet) should be considered as the definitive reference on this topic to date. The book contains 40 pages of references to all of the leading papers and writers on diverse technologies from mining to manufacturing to health and the household. The scope of subject coverage, granted mostly focused on western Europe and America, is truly impressive.

Mokyr deals with ‘useful knowledge,’ as he acknowledges Simon Kuznets‘ phrase. Mokyr argues that the growth of recent centuries was driven by the accumulation of knowledge and the declining costs of access to it. Mokyr helps to break past logjams that have attempted to link single factors such as the growth in science or the growth in certain technologies (such as the steam engine or electricity) as the key drivers of the massive increases in economic growth that coincided with the era now known as the Industrial Revolution.

Mokyr cracks some of these prior impasses by picking up on ideas first articulated through Michael Polanyi‘s “tacit knowing” (among other recent philosophers interested in the nature and definition of knowledge). Mokyr’s own schema posits propositional knowledge, which he defines as the science, beliefs or the epistemic base of knowledge, which he labels omega (Ω), in combination with prescriptive knowledge, which are the techniques (“recipes”), and which he also labels lambda (λ). Mokyr notes that an addition to omega (Ω) is a discovery, an addition to lambda (λ) is an invention.

One of Mokyr’s key points is that both knowledge types reinforce one another and, of course, the Industrial Revolution was a period of unprecedented growth in such knowledge. Another key point, easily overlooked when “discoveries” are seemingly more noteworthy, is that techniques and practical applications of knowledge can provide a multiplier effect and are equivalently important. For example, in addition to his main case studies of the factory, health and the household, he says:

The inventions of writing, paper, and printing not only greatly reduced access costs but also materially
affected human cognition, including the way people thought about their environment.

Mokyr also correctly notes how the accumulation of knowledge in science and the epistemic base promotes productivity and more still-more efficient discovery mechanisms:

The range of experimentation possibilities that needs to be searched over is far larger if the searcher knows nothing about the natural principles at work. To paraphrase Pasteur’s famous aphorism once more, fortune may sometimes favor unprepared minds, but only for a short while. It is in this respect that the width of the epistemic base makes the big difference.

In my own opinion, I think Mokyr starts to get closer to the mark when he discusses knowledge “storage”, access costs and multiplier effects from basic knowledge-based technologies or techniques. Like some other recent writers, he also tries to find analogies with evolutionary biology. For example:

Much like DNA, useful knowledge does not exist by itself; it has to be “carried” by people or in storage
devices. Unlike DNA, however, carriers can acquire and shed knowledge so that the selection process is quite different. This difference raises the question of how it is transmitted over time, and whether it can actually shrink as well as expand.

One of the real advantages of this book is to move forward a re-think of the “great man” or “great event” approach to history. There are indeed complicated forces at work. I think Mokyr summarizes well this transition when he states:

A century ago, historians of technology felt that individual inventors were the main actors that brought about
the Industrial Revolution. Such heroic interpretations were discarded in favor of views that emphasized deeper economic and social factors such as institutions, incentives, demand, and factor prices. It seems, however, that the crucial elements were neither brilliant individuals nor the impersonal forces governing the masses, but a small group of at most a few thousand peopled who formed a creative community based on the exchange of knowledge. Engineers, mechanics, chemists, physicians, and natural philosophers formed circles in which access to knowledge was the primary objective. Paired with the appreciation that such knowledge could be the base of ever-expanding prosperity, these elite networks were indispensible, even if individual members were not. Theories that link education and human capital of technological progress need to stress the importance of these small creative communities jointly with wider phenomena such as literacy rates and universal schooling.

There is so much to like and to be impressed with this book and even later Mokyr writings. My two criticisms are that, first, I found the pseudo-science of his knowledge labels confusing (I kept having to mentally translate the omega symbol) and I disliked the naming distinctions between propositional and prescriptive, even though I think the concepts are spot on.

My second criticism, a more major one, is that Mokyr notes, but does not adequately pursue, “In the decades after 1815, a veritable explosion of technical literature took place. Comprehensive technical compendia appeared in every industrial field.” Statements such as these, and there are many in the book, hint at perhaps some fundamental drivers.

Mokyr has provided the raw grist for answering his starting question of why such massive economic growth occurred in conjunction with the era of the Industrial Revolution. He has made many insights and posited new factors to explain this salutary discontinuity from all prior human history. But, in this reviewer’s opinion, he still leaves the why tantalizingly close but still unanswered. The fixity of information and growing storehouses because of declining production and access costs remain too poorly explored.

This Friday brown bag leftover was first placed into the AI3 refrigerator about four years ago on July 6, 2006. It was part of a series of book reviews I was doing at that time getting at the importance of bulk paper production as a key enabler of economic growth. No changes have been made to the original posting.

Posted:April 13, 2010

Unreluctantly Cutting the Tether

You’ve Got to be Crazy to Look to an Ad-based Revenue Model

OK. After an experiment of more than three years, I have just now canceled my Google AdSense participation. (Which, Google, by the way, makes almost impossible to do: Finding the cancel link is hard enough; but who remembers the day they first signed up for ads and how many impressions they got that day? Both are required to get a cancellation request approved. Give me a break. It is worse than banks claiming small digits from bank interest for their own income!!)

Despite my sub-title, I never did expect to make much (or, really, any) money from Google ads. When I first signed up for it in Dec 2006, I stated I was doing it to find out how this ad-based business really works.

Well, from my standpoint, it does not work well; actually, not well at all.

Over the years I have seen visits on this site climb to nearly 3 K per day, and other nice growth factors. Perhaps if I were really focused on ad revenue I would have rotated stuff, tried alternative placements, yada yada. But, mostly, I was just trying to see who made out in this ad game.

It is certainly not the standard blog. I think my stats put me somewhere in the top 1% of all sites visited, but even that is not enough to even pay my monthly server charges (now higher with Amazon EC2).

Yet, in recent months, I have noticed some vendors have specifically targeted advertising on my blog and there also has been an increase in full banner ads (away from the standard, unobtrusive link Google ads of years past). Maybe they know something I don’t and they are winning, but my monthly ad income has dropped or remained flat.

And, then, I began to get full panel flashing ads on my site that just screamed Hit me! Hit me!. WTF. It was the last straw. Where did the unobtrusive link stuff go? Screw it; I can afford to pay my own monthly chump change.

This is probably not the time or place to discuss business models on the Web, but the woeful state of ad-based revenue is apparent. My goodness, I’m getting tired of ReadWriteWeb, as an example and one of the biggest at that, shilling with repeats and big ads with stories for their prominent advertisers each weekend. And, they are one of the only few ad winners!

My honest guess is that fewer than 1/10 of 1% of Web sites with advertising make enough to cover their bandwidth and server costs. How do you spell s-m-a-r-t?

So, the experiment is over. I will now think a bit about how I can reclaim that valuable Web page space from my former charitable contribution to the Google cafeteria. Bring on the sushi!

Posted:April 9, 2010

Brown Bag Lunch: Methods for Semantic Discovery, Annotation and Mediation

Friday Brown Bag Lunch

Mediating semantic heterogeneities requires tools and automation (or semi-automation) at scale. But existing tools are still crude and lack across-the-board integration. This is one of the next challenges in getting more widespread acceptance of the semantic Web.

In earlier posts, I described the significant progress in climbing the data federation pyramid, today’s evolution in emphasis to the semantic Web, and the 40 or so sources of semantic heterogeneity. We now transition to an overview of how one goes about providing these semantics and resolving these heterogeneities.

Why the Need for Tools and Automation?

In an excellent recent overview of semantic Web progress, Paul Warren points out:[1]

Although knowledge workers no doubt believe in the value of annotating their documents, the pressure to create metadata isn’t present. In fact, the pressure of time will work in a counter direction. Annotation’s benefits accrue to other workers; the knowledge creator only benefits if a community of knowledge workers abides by the same rules. . . . Developing semiautomatic tools for learning ontologies and extracting metadata is a key research area . . . .Having to move out of a user’s typical working environment to ‘do knowledge management’ will act as a disincentive, whether the user is creating or retrieving knowledge.

Of course, even assuming that ontologies are created and semantics and metadata are added to content, there still remains the nasty problems of resolving heterogeneities (semantic mediation) and efficiently storing and retrieving the metadata and semantic relationships.

Putting all of this process in place requires the infrastructure in the form of tools and automation and proper incentives and rewards for users and suppliers to conform to it.

Areas Requiring Tools and Automation

In his paper, Warren repeatedly points to the need for “semi-automatic” methods to make the semantic Web a reality. He makes fully a dozen such references, in addition to multiple references to the need for “reasoning algorithms.” In any case, here are some of the areas noted by Warren needing “semi-automatic” methods:

Assign authoritativeness
Learn ontologies
Infer better search requests
Mediate ontologies (semantic resolution)
Support visualization
Assign collaborations
Infer relationships
Extract entities
Create ontologies
Maintain and evolve ontologies
Create taxonomies
Infer trust
Analyze links
etc.

In a different vein, SemWebCentral lists these clusters of semantic Web-related tasks, each of which also requires tools:[2]

Create an ontology — use a text or graphical ontology editor to create the ontology, which is then validated. The resulting ontology can then be viewed with a browser before being published
Disambiguate data — generate a mapping between multiple ontologies to identify where classes and properties are the same
Expose a relational database as OWL — an editor is first used to create the ontologies that represent the database schema, then the ontologies are validated, translated to OWL and then the generated OWL is validated
Intelligently query distributed data — repository and again able to be queried
Manually create data from an ontology — a user would use an editor to create new OWL data based on existing ontologies, which is then validated and browsable
Programmatically interact with OWL content — custom programs can view, create, and modify OWL content with an API
Query non-OWL data — via an annotation tool, create OWL metadata from non-OWL content
Visualize semantic data — view semantic data in a custom visualizer.

With some ontologies approaching tens to hundreds of thousands to millions of triples, viewing, annotating and reconciling at scale can be daunting tasks, the efforts behind which would never be taken without useful tools and automation.

A Workflow Perspective Helps Frame the Challenge

A 2005 paper by Izza, Vincent and Burlat (among many other excellent ones) at the first International Conference on Interoperability of Enterprise Software and Applications (INTEROP-ESA) provides a very readable overview on the role of semantics and ontologies in enterprise integration.[3] Besides proposing a fairly compelling unified framework, the authors also present a useful workflow perspective emphasizing Web services (WS), also applicable to semantics in general, that helps frame this challenge:

Generic Semantic Integration Workflow (adapted from [3])

For existing data and documents, the workflow begins with information extraction or annotation of semantics and metadata (#1) in accordance with a reference ontology. Newly found information via harvesting must also be integrated; however, external information or services may come bearing their own ontologies, in which case some form of semantic mediation is required.

Of course, this is a generic workflow, and depending on the interoperation task, different flows and steps may be required. Indeed, the overall workflow can vary by perspective and researcher, with semantic resolution workflow modeling a prime area of current investigations. (As one alternative among scores, see for example Cardoso and Sheth.[4])

Matching and Mapping Semantic Heterogeneities

Semantic mediation is a process of matching schemas and mapping attributes and values, often with intermediate transformations (such as unit or language conversions) also required. The general problem of schema integration is not new, with one prior reference going back as early as 1986. [5] According to Alon Halevy:[6]

As would be expected, people have tried building semi-automated schema-matching systems by employing a variety of heuristics. The process of reconciling semantic heterogeneity typically involves two steps. In the first, called schema matching, we find correspondences between pairs (or larger sets) of elements of the two schemas that refer to the same concepts or objects in the real world. In the second step, we build on these correspondences to create the actual schema mapping expressions.

The issues of matching and mapping have been addressed in many tools, notably commercial ones from MetaMatrix,[7] and open source and academic projects such as Piazza, [8] SIMILE, [9] and the WSMX (Web service modeling execution environment) protocol from DERI. [10] [11] A superb description of the challenges in reconciling the vocabularies of different data sources is also found in the thesis by Dr. AnHai Doan, which won the 2003 ACM’s Prestigious Doctoral Dissertation Award.[12]

What all of these efforts has found is the inability to completely automate the mediation process. The current state-of-the-art is to reconcile what is largely unambiguous automatically, and then prompt analysts or subject matter experts to decide the questionable matches. These are known as “semi-automated” systems and the user interface and data presentation and workflow become as important as the underlying matching and mapping algorithms. According to the WSMX project, there is always a trade-off between how accurate these mappings are and the degree of automation that can be offered.

Also a Need for Efficient Semantic Data Stores

Once all of these reconciliations take place there is the (often undiscussed) need to index, store and retrieve these semantics and their relationships at scale, particularly for enterprise deployments. This is a topic I have addressed many times from the standpoint of scalability, more scalability, and comparisons of database and relational technologies, but it is also not a new topic in the general community.

As Stonebraker and Hellerstein note in their retrospective covering 35 years of development in databases,[13] some of the first post-relational data models were typically called semantic data models, including those of Smith and Smith in 1977[14] and Hammer and McLeod in 1981.[15] Perhaps what is different now is our ability to address some of the fundamental issues.

At any rate, this subsection is included here because of the hidden importance of database foundations. It is therefore a topic often addressed in this series.

A Partial Listing of Semantic Web Tools

In all of these areas, there is a growing, but still spotty, set of tools for conducting these semantic tasks. SemWebCentral, the open source tools resource center, for example, lists many tools and whether they interact or not with one another (the general answer is often No).[16] Protégé also has a fairly long list of plugins, but not unfortunately well organized. [17]

In the table below, I begin to compile a partial listing of semantic Web tools, with more than 50 listed. Though a few are commercial, most are open source. Also, for the open source tools, only the most prominent ones are listed (Sourceforge, for example, has about 200 projects listed with some relation to the semantic Web though most of minor or not yet in alpha release).

NAME	URL	DESCRIPTION
Almo	http://ontoware.org/projects/almo	An ontology-based workflow engine in Java
Altova SemanticWorks	http://www.altova.com/products_semanticworks.html	Visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design
Bibster	http://bibster.semanticweb.org/	A semantics-based bibliographic peer-to-peer system
cwm	http://www.w3.org/2000/10/swap/doc/cwm.html	A general purpose data processor for the semantic Web
Deep Query Manager	http://www.brightplanet.com/products/dqm_overview.asp	Search federator from deep Web sources
DOSE	https://sourceforge.net/projects/dose	A distributed platform for semantic annotation
ekoss.org	http://www.ekoss.org/	A collaborative knowledge sharing environment where model developers can submit advertisements
Endeca	http://www.endeca.com	Facet-based content organizer and search platform
FOAM	http://ontoware.org/projects/map	Framework for ontology alignment and mapping
Gnowsis	http://www.gnowsis.org/	A semantic desktop environment
GrOWL	http://ecoinformatics.uvm.edu/technologies/growl-knowledge-modeler.html	Open source graphical ontology browser and editor
HAWK	http://swat.cse.lehigh.edu/projects/index.html#hawk	OWL repository framework and toolkit
HELENOS	http://ontoware.org/projects/artemis	A Knowledge discovery workbench for the semantic Web
Jambalaya	http://www.thechiselgroup.org/jambalaya	Protégé plug-in for visualizing ontologies
Jastor	http://jastor.sourceforge.net/	Open source Java code generator that emits Java Beans from ontologies
Jena	http://jena.sourceforge.net/	Opensource ontology API written in Java
KAON	http://kaon.semanticweb.org/	Open source ontology management infrastructure
Kazuki	http://projects.semwebcentral.org/projects/kazuki/	Generates a java API for working with OWL instance data directly from a set of OWL ontologies
Kowari	http://www.kowari.org/	Open source database for RDF and OWL
LuMriX	http://www.lumrix.net/xmlsearch.php	A commercial search engine using semantic Web technologies
MetaMatrix	http://www.metamatrix.com/	Semantic vocabulary mediation and other tools
Metatomix	http://www.metatomix.com/	Commercial semantic toolkits and editors
MindRaider	http://mindraider.sourceforge.net/index.html	Open source semantic Web outline editor
Model Futures OWL Editor	http://www.modelfutures.com/OwlEditor.html	Simple OWL tools, featuring UML (XMI), ErWin, thesaurus and imports
Net OWL	http://www.netowl.com/	Entity extraction engine from SRA International
Nokia Semantic Web Server	https://sourceforge.net/projects/sws-uriqa	An RDF based knowledge portal for publishing both authoritative and third party descriptions of URI denoted resources
OntoEdit/OntoStudio	http://ontoedit.com/	Engineering environment for ontologies
OntoMat Annotizer	http://annotation.semanticweb.org/ontomat	Interactive Web page OWL and semantic annotator tool
Oyster	http://ontoware.org/projects/oyster	Peer-to-peer system for storing and sharing ontology metadata
Piggy Bank	http://simile.mit.edu/piggy-bank/	A Firefox-based semantic Web browser
Pike	http://pike.ida.liu.se/	A dynamic programming (scripting) language similar to Java and C for the semantic Web
pOWL	http://powl.sourceforge.net/index.php	Semantic Web development platform
Protégé	http://protege.stanford.edu/	Open source visual ontology editor written in Java with many plug-in tools
RACER Project	https://sourceforge.net/projects/racerproject	A collection of Projects and Tools to be used with the semantic reasoning engine RacerPro
RDFReactor	http://rdfreactor.ontoware.org/	Access RDF from Java using inferencing
Redland	http://librdf.org/	Open source software libraries supporting RDF
RelationalOWL	https://sourceforge.net/projects/relational-owl	Automatically extracts the semantics of virtually any relational database and transforms this information automatically into RDF/OW
Semantical	http://semantical.org/	Open source semantic Web search engine
SemanticWorks	http://www.altova.com/products_semanticworks.html	SemanticWorks RDF/OWL Editor
Semantic Mediawiki	https://sourceforge.net/projects/semediawiki	Semantic extension to the MediaWiiki wiki
Semantic Net Generator	https://sourceforge.net/projects/semantag	Utility for generating topic maps automatically
Sesame	http://www.openrdf.org/	An open source RDF database with support for RDF Schema inferencing and querying
SMART	http://web.ict.nsc.ru/smart/index.phtml?lang=en	System for Managing Applications based on RDF Technology
SMORE	http://www.mindswap.org/2005/SMORE/	OWL markup for HTML pages
SPARQL	http://www.w3.org/TR/rdf-sparql-query/	Query language for RDF
SWCLOS	http://iswc2004.semanticweb.org/demos/32/	A semantic Web processor using Lisp
Swoogle	http://swoogle.umbc.edu/	A semantic Web search engine with 1.5 M resources
SWOOP	http://www.mindswap.org/2004/SWOOP/	A lightweight ontology editor
Turtle	http://www.ilrt.bris.ac.uk/discovery/2004/01/turtle/	Terse RDF “Triple” language
WSMO Studio	https://sourceforge.net/projects/wsmostudio	A semantic Web service editor compliant with WSMO as a set of Eclipse plug-ins
WSMT Toolkit	https://sourceforge.net/projects/wsmt	The Web Service Modeling Toolkit (WSMT) is a collection of tools for use with the Web Service Modeling Ontology (WSMO), the Web Service Modeling Language (WSML) and the Web Service Execution Environment (WSMX)
WSMX	https://sourceforge.net/projects/wsmx/	Execution environment for dynamic use of semantic Web services

Tools Still Crude, Integration Not Compelling

Individually, there are some impressive and capable tools on this list. Generally, however, the interfaces are not intuitive, integration between tools is lacking, and why and how standard analysts should embrace them is lacking. In the semantic Web, we have yet to see an application of the magnitude of the first Mosaic browser that made HTML and the World Wide Web compelling.

It is perhaps likely that a similar “killer app” may not be forthcoming for the semantic Web. But it is important to remember just how entwined tools are to accelerating acceptance and growth of new standards and protocols.

This Friday brown bag leftover was first placed into the AI3 refrigerator about four years ago on June 12, 2006. It was the follow-on to last week’s Brown Bag Lunch posting. It is also the first attempt I made at assembling semantic Web- and -related tools, which has now grown into the 800+ Sweet Tools listing. No changes have been made to the original posting.

[1] Paul Warren, “Knowledge Management and the Semantic Web: From Scenario to Technology,” IEEE Intelligent Systems, vol. 21, no. 1, 2006, pp. 53-59. See http://dsonline.computer.org/portal/site/dsonline/menuitem.9ed3d9924aeb0dcd82ccc6716bbe36ec/index.jsp?&pName=dso_level1&path=dsonline/2006/02&file=x1war.xml&xsl=article.xsl&

[2] See http://www.semwebcentral.org/index.jsp?page=workflows. [Link now missing.]

[3] Said Izza, Lucien Vincent and Patrick Burlat, “A Unified Framework for Enterprise Integration: An Ontology-Driven Service-Oriented Approach,” pp. 78-89, in Pre-proceedings of the First International Conference on Interoperability of Enterprise Software and Applications (INTEROP-ESA’2005), Geneva, Switzerland, February 23 – 25, 2005, 618 pp. See http://interop-esa05.unige.ch/INTEROP/Proceedings/Interop-ESAScientific/OneFile/InteropESAproceedings.pdf.

[4] Jorge Cardoso and Amit Sheth, “Semantic Web Processes: Semantics Enabled Annotation, Discovery, Composition and Orchestration of Web Scale Processes,” in the 4th International Conference on Web Information Systems Engineering (WISE 2003), December 10-12, 2003, Rome, Italy. See http://lsdis.cs.uga.edu/lib/presentations/WISE2003-Tutorial.pdf.

[5] C. Batini, M. Lenzerini, and S.B. Navathe, “A Comparative Analysis of Methodologies for Database Schema Integration,” in ACM Computing Survey, 18(4):323-364, 1986.

[6] Alon Halevy, “Why Your Data Won’t Mix,” ACM Queue vol. 3, no. 8, October 2005. See http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=336.

[7] Chuck Moser, Semantic Interoperability: Automatically Resolving Vocabularies, presented at the 4th Semantic Interoperability Conference, February 10, 2006. See http://colab.cim3.net/file/work/SICoP/2006-02-09/Presentations/CMosher02102006.ppt.

[8] Alon Y. Halevy, Zachary G. Ives, Peter Mork and Igor Tatarinov, “Piazza: Data Management Infrastructure for Semantic Web Applications,” Journal of Web Semantics, Vol. 1 No. 2, February 2004, pp. 155-175. See http://www.cis.upenn.edu/~zives/research/piazza-www03.pdf.

[9] Stefano Mazzocchi, Stephen Garland, Ryan Lee, “SIMILE: Practical Metadata for the Semantic Web,” January 26, 2005. See http://www.xml.com/pub/a/2005/01/26/simile.html.

[10] Adrian Mocan, Ed., “WSMX Data Mediation,” in WSMX Working Draft, W3C Organization, 11 October 2005. See http://www.wsmo.org/TR/d13/d13.3/v0.2/20051011.

[11] J.Madhavan , P. A. Bernstein , P. Domingos and A. Y. Halevy, “Representing and Reasoning About Mappings Between Domain Models,” in the Eighteenth National Conference on Artificial Intelligence, pp.80-86, Edmonton, Alberta, Canada, July 28-August 01, 2002.

[12] AnHai Doan, Learning to Map between Structured Representations of Data, Ph.D. Thesis to the Computer Science & Engineering Department, University of Washington, 2002, 133 pp. See http://anhai.cs.uiuc.edu/home/thesis/anhai-thesis.pdf.

[13] Michael Stonebraker and Joey Hellerstein, “What Goes Around Comes Around,” in Joseph M. Hellerstein and Michael Stonebraker, editors, Readings in Database Systems, Fourth Edition, pp. 2-41, The MIT Press, Cambridge, MA, 2005. See http://mitpress.mit.edu/books/chapters/0262693143chapm1.pdf.

[14] John Miles Smith and Diane C. P. Smith, “Database Abstractions: Aggregation and Generalization,” ACM Transactions on Database Systems 2(2): 105-133, 1977.

[15] Michael Hammer and Dennis McLeod, “Database Description with SDM: A Semantic Database Model,” ACM Transactions on Database Systems 6(3): 351-386, 1981.

[16] See http://www.semwebcentral.org/index.jsp?page=home.

[17] See http://protege.cim3.net/cgi-bin/wiki.pl?ProtegePluginsLibraryByType.

Posted:April 2, 2010

Brown Bag Lunch: Sources and Classification of Semantic Heterogeneities

Friday Brown Bag Lunch

Semantic mediation — that is, resolving semantic heterogeneities — must address more than 40 discrete categories of potential mismatches from units of measure, terminology, language, and many others. These sources may derive from structure, domain, data or language.

Earlier postings in this recent series traced the progress in climbing the data federation pyramid to today’s current emphasis on the semantic Web. Partially this series is aimed at disabusing the notion that data extensibility can arise simply by using the XML (eXtensible Markup Language) data representation protocol. As Stonebraker and Hellerstein correctly observe:

XML is sometimes marketed as the solution to the semantic heterogeneity problem . . . . Nothing could be further from the truth. Just because two people tag a data element as a salary does not mean that the two data elements are comparable. One could be salary after taxes in French francs including a lunch allowance, while the other could be salary before taxes in US dollars. Furthermore, if you call them “rubber gloves” and I call them “latex hand protectors”, then XML will be useless in deciding that they are the same concept. Hence, the role of XML will be limited to providing the vocabulary in which common schemas can be constructed.[1]

This series also covers the ontologies and the OWL language (written in XML) that now give us the means to understand and process these different domains and “world views” by machine. According to Natalya Noy, one of the principal researchers behind the Protégé development environment for ontologies and knowledge-based systems:

How are ontologies and the Semantic Web different from other forms of structured and semi-structured data, from database schemas to XML? Perhaps one of the main differences lies in their explicit formalization. If we make more of our assumptions explicit and able to be processed by machines, automatically or semi-automatically integrating the data will be easier. Here is another way to look at this: ontology languages have formal semantics, which makes building software agents that process them much easier, in the sense that their behavior is much more predictable (assuming they follow the specified explicit semantics–but at least there is something to follow). [2]

Again, however, simply because OWL (or similar) languages now give us the means to represent an ontology, we still have the vexing challenge of how to resolve the differences between different “world views,” even within the same domain. According to Alon Halevy:

When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies–or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas. Without such understanding, the multitude of data sources amounts to a digital version of the Tower of Babel. [3]

In the sections below, we describe the sources for how this heterogeneity arises and classify the many different types of heterogeneity. I then describe some broad approaches to overcoming these heterogeneities, though a subsequent post looks at that topic in more detail.

Causes and Sources of Semantic Heterogeneity

There are many potential circumstances where semantic heterogeneity may arise (partially from Halevy [3]):

Enterprise information integration
Querying and indexing the deep Web (which is a classic data federation problem in that there are literally tens to hundreds of thousands of separate Web databases) [4]
Merchant catalog mapping
Schema v. data heterogeneity
Schema heterogeneity and semi-structured data.

Naturally, there will always be differences in how differing authors or sponsors create their own particular “world view,” which, if transmitted in XML or expressed through an ontology language such as OWL may also result in differences based on expression or syntax. Indeed, the ease of conveying these schema as semi-structured XML, RDF or OWL is in and of itself a source of potential expression heterogeneities. There are also other sources in simple schema use and versioning that can create mismatches [3]. Thus, possible drivers in semantic mismatches can occur from world view, perspective, syntax, structure and versioning and timing:

One schema may express a similar “world view” with different syntax, grammar or structure
One schema may be a new version of the other
Two or more schemas may be evolutions of the same original schema
There may be many sources modeling the same aspects of the underlying domain (“horizontal resolution” such as for competing trade associations or standards bodies), or
There may be many sources that cover different domains but overlap at the seams (“vertical resolution” such as between pharmaceuticals and basic medicine).

Regardless, the needs for semantic mediation are manifest, as are the ways in which semantic heterogeneities may arise.

Classification of Semantic Heterogeneities

The first known classification scheme applied to data semantics that I am aware of is from William Kent nearly 20 years ago.[5] (If you know of earlier ones, please send me a note.) Kent’s approach dealt more with structural mapping issues (see below) than differences in meaning, which he pointed to data dictionaries as potentially solving.

The most comprehensive schema I have yet encountered is from Pluempitiwiriyawej and Hammer, “A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources.” [6] They classify heterogeneities into three broad classes:

Structural conflicts arise when the schema of the sources representing related or overlapping data exhibit discrepancies. Structural conflicts can be detected when comparing the underlying DTDs. The class of structural conflicts includes generalization conflicts, aggregation conflicts, internal path discrepancy, missing items, element ordering, constraint and type mismatch, and naming conflicts between the element types and attribute names.

Domain conflicts arise when the semantic of the data sources that will be integrated exhibit discrepancies. Domain conflicts can be detected by looking at the information contained in the DTDs and using knowledge about the underlying data domains. The class of domain conflicts includes schematic discrepancy, scale or unit, precision, and data representation conflicts.

Data conflicts refer to discrepancies among similar or related data values across multiple sources. Data conflicts can only be detected by comparing the underlying DOCs. The class of data conflicts includes ID-value, missing data, incorrect spelling, and naming conflicts between the element contents and the attribute values.

Moreover, mismatches or conflicts can occur between set elements (a “population” mismatch) or attributes (a “description” mismatch).

The table below builds on Pluempitiwiriyawej and Hammer’s schema by adding the fourth major explicit category of language, leading to about 40 distinct potential sources of semantic heterogeneities:

Class	Category	Subcategory
STRUCTURAL	Naming	Case Sensitivity
		Synonyms
		Acronyms
		Homonyms
	Generalization / Specialization
	Aggregation	Intra-aggregation
		Inter-aggregation
	Internal Path Discrepancy
	Missing Item	Content Discrepancy
		Attribute List Discrepancy
		Missing Attribute
		Missing Content
	Element Ordering
	Constraint Mismatch
	Type Mismatch
DOMAIN	Schematic Discrepancy	Element-value to Element-label Mapping
		Attribute-value to Element-label Mapping
		Element-value to Attribute-label Mapping
		Attribute-value to Attribute-label Mapping
	Scale or Units
	Precision
	Data Representation	Primitive Data Type
		Data Format
DATA	Naming	Case Sensitivity
		Synonyms
		Acronyms
		Homonyms
	ID Mismatch or Missing ID
	Missing Data
	Incorrect Spelling
LANGUAGE	Encoding	Ingest Encoding Mismatch
		Ingest Encoding Lacking
		Query Encoding Mismatch
		Query Encoding Lacking
	Languages	Script Mismatches
		Parsing / Morphological Analysis Errors (many)
		Syntactical Errors (many)
		Semantic Errors (many)

Most of these line items are self-explanatory, but a few may not be:

Homonyms refer to the same name referring to more than one concept, such as Name referring to a person v. Name referring to a book
A generalization/specialization mismatch can occur when single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to “phone” but the other schema has multiple elements such as “home phone,” “work phone” and “cell phone”
Intra-aggregation mismatches come when the same population is divided differently (Census v. Federal regions for states, or full person names v. first-middle-last, for examples) by schema, whereas inter-aggregation mismatches can come from sums or counts as added values
Internal path discrepancies can arise from different source-target retrieval paths in two different schema (for example, hierarchical structures where the elements are different levels of remove)
The four sub-types of schematic discrepancy refer to where attribute and element names may be interchanged between schema
Under languages, encoding mismatches can occur when either the import or export of data to XML assumes the wrong encoding type. While XML is based on Unicode, it is important that source retrievals and issued queries be in the proper encoding of the source. For Web retrievals this is very important, because only about 4% of all documents are in Unicode, and earlier BrightPlanet provided estimates there may be on the order of 25,000 language-encoding pairs presently on the Internet
Even should the correct encoding be detected, there are significant differences in different language sources in parsing (white space, for example), syntax and semantics that can also lead to many error types.

It should be noted that a different take on classifying semantics and integration approaches is taken by Sheth et al. [7] Under their concept, they split semantics into three forms: implicit, formal and powerful. Implicit semantics are what is either largely present or can easily be extracted; formal languages, though relatively scarce, occur in the form of ontologies or other descriptive logics; and powerful (soft) semantics are fuzzy and not limited to rigid set-based assignments. Sheth et al.’s main point is that first-order logic (FOL) or descriptive logic is inadequate alone to properly capture the needed semantics.

From my viewpoint, Pluempitiwiriyawej and Hammer’s [6] classification better lends itself to pragmatic tools and approaches, though the Sheth et al. approach also helps indicate what can be processed in situ from input data v. inferred or probabalistic matches.

Importance of Reference Standards

An attractive and compelling vision — perhaps even a likely one — is that standard reference ontologies become increasingly prevalent as time moves on and semantic mediation is seen as more of a mainstream problem. Certainly, a start on this has been seen with the use of the Dublin Core metadata initiative, and increasingly other associations, organizations, and major buyers are busy developing “standardized” or reference ontologies.[8] Indeed, there are now more than 10,000 ontologies available on the Web.[9] Insofar as these gain acceptance, semantic mediation can become an effort mostly at the periphery and not the core.

But, such is not the case today. Standards only have limited success and in targeted domains where incentives are strong. That acceptance and benefit threshold has yet to be reached on the Web. Until such time, a multiplicity of automated methods, semi-automated methods and gazetteers will all be required to help resolve these potential heterogeneities.

This Friday brown bag leftover was first placed into the AI3 refrigerator about four years ago on June 6, 2006. No changes have been made to the original posting. Current approaches to dealing with these heterogeneities would be to use “bridging” ontologies that map the mismatches.

[1] Michael Stonebraker and Joey Hellerstein, “What Goes Around Comes Around,” in Joseph M. Hellerstein and Michael Stonebraker, editors, Readings in Database Systems, Fourth Edition, pp. 2-41, The MIT Press, Cambridge, MA, 2005. See http://mitpress.mit.edu/books/chapters/0262693143chapm1.pdf.

[2] Natalya Noy, “Order from Chaos,” ACM Queue vol. 3, no. 8, October 2005 See http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=341&page=1

[3] Alon Halevy, “Why Your Data Won’t Mix,” ACM Queue vol. 3, no. 8, October 2005. See http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=336.

[4] Michael K. Bergman, “The Deep Web: Surfacing Hidden Value,” BrightPlanet Corporation White Paper, June 2000. The most recent version of the study was published by the University of Michigan’s Journal of Electronic Publishing in July 2001. See http://www.press.umich.edu/jep/07-01/bergman.html.

[5] William Kent, “The Many Forms of a Single Fact”, Proceedings of the IEEE COMPCON, Feb. 27-Mar. 3, 1989, San Francisco. Also HPL-SAL-88-8, Hewlett-Packard Laboratories, Oct. 21, 1988. [13 pp]. See http://www.bkent.net/Doc/manyform.htm.

[6] Charnyote Pluempitiwiriyawej and Joachim Hammer, “A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources,” Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See ftp.dbcenter.cise.ufl.edu/Pub/publications/tr00-004.pdf.

[7] Amit Sheth, Cartic Ramakrishnan and Christopher Thomas, “Semantics for the Semantic Web: The Implicit, the Formal and the Powerful,” in Int’l Journal on Semantic Web & Information Systems, 1(1), 1-18, Jan-March 2005. See http://www.informatik.uni-trier.de/~ley/db/journals/ijswis/ijswis1.html

[8] See, among scores of possible examples, the NIEM (National Information Exchange Model) agreed to between the US Departments of Justice and Homeland Security; see http://www.niem.gov/.

[9] OWL Ontologies: When Machine Readable is Not Good Enough

Main Links

Search

Author: Mike Bergman