There has been a bit of a manic-depressive character on the Web waves of late with respect to linked data. On the one hand, we have seen huzzahs and celebrations from the likes of ReadWriteWeb and Semantic Web.com and, just concluded, the Linked Data on the Web (LDOW) workshop at WWW2010. This treatment has tended to tout the coming of the linked data era and to seek ideas about possible, cool linked data apps . This rise in visibility has been accomplished by much manic and excited discussion on various mailing lists.
On the other hand, we have seen much wringing of hands and gnashing of teeth for why linked data is not being used more and why the broader issue of the semantic Web is not seeing more uptake. This depressive “call to arms” has sometimes felt like ravings with blame being given to the poor state of apps and user interfaces to badly linked data to the difficulty of publishing same. Actually using linked data for anything productive (other than single sources like DBpedia) still appears to be an issue.
Meanwhile, among others, Kingsley Idehen, ubiquitous voice on the Twitter #linkeddata channel, has been promoting the separation of identity of linked data from the notion of the semantic Web. He is also trying to change the narrative away from the association of linked data with RDF, instead advocating “Data 3.0″ and the entity-attribute-value (EAV) model understanding of structured data.
As someone less engaged in these topics since my own statements about linked data over the past couple of years , I have my own distanced-yet-still-biased view of what all of this crisis of confidence is about. I think I have a diagnosis for what may be causing this bipolar disorder of linked data .
A fairly universal response from enterprise prospects when raising the topic of the semantic Web is, “That was a big deal of about a decade ago, wasn’t it? It didn’t seem to go anywhere.” And, actually, I think both proponents and keen observers agree with this general sentiment. We have seen the original advocate, Tim Berners-Lee, float the Giant Global Graph balloon, and now Linked Data. Others have touted Web 3.0 or Web of Data or, frankly, dozens of alternatives. Linked data, which began as a set of techniques for publishing RDF, has emerged as a potential marketing hook and saviour for the tainted original semantic Web term.
And therein, I think, lies the rub and the answer to the bipolar disorder.
If one looks at the original principles for putting linked data on the Web or subsequent interpretations, it is clear that linked data (lower case) is merely a set of techniques. Useful techniques, for sure; but really a simple approach to exposing data using the Web with URLs as the naming convention for objects and their relationships. These techniques provide (1) methods to access data on the Web and (2) specifying the relationships to link the data (resources). The first part is mechanistic and not really of further concern here. And, while any predicate can be used to specify a data (resource) relationship, that relationship should also be discoverable with a URL (dereferencable) to qualify as linked data. Then, to actually be semantically useful, that relationship (predicate) should also have a precise definition and be part of a coherent schema. (Note, this last sentence is actually not part of the “standard” principles for linked data, which itself is a problem.)
When used right, these techniques can be powerful and useful. But, poor choices or execution in how relationships are specified often leads to saying little or nothing about semantics. Most linked data uses a woefully small vocabulary of data relationships, with even a smaller set ever used for setting linkages across existing linked data sets . Linked data techniques are a part of the foundation to overall best practices, but not the total foundation. As I have argued for some time, linked data alone does not speak to issues of context nor coherence.
To speak semantically, linked data is not a synonym for the semantic Web nor is it the sameAs the semantic Web. But, many proponents have tried to characterize it as such. The general tenor is to blow the horns hard anytime some large data set is “exposed” as linked data. (No matter whether the data is incoherent, lacks a schema, or is even poorly described and defined.) Heralding such events, followed by no apparent usefulness to the data, causes confusion to reign supreme and disappointment to naturally occur.
The semantic Web (or semantic enterprise or semantic government or similar expressions) is a vision and an ideal. It is also a fairly complete one that potentially embraces machines and agents working in the background to serve us and make us more productive. There is an entire stack of languages and techniques and methods that enable schema to be described and non-conforming data to be interoperated. Now, of course this ideal is still a work in progress. Does that make it a failure?
Well, maybe so, if one sees the semantic Web as marketing or branding. But, who said we had to present it or understand it as such?
The issue is not one of marketing and branding, but the lack of benefits. Now, maybe I have it all wrong, but it seems to me that the argument needs to start with what “linked data” and the “semantic Web” can do for me. What I actually call it is secondary. Rejecting the branding of the semantic Web for linked data or Web 3.0 or any other somesuch is still dressing the emperor in new clothes.
For a couple of years now I have tried in various posts to present linked data in a broader framework of structured and semantic Web data. I first tried to capture this continuum in a diagram from July 2007:
|Document Web||Structured Web||Semantic Web|
Now, three years later, I think the transitional phase of linked data is reaching an end. OK, we have figured out one useful way to publish large datasets staged for possible interoperability. Sure, we have billions of triples and assertions floating out there. But what are we to do with them? And, is any of it any good?
I think Kingsley is right in one sense to point to EAV and structured data. We, too, have not met a structured data format we did not like. There are hundreds of attribute-value pair models of even more generic nature that also belong to the conversation.
One of my most popular posts on this blog has been, ‘Structs’: Naïve Data Formats and the ABox, from January 2009. Today, we have a multitude of popular structured data formats from XML to JSON and even spreadsheets (CSV). Each form has its advocates, place and reasons for existence and popularity (or not). This inherent diversity is a fact and fixture of any discussion of data. It is a major reason why we developed the irON (instance record and object notation) non-RDF vocabulary to provide a bridge from such forms to RDF, which is accessible on the Web via URIs. irON clearly shows that entities can be usefully described and consumed in either RDF or non-RDF serialized forms.
Though RDF and linked data is a great form for expressing this structured information, other forms can convey the same meaning as well. Of the billions of linked data triples exposed to date, surely more than 99% are of this instance-level, “ABox” type of data . And, more telling, of all of the structured data that is publicly obtainable on the Web, my wild guess is that less than 0.0000000001% of that is even linked RDF data .
Neither linked data nor RDF alone will — today or in the near future — play a pivotal or essential role for instance data. The real contribution from RDF and the semantic Web will come from connecting things together, from interoperation and federation and conjoining. This is the provenance of the TBox and is a role barely touched by linked data. Publishing data as linked data helps tremendously in simplifying ingest and guiding the eventual connections, but the making of those connections, testing for their quality and reliability, are steps beyond the linked data ken or purpose.
It seems, then, that we see two different forces and perspectives at work, each contributing in its own way to today’s bipolar nature of linked data.
On the manic side, we see the celebration for the release of each large, linked data set. This perspective seems to care most about volumes and numbers, with less interest in how and whether the data is of quality or useful. This perspective seems to believe “post the data, and the public will come.” This same perspective is also quite parochial with respect to the unsuitability of non-linked data, be it microdata, microformats or any of the older junk.
On the depressed side, linked data has been seen as a more palatable packaging for the disappointments and perceived failures or slow adoption of the earlier semantic Web phrasing. When this perspective sees the lack of structure, defensible connections and other quality problems with linked data as it presently exists, despair and frustration ensue.
But both of these perspectives very much miss the mark. Linked data will never become the universal technique for publishing structured data, and should not be expected to be such. Numbers are never a substitute for quality. And linked data lacks the standards, scope and investment made in the semantic Web to date. Be patient; don’t despair; structured data and the growth of semantics and useful metadata is proceeding just fine.
Unrealistic expectations or wrong roles and metrics simply confuse the public. We are fortunate that most potential buyers do not frequent the community’s various mailing lists. Reduced expectations and an understanding of linked data’s natural role is perhaps the best way to bring back balance.
We have consciously moved our communications focus from speaking internally to the community to reaching out to the broader enterprise public. There is much of education, clarification and dialog that is now needed with the buying public. The time has moved past software demos and toys to workable, pragmatic platforms, and the methodologies and documentation necessary to support them. This particular missive speaking to the founding community is (perhaps many will Hurray!) likely to become even more rare as we continue to focus outward.
As Structured Dynamics has stated many times, we are committed to linked data, presenting our information as such, and providing better tools for producing and consuming it. We have made it one of the seven foundations to our technology stack and methodology.
But, linked data on its own is inadequate as an interoperability standard. Many practitioners don’t publish it right, characterize it right, or link to it right. That does not negate its benefits, but it does make it a poor candidate to install on the semantic Web throne.
Linked data based on RDF is perhaps the first citizen amongst all structured data citizens. It is an expressive and readily consumed means for publishing and relating structured instance data and one that can be easily interoperated. It is a natural citizen of the Web.
If we can accept and communicate linked data for these strengths, for what it naturally is — a useful set of techniques and best practices for enabling data that can be easily consumed — we can rest easy at night and not go crazy. Otherwise, bring on the Prozac.
In 2002 Joel Mokyr, an economic historian from Northwestern University, wrote a book that should be read by anyone interested in knowledge and its role in economic growth. The Gifts of Athena : Historical Origins of the Knowledge Economy is a sweeping and comprehensive account of the period from 1760 (in what Mokyr calls the “Industrial Enlightenment”) through the Industrial Revolution beginning roughly in 1820 and then continuing through the end of the 19th century.
The book (and related expansions by Mokyr available as separate PDFs on the Internet) should be considered as the definitive reference on this topic to date. The book contains 40 pages of references to all of the leading papers and writers on diverse technologies from mining to manufacturing to health and the household. The scope of subject coverage, granted mostly focused on western Europe and America, is truly impressive.
Mokyr deals with ‘useful knowledge,’ as he acknowledges Simon Kuznets‘ phrase. Mokyr argues that the growth of recent centuries was driven by the accumulation of knowledge and the declining costs of access to it. Mokyr helps to break past logjams that have attempted to link single factors such as the growth in science or the growth in certain technologies (such as the steam engine or electricity) as the key drivers of the massive increases in economic growth that coincided with the era now known as the Industrial Revolution.
Mokyr cracks some of these prior impasses by picking up on ideas first articulated through Michael Polanyi‘s “tacit knowing” (among other recent philosophers interested in the nature and definition of knowledge). Mokyr’s own schema posits propositional knowledge, which he defines as the science, beliefs or the epistemic base of knowledge, which he labels omega (Ω), in combination with prescriptive knowledge, which are the techniques (“recipes”), and which he also labels lambda (λ). Mokyr notes that an addition to omega (Ω) is a discovery, an addition to lambda (λ) is an invention.
One of Mokyr’s key points is that both knowledge types reinforce one another and, of course, the Industrial Revolution was a period of unprecedented growth in such knowledge. Another key point, easily overlooked when “discoveries” are seemingly more noteworthy, is that techniques and practical applications of knowledge can provide a multiplier effect and are equivalently important. For example, in addition to his main case studies of the factory, health and the household, he says:
The inventions of writing, paper, and printing not only greatly reduced access costs but also materially
affected human cognition, including the way people thought about their environment.
Mokyr also correctly notes how the accumulation of knowledge in science and the epistemic base promotes productivity and more still-more efficient discovery mechanisms:
The range of experimentation possibilities that needs to be searched over is far larger if the searcher knows nothing about the natural principles at work. To paraphrase Pasteur’s famous aphorism once more, fortune may sometimes favor unprepared minds, but only for a short while. It is in this respect that the width of the epistemic base makes the big difference.
In my own opinion, I think Mokyr starts to get closer to the mark when he discusses knowledge “storage”, access costs and multiplier effects from basic knowledge-based technologies or techniques. Like some other recent writers, he also tries to find analogies with evolutionary biology. For example:
Much like DNA, useful knowledge does not exist by itself; it has to be “carried” by people or in storage
devices. Unlike DNA, however, carriers can acquire and shed knowledge so that the selection process is quite different. This difference raises the question of how it is transmitted over time, and whether it can actually shrink as well as expand.
One of the real advantages of this book is to move forward a re-think of the “great man” or “great event” approach to history. There are indeed complicated forces at work. I think Mokyr summarizes well this transition when he states:
A century ago, historians of technology felt that individual inventors were the main actors that brought about
the Industrial Revolution. Such heroic interpretations were discarded in favor of views that emphasized deeper economic and social factors such as institutions, incentives, demand, and factor prices. It seems, however, that the crucial elements were neither brilliant individuals nor the impersonal forces governing the masses, but a small group of at most a few thousand peopled who formed a creative community based on the exchange of knowledge. Engineers, mechanics, chemists, physicians, and natural philosophers formed circles in which access to knowledge was the primary objective. Paired with the appreciation that such knowledge could be the base of ever-expanding prosperity, these elite networks were indispensible, even if individual members were not. Theories that link education and human capital of technological progress need to stress the importance of these small creative communities jointly with wider phenomena such as literacy rates and universal schooling.
There is so much to like and to be impressed with this book and even later Mokyr writings. My two criticisms are that, first, I found the pseudo-science of his knowledge labels confusing (I kept having to mentally translate the omega symbol) and I disliked the naming distinctions between propositional and prescriptive, even though I think the concepts are spot on.
My second criticism, a more major one, is that Mokyr notes, but does not adequately pursue, “In the decades after 1815, a veritable explosion of technical literature took place. Comprehensive technical compendia appeared in every industrial field.” Statements such as these, and there are many in the book, hint at perhaps some fundamental drivers.
Mokyr has provided the raw grist for answering his starting question of why such massive economic growth occurred in conjunction with the era of the Industrial Revolution. He has made many insights and posited new factors to explain this salutary discontinuity from all prior human history. But, in this reviewer’s opinion, he still leaves the why tantalizingly close but still unanswered. The fixity of information and growing storehouses because of declining production and access costs remain too poorly explored.
OK. After an experiment of more than three years, I have just now canceled my Google AdSense participation. (Which, Google, by the way, makes almost impossible to do: Finding the cancel link is hard enough; but who remembers the day they first signed up for ads and how many impressions they got that day? Both are required to get a cancellation request approved. Give me a break. It is worse than banks claiming small digits from bank interest for their own income!!)
Despite my sub-title, I never did expect to make much (or, really, any) money from Google ads. When I first signed up for it in Dec 2006, I stated I was doing it to find out how this ad-based business really works.
Well, from my standpoint, it does not work well; actually, not well at all.
Over the years I have seen visits on this site climb to nearly 3 K per day, and other nice growth factors. Perhaps if I were really focused on ad revenue I would have rotated stuff, tried alternative placements, yada yada. But, mostly, I was just trying to see who made out in this ad game.
It is certainly not the standard blog. I think my stats put me somewhere in the top 1% of all sites visited, but even that is not enough to even pay my monthly server charges (now higher with Amazon EC2).
Yet, in recent months, I have noticed some vendors have specifically targeted advertising on my blog and there also has been an increase in full banner ads (away from the standard, unobtrusive link Google ads of years past). Maybe they know something I don’t and they are winning, but my monthly ad income has dropped or remained flat.
And, then, I began to get full panel flashing ads on my site that just screamed Hit me! Hit me!. WTF. It was the last straw. Where did the unobtrusive link stuff go? Screw it; I can afford to pay my own monthly chump change.
This is probably not the time or place to discuss business models on the Web, but the woeful state of ad-based revenue is apparent. My goodness, I’m getting tired of ReadWriteWeb, as an example and one of the biggest at that, shilling with repeats and big ads with stories for their prominent advertisers each weekend. And, they are one of the only few ad winners!
My honest guess is that fewer than 1/10 of 1% of Web sites with advertising make enough to cover their bandwidth and server costs. How do you spell s-m-a-r-t?
So, the experiment is over. I will now think a bit about how I can reclaim that valuable Web page space from my former charitable contribution to the Google cafeteria. Bring on the sushi!
In earlier posts, I described the significant progress in climbing the data federation pyramid, today’s evolution in emphasis to the semantic Web, and the 40 or so sources of semantic heterogeneity. We now transition to an overview of how one goes about providing these semantics and resolving these heterogeneities.
In an excellent recent overview of semantic Web progress, Paul Warren points out:
Although knowledge workers no doubt believe in the value of annotating their documents, the pressure to create metadata isn’t present. In fact, the pressure of time will work in a counter direction. Annotation’s benefits accrue to other workers; the knowledge creator only benefits if a community of knowledge workers abides by the same rules. . . . Developing semiautomatic tools for learning ontologies and extracting metadata is a key research area . . . .Having to move out of a user’s typical working environment to ‘do knowledge management’ will act as a disincentive, whether the user is creating or retrieving knowledge.
Of course, even assuming that ontologies are created and semantics and metadata are added to content, there still remains the nasty problems of resolving heterogeneities (semantic mediation) and efficiently storing and retrieving the metadata and semantic relationships.
Putting all of this process in place requires the infrastructure in the form of tools and automation and proper incentives and rewards for users and suppliers to conform to it.
In his paper, Warren repeatedly points to the need for “semi-automatic” methods to make the semantic Web a reality. He makes fully a dozen such references, in addition to multiple references to the need for “reasoning algorithms.” In any case, here are some of the areas noted by Warren needing “semi-automatic” methods:
In a different vein, SemWebCentral lists these clusters of semantic Web-related tasks, each of which also requires tools:
With some ontologies approaching tens to hundreds of thousands to millions of triples, viewing, annotating and reconciling at scale can be daunting tasks, the efforts behind which would never be taken without useful tools and automation.
A 2005 paper by Izza, Vincent and Burlat (among many other excellent ones) at the first International Conference on Interoperability of Enterprise Software and Applications (INTEROP-ESA) provides a very readable overview on the role of semantics and ontologies in enterprise integration. Besides proposing a fairly compelling unified framework, the authors also present a useful workflow perspective emphasizing Web services (WS), also applicable to semantics in general, that helps frame this challenge:
Generic Semantic Integration Workflow (adapted from )
For existing data and documents, the workflow begins with information extraction or annotation of semantics and metadata (#1) in accordance with a reference ontology. Newly found information via harvesting must also be integrated; however, external information or services may come bearing their own ontologies, in which case some form of semantic mediation is required.
Of course, this is a generic workflow, and depending on the interoperation task, different flows and steps may be required. Indeed, the overall workflow can vary by perspective and researcher, with semantic resolution workflow modeling a prime area of current investigations. (As one alternative among scores, see for example Cardoso and Sheth.)
Semantic mediation is a process of matching schemas and mapping attributes and values, often with intermediate transformations (such as unit or language conversions) also required. The general problem of schema integration is not new, with one prior reference going back as early as 1986.  According to Alon Halevy:
As would be expected, people have tried building semi-automated schema-matching systems by employing a variety of heuristics. The process of reconciling semantic heterogeneity typically involves two steps. In the first, called schema matching, we find correspondences between pairs (or larger sets) of elements of the two schemas that refer to the same concepts or objects in the real world. In the second step, we build on these correspondences to create the actual schema mapping expressions.
The issues of matching and mapping have been addressed in many tools, notably commercial ones from MetaMatrix, and open source and academic projects such as Piazza,  SIMILE,  and the WSMX (Web service modeling execution environment) protocol from DERI.   A superb description of the challenges in reconciling the vocabularies of different data sources is also found in the thesis by Dr. AnHai Doan, which won the 2003 ACM’s Prestigious Doctoral Dissertation Award.
What all of these efforts has found is the inability to completely automate the mediation process. The current state-of-the-art is to reconcile what is largely unambiguous automatically, and then prompt analysts or subject matter experts to decide the questionable matches. These are known as “semi-automated” systems and the user interface and data presentation and workflow become as important as the underlying matching and mapping algorithms. According to the WSMX project, there is always a trade-off between how accurate these mappings are and the degree of automation that can be offered.
Once all of these reconciliations take place there is the (often undiscussed) need to index, store and retrieve these semantics and their relationships at scale, particularly for enterprise deployments. This is a topic I have addressed many times from the standpoint of scalability, more scalability, and comparisons of database and relational technologies, but it is also not a new topic in the general community.
As Stonebraker and Hellerstein note in their retrospective covering 35 years of development in databases, some of the first post-relational data models were typically called semantic data models, including those of Smith and Smith in 1977 and Hammer and McLeod in 1981. Perhaps what is different now is our ability to address some of the fundamental issues.
At any rate, this subsection is included here because of the hidden importance of database foundations. It is therefore a topic often addressed in this series.
In all of these areas, there is a growing, but still spotty, set of tools for conducting these semantic tasks. SemWebCentral, the open source tools resource center, for example, lists many tools and whether they interact or not with one another (the general answer is often No). Protégé also has a fairly long list of plugins, but not unfortunately well organized. 
In the table below, I begin to compile a partial listing of semantic Web tools, with more than 50 listed. Though a few are commercial, most are open source. Also, for the open source tools, only the most prominent ones are listed (Sourceforge, for example, has about 200 projects listed with some relation to the semantic Web though most of minor or not yet in alpha release).
|Almo||http://ontoware.org/projects/almo||An ontology-based workflow engine in Java|
|Altova SemanticWorks||http://www.altova.com/products_semanticworks.html||Visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design|
|Bibster||http://bibster.semanticweb.org/||A semantics-based bibliographic peer-to-peer system|
|cwm||http://www.w3.org/2000/10/swap/doc/cwm.html||A general purpose data processor for the semantic Web|
|Deep Query Manager||http://www.brightplanet.com/products/dqm_overview.asp||Search federator from deep Web sources|
|DOSE||https://sourceforge.net/projects/dose||A distributed platform for semantic annotation|
|ekoss.org||http://www.ekoss.org/||A collaborative knowledge sharing environment where model developers can submit advertisements|
|Endeca||http://www.endeca.com||Facet-based content organizer and search platform|
|FOAM||http://ontoware.org/projects/map||Framework for ontology alignment and mapping|
|Gnowsis||http://www.gnowsis.org/||A semantic desktop environment|
|GrOWL||http://ecoinformatics.uvm.edu/technologies/growl-knowledge-modeler.html||Open source graphical ontology browser and editor|
|HAWK||http://swat.cse.lehigh.edu/projects/index.html#hawk||OWL repository framework and toolkit|
|HELENOS||http://ontoware.org/projects/artemis||A Knowledge discovery workbench for the semantic Web|
|Jambalaya||http://www.thechiselgroup.org/jambalaya||Protégé plug-in for visualizing ontologies|
|Jastor||http://jastor.sourceforge.net/||Open source Java code generator that emits Java Beans from ontologies|
|Jena||http://jena.sourceforge.net/||Opensource ontology API written in Java|
|KAON||http://kaon.semanticweb.org/||Open source ontology management infrastructure|
|Kazuki||http://projects.semwebcentral.org/projects/kazuki/||Generates a java API for working with OWL instance data directly from a set of OWL ontologies|
|Kowari||http://www.kowari.org/||Open source database for RDF and OWL|
|LuMriX||http://www.lumrix.net/xmlsearch.php||A commercial search engine using semantic Web technologies|
|MetaMatrix||http://www.metamatrix.com/||Semantic vocabulary mediation and other tools|
|Metatomix||http://www.metatomix.com/||Commercial semantic toolkits and editors|
|MindRaider||http://mindraider.sourceforge.net/index.html||Open source semantic Web outline editor|
|Model Futures OWL Editor||http://www.modelfutures.com/OwlEditor.html||Simple OWL tools, featuring UML (XMI), ErWin, thesaurus and imports|
|Net OWL||http://www.netowl.com/||Entity extraction engine from SRA International|
|Nokia Semantic Web Server||https://sourceforge.net/projects/sws-uriqa||An RDF based knowledge portal for publishing both authoritative and third party descriptions of URI denoted resources|
|OntoEdit/OntoStudio||http://ontoedit.com/||Engineering environment for ontologies|
|OntoMat Annotizer||http://annotation.semanticweb.org/ontomat||Interactive Web page OWL and semantic annotator tool|
|Oyster||http://ontoware.org/projects/oyster||Peer-to-peer system for storing and sharing ontology metadata|
|Piggy Bank||http://simile.mit.edu/piggy-bank/||A Firefox-based semantic Web browser|
|Pike||http://pike.ida.liu.se/||A dynamic programming (scripting) language similar to Java and C for the semantic Web|
|pOWL||http://powl.sourceforge.net/index.php||Semantic Web development platform|
|Protégé||http://protege.stanford.edu/||Open source visual ontology editor written in Java with many plug-in tools|
|RACER Project||https://sourceforge.net/projects/racerproject||A collection of Projects and Tools to be used with the semantic reasoning engine RacerPro|
|RDFReactor||http://rdfreactor.ontoware.org/||Access RDF from Java using inferencing|
|Redland||http://librdf.org/||Open source software libraries supporting RDF|
|RelationalOWL||https://sourceforge.net/projects/relational-owl||Automatically extracts the semantics of virtually any relational database and transforms this information automatically into RDF/OW|
|Semantical||http://semantical.org/||Open source semantic Web search engine|
|SemanticWorks||http://www.altova.com/products_semanticworks.html||SemanticWorks RDF/OWL Editor|
|Semantic Mediawiki||https://sourceforge.net/projects/semediawiki||Semantic extension to the MediaWiiki wiki|
|Semantic Net Generator||https://sourceforge.net/projects/semantag||Utility for generating topic maps automatically|
|Sesame||http://www.openrdf.org/||An open source RDF database with support for RDF Schema inferencing and querying|
|SMART||http://web.ict.nsc.ru/smart/index.phtml?lang=en||System for Managing Applications based on RDF Technology|
|SMORE||http://www.mindswap.org/2005/SMORE/||OWL markup for HTML pages|
|SPARQL||http://www.w3.org/TR/rdf-sparql-query/||Query language for RDF|
|SWCLOS||http://iswc2004.semanticweb.org/demos/32/||A semantic Web processor using Lisp|
|Swoogle||http://swoogle.umbc.edu/||A semantic Web search engine with 1.5 M resources|
|SWOOP||http://www.mindswap.org/2004/SWOOP/||A lightweight ontology editor|
|Turtle||http://www.ilrt.bris.ac.uk/discovery/2004/01/turtle/||Terse RDF “Triple” language|
|WSMO Studio||https://sourceforge.net/projects/wsmostudio||A semantic Web service editor compliant with WSMO as a set of Eclipse plug-ins|
|WSMT Toolkit||https://sourceforge.net/projects/wsmt||The Web Service Modeling Toolkit (WSMT) is a collection of tools for use with the Web Service Modeling Ontology (WSMO), the Web Service Modeling Language (WSML) and the Web Service Execution Environment (WSMX)|
|WSMX||https://sourceforge.net/projects/wsmx/||Execution environment for dynamic use of semantic Web services|
Individually, there are some impressive and capable tools on this list. Generally, however, the interfaces are not intuitive, integration between tools is lacking, and why and how standard analysts should embrace them is lacking. In the semantic Web, we have yet to see an application of the magnitude of the first Mosaic browser that made HTML and the World Wide Web compelling.
It is perhaps likely that a similar “killer app” may not be forthcoming for the semantic Web. But it is important to remember just how entwined tools are to accelerating acceptance and growth of new standards and protocols.
Earlier postings in this recent series traced the progress in climbing the data federation pyramid to today’s current emphasis on the semantic Web. Partially this series is aimed at disabusing the notion that data extensibility can arise simply by using the XML (eXtensible Markup Language) data representation protocol. As Stonebraker and Hellerstein correctly observe:
XML is sometimes marketed as the solution to the semantic heterogeneity problem . . . . Nothing could be further from the truth. Just because two people tag a data element as a salary does not mean that the two data elements are comparable. One could be salary after taxes in French francs including a lunch allowance, while the other could be salary before taxes in US dollars. Furthermore, if you call them “rubber gloves” and I call them “latex hand protectors”, then XML will be useless in deciding that they are the same concept. Hence, the role of XML will be limited to providing the vocabulary in which common schemas can be constructed.
This series also covers the ontologies and the OWL language (written in XML) that now give us the means to understand and process these different domains and “world views” by machine. According to Natalya Noy, one of the principal researchers behind the Protégé development environment for ontologies and knowledge-based systems:
How are ontologies and the Semantic Web different from other forms of structured and semi-structured data, from database schemas to XML? Perhaps one of the main differences lies in their explicit formalization. If we make more of our assumptions explicit and able to be processed by machines, automatically or semi-automatically integrating the data will be easier. Here is another way to look at this: ontology languages have formal semantics, which makes building software agents that process them much easier, in the sense that their behavior is much more predictable (assuming they follow the specified explicit semantics–but at least there is something to follow). 
Again, however, simply because OWL (or similar) languages now give us the means to represent an ontology, we still have the vexing challenge of how to resolve the differences between different “world views,” even within the same domain. According to Alon Halevy:
When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies–or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas. Without such understanding, the multitude of data sources amounts to a digital version of the Tower of Babel. 
In the sections below, we describe the sources for how this heterogeneity arises and classify the many different types of heterogeneity. I then describe some broad approaches to overcoming these heterogeneities, though a subsequent post looks at that topic in more detail.
There are many potential circumstances where semantic heterogeneity may arise (partially from Halevy ):
Naturally, there will always be differences in how differing authors or sponsors create their own particular “world view,” which, if transmitted in XML or expressed through an ontology language such as OWL may also result in differences based on expression or syntax. Indeed, the ease of conveying these schema as semi-structured XML, RDF or OWL is in and of itself a source of potential expression heterogeneities. There are also other sources in simple schema use and versioning that can create mismatches . Thus, possible drivers in semantic mismatches can occur from world view, perspective, syntax, structure and versioning and timing:
Regardless, the needs for semantic mediation are manifest, as are the ways in which semantic heterogeneities may arise.
The first known classification scheme applied to data semantics that I am aware of is from William Kent nearly 20 years ago. (If you know of earlier ones, please send me a note.) Kent’s approach dealt more with structural mapping issues (see below) than differences in meaning, which he pointed to data dictionaries as potentially solving.
The most comprehensive schema I have yet encountered is from Pluempitiwiriyawej and Hammer, “A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources.”  They classify heterogeneities into three broad classes:
- Structural conflicts arise when the schema of the sources representing related or overlapping data exhibit discrepancies. Structural conflicts can be detected when comparing the underlying DTDs. The class of structural conflicts includes generalization conflicts, aggregation conflicts, internal path discrepancy, missing items, element ordering, constraint and type mismatch, and naming conflicts between the element types and attribute names.
- Domain conflicts arise when the semantic of the data sources that will be integrated exhibit discrepancies. Domain conflicts can be detected by looking at the information contained in the DTDs and using knowledge about the underlying data domains. The class of domain conflicts includes schematic discrepancy, scale or unit, precision, and data representation conflicts.
- Data conflicts refer to discrepancies among similar or related data values across multiple sources. Data conflicts can only be detected by comparing the underlying DOCs. The class of data conflicts includes ID-value, missing data, incorrect spelling, and naming conflicts between the element contents and the attribute values.
Moreover, mismatches or conflicts can occur between set elements (a “population” mismatch) or attributes (a “description” mismatch).
The table below builds on Pluempitiwiriyawej and Hammer’s schema by adding the fourth major explicit category of language, leading to about 40 distinct potential sources of semantic heterogeneities:
|Generalization / Specialization|
|Internal Path Discrepancy|
|Missing Item||Content Discrepancy|
|Attribute List Discrepancy|
|DOMAIN||Schematic Discrepancy||Element-value to Element-label Mapping|
|Attribute-value to Element-label Mapping|
|Element-value to Attribute-label Mapping|
|Attribute-value to Attribute-label Mapping|
|Scale or Units|
|Data Representation||Primitive Data Type|
|ID Mismatch or Missing ID|
|LANGUAGE||Encoding||Ingest Encoding Mismatch|
|Ingest Encoding Lacking|
|Query Encoding Mismatch|
|Query Encoding Lacking|
|Parsing / Morphological Analysis Errors (many)|
|Syntactical Errors (many)|
|Semantic Errors (many)|
Most of these line items are self-explanatory, but a few may not be:
It should be noted that a different take on classifying semantics and integration approaches is taken by Sheth et al.  Under their concept, they split semantics into three forms: implicit, formal and powerful. Implicit semantics are what is either largely present or can easily be extracted; formal languages, though relatively scarce, occur in the form of ontologies or other descriptive logics; and powerful (soft) semantics are fuzzy and not limited to rigid set-based assignments. Sheth et al.’s main point is that first-order logic (FOL) or descriptive logic is inadequate alone to properly capture the needed semantics.
From my viewpoint, Pluempitiwiriyawej and Hammer’s  classification better lends itself to pragmatic tools and approaches, though the Sheth et al. approach also helps indicate what can be processed in situ from input data v. inferred or probabalistic matches.
An attractive and compelling vision — perhaps even a likely one — is that standard reference ontologies become increasingly prevalent as time moves on and semantic mediation is seen as more of a mainstream problem. Certainly, a start on this has been seen with the use of the Dublin Core metadata initiative, and increasingly other associations, organizations, and major buyers are busy developing “standardized” or reference ontologies. Indeed, there are now more than 10,000 ontologies available on the Web. Insofar as these gain acceptance, semantic mediation can become an effort mostly at the periphery and not the core.
But, such is not the case today. Standards only have limited success and in targeted domains where incentives are strong. That acceptance and benefit threshold has yet to be reached on the Web. Until such time, a multiplicity of automated methods, semi-automated methods and gazetteers will all be required to help resolve these potential heterogeneities.