Mediating semantic heterogeneities requires tools and automation (or semi-automation) at scale. But existing tools are still crude and lack across-the-board integration. This is one of the next challenges in getting more widespread acceptance of the semantic Web.
In earlier posts, I described the significant progress in climbing the data federation pyramid, today’s evolution in emphasis to the semantic Web, and the 40 or so sources of semantic heterogeneity. We now transition to an overview of how one goes about providing these semantics and resolving these heterogeneities.
Why the Need for Tools and Automation?
Although knowledge workers no doubt believe in the value of annotating their documents, the pressure to create metadata isn’t present. In fact, the pressure of time will work in a counter direction. Annotation’s benefits accrue to other workers; the knowledge creator only benefits if a community of knowledge workers abides by the same rules. . . . Developing semiautomatic tools for learning ontologies and extracting metadata is a key research area . . . .Having to move out of a user’s typical working environment to ‘do knowledge management’ will act as a disincentive, whether the user is creating or retrieving knowledge.
Of course, even assuming that ontologies are created and semantics and metadata are added to content, there still remains the nasty problems of resolving heterogeneities (semantic mediation) and efficiently storing and retrieving the metadata and semantic relationships.
Putting all of this process in place requires the infrastructure in the form of tools and automation and proper incentives and rewards for users and suppliers to conform to it.
Areas Requiring Tools and Automation
In his paper, Warren repeatedly points to the need for “semi-automatic” methods to make the semantic Web a reality. He makes fully a dozen such references, in addition to multiple references to the need for “reasoning algorithms.” In any case, here are some of the areas noted by Warren needing “semi-automatic” methods:
- Assign authoritativemenss
- Learn ontologies
- Infer better search requests
- Mediate ontologies (semantic resolution)
- Support visualization
- Assign collaborations
- Infer relationships
- Extract entities
- Create ontologies
- Maintain and evolve ontologies
- Create taxonomies
- Infer trust
- Analyze links
- Create an ontology — use a text or graphical ontology editor to create the ontology, which is then validated. The resulting ontology can then be viewed with a browser before being published
- Disambiguate data – generate a mapping between multiple ontologies to identify where classes and properties are the same
- Expose a relational database as OWL — an editor is first used to create the ontologies that represent the database schema, then the ontologies are validated, translated to OWL and then the generated OWL is validated
- Intelligently query distributed data – repository and again able to be queried
- Manually create data from an ontology — a user would use an editor to create new OWL data based on existing ontologies, which is then validated and browsable
- Programmatically interact with OWL content — custom programs can view, create, and modify OWL content with an API
- Query non-OWL data — via an annotation tool, create OWL metadata from non-OWL content
- Visualize semantic data — view semantic data in a custom visualizer.
With some ontologies approaching tens to hundreds of thousands to millions of triples, viewing, annotating and reconciling at scale can be daunting tasks, the efforts behind which would never be taken without useful tools and automation.
A Workflow Perspective Helps Frame the Challenge
A 2005 paper by Izza, Vincent and Burlat (among many other excellent ones) at the first International Conference on Interoperability of Enterprise Software and Applications (INTEROP-ESA) provides a very readable overview on the role of semantics and ontologies in enterprise integration. Besides proposing a fairly compelling unified framework, the authors also present a useful workflow perspective emphasizing Web services (WS), also applicable to semantics in general, that helps frame this challenge:
Generic Semantic Integration Workflow (adapted from )
For existing data and documents, the workflow begins with information extraction or annotation of semantics and metadata (#1) in accordance with a reference ontology. Newly found information via harvesting must also be integrated; however, external information or services may come bearing their own ontologies, in which case some form of semantic mediation is required.
Of course, this is a generic workflow, and depending on the interoperation task, different flows and steps may be required. Indeed, the overall workflow can vary by perspective and researcher, with semantic resolution workflow modeling a prime area of current investigations. (As one alternative among scores, see for example Cardoso and Sheth.)
Matching and Mapping Semantic Heterogeneities
Semantic mediation is a process of matching schemas and mapping attributes and values, often with intermediate transformations (such as unit or language conversions) also required. The general problem of schema integration is not new, with one prior reference going back as early as 1986.  According to Alon Halevy:
As would be expected, people have tried building semi-automated schema-matching systems by employing a variety of heuristics. The process of reconciling semantic heterogeneity typically involves two steps. In the first, called schema matching, we find correspondences between pairs (or larger sets) of elements of the two schemas that refer to the same concepts or objects in the real world. In the second step, we build on these correspondences to create the actual schema mapping expressions.
The issues of matching and mapping have been addressed in many tools, notably commercial ones from MetaMatrix, and open source and academic projects such as Piazza,  SIMILE,  and the WSMX (Web service modeling execution environment) protocol from DERI.   A superb description of the challenges in reconciling the vocabularies of different data sources is also found in the thesis by Dr. AnHai Doan, which won the 2003 ACM’s Prestigious Doctoral Dissertation Award.
What all of these efforts has found is the inability to completely automate the mediation process. The current state-of-the-art is to reconcile what is largely unambiguous automatically, and then prompt analysts or subject matter experts to decide the questionable matches. These are known as “semi-automated” systems and the user interface and data presentation and workflow become as important as the underlying matching and mapping algorithms. According to the WSMX project, there is always a trade-off between how accurate these mappings are and the degree of automation that can be offered.
Also a Need for Efficient Semantic Data Stores
Once all of these reconciliations take place there is the (often undiscussed) need to index, store and retrieve these semantics and their relationships at scale, particularly for enterprise deployments. This is a topic I have addressed many times from the standpoint of scalability, more scalability, and comparisons of database and relational technologies, but it is also not a new topic in the general community.
As Stonebraker and Hellerstein note in their retrospective covering 35 years of development in databases, some of the first post-relational data models were typically called semantic data models, including those of Smith and Smith in 1977 and Hammer and McLeod in 1981. Perhaps what is different now is our ability to address some of the fundamental issues.
At any rate, this subsection is included here because of the hidden importance of database foundations. It is therefore a topic often addressed in this series.
A Partial Listing of Semantic Web Tools
In all of these areas, there is a growing, but still spotty, set of tools for conducting these semantic tasks. SemWebCentral, the open source tools resource center, for example, lists many tools and whether they interact or not with one another (the general answer is often No). Protégé also has a fairly long list of plugins, but not unfortunately well organized. 
In the table below, I begin to compile a partial listing of semantic Web tools, with more than 50 listed. Though a few are commercial, most are open source. Also, for the open source tools, only the most prominent ones are listed (Sourceforge, for example, has about 200 projects listed with some relation to the semantic Web though most of minor or not yet in alpha release).
|Almo||http://ontoware.org/projects/almo||An ontology-based workflow engine in Java|
|Altova SemanticWorks||http://www.altova.com/products_semanticworks.html||Visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design|
|Bibster||http://bibster.semanticweb.org/||A semantics-based bibliographic peer-to-peer system|
|cwm||http://www.w3.org/2000/10/swap/doc/cwm.html||A general purpose data processor for the semantic Web|
|Deep Query Manager||http://www.brightplanet.com/products/dqm_overview.asp||Search federator from deep Web sources|
|DOSE||https://sourceforge.net/projects/dose||A distributed platform for semantic annotation|
|ekoss.org||http://www.ekoss.org/||A collaborative knowledge sharing environment where model developers can submit advertisements|
|Endeca||http://www.endeca.com||Facet-based content organizer and search platform|
|FOAM||http://ontoware.org/projects/map||Framework for ontology alignment and mapping|
|Gnowsis||http://www.gnowsis.org/||A semantic desktop environment|
|GrOWL||http://ecoinformatics.uvm.edu/technologies/growl-knowledge-modeler.html||Open source graphical ontology browser and editor|
|HAWK||http://swat.cse.lehigh.edu/projects/index.html#hawk||OWL repository framework and toolkit|
|HELENOS||http://ontoware.org/projects/artemis||A Knowledge discovery workbench for the semantic Web|
|Jambalaya||http://www.thechiselgroup.org/jambalaya||Protégé plug-in for visualizing ontologies|
|Jastor||http://jastor.sourceforge.net/||Open source Java code generator that emits Java Beans from ontologies|
|Jena||http://jena.sourceforge.net/||Opensource ontology API written in Java|
|KAON||http://kaon.semanticweb.org/||Open source ontology management infrastructure|
|Kazuki||http://projects.semwebcentral.org/projects/kazuki/||Generates a java API for working with OWL instance data directly from a set of OWL ontologies|
|Kowari||http://www.kowari.org/||Open source database for RDF and OWL|
|LuMriX||http://www.lumrix.net/xmlsearch.php||A commercial search engine using semantic Web technologies|
|MetaMatrix||http://www.metamatrix.com/||Semantic vocabulary mediation and other tools|
|Metatomix||http://www.metatomix.com/||Commercial semantic toolkits and editors|
|MindRaider||http://mindraider.sourceforge.net/index.html||Open source semantic Web outline editor|
|Model Futures OWL Editor||http://www.modelfutures.com/OwlEditor.html||Simple OWL tools, featuring UML (XMI), ErWin, thesaurus and imports|
|Net OWL||http://www.netowl.com/||Entity extraction engine from SRA International|
|Nokia Semantic Web Server||https://sourceforge.net/projects/sws-uriqa||An RDF based knowledge portal for publishing both authoritative and third party descriptions of URI denoted resources|
|OntoEdit/OntoStudio||http://ontoedit.com/||Engineering environment for ontologies|
|OntoMat Annotizer||http://annotation.semanticweb.org/ontomat||Interactive Web page OWL and semantic annotator tool|
|Oyster||http://ontoware.org/projects/oyster||Peer-to-peer system for storing and sharing ontology metadata|
|Piggy Bank||http://simile.mit.edu/piggy-bank/||A Firefox-based semantic Web browser|
|Pike||http://pike.ida.liu.se/||A dynamic programming (scripting) language similar to Java and C for the semantic Web|
|pOWL||http://powl.sourceforge.net/index.php||Semantic Web development platform|
|Protégé||http://protege.stanford.edu/||Open source visual ontology editor written in Java with many plug-in tools|
|RACER Project||https://sourceforge.net/projects/racerproject||A collection of Projects and Tools to be used with the semantic reasoning engine RacerPro|
|RDFReactor||http://rdfreactor.ontoware.org/||Access RDF from Java using inferencing|
|Redland||http://librdf.org/||Open source software libraries supporting RDF|
|RelationalOWL||https://sourceforge.net/projects/relational-owl||Automatically extracts the semantics of virtually any relational database and transforms this information automatically into RDF/OW|
|Semantical||http://semantical.org/||Open source semantic Web search engine|
|SemanticWorks||http://www.altova.com/products_semanticworks.html||SemanticWorks RDF/OWL Editor|
|Semantic Mediawiki||https://sourceforge.net/projects/semediawiki||Semantic extension to the MediaWiiki wiki|
|Semantic Net Generator||https://sourceforge.net/projects/semantag||Utility for generating topic maps automatically|
|Sesame||http://www.openrdf.org/||An open source RDF database with support for RDF Schema inferencing and querying|
|SMART||http://web.ict.nsc.ru/smart/index.phtml?lang=en||System for Managing Applications based on RDF Technology|
|SMORE||http://www.mindswap.org/2005/SMORE/||OWL markup for HTML pages|
|SPARQL||http://www.w3.org/TR/rdf-sparql-query/||Query language for RDF|
|SWCLOS||http://iswc2004.semanticweb.org/demos/32/||A semantic Web processor using Lisp|
|Swoogle||http://swoogle.umbc.edu/||A semantic Web search engine with 1.5 M resources|
|SWOOP||http://www.mindswap.org/2004/SWOOP/||A lightweight ontology editor|
|Turtle||http://www.ilrt.bris.ac.uk/discovery/2004/01/turtle/||Terse RDF “Triple” language|
|WSMO Studio||https://sourceforge.net/projects/wsmostudio||A semantic Web service editor compliant with WSMO as a set of Eclipse plug-ins|
|WSMT Toolkit||https://sourceforge.net/projects/wsmt||The Web Service Modeling Toolkit (WSMT) is a collection of tools for use with the Web Service Modeling Ontology (WSMO), the Web Service Modeling Language (WSML) and the Web Service Execution Environment (WSMX)|
|WSMX||https://sourceforge.net/projects/wsmx/||Execution environment for dynamic use of semantic Web services|
Tools Still Crude, Integration Not Compelling
Individually, there are some impressive and capable tools on this list. Generally, however, the interfaces are not intuitive, integration between tools is lacking, and why and how standard analysts should embrace them is lacking. In the semantic Web, we have yet to see an application of the magnitude of the first Mosaic browser that made HTML and the World Wide Web compelling.
It is perhaps likely that a similar “killer app” may not be forthcoming for the semantic Web. But it is important to remember just how entwined tools are to accelerating acceptance and growth of new standards and protocols.
|NOTE: This posting is part of an occasional series looking at a new category that I and BrightPlanet are terming the eXtensible Semantic Data Model (XSDM). Topics in this series cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing and indexing of XML, RDF, OWL, or SKOS data. A major white paper will be produced at the conclusion of the series.|
 Paul Warren, “Knowledge Management and the Semantic Web: From Scenario to Technology,” IEEE Intelligent Systems, vol. 21, no. 1, 2006, pp. 53-59. See http://dsonline.computer.org/portal/site/dsonline/menuitem.9ed3d9924aeb0dcd82ccc6716bbe36ec/index.jsp?&pName=dso_level1&path=dsonline/2006/02&file=x1war.xml&xsl=article.xsl&
 Said Izza, Lucien Vincent and Patrick Burlat, “A Unified Framework for Enterprise Integration: An Ontology-Driven Service-Oriented Approach,” pp. 78-89, in Pre-proceedings of the First International Conference on Interoperability of Enterprise Software and Applications (INTEROP-ESA’2005), Geneva, Switzerland, February 23 – 25, 2005, 618 pp. See http://interop-esa05.unige.ch/INTEROP/Proceedings/Interop-ESAScientific/OneFile/InteropESAproceedings.pdf.
 Jorge Cardoso and Amit Sheth, “Semantic Web Processes: Semantics Enabled Annotation, Discovery, Composition and Orchestration of Web Scale Processes,” in the 4th International Conference on Web Information Systems Engineering (WISE 2003), December 10-12, 2003, Rome, Italy. See http://lsdis.cs.uga.edu/lib/presentations/WISE2003-Tutorial.pdf.
 Alon Halevy, “Why Your Data Won’t Mix,” ACM Queue vol. 3, no. 8, October 2005. See http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=336.
 Chuck Moser, Semantic Interoperability: Automatically Resolving Vocabularies, presented at the 4th Semantic Interoperability Conference, February 10, 2006. See http://colab.cim3.net/file/work/SICoP/2006-02-09/Presentations/CMosher02102006.ppt.
 Alon Y. Halevy, Zachary G. Ives, Peter Mork and Igor Tatarinov, “Piazza: Data Management Infrastructure for Semantic Web Applications,” Journal of Web Semantics, Vol. 1 No. 2, February 2004, pp. 155-175. See http://www.cis.upenn.edu/~zives/research/piazza-www03.pdf.
 Stefano Mazzocchi, Stephen Garland, Ryan Lee, “SIMILE: Practical Metadata for the Semantic Web,” January 26, 2005. See http://www.xml.com/pub/a/2005/01/26/simile.html.
 Adrian Mocan, Ed., “WSMX Data Mediation,” in WSMX Working Draft, W3C Organization, 11 October 2005. See http://www.wsmo.org/TR/d13/d13.3/v0.2/20051011.
 J.Madhavan , P. A. Bernstein , P. Domingos and A. Y. Halevy, “Representing and Reasoning About Mappings Between Domain Models,” in the Eighteenth National Conference on Artificial Intelligence, pp.80-86, Edmonton, Alberta, Canada, July 28-August 01, 2002.
 AnHai Doan, Learning to Map between Structured Representations of Data, Ph.D. Thesis to the Computer Science & Engineering Department, University of Washington, 2002, 133 pp. See http://anhai.cs.uiuc.edu/home/thesis/anhai-thesis.pdf.
 Michael Stonebraker and Joey Hellerstein, “What Goes Around Comes Around,” in Joseph M. Hellerstein and Michael Stonebraker, editors, Readings in Database Systems, Fourth Edition, pp. 2-41, The MIT Press, Cambridge, MA, 2005. See http://mitpress.mit.edu/books/chapters/0262693143chapm1.pdf.