Posted:June 18, 2006

This is the last entry in our recent series on data federation. This post compares interoperability models and concludes with a new approach for promoting enterprise interoperability and innovation.

We are now about to conclude this mini-series on data federation. In earlier posts, I described the significant progress in climbing the data federation pyramid, today’s evolution in emphasis to the semantic Web, the 40 or so sources of semantic heterogeneity, and tools and methods for processing and mediating semantic information. We now conclude with a comparison of implementation models for semantic interoperability.

Guido Vetere, an IBM research scientist and one of the clearest writers on this subject has said:

Despite the increasing availability of semantic-oriented standards and technologies, the problem of dealing with semantics in Web-based cooperation, taken in its generality, is very far from trivial, not only for practical reasons, but also because it involves deep and controversial philosophical aspects. Nevertheless, for relatively small communities dealing with well-founded disciplines such as biology, concrete solutions can be effectively put in place. In fact, most of the data structures will represent commonly understood natural kinds (e.g. microorganisms), well-studied processes (e.g. syntheses) and so on. Still, significant differences in the way actual data structures are used to represent these concepts might require complex mappings and transformations. [1]

Yet there are easier ways and harder ways to achieve interoperability. Paul Warren, a frequent citation in this series, observes:[2]

I believe there will be deep semantic interoperability within organizational intranets. This is already the focus of practical implementations, such as the SEKT (Semantically Enabled Knowledge Technologies) project, and across interworking organizations, such as supply chain consortia. In the global Web, semantic interoperability will be more limited.

I agree with this observation. But why might this be the case? Before we conclude why some models for semantic interoperation may work better and earlier than others, let’s first step back and look at four model paradigms.

Four Paradigms of Semantic Interoperability

In late 2005, G. Vetere and M. Lenzerini (V&L) published an excellent synthesis description of “Models for Semantic Interoperability in Service Oriented Architectures” in the IBM Systems Journal. [3] Though obviously geared to an SOA perspective, the observations ring true for any semantic-oriented architecture. These prototypical patterns provide a solid basis for discussing the pros and cons of various implementation approaches.

Any-to-Any Centralized

The ‘Any-to-Any Centralized’ model (also called by Vetere as unmodeled-centralized in the first-cited paper [1]), is “tightly-coupled” in that the mapping requires a semantic negotiation or understanding between the integrated, central parties. The integrated pieces (services, in this instance) are usually atomic, independent and self-contained, The integration takes place within a single instantiation and is not generalized.

V&L diagrammed this model as follows:

Any-to-Any Centralized Semantic Interoperability Model (reprinted from [3])

It is clear in this model that there is no ready “swap out” of components. There needs to be bilateral agreements betweeen all integration components. Semantic errors can only be determined at runtime.

This model is really the traditional one used by system integrators (SIs). It is a one-off, high level-of-effort, non-generalized and non-repeatable model of integration that only works in closed environments. No ontology is involved. It is the most costly and fragile of the interoperability models.

Any-to-One Centralized

In V&L’s ‘Any-to-One Centralized’ model (also known as modeled-centralized), while it may not be explicit, there is a single “ontology” that is a superset of all contributing systems. This “ontology” framework may often take the form of an enterprise service bus, where its internal protocols provides the unifying ontology.

This interoperability model, too, is quite costly in that all suppliers (or service providers) must conform to the single ontology.

It is often remarked that the number of mappings required for the entire system is significantly reduced in any-to-one models, decreasing (in the limit) from N × (N − 1) to N, where N is the number of services involved. However, the reduction in the number of mappings is not the striking difference here. The real difference is in the existence of a business model.

V&L diagrammed this model as follows:

Any-to-One Centralized Semantic Interoperability Model (reprinted from [3])

The extensibility of the business model is therefore the key to the success of this interoperability pattern. Generally, in this case, the enterprise makes the determination of what components to interoperate with and conducts the mapping.

Any-to-Any Decentralized

The normal condition in ‘loosely-coupled’ environments such as the Web in general is called the ‘Any-to-Any Decentralized’ model (also known as unmodeled-decentralized). In this model, the integration logic is distributed, and there are not shared ontologies. This is a peer-to-peer system, sometimes known as P2P information integration systems or ‘emergent semantics.’

The semantics are distributed in systems that are strongly isolated from one another, and though grids can help the transaction aspects, much repeat effort occurs as a function of the interoperating components. V&L diagrammed this model as follows:

Any-to-Any Decentralized Semantic Interoperability Model (reprinted from [3])

It is thus the responsibility of each party to perform the mapping to any other party with which it desires to interoperate. The lack of a central model acts to greatly increase the effort needed for much interoperability at the system level.

Any-to-One Decentralized

One way to decrease the effort required is to adopt the ‘Any-to-One Decentralized’ model (also modeled-decentralized) wherein an ontology model provides the mapping guidance. (Note there may need to be multiple layers of ontologies from shared “upper level” ones to those that are domain specific).

In this model, the integration logic is distributed by any service or component implementation, based on a shared ontology. It is this model that is the one generally referred to as the semantic Web approach. (In Web services, this is accomplished via the multiple WS* web service protocols.)

According to V & L:

. . . having business models specified in a sound and rich ontology language and having them based on a suitable foundational layer reduces the risk of misinterpretations injected by agents when mapping their own conceptualizations to such models. A suitable coverage of primitive concepts with respect to the business domain (completeness) that entails the possibility to progressively enhance conceptual schemas (extensibility) are also key factors of success in the adoption of shared ontologies within service-oriented infrastructures. But still there is the risk of inaccuracy, misunderstandings, errors, approximations, lack of knowledge, or even malicious intent that need to be considered, especially in uncontrolled, loosely coupled environments.

V&L then diagrammed this model as follows:

Any-to-One Decentralized Semantic Interoperability Model (reprinted from [3])

In this model, the ontology (or multiple ontologies) need to comprehensively include the semantics and concepts of all participating components. Thus, while the components are “decentralized,” the ontology growth must somehow be “centralized” in that it needs to expand and grow.

A collaboration environment similar to say, Wikipedia, could accomplish this task, though the issues of authoritativeness and quality would also arise as they have for Wikipedia.

Alternative Approaches to Semantic Web Ontologies

There are a number of ways to provide a “centralized” ontology view for this semantic Web collaboration environment. Xiaomeng Su provides three approaches for thinking about this problem:[4]

Approaches to Semantic Web Ontologies (modified from [4])

The simplest, but least tractable approach given the anarchic nature of the Web, is to adopt a single ontology (this is what is implied in the ‘Any-to-One Decentralized’ interoperability model above). A more realistic approach is where there are multiple world views or ontologies, shown in the center of the diagram. This approach recognizes that different parties will have different world views, the reconciliation of which requries semantic mediation (see previous post in this series). Finally, in order to help minimize the effort of mediation, some shared general ontologies may be adopted. This hybrid approach can also rely on so-called upper-level ontologies such as SUMO (Suggested Upper Merged Ontology) or the Dublin Core. While semantic mediation is still required between the local ontologies, the effort is somewhat lessened.

The Best Model Depends on Circumstance

These models illustrate trade-offs and considerations depending on the circumstance where interoperability is an imperative.

For the semantic Web, which is the most difficult environment given the lack of coordination possible between contributing parties, the best model appears to be the ‘Any-to-One Decentralized’ model with a hybrid approach to the ontology model. Besides the need for ontologies to mature, the means for semantic mediation and the tools to help automate the tagging and mediation process also need to mature significantly. Though isolated pockets

Comprehensive Interfaces: A Hybrid Model for Enterprise Interoperability

I have argued in previous posts that enterprises are likely to be the place where semantic interoperability first proves itself. This is because, as we have seen above, centralized models are a simpler design and easier to implement, and because enterprises can provide the economic incentive for contributing players to conform to this model.

So, given the model discussions above, how might this best work?

First, by definition, we have a “centralized” model in that the enterprise is able to call the shots. Second, we do want a “One” model wherein a single ontology governs the semantics. This means that we can eliminate requirements and tools to mediate semantic heterogeneities.

On the other hand, we also want a “loosely-coupled” system in that we don’t want a central command-and-control system that requires upfront decisions as to which components can interoperate.

In other words, how can we gain the advantages of a free market for new tools and components at minimum cost and technical difficulty?

The key to answering this seeming dilemma is to “fix” the weaknesses of the ‘Any-to-One Centralized’ model while retaining its strengths. This hybrid shift is shown by this diagram:

Shifting Interoperability to the Supplier (modified from [3])

The diagram indicates that the enterprise adopts a single ontology model, but exposes its interoperability framework through a complete “recipe book” for interoperating to which external parties may embrace (the green arrows).

The idea is to expand beyond single concepts such as APIs (application programming interfaces), ontologies, service buses and broker and conversion utilities to one of a complete set of community standards, specifications and conversion utilities that enables any outside party to interoperate with the enterprises central model. The interface thus becomes the comprehensive objective, comprehensively defined.

By definition, then, this type of interoperability is “loosely coupled” in that the third (external) party can effect the integration without any assistance or guidance (other than the standard “recipe book”) from the central authority. The central system thus becomes a “black box” as traditionally defined. This means that any aggressive potential supplier can adapt its components to the interface in order to convince the enterprise to buy its wares or services.

This design can suffer the weakness of the potential inefficiencies that result from loosely-coupled integration. However, if the new component proves itself and fits the bill, the central enterprise authority always has the option to go to more efficient, tightly-coupled integration with that third party to overcome any performance bottlenecks.

It should thus be possible for enterprises (central authorities) to both write these comprehensive “recipe books” and to establish “proof-of-concept” interoperability labs where any potential vendor can link in and prove its stuff. This design shifts the cost of overcoming barriers to entry to the potential supplier. If that supplier believes its offerings to be superior, it can incur the time and effort of coding to the interface and then demonstrating its superiority.

There are very exciting prospects in such an entirely new procurement and adoption model that I’ll be discussing further in future postings.

A key to such a design, of course, is comprehensive and easily implemented interfaces. Once comprehensive approach to a similar design is provided by Izza et al.[5] The absolute cool thing about this new design is that today’s new standards and protocols provide easy means for third parties to comply. This new design completely overcomes the limitations of the prior proprietary approaches for enterprises involving high-cost ETL (extract, transform, load) or their later enterprise service bus (ESB) cousins.

NOTE: This posting is part of an occasional series looking at a new category that I and BrightPlanet are terming the eXtensible Semantic Data Model (XSDM). Topics in this series cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing and indexing of XML, RDF, OWL, or SKOS data. A major white paper will be produced at the conclusion of the series.

[1] G. Vetere, “Semantics in Data Integration Processes,” presented at NETTAB 2005, Napoli, October 4-7, 2005. See

[2] Paul Warren, “Knowledge Management and the Semantic Web: From Scenario to Technology,” IEEE Intelligent Systems, vol. 21, no. 1, 2006, pp. 53-59. See

[3] G. Vetere and M. Lenzerini, “Models for Semantic Interoperability in Service Oriented Architectures, in IBM Systems Journal, Vol. 44, November 4, 2005, pp. 887-904. See

[4] Xiaomeng Su, “A Text Categorization Perspective for Ontology Mapping,” a position paper. See

[5] Saïd Izza, Lucien Vincent and Patrick Burlat, “A Unified Framework for Enterprise Integration: An Ontology-Driven Service-Oriented Approach,” pp. 78-89, in Pre-proceedings of the First International Conference on Interoperability of Enterprise Software and Applications (INTEROP-ESA’2005), Geneva, Switzerland, February 23 – 25, 2005, 618 pp. See

Posted by AI3's author, Mike Bergman Posted on June 18, 2006 at 2:07 pm in Semantic Web | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:June 12, 2006

Mediating semantic heterogeneities requires tools and automation (or semi-automation) at scale. But existing tools are still crude and lack across-the-board integration. This is one of the next challenges in getting more widespread acceptance of the semantic Web.

In earlier posts, I described the significant progress in climbing the data federation pyramid, today’s evolution in emphasis to the semantic Web, and the 40 or so sources of semantic heterogeneity. We now transition to an overview of how one goes about providing these semantics and resolving these heterogeneities.

Why the Need for Tools and Automation?

In an excellent recent overview of semantic Web progress, Paul Warren points out:[1]

Although knowledge workers no doubt believe in the value of annotating their documents, the pressure to create metadata isn’t present. In fact, the pressure of time will work in a counter direction. Annotation’s benefits accrue to other workers; the knowledge creator only benefits if a community of knowledge workers abides by the same rules. . . . Developing semiautomatic tools for learning ontologies and extracting metadata is a key research area . . . .Having to move out of a user’s typical working environment to ‘do knowledge management’ will act as a disincentive, whether the user is creating or retrieving knowledge.

Of course, even assuming that ontologies are created and semantics and metadata are added to content, there still remains the nasty problems of resolving heterogeneities (semantic mediation) and efficiently storing and retrieving the metadata and semantic relationships.

Putting all of this process in place requires the infrastructure in the form of tools and automation and proper incentives and rewards for users and suppliers to conform to it.

Areas Requiring Tools and Automation

In his paper, Warren repeatedly points to the need for “semi-automatic” methods to make the semantic Web a reality. He makes fully a dozen such references, in addition to multiple references to the need for “reasoning algorithms.” In any case, here are some of the areas noted by Warren needing “semi-automatic” methods:

  • Assign authoritativemenss
  • Learn ontologies
  • Infer better search requests
  • Mediate ontologies (semantic resolution)
  • Support visualization
  • Assign collaborations
  • Infer relationships
  • Extract entities
  • Create ontologies
  • Maintain and evolve ontologies
  • Create taxonomies
  • Infer trust
  • Analyze links
  • etc.

In a different vein, SemWebCentral lists these clusters of semantic Web-related tasks, each of which also requires tools:[2]

  • Create an ontology — use a text or graphical ontology editor to create the ontology, which is then validated. The resulting ontology can then be viewed with a browser before being published
  • Disambiguate data – generate a mapping between multiple ontologies to identify where classes and properties are the same
  • Expose a relational database as OWL — an editor is first used to create the ontologies that represent the database schema, then the ontologies are validated, translated to OWL and then the generated OWL is validated
  • Intelligently query distributed data – repository and again able to be queried
  • Manually create data from an ontology — a user would use an editor to create new OWL data based on existing ontologies, which is then validated and browsable
  • Programmatically interact with OWL content — custom programs can view, create, and modify OWL content with an API
  • Query non-OWL data — via an annotation tool, create OWL metadata from non-OWL content
  • Visualize semantic data — view semantic data in a custom visualizer.

With some ontologies approaching tens to hundreds of thousands to millions of triples, viewing, annotating and reconciling at scale can be daunting tasks, the efforts behind which would never be taken without useful tools and automation.

A Workflow Perspective Helps Frame the Challenge

A 2005 paper by Izza, Vincent and Burlat (among many other excellent ones) at the first International Conference on Interoperability of Enterprise Software and Applications (INTEROP-ESA) provides a very readable overview on the role of semantics and ontologies in enterprise integration.[3] Besides proposing a fairly compelling unified framework, the authors also present a useful workflow perspective emphasizing Web services (WS), also applicable to semantics in general, that helps frame this challenge:

Generic Semantic Integration Workflow (adapted from [3])

For existing data and documents, the workflow begins with information extraction or annotation of semantics and metadata (#1) in accordance with a reference ontology. Newly found information via harvesting must also be integrated; however, external information or services may come bearing their own ontologies, in which case some form of semantic mediation is required.

Of course, this is a generic workflow, and depending on the interoperation task, different flows and steps may be required. Indeed, the overall workflow can vary by perspective and researcher, with semantic resolution workflow modeling a prime area of current investigations. (As one alternative among scores, see for example Cardoso and Sheth.[4])

Matching and Mapping Semantic Heterogeneities

Semantic mediation is a process of matching schemas and mapping attributes and values, often with intermediate transformations (such as unit or language conversions) also required. The general problem of schema integration is not new, with one prior reference going back as early as 1986. [5] According to Alon Halevy:[6]

As would be expected, people have tried building semi-automated schema-matching systems by employing a variety of heuristics. The process of reconciling semantic heterogeneity typically involves two steps. In the first, called schema matching, we find correspondences between pairs (or larger sets) of elements of the two schemas that refer to the same concepts or objects in the real world. In the second step, we build on these correspondences to create the actual schema mapping expressions.

The issues of matching and mapping have been addressed in many tools, notably commercial ones from MetaMatrix,[7] and open source and academic projects such as Piazza, [8] SIMILE, [9] and the WSMX (Web service modeling execution environment) protocol from DERI. [10] [11] A superb description of the challenges in reconciling the vocabularies of different data sources is also found in the thesis by Dr. AnHai Doan, which won the 2003 ACM’s Prestigious Doctoral Dissertation Award.[12]

What all of these efforts has found is the inability to completely automate the mediation process. The current state-of-the-art is to reconcile what is largely unambiguous automatically, and then prompt analysts or subject matter experts to decide the questionable matches. These are known as “semi-automated” systems and the user interface and data presentation and workflow become as important as the underlying matching and mapping algorithms. According to the WSMX project, there is always a trade-off between how accurate these mappings are and the degree of automation that can be offered.

Also a Need for Efficient Semantic Data Stores

Once all of these reconciliations take place there is the (often undiscussed) need to index, store and retrieve these semantics and their relationships at scale, particularly for enterprise deployments. This is a topic I have addressed many times from the standpoint of scalability, more scalability, and comparisons of database and relational technologies, but it is also not a new topic in the general community.

As Stonebraker and Hellerstein note in their retrospective covering 35 years of development in databases,[13] some of the first post-relational data models were typically called semantic data models, including those of Smith and Smith in 1977[14] and Hammer and McLeod in 1981.[15] Perhaps what is different now is our ability to address some of the fundamental issues.

At any rate, this subsection is included here because of the hidden importance of database foundations. It is therefore a topic often addressed in this series.

A Partial Listing of Semantic Web Tools

In all of these areas, there is a growing, but still spotty, set of tools for conducting these semantic tasks. SemWebCentral, the open source tools resource center, for example, lists many tools and whether they interact or not with one another (the general answer is often No).[16] Protégé also has a fairly long list of plugins, but not unfortunately well organized. [17]

In the table below, I begin to compile a partial listing of semantic Web tools, with more than 50 listed. Though a few are commercial, most are open source. Also, for the open source tools, only the most prominent ones are listed (Sourceforge, for example, has about 200 projects listed with some relation to the semantic Web though most of minor or not yet in alpha release).




Almo An ontology-based workflow engine in Java
Altova SemanticWorks Visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design
Bibster A semantics-based bibliographic peer-to-peer system
cwm A general purpose data processor for the semantic Web
Deep Query Manager Search federator from deep Web sources
DOSE A distributed platform for semantic annotation A collaborative knowledge sharing environment where model developers can submit advertisements
Endeca Facet-based content organizer and search platform
FOAM Framework for ontology alignment and mapping
Gnowsis A semantic desktop environment
GrOWL Open source graphical ontology browser and editor
HAWK OWL repository framework and toolkit
HELENOS A Knowledge discovery workbench for the semantic Web
Jambalaya Protégé plug-in for visualizing ontologies
Jastor Open source Java code generator that emits Java Beans from ontologies
Jena Opensource ontology API written in Java
KAON Open source ontology management infrastructure
Kazuki Generates a java API for working with OWL instance data directly from a set of OWL ontologies
Kowari Open source database for RDF and OWL
LuMriX A commercial search engine using semantic Web technologies
MetaMatrix Semantic vocabulary mediation and other tools
Metatomix Commercial semantic toolkits and editors
MindRaider Open source semantic Web outline editor
Model Futures OWL Editor Simple OWL tools, featuring UML (XMI), ErWin, thesaurus and imports
Net OWL Entity extraction engine from SRA International
Nokia Semantic Web Server An RDF based knowledge portal for publishing both authoritative and third party descriptions of URI denoted resources
OntoEdit/OntoStudio Engineering environment for ontologies
OntoMat Annotizer Interactive Web page OWL and semantic annotator tool
Oyster Peer-to-peer system for storing and sharing ontology metadata
Piggy Bank A Firefox-based semantic Web browser
Pike A dynamic programming (scripting) language similar to Java and C for the semantic Web
pOWL Semantic Web development platform
Protégé Open source visual ontology editor written in Java with many plug-in tools
RACER Project A collection of Projects and Tools to be used with the semantic reasoning engine RacerPro
RDFReactor Access RDF from Java using inferencing
Redland Open source software libraries supporting RDF
RelationalOWL Automatically extracts the semantics of virtually any relational database and transforms this information automatically into RDF/OW
Semantical Open source semantic Web search engine
SemanticWorks SemanticWorks RDF/OWL Editor
Semantic Mediawiki Semantic extension to the MediaWiiki wiki
Semantic Net Generator Utility for generating topic maps automatically
Sesame An open source RDF database with support for RDF Schema inferencing and querying
SMART System for Managing Applications based on RDF Technology
SMORE OWL markup for HTML pages
SPARQL Query language for RDF
SWCLOS A semantic Web processor using Lisp
Swoogle A semantic Web search engine with 1.5 M resources
SWOOP A lightweight ontology editor
Turtle Terse RDF “Triple” language
WSMO Studio A semantic Web service editor compliant with WSMO as a set of Eclipse plug-ins
WSMT Toolkit The Web Service Modeling Toolkit (WSMT) is a collection of tools for use with the Web Service Modeling Ontology (WSMO), the Web Service Modeling Language (WSML) and the Web Service Execution Environment (WSMX)
WSMX Execution environment for dynamic use of semantic Web services

Tools Still Crude, Integration Not Compelling

Individually, there are some impressive and capable tools on this list. Generally, however, the interfaces are not intuitive, integration between tools is lacking, and why and how standard analysts should embrace them is lacking. In the semantic Web, we have yet to see an application of the magnitude of the first Mosaic browser that made HTML and the World Wide Web compelling.

It is perhaps likely that a similar “killer app” may not be forthcoming for the semantic Web. But it is important to remember just how entwined tools are to accelerating acceptance and growth of new standards and protocols.

NOTE: This posting is part of an occasional series looking at a new category that I and BrightPlanet are terming the eXtensible Semantic Data Model (XSDM). Topics in this series cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing and indexing of XML, RDF, OWL, or SKOS data. A major white paper will be produced at the conclusion of the series.

[1] Paul Warren, “Knowledge Management and the Semantic Web: From Scenario to Technology,” IEEE Intelligent Systems, vol. 21, no. 1, 2006, pp. 53-59. See

[2] See

[3] Said Izza, Lucien Vincent and Patrick Burlat, “A Unified Framework for Enterprise Integration: An Ontology-Driven Service-Oriented Approach,” pp. 78-89, in Pre-proceedings of the First International Conference on Interoperability of Enterprise Software and Applications (INTEROP-ESA’2005), Geneva, Switzerland, February 23 – 25, 2005, 618 pp. See

[4] Jorge Cardoso and Amit Sheth, “Semantic Web Processes: Semantics Enabled Annotation, Discovery, Composition and Orchestration of Web Scale Processes,” in the 4th International Conference on Web Information Systems Engineering (WISE 2003), December 10-12, 2003, Rome, Italy. See

[5] C. Batini, M. Lenzerini, and S.B. Navathe, “A Comparative Analysis of Methodologies for Database Schema Integration,” in ACM Computing Survey, 18(4):323-364, 1986.

[6] Alon Halevy, “Why Your Data Won’t Mix,” ACM Queue vol. 3, no. 8, October 2005. See

[7] Chuck Moser, Semantic Interoperability: Automatically Resolving Vocabularies, presented at the 4th Semantic Interoperability Conference, February 10, 2006. See

[8] Alon Y. Halevy, Zachary G. Ives, Peter Mork and Igor Tatarinov, “Piazza: Data Management Infrastructure for Semantic Web Applications,” Journal of Web Semantics, Vol. 1 No. 2, February 2004, pp. 155-175. See

[9] Stefano Mazzocchi, Stephen Garland, Ryan Lee, “SIMILE: Practical Metadata for the Semantic Web,” January 26, 2005. See

[10] Adrian Mocan, Ed., “WSMX Data Mediation,” in WSMX Working Draft, W3C Organization, 11 October 2005. See

[11] J.Madhavan , P. A. Bernstein , P. Domingos and A. Y. Halevy, “Representing and Reasoning About Mappings Between Domain Models,” in the Eighteenth National Conference on Artificial Intelligence, pp.80-86, Edmonton, Alberta, Canada, July 28-August 01, 2002.

[12] AnHai Doan, Learning to Map between Structured Representations of Data, Ph.D. Thesis to the Computer Science & Engineering Department, University of Washington, 2002, 133 pp. See

[13] Michael Stonebraker and Joey Hellerstein, “What Goes Around Comes Around,” in Joseph M. Hellerstein and Michael Stonebraker, editors, Readings in Database Systems, Fourth Edition, pp. 2-41, The MIT Press, Cambridge, MA, 2005. See

[14] John Miles Smith and Diane C. P. Smith, “Database Abstractions: Aggregation and Generalization,” ACM Transactions on Database Systems 2(2): 105-133, 1977.

[15] Michael Hammer and Dennis McLeod, “Database Description with SDM: A Semantic Database Model,” ACM Transactions on Database Systems 6(3): 351-386, 1981.

[16] See

[17] See

Posted by AI3's author, Mike Bergman Posted on June 12, 2006 at 6:04 pm in Adaptive Information, Semantic Web | Comments (9)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:June 9, 2006

In an earlier posting I had some fun with the Website as a Graph utility where you can enter a Web address and the system provides a visual analysis of that individual Web page (not an overall view of the site).  Friends, family, indeed, the entire Web, has been having some fun with this one.

Before I let this toy go, I decided to do some comparative stuff (and some image animation, a relic of the not too distant past).  So, the image below shows my blog at the time of its release about 1 year ago (a single post), and then the structure of this page about one week ago and then yesterday.  What a difference a week (or day) makes!  So here are these changes for my blog site:

Posted by AI3's author, Mike Bergman Posted on June 9, 2006 at 9:39 pm in Site-related | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:June 8, 2006

Somehow, since Bloglines (via its parent announced its new blog and feed search I have noticed my standard search feeds no longer return as many results.  I’ve checked via searches on Google’s Blogsearch and Technorati (not to mention Bloglines itself) and see no mention of this problem from others.  Has anyone else been experiencing Bloglines search degradation?

I sent a ding to the Bloglines tech support, which was acknowledged, but no real response as yet: 

I have noticed that some of my standard ‘Search feeds’ are no longer returning the number of results they did a week or so ago, by massively large amounts. Is this related to the new search function announced this week (which I like)? Or due to some other problem? (BTW, the ‘Search feed’ that most shows this behavior is "semantic web").

If anyone has an insight, I welcome your comments. 

Posted by AI3's author, Mike Bergman Posted on June 8, 2006 at 9:34 pm in Searching | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:June 6, 2006

Semantic mediation — that is, resolving semantic heterogeneities — must address more than 40 discrete categories of potential mismatches from units of measure, terminology, language, and many others. These sources may derive from structure, domain, data or language.

Earlier postings in this recent series traced the progress in climbing the data federation pyramid to today’s current emphasis on the semantic Web. Partially this series is aimed at disabusing the notion that data extensibility can arise simply by using the XML (eXtensible Markup Language) data representation protocol. As Stonebraker and Hellerstein correctly observe:

XML is sometimes marketed as the solution to the semantic heterogeneity problem . . . . Nothing could be further from the truth. Just because two people tag a data element as a salary does not mean that the two data elements are comparable. One could be salary after taxes in French francs including a lunch allowance, while the other could be salary before taxes in US dollars. Furthermore, if you call them “rubber gloves” and I call them “latex hand protectors”, then XML will be useless in deciding that they are the same concept. Hence, the role of XML will be limited to providing the vocabulary in which common schemas can be constructed.[1]

This series also covers the ontologies and the OWL language (written in XML) that now give us the means to understand and process these different domains and “world views” by machine. According to Natalya Noy, one of the principal researchers behind the Protégé development environment for ontologies and knowledge-based systems:

How are ontologies and the Semantic Web different from other forms of structured and semi-structured data, from database schemas to XML? Perhaps one of the main differences lies in their explicit formalization. If we make more of our assumptions explicit and able to be processed by machines, automatically or semi-automatically integrating the data will be easier. Here is another way to look at this: ontology languages have formal semantics, which makes building software agents that process them much easier, in the sense that their behavior is much more predictable (assuming they follow the specified explicit semantics–but at least there is something to follow). [2]

Again, however, simply because OWL (or similar) languages now give us the means to represent an ontology, we still have the vexing challenge of how to resolve the differences between different “world views,” even within the same domain. According to Alon Halevy:

When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies–or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas. Without such understanding, the multitude of data sources amounts to a digital version of the Tower of Babel. [3]

In the sections below, we describe the sources for how this heterogeneity arises and classify the many different types of heterogeneity. I then describe some broad approaches to overcoming these heterogeneities, though a subsequent post looks at that topic in more detail.

Causes and Sources of Semantic Heterogeneity

There are many potential circumstances where semantic heterogeneity may arise (partially from Halevy [3]):

  • Enterprise information integration
  • Querying and indexing the deep Web (which is a classic data federation problem in that there are literally tens to hundreds of thousands of separate Web databases) [4]
  • Merchant catalog mapping
  • Schema v. data heterogeneity
  • Schema heterogeneity and semi-structured data.

Naturally, there will always be differences in how differing authors or sponsors create their own particular “world view,” which, if transmitted in XML or expressed through an ontology language such as OWL may also result in differences based on expression or syntax. Indeed, the ease of conveying these schemas as semi-structured XML, RDF or OWL is in and of itself a source of potential expression heterogeneities. There are also other sources in simple schema use and versioning that can create mismatches [3]. Thus, possible drivers in semantic mismatches can occur from world view, perspective, syntax, structure and versioning and timing:

  • One schema may express a similar “world view” with different syntax, grammar or structure
  • One schema may be a new version of the other
  • Two or more schemas may be evolutions of the same original schema
  • There may be many sources modeling the same aspects of the underlying domain (“horizontal resolution” such as for competing trade associations or standards bodies), or
  • There may be many sources that cover different domains but overlap at the seams (“vertical resolution” such as between pharmaceuticals and basic medicine).

Regardless, the needs for semantic mediation are manifest, as are the ways in which semantic heterogeneities may arise.

Classification of Semantic Heterogeneities

The first known classification scheme applied to data semantics that I am aware of is from William Kent nearly 20 years ago.[5] (If you know of earlier ones, please send me a note.) Kent’s approach dealt more with structural mapping issues (see below) than differences in meaning, which he pointed to data dictionaries as potentially solving.

The most comprehensive schema I have yet encountered is from Pluempitiwiriyawej and Hammer, “A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources.” [6] They classify heterogeneities into three broad classes:

  • Structural conflicts arise when the schema of the sources representing related or overlapping data exhibit discrepancies. Structural conflicts can be detected when comparing the underlying DTDs. The class of structural conflicts includes generalization conflicts, aggregation conflicts, internal path discrepancy, missing items, element ordering, constraint and type mismatch, and naming conflicts between the element types and attribute names.
  • Domain conflicts arise when the semantic of the data sources that will be integrated exhibit discrepancies. Domain conflicts can be detected by looking at the information contained in the DTDs and using knowledge about the underlying data domains. The class of domain conflicts includes schematic discrepancy, scale or unit, precision, and data representation conflicts.
  • Data conflicts refer to discrepancies among similar or related data values across multiple sources. Data conflicts can only be detected by comparing the underlying DOCs. The class of data conflicts includes ID-value, missing data, incorrect spelling, and naming conflicts between the element contents and the attribute values.

Moreover, mismatches or conflicts can occur between set elements (a “population” mismatch) or attributes (a “description” mismatch).

The table below builds on Pluempitiwiriyawej and Hammer’s schema by adding the fourth major explicit category of language, leading to about 40 distinct potential sources of semantic heterogeneities:






Case Sensitivity
Generalization / Specialization
Aggregation Intra-aggregation
Internal Path Discrepancy
Missing Item Content Discrepancy
Attribute List Discrepancy
Missing Attribute
Missing Content
Element Ordering
Constraint Mismatch
Type Mismatch
DOMAIN SchematicDiscrepancy Element-value to Element-label Mapping
Attribute-value to Element-label Mapping
Element-value to Attribute-label Mapping
Attribute-value to Attribute-label Mapping
Scale or Units
DataRepresentation Primitive Data Type
Data Format
DATA Naming Case Sensitivity
ID Mismatch or Missing ID
Missing Data
Incorrect Spelling
LANGUAGE Encoding Ingest Encoding Mismatch
Ingest Encoding Lacking
Query Encoding Mismatch
Query Encoding Lacking
Languages Script Mismatches
Parsing / Morphological Analysis Errors (many)
Syntactical Errors (many)
Semantic Errors (many)

Most of these line items are self-explanatory, but a few may not be:

  • Homonyms refer to the same name referring to more than one concept, such as Name referring to a person v. Name referring to a book
  • A generalization/specialization mismatch can occur when single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to “phone” but the other schema has multiple elements such as “home phone,” “work phone” and “cell phone”
  • Intra-aggregation mismatches come when the same population is divided differently (Census v. Federal regions for states, or full person names v. first-middle-last, for examples) by schema, whereas inter-aggregation mismatches can come from sums or counts as added values
  • Internal path discrepancies can arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove)
  • The four sub-types of schematic discrepancy refer to where attribute and element names may be interchanged between schemas
  • Under languages, encoding mismatches can occur when either the import or export of data to XML assumes the wrong encoding type. While XML is based on Unicode, it is important that source retrievals and issued queries be in the proper encoding of the source. For Web retrievals this is very important, because only about 4% of all documents are in Unicode, and earlier BrightPlanet provided estimates there may be on the order of 25,000 language-encoding pairs presently on the Internet
  • Even should the correct encoding be detected, there are significant differences in different language sources in parsing (white space, for example), syntax and semantics that can also lead to many error types.

It should be noted that a different take on classifying semantics and integration approaches is taken by Sheth et al.[7] Under their concept, they split semantics into three forms: implicit, formal and powerful. Implicit semantics are what is either largely present or can easily be extracted; formal languages, though relatively scarce, occur in the form of ontologies or other descriptive logics; and powerful (soft) semantics are fuzzy and not limited to rigid set-based assignments. Sheth et al.’s main point is that first-order logic (FOL) or descriptive logic is inadequate alone to properly capture the needed semantics.

From my viewpoint, Pluempitiwiriyawej and Hammer’s [6] classification better lends itself to pragmatic tools and approaches, though the Sheth et al. approach also helps indicate what can be processed in situ from input data v. inferred or probabalistic matches.

Importance of Reference Standards

An attractive and compelling vision  — perhaps even a likely one  — is that standard reference ontologies become increasingly prevalent as time moves on and semantic mediation is seen as more of a mainstream problem. Certainly, a start on this has been seen with the use of the Dublin Core metadata initiative, and increasingly other associations, organizations, and major buyers are busy developing “standardized” or reference ontologies.[8] Indeed, there are now more than 10,000 ontologies available on the Web.[9] Insofar as these gain acceptance, semantic mediation can become an effort mostly at the periphery and not the core.

But, such is not the case today. Standards only have limited success and in targeted domains where incentives are strong. That acceptance and benefit threshold has yet to be reached on the Web. Until such time, a multiplicity of automated methods, semi-automated methods and gazetteers will all be required to help resolve these potential heterogeneities.

NOTE: This posting is part of an occasional series looking at a new category that I and BrightPlanet are terming the eXtensible Semantic Data Model (XSDM). Topics in this series cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing and indexing of XML, RDF, OWL, or SKOS data. A major white paper will be produced at the conclusion of the series.

[1] Michael Stonebraker and Joey Hellerstein, “What Goes Around Comes Around,” in Joseph M. Hellerstein and Michael Stonebraker, editors, Readings in Database Systems, Fourth Edition, pp. 2-41, The MIT Press, Cambridge, MA, 2005. See[2] Natalya Noy, “Order from Chaos,” ACM Queue vol. 3, no. 8, October 2005 See

[3] Alon Halevy, “Why Your Data Won’t Mix,” ACM Queue vol. 3, no. 8, October 2005. See

[4] Michael K. Bergman, “The Deep Web: Surfacing Hidden Value,” BrightPlanet Corporation White Paper, June 2000. The most recent version of the study was published by the University of Michigan’s Journal of Electronic Publishing in July 2001. See

[5] William Kent, “The Many Forms of a Single Fact”, Proceedings of the IEEE COMPCON, Feb. 27-Mar. 3, 1989, San Francisco. Also HPL-SAL-88-8, Hewlett-Packard Laboratories, Oct. 21, 1988. [13 pp]. See

[6] Charnyote Pluempitiwiriyawej and Joachim Hammer, “A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources,” Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See

[7] Amit Sheth, Cartic Ramakrishnan and Christopher Thomas, “Semantics for the Semantic Web: The Implicit, the Formal and the Powerful,” in Int’l Journal on Semantic Web & Information Systems, 1(1), 1-18, Jan-March 2005. See

[8] See, among scores of possible examples, the NIEM (National Information Exchange Model) agreed to between the US Departments of Justice and Homeland Security; see

[9] OWL Ontologies: When Machine Readable is Not Good Enough

Posted by AI3's author, Mike Bergman Posted on June 6, 2006 at 6:12 pm in Adaptive Information, Semantic Web | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is: