Posted:February 14, 2006

How often do you see vendor literature or system or application descriptions that claim extensibility simply because of a heavy reliance on XML? I find it amazing how common the claim is and how prevalent are the logical fallacies surrounding this notion.

Don’t get me wrong. As a data exchange format, eXtensible Markup Language (XML) does provide data representation extensibility. This contribution is great, with widespread adoption a major factor in its own right helping to bring down the Tower of Babel. But the simple use of XML is insufficient alone to provide extensibility.

Fully extensible systems need to have at least these capabilities:

  • Extensible data representation so that any data type and form can be transmitted between two disparate systems. XML and its other structured cousins such as RDF and OWL perform this role. Note, however, that standard data exchange formats have been an active topic of research and adoption for at least 20 years, with other notable formats such as ASN.1, CDF, EDI, etc., also performing the task now largely being overtaken by XML
  • Extensible semantics, since once more than one source of data is brought into an extended environment it likely introduces new semantics and heterogeneities. These mismatches fall into the classic challenge areas of data federation. The key point, however, is that simply being able to ingest extended data does nothing if the meaning of that data is not also captured. Semantic extensibilitiy requires more structured data representations (RDF-S or OWL, for example), reference vocabularies and ontologies, and utilities and means to map the meanings between different schema
  • Extensible data management. Though native XML data bases and other extensions to conventional data systems have been attempted, truly extensible data management systems have not yet been developed that: 1) perform at scale; 2) can be extended without re-architecting the schema; 3) can be extended without re-processing the original source data; and 4) perform efficiently. Until extensible infrastructure with these capabilities is available, extensibility will not become viable at the enterprise level and will remain an academic or startup curiosity, and
  • Extensible capabilities through extendable and interoperable applications or tools. Though we are now moving up the stack into the application layer, real extensibility comes from true interoperability. Service-oriented architectures (SOAs) and other approaches allow the registry and message brokering amongst extended apps and services. But central v. decentralized systems, inclusion or not of business process interoperabilty, and even the accommodation of the other extensible imperatives above make this last layer potentially fiendishly difficult.

These challenges are especially daunting in a completely decentralized, chaotic, distributed enviornment such as the broader Internet. This environment requires peer-to-peer protocols, significant error checking and validation, and therefore the inefficiencies due to excessive protocol layering. Moreover, there are always competing standards and few incentives and fewer rewards for gaining compliance or adherence.

Thus it is likely that whatever progress is made on these extensibility and interoperabilkity fronts will show themselves soonest in the enterprise. Enterprises can better enforce and reward centralized standards. Yet even in this realm, while perhaps virtually all of the extensible building blocks and nascent standards exist, pulling them together into a cohesive whole, in which the standards themselves are integrated and cohesive, is the next daunting challenge.

Thus, the next time you hear about a system with its amazing extensibilitiy, look more closely at it in terms of these threshold criteria. The claims will likely fail. And, even if they do appear to work in a demo setting, make sure you look around carefully for the wizard’s curtain.

Posted:February 13, 2006

Conventional service-oriented architectures (SOAs) have been found to have:

  • Slow and inefficient bindings
  • Complete duplication of information processing in requests because of no caching, and
  • Generally slow performance because of RDBMS storage.

These problems are especially acute at scale.

Frank Cohen recently posted a paper on IBM’s developerWorks, "FastSOA: Accelerate SOA with XML, XQuery, and native XML database technology: The role of a mid-tier SOA cache architecture," that presents some interesting alternatives to this conundrum.

The specific FastSOA proposal may or may not be your preferred solution if you are working with complex SOA environments at scale. But the general overview of conventional SOA constraints (in the SOAP framework) is very helpful and highly recommended.

Posted by AI3's author, Mike Bergman Posted on February 13, 2006 at 9:59 am in Information Automation, Semantic Web | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:February 12, 2006

The enterprise Semantic Web, as all Semantic Web instances, by definition depends on semi-structured data. Generally lacking in the move toward a semi-structured data paradigm has been the creation of adequate processing engines for efficient and scalable storage and retrieval of semi-structured data.[1]

While tremendous effort has gone into data representations like XML, when it comes to positing or designing engines for manipulating that data the approach is to clone kludgy workarounds on to existing relational DBMSs or text search engines. Neither meet the test. Thus, as the semantic Web and its association to semi-structured data looks forward, two impediments stand like gatekeepers blocking progress: 1) efficient processing engines and 2) scalable systems and architectures.

Unlike structured or unstructured data, there is no accepted database engine specific to semi-structured data. Some systems attempt to use relational DBMS approaches from the structured end of the spectrum; other systems attempt to add some structure to standard unstructured search engines (see the figure in my related posting). Structured data is dominated by RDBMSs and unstructured data is largely the realm of text or search engines:

Attempts to manage the middle ground of semi-structured data has involved either modifying RDBMS systems to be XML enabled, adding some structure to existing IR systems, or developing new, native XML data systems from scratch. The native XML systems are relatively new and unproven. For a listing of native XML databases, plus generally useful discussion about the use of XML within databases, see Ron Bourret’s Web site.[2].

Semi-structured data models are sometimes called “self-describing” (or schema-less). These data models are often represented as labeled graphs, or sometimes labeled trees with the data stored at the leaves. The schema information is contained in the edge labels of the graph. Semi-structured representations also lend themselves well to data exchange or the integration of heterogeneous data sources.

However, all of these three approaches to managing semi-structured XML data¬† — enabled RDBMSs, modified IR text engines, or native XML data systems¬† — have their own strengths and weaknesses, as shown by the table below:

Type Pros Cons

Because of their prevalence, XML-enabled RDBMSs are perhaps the most common approach, with all commercial vendors such as Oracle, IBM and Sybase offering their own versions. But realize that XML is itself text, much of its information requires text-based retrieval, and open XML schemas with the need to preserve ordering are very poorly suited to the relational data model. As a result, RDBMS options are very fragile, perform poorly for document-centric retrievals, and lose critical information.

IR-based text search systems do well on the text retrieval scale, but are not suited at all for storing and managing structured data. Further, many of these systems use in-line tagging of structural attributes. While this approach parses well and can seamlessly work with existing text token indexing, at scale it suffers the fatal flaw of requiring the complete re-indexing of existing content should new attributes or extensions be desired.

Finally, all native XML data systems perform poorly at scale. Some of these native systems build from a text-search basis, others from more object or relational approaches. But, in general, queries and other mechanisms are still highly XML document-centric, with very slow retrievals across large document repositories.

As XML and semi-structured data have become ubiquitous, clearly the path is opening in the marketplace for a “third way.” Later postings will look at efforts by new vendors such as Mark Logic to address this opportunity, as well as emerging efforts from BrightPlanet.

NOTE: This posting is part of an occasional series looking at a new category that I and BrightPlanet are terming the eXtensible Semantic Data Model (XSDM). Topics in this series cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing, indexing or semantic schemas and mappings of XML, RDF, OWL, or SKOS data. A major white paper will be produced at the conclusion of the series. Stay tuned!

[1] Matteo Magnani and Danilo Montesi, “A Unified Approach to Structured, Semistructured and Unstructured Data,” Technical Report UBLCS-2004-9, Department of Computer Science, University of Bologna, 29 pp., May 29, 2004. See


Posted by AI3's author, Mike Bergman Posted on February 12, 2006 at 12:51 pm in Semantic Web | Comments (4)
The URI link reference to this post is:
The URI to trackback this post is:

The W3C organization has just published an update on "A Survey of RDF/Topic Maps Interoperability Proposals."  This note, dated Feb 10, updates the previous version of one year ago.

It is well and good to embrace standards for semantic content such as RDF or OWL, but without mechanisms for standardly expressing schemas it is difficult to actually map and resolve semantic heterogeneities.  This introductory survey is useful from the standpoint of topic maps. 

Posted by AI3's author, Mike Bergman Posted on February 12, 2006 at 11:52 am in Adaptive Information, Semantic Web | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:February 1, 2006

IBM has announced it has completed the first step of making the Unstructured Information Management Architecture (UIMA) available to the open source community by publishing the UIMA source code to UIMA is an open software framework to aid the creation, development and deployment of technologies for unstructured content. IBM first unveiled UIMA in December of 2004. The source code for the IBM reference implementation of UIMA is currently available and can be downloaded from . In addition, the IBM UIMA SDK, with additional facilities and components, can be downloaded for free from .

UIMA has received support from the Defense Advanced Research Projects Agency (DARPA) and is currently in use as part of DARPA’s new human language technology research and development program called GALE (Global Autonomous Language Exploitation). UIMA is also embedded in various IBM products for processing unstructured information.