How often do you see vendor literature or system or application descriptions that claim extensibility simply because of a heavy reliance on XML? I find it amazing how common the claim is and how prevalent are the logical fallacies surrounding this notion.
Don’t get me wrong. As a data exchange format, eXtensible Markup Language (XML) does provide data representation extensibility. This contribution is great, with widespread adoption a major factor in its own right helping to bring down the Tower of Babel. But the simple use of XML is insufficient alone to provide extensibility.
Fully extensible systems need to have at least these capabilities:
These challenges are especially daunting in a completely decentralized, chaotic, distributed enviornment such as the broader Internet. This environment requires peer-to-peer protocols, significant error checking and validation, and therefore the inefficiencies due to excessive protocol layering. Moreover, there are always competing standards and few incentives and fewer rewards for gaining compliance or adherence.
Thus it is likely that whatever progress is made on these extensibility and interoperabilkity fronts will show themselves soonest in the enterprise. Enterprises can better enforce and reward centralized standards. Yet even in this realm, while perhaps virtually all of the extensible building blocks and nascent standards exist, pulling them together into a cohesive whole, in which the standards themselves are integrated and cohesive, is the next daunting challenge.
Thus, the next time you hear about a system with its amazing extensibilitiy, look more closely at it in terms of these threshold criteria. The claims will likely fail. And, even if they do appear to work in a demo setting, make sure you look around carefully for the wizard’s curtain.
Conventional service-oriented architectures (SOAs) have been found to have:
These problems are especially acute at scale.
Frank Cohen recently posted a paper on IBM’s developerWorks, "FastSOA: Accelerate SOA with XML, XQuery, and native XML database technology: The role of a mid-tier SOA cache architecture," that presents some interesting alternatives to this conundrum.
The specific FastSOA proposal may or may not be your preferred solution if you are working with complex SOA environments at scale. But the general overview of conventional SOA constraints (in the SOAP framework) is very helpful and highly recommended.
The enterprise Semantic Web, as all Semantic Web instances, by definition depends on semi-structured data. Generally lacking in the move toward a semi-structured data paradigm has been the creation of adequate processing engines for efficient and scalable storage and retrieval of semi-structured data.
While tremendous effort has gone into data representations like XML, when it comes to positing or designing engines for manipulating that data the approach is to clone kludgy workarounds on to existing relational DBMSs or text search engines. Neither meet the test. Thus, as the semantic Web and its association to semi-structured data looks forward, two impediments stand like gatekeepers blocking progress: 1) efficient processing engines and 2) scalable systems and architectures.
Unlike structured or unstructured data, there is no accepted database engine specific to semi-structured data. Some systems attempt to use relational DBMS approaches from the structured end of the spectrum; other systems attempt to add some structure to standard unstructured search engines (see the figure in my related posting). Structured data is dominated by RDBMSs and unstructured data is largely the realm of text or search engines:
Attempts to manage the middle ground of semi-structured data has involved either modifying RDBMS systems to be XML enabled, adding some structure to existing IR systems, or developing new, native XML data systems from scratch. The native XML systems are relatively new and unproven. For a listing of native XML databases, plus generally useful discussion about the use of XML within databases, see Ron Bourret’s Web site..
Semi-structured data models are sometimes called “self-describing” (or schema-less). These data models are often represented as labeled graphs, or sometimes labeled trees with the data stored at the leaves. The schema information is contained in the edge labels of the graph. Semi-structured representations also lend themselves well to data exchange or the integration of heterogeneous data sources.
However, all of these three approaches to managing semi-structured XML data — enabled RDBMSs, modified IR text engines, or native XML data systems — have their own strengths and weaknesses, as shown by the table below:
Because of their prevalence, XML-enabled RDBMSs are perhaps the most common approach, with all commercial vendors such as Oracle, IBM and Sybase offering their own versions. But realize that XML is itself text, much of its information requires text-based retrieval, and open XML schemas with the need to preserve ordering are very poorly suited to the relational data model. As a result, RDBMS options are very fragile, perform poorly for document-centric retrievals, and lose critical information.
IR-based text search systems do well on the text retrieval scale, but are not suited at all for storing and managing structured data. Further, many of these systems use in-line tagging of structural attributes. While this approach parses well and can seamlessly work with existing text token indexing, at scale it suffers the fatal flaw of requiring the complete re-indexing of existing content should new attributes or extensions be desired.
Finally, all native XML data systems perform poorly at scale. Some of these native systems build from a text-search basis, others from more object or relational approaches. But, in general, queries and other mechanisms are still highly XML document-centric, with very slow retrievals across large document repositories.
As XML and semi-structured data have become ubiquitous, clearly the path is opening in the marketplace for a “third way.” Later postings will look at efforts by new vendors such as Mark Logic to address this opportunity, as well as emerging efforts from BrightPlanet.
|NOTE: This posting is part of an occasional series looking at a new category that I and BrightPlanet are terming the eXtensible Semantic Data Model (XSDM). Topics in this series cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing, indexing or semantic schemas and mappings of XML, RDF, OWL, or SKOS data. A major white paper will be produced at the conclusion of the series. Stay tuned!|
The W3C organization has just published an update on "A Survey of RDF/Topic Maps Interoperability Proposals." This note, dated Feb 10, updates the previous version of one year ago.
It is well and good to embrace standards for semantic content such as RDF or OWL, but without mechanisms for standardly expressing schemas it is difficult to actually map and resolve semantic heterogeneities. This introductory survey is useful from the standpoint of topic maps.
IBM has announced it has completed the first step of making the Unstructured Information Management Architecture (UIMA) available to the open source community by publishing the UIMA source code to SourceForge.net. UIMA is an open software framework to aid the creation, development and deployment of technologies for unstructured content. IBM first unveiled UIMA in December of 2004. The source code for the IBM reference implementation of UIMA is currently available and can be downloaded from http://uima-framework.sourceforge.net/ . In addition, the IBM UIMA SDK, with additional facilities and components, can be downloaded for free from http://www.alphaworks.ibm.com/tech/uima .
UIMA has received support from the Defense Advanced Research Projects Agency (DARPA) and is currently in use as part of DARPA’s new human language technology research and development program called GALE (Global Autonomous Language Exploitation). UIMA is also embedded in various IBM products for processing unstructured information.