This post introduces a new category area in my blog related to what I and BrightPlanet are terming the eXtensible Semi-structured Data Model (XSDM). Topics in this category cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing and indexing of XML, RDF, OWL, or SKOS data formats.
Why this category is important is introduced by Fengdong Du in the master’s thesis, Moving from XML Documents to XML Databases, submitted to the University of British Columbia in March 2004. As succinctly stated in the introduction to that thesis:
Depending on the characteristics of XML applications, the current XML storage techniques can be classified into two major categories. Most text-centric applications (e.g., newspapers) choose an existing file system for data storage. Data is usually divided into logical units, and each logical unit is physically stored as a separate file. As an example, a newspaper application may divide the entire year's newspapers into 12 collections by months, and store each collection as a document file. This type of application usually provides a keyword-based search tool and manipulates the data in application-specific processes. While this approach simplifies the storage problem, it has some major drawbacks. First, storing XML data as plain text makes it difficult to develop a generic data manipulation interface.
Second, mapping logical units of data to individual files makes it difficult to view the data from a different perspective. For this reason, this type of application only provides services with limited functionalities and therefore restricts the usage of data.
On the other hand, in data-centric applications such as e-commerce applications, data is typically highly-structured, e.g., extracted from a relational database management system (RDBMS). XML is primarily used as a tool to publish data to the Web or deliver information in a self-descriptive way in place of the conventional relative files. This type of application relies on the RDBMS for data storage. Data received in XML format is eventually put into an RDBMS when persistence is desired. Over the years, an RDBMS has been well developed to efficiently store and retrieve well-structured data. Structured Query Language (SQL) and many useful extended RDBMS utilities (e.g., Programming Language SQL, stored procedures) act as an application-independent data manipulation interface. Applications can communicate with databases through this generic interface and, on top of it, provide services with very rich functionalities.
While storing XML data into an RDBMS can take advantage of the well-developed relational database techniques and open interfaces, this approach requires an extra schema-mapping process applied to XML data, which involves schema transformation and usually decomposition. The schemas of XML data have to be mapped to strictly-defined relational schemas before data is actually stored. This process is strongly application-dependent or domain-dependent because there must be enough information available to determine many relational database design issues such as which table in the target RDBMS is a good place to store the information delivered, what new tables need to be created, which elements/attributes should be indexed, etc. No matter how this kind of information is obtained, whether delivered with XML data as schemas and processing instructions, or the application context makes it obvious, it is hard to develop an automatic and generic schema-mapping mechanism. Instead, application-specific work needs to take care of the schema-mapping problem. This involves non-trivial work of database server-side programming and database administration.
Another drawback of storing XML data in an RDBMS is that it is hard to efficiently support many types of queries that people want to ask on XML data. In RDBMS, each table has a pre-defined primary key field, and possibly a few other indexed fields. Queries not on the key field and not on the indexed fields will result in table scans (i.e., possibly a very large number of I/O's, which can be very time consuming) such as for the following path and predicate expression:
It is very likely that "department" is not indexed on "street" and that "student" is not indexed on "nationality". Therefore, resolving this path expression will cause table scans. Moreover, storing XML data in an RDBMS often results in schema decomposition and produces many small tables. Hence, evaluating a query often needs many expensive join operations.
For unstructured or semi-structured data, an RDBMS has greater difficulty, and query performance is usually unacceptable for relatively large amount of data. For these reasons, a native database management system is expected in the XML world. Like a traditional RDBMS, native XML databases would provide a comprehensive and generic data management interface, and therefore isolate lower level details from the database applications. Unlike an RDBMS, an ideal native XML database would make no distinction between unstructured data and strictly structured data. It treats all valid XML data in the same way and manages them equally efficiently. Its performance is only affected by the type of data manipulation. In other words, an ideal XML native database is not only access transparent but also performance transparent upon the structural difference of data.
Future topics in this XSDM area will expand on these challenges and describe new standards-based solutions being developed by BrightPlanet that specifically address these challenges..