I just finished participating in a discussion that has mirrored many others I have observed in the past: We have a complicated problem with much data before us, and we don’t know where it may evolve or trend. Can we architect a single database schema up front that handles all possible options?
Every programmer or database administrator (DBA) will recommend keeping designs to a single database, schema and vendor. It makes life easier for them.
However, every real world application and community points to the natural outcomes of multiple schemas and databases. This reality, in fact, has what has led to the whole topic of data federation and the various needs to resolve physical, semantic, syntactic and other schema heterogeneities.
Designers can certainly be clever with respect to anticipating growth, changes as seen in the past and so forth. Leaving "open slots" or "generic fields" in schemas are often posited and perhaps may allow for a little bit of growth. Also, perhaps quite a bit of mitigation for schema evolution can be anticipated up front.
But the reality of diversity remains. The semantic Web and proliferation of user-generated metadata will only exacerbate these challenges. Simply talk to the biological or physics communities of what they have seen in finding a single "grand schema." They haven’t been able to, can’t, and it is a chimera.
Thus, smart design does not begin with a naive single database premise. It recognizes that information exists in many forms in many places and in many transmutations from many viewpoints. And what is important today will surely change tomorrow. Explicit recognition of these realities is critical to successful upfront information management design.
Viva la multiple databases!
I just came across an easily readable and accessible short series on the semantic Web by Sunil Goyal of the enventure blog. The four-part series consists of:
- Part 1 — introduction and overview of various Web services
- Part 2 — the challenges of data integration
- Part 3 – RDF and OWL data models and service-oriented middleware, and
- Part 4 — user applications, enterprise systems, research applications and themes and services.
If you are new to this topic, you may find this series an easy first introduction.
Earlier posts have noted the near-term importance of the federal government to integrated document management and integration of open source software. A recent article by Darryl K. Taft of eWeek.com titled "GSA Modernizes With Open-Source Stack" indicates the lead role the General Services Administration will play, at least on the civilian side of the government. According to the article:
George Thomas, a chief architect at the General Services Administration, said the GSA is leading the effort to deliver an OSERA (Open Source eGov Reference Architecture) that will feature foundational technologies such as MDA (Model Driven Architecture), an ESB (enterprise service bus), an SOA (service-oriented architecture) and the Semantic Web, among other things.
OSERA deserves close tracking as the federal government implements these standards. GSA has set up a Web site on OSERA that is still awaiting content.
There were a number of references to the UMBC Semantic Web Reference Card – v2 when it was first posted about a month ago. Because it is so useful, I chose to bookmark the reference and post again later (today) after the initial attention had been forgotten.
According to the site:
The UMBC Semantic Web Reference Card is a handy "cheat sheet" for Semantic Web developers. It can be printed double sided on one sheet of paper and tri-folded. The card includes the following content:
- RDF/RDFS/OWL vocabulary
- RDF/XML reserved terms (they are outside RDF vocabulary)
- a simple RDF example in different formats
- SPARQL semantic web query language reference
- many handy facts for developers.
The reference card is provided through the University of Maryland, Baltimore County (UMBC) eBiquity program. The eBiquity site provides excellent links to semantic Web publications as well as generally useful information on context-aware computing; data mining; ecommerce; high-performance computing; knowledge representation and reasoning; language technology; mobile computing; multi-agent systems; networking and systems; pervasive computing; RFID; security, trust and privacy; semantic Web, and Web services.
The UMBC eBiquity program also maintains the Swoogle service. Swoogle crawls and indexes semantic Web RDF and OWL documents encoded in XML or N3. As of today, Swoogle contains about 350,000 documents and over 4,100 ontologies.
The Reference Card itself is available as a PDF download. Highly recommended!
In earlier posts I have put forward a vision for the semantic Web in the enterprise that has an extensible database supporting semi-structured data at its core with XML mediating multiple ingest feeds, interaction with analytic tools, and sending results to visualization and reporting tools.
This is well and good as far as it goes. However, inevitably, whenever more than one tool or semi-structured dataset is added to a system, it brings with it a different “view” of the world. Formalized and standardized protocols and languages are needed to both: 1) capture these disparate “views” and 2) provide facilities to map them to resolve data and schema federation heterogeneities. These are the roles of RDF and OWL.
Fortunately, there is a very active community with tools and insights for working in RDF and OWL. Stanford and UMBC are perhaps the two leading centers of academic excellence.
If you are not generally familiar with this stuff, I recommend you begin with the recent “Order from Chaos” from Natalya Noy of the Protégé group at Stanford Medical. This piece describes issues like trust, etc., that are likely not as relevant to application of the semantic Web to enterprise intranets as they are to the cowboy nature of the broader Internet. However, much else of this article is of general use to the architect considering enterprise applications.
To keep things simple and to promote interoperability, a critical aspect of any enterprise semantic Web implementation will be providing the “data API” (including extensible XML, and RDF and OWL) standards that govern the rules of how to play in the sandbox. Time spent defining these rules of engagement will pay off in spades in relation to any other appproach for multiple ingest, multiple analytic tools and multiple audiences, reports and collaboration.
Another advantage of this approach is the existence of many open source tools for managing such schema (e.g., Protégé) and visualization (literally dozens), among thousands of ontologies and other intellectual property.