A little remarked chink in the armor for deployment of an Internet-wide semantic Web is scalability.
Recent Jena Results
The most recent report on this topic, from Katie Portwin & Priya Parvatikar, entitled "Scaling Jena in a Commercial Environment: The Ingenta MetaStore Project," is one of a number of very useful papers from last week’s (May 10th and 11th) 2006 Jena User Conference in Bristol, UK. Though this paper shows continued incremental progress from prior reports, it is just that — incremental — and gives pause to the ability of the semantic Web to scale with current approaches.
The common terminology for semantic Web databases or data stores is a "triple store" or RDF store or sometimes even a native XML database. (Terminology is still being worked out and for some of the reasons noted below is uncertain and in flux.) RDF or the triple store idea comes from the fact that to address the meaning (read: semantics) of data it is necessary to establish its relationship to some reference base. Thus, in the subject-object-predicate nomenclature of first-order logic inherent to triples, this whole "relation to" stuff introduces a lot of unusual complexity. That truth, plus the fact that RDF, RDF-S or OWL standards mean the data interchange is also occurring via straight text makes indexing and storage problematic from the get-go.
The net effect is that scalability is questionable. (Among other technical challenges.)
In the Portwin & Parvatikar paper, their design does all of the proper front-end stuff regarding processing the triples for querying, and then stores the resulting combinations in a relational database management system (RDBMS). If not an RDBMS, other approaches use native datastores or object-oriented databases or inline tagging of conventional search engines (or information retrieval) systems, a subject I have spoken about regarding Enterprise Semantic Webs (ESW) Demand New Database Paradigms or the unique demands of Semi-structured Data Systems. These are subjects I will return to many times in the future. (Databases may not be sexy, but they are the foundation and the crux of such matters.) In any case, the approach taken by Portwin & Parvatikar is quite common because of the market acceptance and mature capabilities of RDBMSs, though such choices are often flamed because of silly "religious" prejudices. In any case, these topics remain the targets of hot debate.
The Portwin & Parvatikar paper claims scaling up to 200 million "triple stores" at acceptable performance. However, apples may be crabapples or sometimes they may be road apples, even if we don’t have to get into the apples-and-oranges wider debate. So, let’s dissect these 200 million triple stores:
- They actually represent only 4.3 million document records, since each document also had an average of 47 triples per doc
- These were only titles and minor abstracts, not full document content
- Use of OWL was actually problematic, because it introduces even more complex and subtle relationships (also therefore harder to model with greater storage requirements and processing times). Indeed, the use of OWL in this context seemed to hit a brick wall at about 11 million triples, or about 5% of what was shown for RDF
- Total storage for these records was about 65 GB.
While use of the RDBMS provides the backup, administration, transaction and other support desirable with a commercial DB, the system does not provide easy XML RDF support and has a hard time with text and text indexing. Query performance is also hard to benchmark viz other systems. It is kind of hard to understand why each doc requires 16 K storage for these relationships, when that is pretty close to the average when storing a complete full-text index for the entire document itself using an efficient system (as BrightPlanet’s XSD Engine provides).
‘Triple Stores’ Are Way Too Optimistic
So, again, we come to another dirty little secret of the semantic Web. It is truly (yes, TRULY), not uncommon to see ten-fold storage increases with semantically-aware document sets. Sure, while storage is cheap, it is hard to see why a given document should bloat in size more than 10-fold simply by adding metadata and relationship awareness. First, effcient full-text indexing systems, again pointing to BrightPlanet, actually result in storage 30% of the original document and without stemming or stop lists. Second, most of the semantic stuff stored regarding a document is solely in nouns and verbs. These are already understandable within the actual text: WHY DO WE NEED TO STORE THIS STUFF AGAIN??? So, actually, while many lament the 10x increase in semantic Web and RDF-type triple storage, the actual reality is even worse. Since we can store the complete full-text document index in 30% of original, which overlaps to 90% or more with what is stored for "semantic" purposes, the semantic storage bloat with typical approaches is actually closer to 33 times!
Storage size comparisons seem pretty esoteric. But every programmer worth his salt understands that performance — the real metric of interest — is highly correlated to code size and data storage.
Finally, it should be appreciated that infrerred information has extremely high value across reposiitories. As soon as per document limts are broken — which we have already — then cross-repository issues move into the cross hairs. These are the problems of ultimate payback. And, because of bottlenecks resulting from poor performance and storage demands, virtually no one is doing that today.
Apples v. Oranges v. Road Apples v. Crabapples
The unfortunate reality is that we don’t have accepted standards and protocols for testing the indexing, storage, querying and set retreival of semantic Web data. What is a "statement"? What is a "triple?" What is the relationship of these to an actual document? What should be in the denominator to meaningfully measure performance? When will search on metadata and full text (surely, yes!) be efficiently achieved in a combined way? Are we testing with exemplar current technology, or relying on atypical hardware, CPU and storage to create the appearance of performance? Even the concept of the "triple store" and its reliable use as a performance metric is easily called into question.
According to Scalability: A Semantic Web Perspective My Background:
It turns out that Sesame’s (and probably every triple store’s) performance is highly dependent on not just things like processor
speed, amount of RAM, etc., but also very much depends on the structure of the RDF that you are using to test the system.
There are a number of aspects of test data that may influence results. One is the number of new URIs introduced per triple. . . . if you decide to test scalability with an artificially generated dataset, and you are not careful, you can get badly skewed results. . . .Another big one has to do with namespaces. Sesame internally splits each incoming URI into a namespace and local name . . . Works great. Except, the Uniprot dataset contains URNs of this form:
When we tried to add the dataset to a Sesame store, Sesame dutifully split each URI into a namespace and a local name . . . [with unfortunate results] . . . . and we ended up with a single, shared, namespace for all these URNs). Because we never assumed the namespace set to become that big and simply cache every namespace in memory (to improve lookup speed), we ran into big performance problems after adding little more than 5 million statements. Conclusion: Sesame sucks!
On the other hand, we also ran some tests with the Lehigh University benchmark data set. This (generated) dataset does not have this particular URI-splitting problem, so we happily added 70 million triples to a native store, without hitting a ceiling. Conclusion: Sesame rules!
Of course, these disparities are not unique to RDF data storage nor the semantic Web. But it is also the case that needs drive the formulation of standards and testing protocols. Lehigh University is to be commended for its leadership in showing the way in this area as well as being a williing clearninghouse for standards and datasets.
250 M Triples: The Current High Water Mark
Efforts of the past few years have only inched up the performance and scale metric. Cynically, one can say that trending this stuff out over the past few years has been the result of hardware improvements alone. Unfortunately, there may be some real truth to this. Many of the notable performance benchmark increases of the past couple of years seemingly have more to do with the use of 64-bit platforms than the fundamental insights to truly scale to Internet dimensions.
The fault has been in the software paradigm and indexing paradigm and perhaps other data representation paradigms, because be it an AI language, existing RDBMS or OODMS data system, or even a "native " storage system, they all choke and burn at the limit of about 250 million triples, and most choke well before that. Remember, this triple metric translates into only a bit more than 5 million documents (still without full-text search), fairly typical for perhaps a Fortune 2000 enterprise, but at least four orders of magnitude lower than real Internet scales. While useful, we should not have to look to 64 bit computing to overcome these limits, but rather combine it with a better understanding of what is appearing to be a truly unique data storage paradigm.
At any rate, a compilation of triple stores performance would not be complete without citing these sources:
- Prolog-based Infrastructure for RDF: Scalability and Performance – storage has achieved about 40 million triples
- 3store: Efficient Bulk RDF Storage – has achieved about 20 million triples and 5000 classes and properties.
- Tucana Semantic Analysis and Interoperability – has achieved about 100 million statements (32 bit) or 1 billion statements (64 bit), with most recently the 32 bit increased to 350 million RDF statements (Northrop Grumman on Tucana)
- According to Solutions: Technology – key other entities have benchmarks of: Aduna, ~ 70 million statements; BRAHMS (U. of Georgia), ~ 100 million statements; OWLIM (Ontotext), ~ 40 million statements; DLDB (Lehigh Univ), ~ 13 million statements; general consensus, ~ 45 million statements for real Semantic Web documents.
- Finally, a posting on Data Stores from the W3c point to other benchmarks and engines. Do check it out.
These numbers are hardly Google class. Given thesee scale demands, it is perhaps not surprising, then, that many of our favorite Web 2.0 sites such as Flickr, Technorati or Bloglines at times perform atrociously. The major Internet players of Amazon, Ebay, Yahoo! and Google truly get it, but they are also truly adding hardware capacity at either breakneck or bankruptcy speed (depending on how one looks at it) to keep pace with these interactive site designs.
Oh, By the Way: Indexing and Querying Also Sucks
This post has looked mostly at the question of storage, but another unfortunate truth is that existing semantic Web data systems also suck in the times it takes for initial ingest and subsequent repository querying and retrieval. Some general sources for these observations come from combining metrics from OntoWorld and an interesting post on LargeTripleStores on the ESW (W3C) Wiki.
Here are some additional useful links from the ESW wiki, TripleStoreScalability regarding scalability:
- Vineet Sinha, 2006, RDF DB Shootout: Preliminary Results. See also the follow-up discussion on simile-general
Guo, Yuanbo; Pan, Zhengxiang; Heflin, Jeff, Lehigh University Benchmark.
R. Lee, MIT 2004, Scalability Report on Triple Store Applications
S. Harris and N. Gibbins, University of Southampton 2003, 3store: Efficient Bulk RDF Storage (PDF)
D. Beckett, University of Bristol, 2002, SWAD-Europe: Scalability and storage: Survey of free software / open source RDF storage systems
D. Wood, P. Gearon, University of Queensland, and T. Adams, Bosatsu Consulting, Inc., 2002, Kowari: A Platform for Semantic Web Storage and Analysis
Scalable Semantic Web Knowledge Base Systems (SSWS 2005), at WISE 05, New York, November 20, 2005
Proceedings of the First International Workshop on Practical and Scalable Semantic Systems, Sanibel Island, Florida, USA October 20, 2003
Is It Bikini Briefs or the Full Monty for the Emperor?
So, is this semantic Web stuff totally naked? Does the emperor have any clothes? And what is the hype from the reality?
It is hard to escape the fact that enterprise-scale performance still eludes current technologies attempting to provide the semantic Web infrastructure. Since enterprises are themselves still three to five orders of magnitude smaller than the general Internet, I think we can also safely assert that semantic Web data storage and indexing is also not ready for global prime time.
Actually, the latter point is not all that upsetting, given the more fundamental bottleneck of having the need for sufficient semantic content across the Internet to trigger the network effect. That will happen, but not tomorrow (though also likely sooner than any of us rational data and statistics hounds will foresee).
With today’s limited enterprise experiments, it is essential there be loads of RAM (and, BTW, fast SANs or SCSI drives to
boot since everything has to page to disk once we stop talking about
toys) for any semantic Web system to even sort of approximate acceptable performance. This is not acceptable and not sustainable.
More hardware, fancy hash algorithms, or work-shifting to external smart parsers are only temporary panaceas to the abysmal performance profiles of semantic Web implementations. Fundamentally new technology is not only needed to redress the limitations with existing standups, but to open the door for the next generation of applications. And, for all of this to occur, performance and storage breakthoughs of at least one order of magnitude, and preferably two to three, will be required. I have the personal good fortune of being right in the middle of these exciting prospects. . . .