Posted:May 23, 2006

Three related shifts in data use and management are intersecting to create a unique market opportunity. This opportunity represents a generational change from an era of structured data in stove-piped systems managed by relational data systems to one of semi-structured data in hugely scaled and interconnected and interoperable networks. How this next-generation system is to be managed is still unclear, the answer to which represents the major market opportunity.

The specific shifts driving this change are:

And, of course, all of this is occurring within the context of explosive data access and growth, literally at Internet scales.

This opportunity is first presenting itself through the leadership and support of the federal intelligence, defense and homeland security agencies. Certain industries, notably financial services and pharmaceuticals, are the next likely to show enterprise applicability, and specific academic domains, particularly in biology, are also innovating this trail. The emerging leaders in this nascent market opportunity will likely be determined within the next two to three years.

Posted by AI3's author, Mike Bergman Posted on May 23, 2006 at 8:14 pm in Semantic Web, Software and Venture Capital | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:May 18, 2006

A little remarked chink in the armor for deployment of an Internet-wide semantic Web is scalability.

Recent Jena Results

The most recent report on this topic, from Katie Portwin & Priya Parvatikar, entitled "Scaling Jena in a Commercial Environment: The Ingenta MetaStore Project," is one of a number of very useful papers from last week’s (May 10th and 11th) 2006 Jena User Conference in Bristol, UK. Though this paper shows continued incremental progress from prior reports, it is just that — incremental — and gives pause to the ability of the semantic Web to scale with current approaches.

The common terminology for semantic Web databases or data stores is a "triple store" or RDF store or sometimes even a native XML database.  (Terminology is still being worked out and for some of the reasons noted below is uncertain and in flux.)  RDF or the triple store idea comes from the fact that to address the meaning (read: semantics) of data it is necessary to establish its relationship to some reference base.  Thus, in the subject-object-predicate nomenclature of first-order logic inherent to triples, this whole "relation to" stuff introduces a lot of unusual complexity.  That truth, plus the fact that RDF, RDF-S or OWL standards mean the data interchange is also occurring via straight text makes indexing and storage problematic from the get-go.

The net effect is that scalability is questionable.  (Among other technical challenges.)

In the Portwin & Parvatikar paper, their design does all of the proper front-end stuff regarding processing the triples for querying, and then stores the resulting combinations in a relational database management system (RDBMS).  If not an RDBMS, other approaches use native datastores or object-oriented databases or inline tagging of conventional search engines (or information retrieval) systems, a subject I have spoken about regarding Enterprise Semantic Webs (ESW) Demand New Database Paradigms or the unique demands of Semi-structured Data Systems.  These are subjects I will return to many times in the future.  (Databases may not be sexy, but they are the foundation and the crux of such matters.)  In any case, the approach taken by Portwin & Parvatikar is quite common because of the market acceptance and mature capabilities of RDBMSs, though such choices are often flamed because of silly "religious" prejudices.  In any case, these topics remain the targets of hot debate.

The Portwin & Parvatikar paper claims scaling up to 200 million "triple stores" at acceptable performance.  However, apples may be crabapples or sometimes they may be road apples, even if we don’t have to get into the apples-and-oranges wider debate.  So, let’s dissect these 200 million triple stores:

  • They actually represent only 4.3 million document records, since each document also had an average of 47 triples per doc
  • These were only titles and minor abstracts, not full document content
  • Use of OWL was actually problematic, because it introduces even more complex and subtle relationships (also therefore harder to model with greater storage requirements and processing times).  Indeed, the use of OWL in this context seemed to hit a brick wall at about 11 million triples, or about 5% of what was shown for RDF
  • Total storage for these records was about 65 GB.

While use of the RDBMS provides the backup, administration, transaction and other support desirable with a commercial DB, the system does not provide easy XML RDF support and has a hard time with text and text indexing.  Query performance is also hard to benchmark viz other systems.  It is kind of hard to understand why each doc requires 16 K storage for these relationships, when that is pretty close to the average when storing a complete full-text index for the entire document itself using an efficient system (as BrightPlanet’s XSD Engine provides).

‘Triple Stores’ Are Way Too Optimistic

So, again, we come to another dirty little secret of the semantic Web.  It is truly (yes, TRULY), not uncommon to see ten-fold storage increases with semantically-aware document sets.  Sure, while storage is cheap, it is hard to see why a given document should bloat in size more than 10-fold simply by adding metadata and relationship awareness.  First, effcient full-text indexing systems, again pointing to BrightPlanet, actually result in storage 30% of the original document and without stemming or stop lists.  Second, most of the semantic stuff stored regarding a document is solely in nouns and verbs.  These are already understandable within the actual text:  WHY DO WE NEED TO STORE THIS STUFF AGAIN???  So, actually, while many lament the 10x increase in semantic Web and RDF-type triple storage, the actual reality is even worse.  Since we can store the complete full-text document index in 30% of original, which overlaps to 90% or more with what is stored for "semantic" purposes, the semantic storage bloat with typical approaches is actually closer to 33 times!

Storage size comparisons seem pretty esoteric.  But every programmer worth his salt understands that performance — the real metric of interest — is highly correlated to code size and data storage.

Finally, it should be appreciated that infrerred information has extremely high value across reposiitories.  As soon as per document limts are broken — which we have already — then cross-repository issues move into the cross hairs.  These are the problems of ultimate payback.  And, because of bottlenecks resulting from poor performance and storage demands, virtually no one is doing that today.

Apples v. Oranges v. Road Apples v. Crabapples

The unfortunate reality is that we don’t have accepted standards and protocols for testing the indexing, storage, querying and set retreival of semantic Web data.  What is a "statement"?  What is a "triple?"  What is the relationship of these to an actual document?  What should be in the denominator to meaningfully measure performance?  When will search on metadata and full text (surely, yes!) be efficiently achieved in a combined way?  Are we testing with exemplar current technology, or relying on atypical hardware, CPU and storage to create the appearance of performance?  Even the concept of the "triple store" and its reliable use as a performance metric is easily called into question.

According to Scalability: A Semantic Web Perspective My Background:

It turns out that Sesame’s (and probably every triple store’s) performance is highly dependent on not just things like processor
speed, amount of RAM, etc., but also very much depends on the structure of the RDF that you are using to test the system.

There are a number of aspects of test data that may influence results. One is the number of new URIs introduced per triple. . . . if you decide to test scalability with an artificially generated dataset, and you are not careful, you can get badly skewed results. . . .Another big one has to do with namespaces. Sesame internally splits each incoming URI into a namespace and local name . . . Works great. Except, the Uniprot dataset contains URNs of this form:


When we tried to add the dataset to a Sesame store, Sesame dutifully split each URI into a namespace and a local name . . .  [with unfortunate results] . . . .  and we ended up with a single, shared, namespace for all these URNs). Because we never assumed the namespace set to become that big and simply cache every namespace in memory (to improve lookup speed), we ran into big performance problems after adding little more than 5 million statements. Conclusion: Sesame sucks!

On the other hand, we also ran some tests with the Lehigh University benchmark data set. This (generated) dataset does not have this particular URI-splitting problem, so we happily added 70 million triples to a native store, without hitting a ceiling. Conclusion: Sesame rules!

Of course, these disparities are not unique to RDF data storage nor the semantic Web.  But it is also the case that needs drive the formulation of standards and testing protocols.  Lehigh University is to be commended for its leadership in showing the way in this area as well as being a williing clearninghouse for standards and datasets.

250 M Triples:  The Current High Water Mark

Efforts of the past few years have only inched up the performance and scale metric.  Cynically, one can say that trending this stuff out over the past few years has been the result of hardware improvements alone.  Unfortunately, there may be some real truth to this.  Many of the notable performance benchmark increases of the past couple of years seemingly have more to do with the use of 64-bit platforms than the fundamental insights to truly scale to Internet dimensions.

The fault has been in the software paradigm and indexing paradigm and perhaps other data representation paradigms, because be it an AI language, existing RDBMS or OODMS data system, or even a "native " storage system, they all choke and burn at the limit of about 250 million triples, and most choke well before that. Remember, this triple metric translates into only a bit more than 5 million documents (still without full-text search), fairly typical for perhaps a Fortune 2000 enterprise, but at least four orders of magnitude lower than real Internet scales.  While useful, we should not have to look to 64 bit computing to overcome these limits, but rather combine it with a better understanding of what is appearing to be a truly unique data storage paradigm.

At any rate, a compilation of triple stores performance would not be complete without citing these sources: 

  • Prolog-based Infrastructure for RDF: Scalability and Performance – storage has achieved about 40 million triples
  • 3store: Efficient Bulk RDF Storage -  has achieved about 20 million triples and 5000 classes and properties.
  • Tucana Semantic Analysis and Interoperability – has achieved about 100 million statements (32 bit) or 1 billion statements (64 bit), with most recently the 32 bit increased to 350 million RDF statements (Northrop Grumman on Tucana)
  • According to Solutions: Technology – key other entities have benchmarks of:  Aduna, ~ 70 million statements; BRAHMS (U. of Georgia), ~ 100 million statements; OWLIM (Ontotext), ~ 40 million statements; DLDB (Lehigh Univ), ~ 13 million statements; general consensus, ~ 45 million statements for real Semantic Web documents.
  • Finally, a posting on Data Stores from the W3c point to other benchmarks and engines.  Do check it out.

These numbers are hardly Google class.  Given thesee scale demands, it is perhaps not surprising, then, that many of our favorite Web 2.0 sites such as Flickr, Technorati or Bloglines at times perform atrociously.  The major Internet players of Amazon, Ebay, Yahoo! and Google truly get it, but they are also truly adding hardware capacity at either breakneck or bankruptcy speed (depending on how one looks at it) to keep pace with these interactive site designs.

Oh, By the Way:  Indexing and Querying Also Sucks

This post has looked mostly at the question of storage, but another unfortunate truth is that existing semantic Web data systems also suck in the times it takes for initial ingest and subsequent repository querying and retrieval.  Some general sources for these observations come from combining metrics from OntoWorld and an interesting post on LargeTripleStores on the ESW (W3C) Wiki.

Here are some additional useful links from the ESW wiki, TripleStoreScalability regarding scalability:

Is It Bikini Briefs or the Full Monty for the Emperor? 

So, is this semantic Web stuff totally naked?  Does the emperor have any clothes?  And what is the hype from the reality?

It is hard to escape the fact that enterprise-scale performance still eludes current technologies attempting to provide the semantic Web infrastructure.  Since enterprises are themselves still three to five orders of magnitude smaller than the general Internet, I think we can also safely assert that semantic Web data storage and indexing is also not ready for global prime time.

Actually, the latter point is not all that upsetting, given the more fundamental bottleneck of having the need for sufficient semantic content across the Internet to trigger the network effect.  That will happen, but not tomorrow (though also likely sooner than any of us rational data and statistics hounds will foresee). 

With today’s limited enterprise experiments, it is essential there be loads of RAM (and, BTW, fast SANs or SCSI drives to
boot since everything has to page to disk once we stop talking about
toys) for any semantic Web system to even sort of approximate acceptable performance.  This is not acceptable and not sustainable.

More hardware, fancy hash algorithms, or work-shifting to external smart parsers are only temporary panaceas to the abysmal performance profiles of semantic Web implementations.  Fundamentally new technology is not only needed to redress the limitations with existing standups, but to open the door for the next generation of applications.  And, for all of this to occur, performance and storage breakthoughs of at least one order of magnitude, and preferably two to three, will be required.  I have the personal good fortune of being right in the middle of these exciting prospects. . . .

Posted by AI3's author, Mike Bergman Posted on May 18, 2006 at 4:03 am in Semantic Web | Comments (6)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:May 15, 2006

There is often no substitute for learning about subjects than from the key practitioners and thinkers behind them. Even though one noted researcher in this field, Hamish Cunningham at the University of Sheffield, a specialist in human language technology (HLT), does sometimes cite the low 5% retention of information from a lecture, I am finding repeated viewings and rewinds to be a pretty effective way to learn:

[The original Learning Pyramid analysis traces back to the 1960s and is now attributed to the NTL Institute; the actual picture is from Dr. Tom Bayston at the University of Central Florida. It would be interesting to know whether repeated viewings of online videos, or simultaneous video + Powerpoints act to increase retention. (Actually, blogging about something may be at the higher end of retention within the Learning Pyramid.) I'm also finding that the combination of video/slides with the audio explanations to be immensely helpful.]

Nonetheless, I have previously reported on some great online Semantic Web videos, for example, one by Henry Story and another by Tim Berners-Lee, and faced with a rainy day I tried to be more comprehensive in my discovery.

I found many distribution points, but was most taken with a series of video tutorials and training sessions from SEKT (Semantically-enabled Knowledge Technologies), a three-year, EU-sponsored project that ends at the end of 2006.

SEKT – Online video presentations. SEKT offers 19 different ones; some with syncrhonized slides or alternatively slides (PPTs) separately. My three favorites (block out four hours!!!) are:

Besdies these, here are some other good intro to advanced videos that you can watch off the Web:

So, wait for that rainy day, grab some hot chocolate, and enjoy!

Jewels & Doubloons An AI3 Jewels & Doubloon Winner

Posted by AI3's author, Mike Bergman Posted on May 15, 2006 at 3:51 pm in Jewels & Doubloons, Semantic Web | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:April 30, 2006

Despite page ranking and other techniques, the scale of the Internet is straining available commercial search engines to deliver truly relevant content.  This observation is not new, but its relevance is growing.  Similarly, the integration and interoperabillity challenges facing enterprises have never been greater.  One approach to address these needs, among others, is to adopt semantic Web standards and technologies.

The image is compelling:  targeted and unambiguous information from all relevant sources, served in usable bit-sized chunks.  It sounds great; why isn’t it happening?

There are clues — actually, reasons — why semantic Web technology is not being embraced on a broad-scale way.  I have spoken elsewhere as to why enterprises or specific organizations will be the initial adopters and promoters of these technologies.  I still believe that to be the case.  The complexity and lack of a network effect ensure that semantic Web stuff will not initially arise from the public Internet.

Parellels with Knowledge Management

Paul Warren, in  “Knowledge Management and the Semantic Web: From Scenario to Technology,” IEEE Intelligent Systems, vol. 21, no. 1, 2006, pp. 53-59, has provided a structured framework for why these assertions make sense.  This February online article is essential reading for anyone interested in semantic Web issues (and has a listing of fairly classic references).

If you can get past the first silly paragraphs regarding Sally the political scientist and her research example (perhaps in a separate post I will provide better real-world examples from open source intelligence, or OSINT), Warren actually begins to dissect the real issues and challenges in effecting the semantic Web.  It is this latter two-thirds or so of Warren’s piece that is essential reading.

He does not organize his piece in the manner listed below, but real clues emerge in the repeated pointing to the need for “semi-automatic” methods to make the semantic Web a reality.  Fully a dozen such references are provided.  Relatedly, in second place, are multiple references to the need or value of “reasoning algorithms.”  In any case, here are some of the areas noted by Warren needing “semi-automatic” methods:

  • Assign authoritativemenss
  • Learn ontologies
  • Infer better search requests
  • Mediate ontologies (semantic resolution)
  • Support visualization
  • Assign collaborations
  • Infer relationships
  • Extract entities
  • Create ontologies
  • Maintain and evolve ontologies
  • Create taxonomies
  • Infer trust
  • Analyze links
  • etc.

These challenges are not listed in relevance, but as encountered in reading the Warren piece.  Tagging, extracting, classifying and organizing all are pretty intense tasks that certainly can not be done solely manually while still scaling.

Keep It Simple, Stupid

The lack of “simple” approaches is posited as another reason for slow adoption of the semantic Web.  In the article “Spread the word, and join it up,” in the April 6 Guardian, SA Matheson reports Tim O’Reilly as saying:

“I completely believe in the long-term vision of the semantic web – that we’re moving towards a web of data, and sophisticated applications that manipulate and navigate that data web.  However, I don’t believe that the W3C semantic web activity is what’s going to take us there….It always seemed a bit ironic to me that Berners-Lee, who overthrew many of the most cherished tenets of both hypertext theory and SGML with his ‘less is more
and worse is better’ implementation of ideas from both in the world wide web, has been deeply enmeshed in a theoretical exercise rather than just celebrating the bottom-up activity that will ultimately result in the semantic web…..It’s still too early to formalise the mechanisms for the semantic web. We’re going
to learn by doing, and make small, incremental steps, rather than a great leap forward.”

There is certainly much need for simplicity to encourage voluntary compliance with semantic Web potentials, short of crossing the realized rewards of broad benefits from the semantic Web and network effects. However, simplicity and broad use are but two of the factors limiting adoption, some of the others including incentives, self-interest and rewards.

As Warren points out in his piece:

Although knowledge workers no doubt believe in the value of annotating their documents, the pressure to create metadata isn’t present. In fact, the pressure of time will work in a counter direction. Annotation’s benefits accrue to other workers; the knowledge creator only benefits if a community of knowledge workers abides by the same rules. In addition, the volume of information in this scenario is much greater than in the services scenario. So, it’s unlikely that manual annotation of information will occur to the extent required to make this scenario work. We need techniques for reducing the load on the knowledge creator.

Somehow we keep coming back to the tools and automated ways to ease the effort and workflow necessary to put in place all of this semantic Web infrastructure. These aids are no doubt important — perhaps critical — but in my mind still short changes the most determinant dynamic of semantic Web technology adoption: the imperatives of the loosely-federated, peer-to-peer broader Web v. enterprise adoption.

Oligarchical (Enterprise) Control Preceeds the Network Effect

There are some analogies between service-oriented architectures and their associated standards, and the standards contemplated for the semantic Web.  Both are rigorous, prescribed, and meant to be intellectually and functionally complete.  (In fact, most of the WS** standards are specific SOA ones for the semantic Web.)  The past week has seen some very interesting posts on the tensions between “SOA Versus Web 2.0?, triggered by John Hagel’s post:

. . . a cultural chasm separates these two technology communities, despite the fact that they both rely heavily on the same foundational standard – XML. The evangelists for SOA tend to dismiss Web 2.0 technologies as light-weight “toys” not suitable for the “real” work of enterprises.  The champions of Web 2.0 technologies, on the other hand, make fun of the “bloated” standards and architectural drawings generated by enterprise architects, skeptically asking whether SOAs will ever do real work. This cultural gap is highly dysfunctional and IMHO precludes extraordinary opportunities to harness the potential of these two complementary technology sets.

This theme was picked up by Dion Hinchcliffe, among others.  Dion consistently posts on this topic in his ZDNet Enterprise Web 2.0 and Web 2.0 blogs, and is always a thoughtful read.   In his response to Hagel’s post, Hinchcliffe notes “… these two cultures are generally failing to cross-pollinate like they should, despite potentially ‘extraordinary opportunities.’.”

Supposedly, kitchen and garage coders playing around with cool mashups while surfing and blogging and posting pictures to Flickr are seen as a different “culture” than supposedly buttoned-down IT geeks (even if they wear T-shirts or knit shirts).  But, in my experience, these differences have more to do with the claim on time than the fact we are talking about different tribes of people.  From a development standpoint, we’re talking about the same people, with the real distinction being whether they are on payroll time or personal time.

I like the graphic that Hinchcliffe offers where he is talking about the SaaS model in the enterprise and the fact it may be the emerging form.  You can take this graphic and say the left-hand side of the diagram is corporate time, the right-hand side personal time.

Web 2.0 Enterprise Directions

I make this distinction because where systems may go is perhaps more useful to look at in terms of imperatives and opportunities v. some form of “culture” clash.  In the broad Web, there is no control other than broadly-accepted standards, there is no hegemony, there is only what draws attention and can be implemented in a decentralized way.  This impels simpler standards, and simpler “loosely-coupled” integrations.  We thus see mashups and simpler Web 2.0 sites like social bookmarking.   The  drivers are not “complete” solutions to knowledge creation and sharing, but what is fun, cool and gets buzz.

The corporate, or enterprise side, on the other hand, has a different set of imperatives and, as importantly, a different set of control mechanisms to set higher and more constraining standards to meets those imperatives.   SOA and true semantic Web standards like RDF-S or OWL can be imposed, because the sponsor can either require it or pay for it.  Of course, this oligarchic control still does not ensure adherence, just as IT departments were not able to prevent PC adoption 20 years ago, so it is important that productivity tools, workflows and employee incentives also be aligned with the desired outcomes.

So, what we are likely to see, indeed are seeing now, is that more innnovation and experimentation in “looser” ways will take place in Web 2.0 by lots of folks, many on them in their personal time away from the office.  Enterprises, on the other hand, will take the near-term lead on more rigorous and semantically-demanding integration and interoperability using semantic Web standards.

Working Both Ends to the Middle

I guess, then, this puts me squarely in the optimists camp where I normally reside.  (I also come squarely from an enterprise perspective since that is where my company resides.)   I see innovation at an unprecedented level with Web 2.0, mashups and participatory media, matched with effort and focus by leading enterprises to climb the data federation pyramid while dealing with very real and intellectually challenging semantic mediation.  Both ends of this spectrum are right, both will instruct, and therefore both should be monitored closely.

Warren gets it right when he points to prior knowledge management challenges as also informing the adoption challenges for the semantic Web in enterprises:

Currently, the main obstacle for introducing ontology-based knowledge management applications into commercial environments is the effort needed for ontology modeling and metadata creation. Developing semiautomatic tools for learning ontologies and extracting metadata is a key research area….Having to move out of a user’s typical working environment to ‘do knowledge management’ will act as a disincentive, whether the user is creating or retrieving knowledge…. I believe there will be deep semantic interoperability within organizational intranets. This is already the focus of practical implementations, such as the SEKT (Semantically Enabled Knowledge Technologies) project,
and across interworking organizations, such as supply chain consortia. In the global Web, semantic interoperability will be more limited.

My suspicion is that Web 2.0 is the sandbox where the tools, interfaces and approaches will emerge that help overcome these enterprise obstacles.  But we will still look strongly to enterprises for much of the money and the W3C for the standards necessary to make it all happen within semantic Web imperatives.

Posted by AI3's author, Mike Bergman Posted on April 30, 2006 at 7:14 pm in Information Automation, Semantic Web | Comments (1)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:April 21, 2006

I recently posted up a listing and description of 40 social bookmarking sites. Little did I realize what a small tip of the iceberg I was describing!

In commentary to that post, I was directed to Baris Karadogan’s posting of Web 2.0 Companies, which contains a fantastic compilation of 980 specific sites with links. I began reviewing user comments and discovered, possibly, that the original compiler of this list is Bob Stumpel of the Everything 2.0 blog. It is hard to provenance the origination of the list, but Bob is active in assembling long lists of updates. Boris’ is more attractive with embedded live links.

Bob has updated the master list a number of times, and now shows 1,601 sites as of April 16.

Both of these reference lists have organized the sites into about 70 different categories, a sampling of which shows the diversity and innovation taking place:

  • Audio and video
  • Social bookmarking
  • Venture capital
  • Wish lists
  • Search (biggest category)
  • Images
  • Collaboration
  • Fun and games
  • Etc.

The next useful step for some enterprising soul is to provide more commentary on each site and better describe what is meant by each category. Perhaps someone will step forward with a wiki somewhere.

Likely few of us have the time to look at all of the sites listed. But I am slowly sampling my way through the list, checking out the variety of approaches being taken by clever innovators out there. While some of the links are dead — not unexpected in such a nascent area or with such a long list — I’m also seeing alot of clever ideas.

This listing is a useful service. If you know of a missing site, please suggest it to one of the two compilation sites. And, I do hope someone takes authoritative ownership of the list and proper attribution is given where appropriate.

Posted by AI3's author, Mike Bergman Posted on April 21, 2006 at 10:40 am in Semantic Web, Semantic Web Tools | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is: