Katie Portwin, one of the Ingenta developers whose Jena paper stimulated my recent posting on semantic Web scalability, has expanded on the scalability theme in interesting ways in her recent performance, triplestores, and going round in circles.. post.
In her post, Katie asks rhetorically, Can industrial scale triplestores be made to perform? Is breaking the "triple table" model the answer? She then goes on to note that in a related XTech paper, the Ingenta team showed that even a simple, bread and butter sample query takes 1.5 seconds on a 200 million triple-store. The post also contains interesting links to other speakers at the Jena User’s Conference last week, including clever ways to cluster triples in an RDBMS.
I asked Tom Tiahrt, BrightPlanet’s chief scientist and lead developer on our text and semantic engines, to review this post and give me his thoughts. Here are his comments:
I always like to see this: "re-modelling" or "modelling" instead of "modeling" because I abhor human-induced language entropy. Kudos to Katie Portwin (KP) for that alone.
Kevin Wilkinson (KW) defines a triple store as a three-column table in a relational system. This is unfortunate because a triple-store is not exclusive to RDB systems. It must be provided by any RDF system as part of its logical design, even if does not use it for its physical design.
KW's patterns-identification aspect is likely true in many instances, and his 'breaking' the clean RDF format is what DBAs and RDB developers always do to improve performance (denormalizing the database). KP points out the problem with this, viz., that you must maintain a more complex schema, and duplicates raise data retrieval issues (though they are tractable). Moreover, KP writes "The great thing about the triplestore is that we don’t have to bake assumptions about the data into the database – we can have as many whatevers as we like."
The point is to achieve acceptable performance you cannot simply rely on the triple store alone. At the same time, RDF requires triples, and to prevent assumption baking the user should not have to decide how to denormalize the triple-store. In addition,
the transitive closure computation is the onerous query that the RDBMS cannot do within a reasonable amount of time.
Here are the parameters of the great problem. Static assumptions about what will happen directly oppose what RDF is supposed to provide. Open-ended dynamic processing cannot perform well enough to solve the problem.
Katie Portwin also points out that re-modelling is a real problem as well when the system is hosted by an RDBMS, though the triple stores can remain intact.
I’ll keep monitoring this topic and post other interesting perspectives on RDF, triple-store and semantic Web scalability as I encounter them.