In Part 1 of this series, I advocated the placement of linked data in an ABox construct from description logics  based on a separation of concerns argument. In Part 2 of this series, I reinforced that argument from the perspective of the work to be done within a knowledge base.
I came to these viewpoints independently. I do not have any special background in these disciplines; I am a recent researcher and practitioner in the field, perhaps akin to a gentleman natural scientist of the 1800s. As these ideas have formed, therefore, I have also attempted to see what some of the noted experts in the field have said and wrote.
Like any other field, there is no common viewpoint or doctrine about these matters. But, there is considerable — and historic — support for this viewpoint of splitting the TBox and the ABox for many different reasons. To my knowledge, this viewpoint has yet to be consolidated and applied to linked data. Perhaps this series will help stimulate that discussion.
The first specific discussion of this matter I was able to discover (though I suspect it had been discussed in earlier internal forums or papers) was on the W3C’s RDF logic mail lists in 2001. While only eight years ago, it does feel a bit like ancient history with regard to the development and understanding of semantic Web languages.
The mail list topic was what role RDF should assume, at a formative point in the language’s development. A possibly restricted scope for RDF akin to a relational database or even as an “ABox” was being discussed. (This restriction, of course, was dropped for the more open, free scope of the present RDF.) To help clarify these matters, Jérôme Euzenat first noted in a thread with Pat Hayes, who would later author the RDF semantics W3C standards document  two years hence, that:
I find two important ideas emerging at this point. First, the roles and purposes of the TBox and ABox are being made clearer, consistent with the definition we have been using  and with better clarity and applicability to the semantic Web than what was earlier presented in the Description Logics Handbook . And, second, the idea of a language split between OWL and RDF viz the TBox and ABox is made public for the first time.
The conceptual foundations of the relational model and RDF are indeed quite similar, based as they are on set theory and relationships. The relational model and description logics are both based on FOL (first-order predicate logic). The linkage with the data table of relational database systems is especially close and direct and was first a topic of a design architecture document by Tim Berners-Lee in 1998 .
There is a rich literature to investigate these aspects in detail. And, of course, these matters are of critical import because 99% of the current structured data in the world is being managed by RDBMs .
However, of more direct interest to this specific series of articles is how this close relationship between RDB and RDF is viewed with respect to the ABox and TBox separation of concerns.
Ian Horrocks (not surprisingly given the nature of his comments above), among others, has played a prominent role in looking at questions such as “hybrid DL-DB” (description logics + [relational] database) systems  and building conceptual links between relational databases and ontological-level reasoning .
We need not deconstruct his observations and arguments here in detail. What I glean from his strong background in description logics, however, is that relational data tables can be left in situ as ABox constructs, in the process gaining the efficiency of limited ABox reasoning and the efficiency of RDBMs. With proper design — which I also understand to be pretty straightforward — it should be possible to design hybrid ABox and TBox systems that work in a distributed context.
(Hmmm; sounds to me like ideas applicable to linked data !).
Horrocks’ et al. more recent paper  and its expansion  propose an extension of OWL for such hybrid or split knowledge bases. These extensions are designed to allow modelers to designate a subset of TBox axioms as integrity constraints. For TBox-level reasoning, these axioms are treated as usual. However, these axioms may also be applied separately to ABox instance data to perform integrity checks. Integrity can then be checked in advance with those axioms ignored during standard TBox reasoning, thus also improving performance. I think these points also have direct relevance to linked data.
As a proponent of the OWL side of the spectrum, Horrocks’ viewpoints have perhaps been too readily dismissed by some in the linked data community. Yet the major reason for looking at all of these questions from the perspective of description logics is to gain a coherent view across the entire semWeb enterprise. In the end, we are linking data for a purpose, to be able to do meaningful work with it. Just as RDB data tables can be looked at and integrated productively as ABoxes in a DL construct, so may linked data.
If there truly is a separation of concerns between instance records (ABox) and reasoning constructs (TBox ontologies), what does that begin to tell us about the languages we need for these purposes? If we can postulate no OWL in a linked data instance construct (the ABox), why not narrow RDFS  as well in order to have a vocabulary only as expressive as what an instance record and its assertions require?
de Bruijn et al in 2005 demonstrated logically how RDF models can be related to description logics-based ontology languages, especially OWL DL, without the need to change syntax or sematics in either language . They noted specifically the use of RDF graphs as ABoxes that could be readily queried using SPARQL.
Herman ter Horst wrote another influential paper in 2005 where he looked closely at the proofs of completeness and decidability for RDF and RDFS . He defined a general RDF graph extension that was fully decidable, and importantly looked at each statement in the language from the standpoints of complexity and computational tractability. He was particularly seeking logics that would be more computationally efficient due to fewer entailments, while still being “decidable” (that is, provable to reach computational closure). Here is the basic chart of plotting the various language dialects he investigated:
He noted that inclusion of XML datatypes required the use of RDFS for closure and the addition of the so-called ‘D*entailment‘ could extend RDFS to include reasoning with datatypes. He then extended that construct into what he called the ‘pD*semantics,’ which was intended to allow useful conclusions to be drawn about instances in the presence of an ontology with relatively low computational complexity.
What this construct means — as I understand it in the context of this series — is that specialized dialects (pD*) could govern the work of instance checking and other specialized work at the ABox level (RDFS) while being fully compatible with the TBox level (OWL) ontology. This means that languages and dialects could be tailored for the work at hand for efficiency and representational reasons, while maintaining logical integrity. Indeed, this very pD* dialect of OWL is now included as one of the proposed profiles, OWL 2 RL, for the new release of OWL 2 .
In a different vein, the paper that won the best award at ESWC in 2007 looked at the question of simplifying RDFS . The authors were able to identify a fragment of RDFS that captured the complete semantics of RDF by carefully removing pieces that only described or allowed reasoning of the language itself.
A relatively streamlined and simplified structure for the ABox is not a new idea. Through version 3x, the Protégé ontology editor included a built-in for SWRL (the Semantic Web Rule Language) that included an ABox ontology :
<rdf:RDF xml:base='http://swrl.stanford.edu/ontologies/built-ins/3.3/abox.owl'> <owl:Ontology rdf:about=' '/> <swrl:Builtin rdf:ID='hasValue'/> <swrl:Builtin rdf:ID='hasURI'/> <swrl:Builtin rdf:ID='isNumeric'/> <swrl:Builtin rdf:ID='notNumeric'/> <swrl:Builtin rdf:ID='isIndividual'/> <swrl:Builtin rdf:ID='isConstant'/> <swrl:Builtin rdf:ID='hasClass'/> <swrl:Builtin rdf:ID='hasProperty'/> <swrl:Builtin rdf:ID='hasIndividual'> <swrlb:maxArgs rdf:datatype='http://www.w3.org/2001/XMLSchema#int'>1</swrlb:maxArgs> <swrlb:args rdf:datatype='http://www.w3.org/2001/XMLSchema#int'>1</swrlb:args> <swrlb:minArgs rdf:datatype='http://www.w3.org/2001/XMLSchema#int'>1</swrlb:minArgs> </swrl:Builtin> <swrl:Builtin rdf:ID='setValue'/> </rdf:RDF>
Don’t be fooled by the OWL designation in this file; for these uses, Protégé by convention requires all of its files to be of the OWL type. Note the simple vocabulary above has solely RDF predicates. We do not think this is yet the correct design (see Part 4), but it captures the right idea.
Logical and mathematical advances since the first releases of RDF and OWL now suggest that, with proper care and design, various dialects or fragments can be designed for specific purposes and for computational efficiency while maintaining — in their combination — logical integrity. An RDFS fragment, if you will, dedicated for linked instance data and ABox instance record purposes, appears conceptually doable. And, it may be computationally advisable.
The SWSE (“swizzy”) project from DERI and the National University of Ireland in Galway has an interesting legacy and has been combining many of these threads into one approach to a semantic web search engine (hence, SWSE). You can use and test for yourself the new VisiNav interface and service, the newest instantiation of SWSE.
Relative to ABox (instance) data, the volume of TBox (structural) data on the Web is small: only around 0.7% of statements were classifiable as TBox statements .
In building what they call SAOR (for Scalable, Authoritative OWL Reasoner) for the SWSE effort, Aidan Hogan, Andreas Harth and Axel Polleres have intersected a number of interesting approaches and have taken some innovative paths to the questions of separating the TBox and ABox . They have further applied this to the large-scale Billion Triples Challenge with interesting findings and results .
In their approach to building SAOR, the designers:
They further picked up on a variant of the ter Horst pD*semantics noted above to speed the reasoner for calculating the forward-chaining inferences.
According to these decisions and rules, they found the overwhelming majority of statements within the 315,000 sources they crawled as being “non-authoritative” and indeed made many decisions that, in essence, threw out statements in the source instance sets. One interpretation, related to the thesis of this series but not directly noted by the authors, is that much of the linked data presently available on the Web is either over-specified or mis-specified. (I would argue that is due in part to linked data instance records trying to do more than their natural assertional role.)
Now, perhaps one could quibble with the rules and the decisions the authors employed (indeed, we do), but that is a topic for another day. What is interesting about the entire SAOR approach, I think, is its close attention to value and authoritativeness, all being split and recast into more tractable ABox and TBox portions, for handling reasoning at scale over large numbers of instances.
In my opinion, this is a seminal approach to the next generation of linked data that warrants much inspection and discussion.
A cursory discussion of the literature also shows some efforts that address the interstitial work areas noted in my conceptual architecture from Part 2.
Part 2 discussed full-text search engines in the broad semantic sense, and not specifically related to the ABox-TBox split. A couple of those and some others deserve a look because of their tighter integration of full-text search and attention to work splits.
A sister project to SWSE is Sindice, which also uses Solr and employs the ter Horst pD* semantic framework . An inspection of Sweet Tools, the semantic Web and -related tools listing, also suggests Aperture (a broadscale, full-text harvester with semantic capabilities); LARQ (which adds free text search to ARQ); Virtuoso (full-text and faceted search on top of a universal datastore); Watson (full-text search of metadata fields); and Zebra (specializing in structured library data and related).
A couple of different approaches are being taken to identity testing, similarity or relations. The more direct approach is to do identity matching with a canonical ID or similar.
The SWSE group has one approach to object consolidation , which uses a clever method based on the owl:InverseFunctionalProperty (IFP) for performing large-scale consolidation of object identifiers for equivalent instances across data sources. Yet, as the authors note,
This is both bad logic and wrong in many cases [see 19 for a critique]. The authors therefore needed to drop this assignment from their method. But, frankly, I think the broader problem again is too much predicate firepower for what should be a simple assertion that Joe Farmer has a blog (in fact, may have three!) and here are their URIs.
A very large, multi-year project to assign unique identifiers to entities is the OKKAM project . The intent is to provide a single and globally unique identifier for any entity through an ‘Entity Name System‘, plus tools. Many methods will be employed to assign the identity relationship; specifics are still forthcoming with dozens of researchers working on the problem. I should note that the reference paper also touches upon some of the massive challenges associated with the current use of owl:sameAs.
Others have questioned a centralized ID service, instead preferring a mechanism that is more local and builds on co-reference research . The ReSIST project has noted some of the issues of owl:sameAs use and management. It has proposed, instead, a ‘Consistent Reference Service‘ (or CRS). Asserting a co-reference in this approach is like its use in linguistics: it means a URI that describes the same entity, as does ‘he’, ‘she’ or ‘it’ as a co-reference in a sentence. This predicate indicates that the two resources are describing the same thing without carrying all of the heavy entailment of the owl:sameAs predicate that semantically means the two resources are exactly the same. The CRS are proposed to be set up and managed locally and in a distributed fashion.
A very different approach to identity assignments is Rough DL, which is a qualitative, “fuzzy” ontology for relating entities or concepts to their similar resources . The method has also been applied to the very difficult problem of bibliographic records , where similarity is harder to judge because of use of initials and abbreviations. Rough DL may be especially appealing because even with the best state-of-the-art, there are error rates in any of the identity relating or disambiguation methods available. And, rather than try to assess these similarities with a probabilistic score, the “fuzzy” approach may even be one that can be reasoned over.
To my knowledge, there is no disambiguation of entities presently taking place for distributed linked data sets. But, if not already, it soon will.
An example of how such a service might occur is the uBio Taxonomic Name Server from The Marine Biological Laboratory at Woods Hole. Via Web service or direct HTML form, an entity name (in this case a biological species) or its variants can be submitted for disambiguation and assignment to the proper identifier (name).
There is much research behind the algorithms and approaches to named entity disambiguation beyond the scope of this present series.
Our arguments to this point in this series do not suggest nor require that current practice need change. Clearly, we are seeing growth, uptake and use with current practices regarding linked data.
One of the beautiful aspects of RDF as a data model and the semantic Web is that the underlying languages and standards are so flexible. Find a way to do stuff better in the future? Fine; go ahead and do it, because what has come before can be easily transitioned or accommodated.
The real thrust of this series has been “best practice.” There are certainly many viewpoints on that topic, and the understanding of it for a linked data environment at scale is also evolving. This is healthy, vibrant and exciting. Who knows what is truly best practice? I personally believe the market will determine that by what gets adopted and becomes self-sustaining by providing value.
However, as Structured Dynamics attempts to think through these issues — to look seriously at moving from simply proving the exposure of Web data to one of meaningfully doing work and relying on it at large scale — we see warts and challenges. Such is growth. It is natural. And change does not mean that what came before was wrong.
So, what do we see as some of these ‘big picture’ implications?:
If we can do these things, we can simplify what it means to publish “linked data-ready” structured data. Being coherent about these matters is a key.
In Part 1 of this series, I advocated the placement of linked data in an ABox construct from description logics  based on a separation of concerns argument. Actually, this broader position arose from close inspection of an earlier table on TBox and ABox purposes and roles that I had assembled from the literature.
That table synthesized TBox and ABox purposes and roles from the citations listed in Back to the Future with Description Logics, though was based for the most part on the The Description Logic Handbook . As we continued to look further at the assignments in that table we saw a few issues:
Work, of course, is what our computers do for us on the data. Properly identifying and isolating these work tasks is a good starting point for teasing out architectural and schema design.
These perspectives allowed Fred Giasson and me to re-evaluate this earlier assignments table (see earlier version), as shown below. Note that some items have been moved to a different column (shown in blue, all of which were formerly in the ‘ABox’) and some have been added and may be external (shown in green):
|TBox||TBox <–> ABox||ABox|
Blue = moved
Green = added
As the table now shows, the TBox is where the reasoning work occurs, the ABox is where assertions and data integrity occurs, and knowledge base work in the middle (among other aspects) requires both.
As Part 1 discussed, linked data in its current form, which are mostly instance records, most closely resembles ABoxes. The linkage in current linked data occurs via asserted relations and identities with things in other datasets (or sources). This might be the references to the objects themselves or to the predicates (properties) that describe the nature of the asserted relationship. The identifiers for these references are URIs, which are often external to the originating dataset.
This construction has led to the now well-known ‘LOD cloud‘ for publicly accessible open linked data:
Each of the arrows on the diagram reflects an external linkage based on these property or object identity assertions.
This same approach can be applied as well for specific domains. This example from the LODD (‘linking open drug data’) “cloud” shows  a similar “linked” structure; also note the central node of DBpedia, also shared with the previous diagram:
Part of my argument in Part 1 was for the linked data community and its advocates to view its role via an ABox perspective. I further argued that this could help simplify schema and vocabulary design for publishing linked data, helping to promote faster and broader uptake.
We can now take these arguments a bit further by adding the perspective of useful work that can be performed within this ABox framework, building upon the table above.
Looking at linked data from the perspective of work makes sense as we see the scale of linked data grow. Besides the sheer increase in numbers arising from that growth, newer publishers may have fewer skills and less conformance to best practice. Questions as to quality and erroneous mappings now appear more frequently on the support forums and mailing lists.
Linked data — or even more generically, distributed or federated data systems — have particular needs for search, for checking identity relations, and for disambiguation that arise from their distributed nature. These are challenges, and were not generally topics in the initial discourse around description logics. On the other hand, I think it fair to say that linked data has already shown its benefits in browsing and discovery, and for linked data destination sites with SPARQL endpoints, benefits in local search.
The greatest strength of linked data, imo, is that it is showing the way to how the Web can become a global, interoperating database, the Web of Data. Standards are being applied, practices are being worked out, and actual data is being exposed. Visibility and the realization is growing. This is the real story — and an important one — and frankly a massive development that is a mere two years old.
Fundamental to this growth has been the superb flexibility and simplicity of RDF as a data model. No matter what else, the benefits of RDF and linked data have a signal that now exceeds the noise. I believe it can be comfortably asserted that the next generation of information systems will be built on this and REST, the combination of which has been called Web-oriented architecture (WOA).
If you know the locations and if you know the format and (sometimes) if you know the syntax, you can see and discover much from linked data on the Web. If you can get beyond these start-up hurdles, you can now start swimming in a sea of linked references.
Granted, sometimes you don’t know what those cryptic link references mean; other times those links may take you to dead ends or silly stuff. I think it is fair to say that the community that is exposing this linked data stuff would acknowledge that their user interfaces and work flows are not yet intuitive or often easily grasped.
But, bear with this for a moment. Try a couple of these links:
(BTW, any URI will work so long as it has RDF in the correct, servicable form; a challenge unto itself! )
Just like we first discovered the magic of moving from hyperlink to hyperlink in the initial Web, we are now beginning to see the process of moving from hyperdata to hyperdata with regard to structured and organized information.
Browsing and “stumbling upon” new cool stuff is the current main strength of linked data; again, assuming you are given some tips about how to start. Here are some linked data browers that might help you on this discovery journey, often with useful accompanying help and link examples: OpenLink Data Explorer or Zitgist Data Viewer or Marbles or DISCO or Tabulator. Don’t be surprised if some of them break or don’t work! (Again,if you don’t know where to begin, try issuing one of the two link queries above after the ‘url’ or ‘uri’ part.)
If you are a bit more experienced, some of the linked data source sites can be queried directly. The language used is SPARQL, which bears many resemblances to SQL for standard relational databases.
Generally (unless you are an expert), you will issue your queries over the local database at the target site. To see and test some examples, first bring up one of the many SPARQL clients, for example this one presents a rather complete overview and demo environment:
A couple of other options with pre-loaded queries include:
The SPARQL facility is really quite powerful and elegant; for the knowledgable user or system, this query syntax can also be applied across linked datasets. (Indeed, SPARQL is a key component in the background to Structured Dynamics’ services.) SPARQL itself is a deserving topic in its own right.
There are a few RDF search engines that can do property, object or ontology searches, including Falcons (IWS China), Sindice (DERI Ireland), Watson (KMi), Semantic Web Search Engine (SWSE) (DERI Ireland) and Swoogle (the Ebiquity Group at UMBC). My personal favorite is Falcons.
We also are beginning to see structured data and attribute information creep into major search engines, often with little fanfare or notice; however, these are structured supplements to standard search results and not directly searchable alone. And, finally, there is a class of semantic search engines that do provide structured search, but not directly relevant to linked data (that is, they add structure and semantics to conventional search results). Prominent examples include Powerset (Microsoft), Hakia, True Knowledge and Cognition.
In general, user interfaces have not been strong points for linked data. The emphasis, after all, has been more on the principles of linking and getting more data exposed.
More critical, however, from the standpoint of making the compelling case for an evolving Web of Data, is the weak tools support as the number and breadth of sources goes up. These limitations include full-text search, which requires performance equivalent to standard search engines; a lack of identity testing; and — increasingly — the need for disambiguation.
Though there are linked data datastores (generally called “triple stores”) for RDF that do full text search, they do not perform as well as dedicated text search engines. Depending on scale or hardware choices and configuration, these issues may often be overcome and not directly apparent to end users.
<rdf:resource=”http://mkbergman.com/me/” foaf:knows rdf:resource=”http://fgiasson.com/me/”>
Now, what happens if I enter the search phrase “Fred Giasson” into my search box? Will this record appear?
The answer, often times, is no. And the reason it may not is that the “thing” of “Fred Giasson” is now represented by the resource URI of “http://fgiasson.com/me/” that does not contain the “Fred Giasson” search string.
In fact, one of the first observations that new users exposed to searching link data make is, Where are all of the results? Because, you see, they expect to see the same complete listing of results that would be returned by a conventional search engine.
The reason this occurs is that one of the very strengths of linked data — to give unique resources a unique Web ID or URI — is also what abstracts the resource from its literal string descriptor.
Now, this issue can be overcome, though to my knowledge no one is doing it. First, for a given resource, you can get all the triples where the object is a text “string” and index them. Then, for a subsequent resource, you can get all of the triples where the object is a resource and, if it is of the right type indicating a label, we then try to find the matching property for that resource.
This is generally not work done by triple stores. We are in essence trying to trace back from a URI identifier to its literal text descriptor (if it has one) and substituting the text descriptor into the search index such that it can be found during a full-text search (whew!). That sounds hard and like a lot of work. Indeed it is.
However, as time goes on, this is likely functionality that users will demand. And, because the nature of the work is really text search and not triple stores or inferencing based from a graph, it may require special processing at ingest and the use of dedicated text engines.
Remember what we said earlier about linked data and its relation to instances and the ABox? It can assert it is related to or has properties or attributes of one kind or another, but from the information contained in the instance record itself we can not determine if those assertions are true. We can believe or trust some sources more than others as being more authoritative, but even for authoritative sources we may want to test the assertion.
Perhaps this is less of a problem when the linked datasets come from a relatively small community, as has been the case. But that is rapidly changing, and how do we begin to test identity relations?
Again, to my knowledge, identity testing is not yet being applied to linked instance data; we are relying on assertions and trust, each confounded by differences of opinion as to what asserting “identity” even means. Identity testing is a good example of a possible service or component that could either reside external to the TBox, or could use the concept graph at the TBox level to aid its logic.
Identity relations would check the assertions of relatedness made at the ABox level. At its simplest level, this may only be a lookup service for synonyms or aliases (or what we more broadly call ‘semsets‘ in UMBEL). In more complex or comprehensive forms it could, for example, do a check for similar instances across the instance base (all ABoxes) using techniques similar to disambiguation. In this manner, for example, owl:sameAs could now be better determined, and under well defined conditions that users could understand and then accept or reject. Disjointedness and similarity are two additional identity checks possible.
Identity checking could thus better work across disparate datasets and could have a relatively simple and optimized index structure. This would be useful for real-time access; newly discovered identities could be separately slip-streamed and later evaluated more rigorously.
Disambiguation is another component that could similarly reside internal or external to the standard system. Like identity relations, it might rely on concept graph information but its logic need not be explicitly modeled at the TBox level.
Disambiguation basically tries to test whether Joe Farmer the farmer is the same or different than Joe Farmer the truck driver. TBox information can certainly aid this — such as whether agricultural concepts or attributes are in association with the entity at hand — and the identity testing and synonyms or alternate name forms can also play into this determination.
Again, this is work that has largely not been critical for the early phases of linked data, but is becoming essential as the application of linked data is being contemplated for doing actual, meaningful stuff.
Of course, linked data and the ABox by definition don’t provide this kind of work; reasoning and inference are appropriately the work of the TBox.
We can now start diagramming these pieces into a conceptual architecture as follows:
This figure maps the work activities noted in the table at the top of this article and described throughout. Note there are a number of possible and specialized work activities at the interstices between the TBox and ABox, some of which are somewhat new to the description logics discourse as noted in the prior section.
I’m certainly no AI or KR expert, but it appears to me that there are pragmatic work tasks that emerge that do not easily fit into purely conceptual discussions of these things. The interstitial items possibly fall into this category, and those in the middle “bubbles” could even be done via separate processing or indexes not invoking the TBox level at all.
I suspect, though, that would be a mistake. The organizational view of the world at the TBox level provides useful reasoning and inferential bases for aiding other work tasks. For example, for disambiguation work, knowing that the instance of Joe Farmer was found in context with concepts relating to fertilizer or farm implements would help distinguish from Joe Farmer the lawyer (though would need additional clues or reasoning to dismiss that is was Joe Farmer the trucker shipping these things).
If you recall from Part 1, considerable discussion was devoted to the suggestion that linked data belonged in the ABox and that even if linkages are being made to other entities or instances, those assignments remain only assertions. We discussed identity evaluation and disambiguation in the previous section. These really come down to the questions of whether we are talking about the same or different things across multiple data sources.
Maintaining identity relations and disambiguation as separate components also has the advantage of enabling different methodologies or algorithms to be determined or swapped out as better methods become available. A low-fidelity service, for example, could be applied for quick or free uses, with more rigorous methods reserved for paid or batch mode analysis. Similarly, maintaining full-text search as a separate component means the work can be done by optimized search engines with built-in faceting (such as the excellent open-source Solr application). I will likely have more to say on this aspect after this series concludes.
People can do as they like and will within the semantic scope of the RDF and OWL languages. But, besides my recommendation to view linked data as ABoxes through the lens of description logics, I’d like to suggest a couple of additional ‘best practices.’
I think linked data does itself a disservice to throw the owl:sameAs predicate around as much as it does. Ooooh! OWL; this is complicated stuff, no? And, actually, owl:sameAs is perhaps the most powerful entailment predicate around [5,6], which is also reflexive and transitive. How can we really use this property when all we are really talking about is an assertion at the ABox instance level?
I think use of predicates like this confuse ourselves and confuse the public. We give stuff a gloss and patina that is not supportable by any reasoning.
The OWL2 folks, I think, understood this issue in moving toward specific “assertion” language in the new draft version . OWL2 now proposes these assertion predicates of owl:SameIndividual, owl:DifferentIndividuals, owl:ClassAssertion, owl:ObjectPropertyAssertion, owl:NegativeObjectPropertyAssertion, owl:DataPropertyAssertion, and owl:NegativeDataPropertyAssertion. Note that, axiomatically, the owl:SameIndividual is now the functional equivalent of the older owl:sameAs .
I think it is a step in the right direction, though frankly I would have preferred a semantics that used the “assert” naming for instances (individuals) as well.
But, actually, I have a more fundamental question, related to confusion argument above: Why should assertions be an OWL property? Would it not make better and cleaner sense to establish assertion predicates within RDF or RDFS and to minimize stronger implications of possible decidability or entailment? Would this approach not lead to a cleaner split between instance records (ABoxes) and linked data that could be kept solely within the RDF vocabulary without invoking OWL at all?
In this manner, we could leave the various emerging profiles for OWL2  reserved for TBox-level reasoning. It would make for easier and understandable reasoners, while making it easier for publishers to expose linked instance data.
Well, OK; so, I have now gotten most of the arguments off my chest about ABoxes and linked data, the separation of concerns, and maintaining distinctions in language.
I will discuss in the next Part 3 what other researchers have to say on this use of the TBox and ABox and the split between them. Then, in the concluding Part 4, I discuss how this perspective might lead to a pretty simple RDF vocabulary and structure for linked data and instance records.
It is clear that Fred Giasson and I have been spending considerable time on description logics of late. While we could perhaps claim that we like hard-to-read and -understand stuff, our reasons for this interest have been quite pragmatic: How to apply linked data principles to real-world commercial and organizational environments? Indeed, what should those principles even be?
Our first intuition, reaching back nearly two years now, was that linked data needed a context for bringing related datasets together. This belief led us to construct the UMBEL subject concept ontology, a basic reference roadmap for helping to point to information related (or “about”) similar subjects. UMBEL as a set of subject concepts has proved useful as a reference roadmap; and the approach to construct UMBEL and its resulting vocabulary (heavily based in SKOS) has also proved helpful to construct specific domain-level ontologies.
By their nature, these ontologies have been conceptual and structural. They define relationships, but are instance-poor. They focus on ways to describe various lenses — world views — into the domains for which we have been engaged.
But superstructures are meant to be built upon and fleshed out. For that, real instance data is required.
Thus, we have more recently shifted from concepts and structure to focus on how to represent the actual things that populate that structure; that is, a domain’s actual objects or instances. We appreciate that different audiences and proponents will use terminology such as instance, or object, or entity, or individual or even foo to describe such things, but for us (Peirceian logic aside) we simply wanted a way to describe the referents to a specific real-world thing .
We initially chose the term ‘named entities‘ to describe these actual objects. This naming arose from the work of Sekine and his 200 named entity types . Typical named entities are specific (individual) people, organizations, events, artifacts (‘Mona Lisa’), places, products, or whatever. For example, here is our first published definition describing ‘named entities‘:
Actually, ‘named entities‘, in even that sense, do not all have proper names with capitalization. Some accepted ‘named entities‘ are also written in lower case, with examples such as rocks (‘gneiss’) or common animals or plants (‘daisy’) or chemicals (‘ozone’) or minerals (‘mica’) or drugs (‘aspirin’) or foods (‘sushi’) or whatever.
|City of Iowa City|
|Clinton St., Iowa City|
|Location in the state of Iowa|
|Coordinates: 41°39′21″N 91°31′30″W 41.65583°N 91.525°W|
|Metro||Iowa City Metropolitan Area|
|- Type||Council-manager government|
|- Mayor||Regenia Bailey|
|- City Manager||Michael Lombardo|
|- City||24.4 sq mi (63.3 km2)|
|- Land||24.2 sq mi (62.6 km2)|
|- Water||0.3 sq mi (0.7 km2)|
|Elevation||668 ft (203.6 m)|
|Population (2007 est.)|
|- Density||2,748.4/sq mi (1,059.4/km2)|
|Time zone||CST (UTC-6)|
|- Summer (DST)||CDT (UTC-5)|
|GNIS feature ID||0457827|
But, hmmm. While ‘daisy’ might be an instance of the common flowers, is that the same as a specific daisy flower? especially when I can see literally thousands of daisy flowers at present in my back yard?
This epistemological question of thing v instance v individual can really mess you up! Furthermore, from the standpoint of describing these things on the Web, are we talking about the real thing, a symbol of some sort (Peirce again!) for that thing, or a multitude of similar descriptive terms (flower, bloom, daisy, florescence, bellis, chrysanthenum) for that thing?
Whether a thing is an instance or an individual or even a class depends on context. A plant taxonomy could represent its terminal nodes as specific species or subspecies of daisy. But, in a flower show, the specific thing being referred to could actually be a unique individual, the Pretty Miss Daisy blue-ribbon winner.
Description logics with its TBox and ABox splits  actually helps considerably to unravel these potentially confounding distinctions. The ABox covers the description of instances with their asserted attributes or characteristics. Thus, we can have an ABox description of the daisy instance that refers to daisies in general or daisies as a species, or we can have an ABox description of an individual daisy with specific proper name in a flower show.
This instance idea is really a very clean one. As long as we focus on the idea of an instance and its attributes, we can put off for the moment (or defer to another layer, that is the reasoning TBox) what kind of instance this is.
After another segue, we’ll return to this instance concept in a moment.
DBpedia, the structured and linked data extraction of “facts” from Wikipedia, was first released about two years ago. Happy 2nd Birthday! I first wrote about DBpedia shortly thereafter claiming, I think somewhat accurately, the birth of the structured Web. We now know that phenomenon and the many additional datasets that nucleate around DBpedia as linked data.
When first explained, DBpedia used examples of so-called Wikipedia infoboxes for the cities of Leipzig or Innsbruck to describe the source of its structured data. (Subsequently Berlin has also been commonly referenced, all understandable given DBpedia’s two principal founders of SÃ¶ren Auer at the UniversitÃ¤t Leipzig and Chris Bizer at Freie UniversitÃ¤t Berlin; of course, many others have joined and meaningfully supported the project since.) Infoboxes are Wikipedia templates that provide standardized, structured information across related articles of a similar type.
I have copied a similar infobox — in this case for the same city type from Wikipedia for my home town of Iowa City — to show one of these structured data templates (shown to right). In using it I am, of course being a bit parochial, but it is also interesting to see the growth of structured data (attributes) that such templates now contain compared to what was available at the time of DBpedia’s first release.
This infobox is a perfect example of an ABox. The instance it describes is the ‘City of Iowa City’. Each of the items that follow show an attribute or data characteristic of some form with its associated value as a key-value pair. Sometimes those values refer to other instances, some of which are individuals, such as the county or name of the mayor.
In ABox terminology, these values are asserted for each attribute. Because this is Wikipedia, which has a reputation for accuracy and authority, we tend to believe and accept these assertions. But, we also know that sometimes these values are not correct, even for Wikipedia. We also know that instance records can come from many, many different sources, perhaps most not with the accuracy or authority of Wikipedia.
It is these types of instance records (for many other types of things than city, of course!) that are now being published as linked data. Today more than 50 general public datasets and perhaps another 50 from the sciences (especially biology) have been published. The total assertions across all datasets now exceeds millions, and the RDF statements that capture all of the relationships between these instances, attributes and concepts that describe them exceed one billion, as the recent Billion Triples challenge attests.
What is nice about this ABox structure is that they are relatively simple — instances characterized by attributes — with the “facts” so expressed understood to be assertions and not necessarily verified truth or accuracy. No matter what the source, there is no guarantee that all assertions will be complete and accurate. (Though, as has proved to be the case for DBpedia because of its Wikipedia heritage, some of the sources can be comfortably asserted to be authoritative.)
Assertions about many of the attributes are relatively straightforward such as, in the Iowa City instance, zip codes or time zones or population. (Still, the estimates used could also be out of date or the estimation methods could be argued.) However, other assertions, more based on interpretation or personal opinion, such as subject matter or political or religious affiliation or bias, can be quite controversial.
Another potential source of error is the linked data assertion that one instance is the owl:sameAs a different instance in a different dataset. Erroneous ‘same as’ assertions can arise quite simply and not require malice or stupidity. For example, for me, I actually live in Coralville, Iowa, not Iowa City. But, Coralville completely abuts Iowa City, shares a school district, and my wife works in Iowa City. I more often than not claim Iowa City as my location, though my actual mailing address is Coralville. How does one reasonably say that the identity of Michael Bergman of Coralville is the same as the Michael Bergman of Iowa City?
So, what can the perspective of the ABox and description logics tell us about these issues?:
This is the rationale from an earlier posting from me called Back to the Future with Description Logics that clearly separates the TBox and ABox functions:
Now, it is true that the ABox and TBox distinctions are conceptual, and in practice not often actual, with no mandate or requirement based in description logics that they remain separate. However, for reasons of tractability and communication and computational performance at scale, there may be justification for keeping these constructs separate .
In the diagram, note that each ABox instance has the simple appearance of an instance record. Also note that the attributes that describe or characterize those instances should also be included and described with relationships modeled at the TBox level. The TBox is the proper place to describe all of the attribute relationships.
So, for Structured Dynamics, we have made a clean split in these roles and data structures in those client architectures over which we have design control. Ontologies populate the TBox level. Instance records assembled into instance dictionaries populate the ABox level, with various instance types governed by their own lightweight schema and vocabularies. This simple functional split leads to cleaner architectures and easier decisions about what belongs in which box or another for a given circumstance. It is also more performant, but more on that in a later part.
Of course, this is not how the linked data and semantic Web is currently architected or conceptualized. This smearing of roles and work responsibilities leads, we think, to many communication issues and slower uptake. As our own thinking gets clearer on these issues, we see there are some key benefits arising from keeping distinct the TBox (ontologies) and ABox (instances) pieces of the semWeb pie.
Wikipedia is a good case in point where conjoining facts with a world view does not work well. One part that does work well are the “facts” in the specific Wikipedia pages that describe things. They are the ABox structure of the Wikipedia knowledge base. Another useful aspect of Wikipedia, kind of at the interstices of the ABox and TBox, are its see also and disambiguation pages. These, too, have proved to be very useful for gathering synonyms for a specific instance or for disambiguating two similarly named instances.
But at the conceptual level of how the world is organized — what are the relations between instances and how those instances are categorized — Wikipedia has arguably been unsatisfactory. Why that might be is a discussion for another time.
One perhaps could make an inverse observation about the Cyc knowledge base where a quite coherent world view of concepts exists (and, actually, many world views through Cyc’s very useful microtheories construct), but is often hard to discern and discover because of the admixing of instances, the coverage of which is also quite lumpy. Some domains have many instances, others are quite sparse.
Trying assiduously to keep bodies of facts and assertions (ABox) separate from how to interpret that world (TBox) brings distinct benefits. The facts base (ABox) is more easily tested for consistency. Different world views (TBox) can be more easily applied and compared against these fact bases. Testing and accepting different aspects of different sources is made easier if the ABox and TBox are not conjoined.
When the different purposes and roles and resulting work that might be applied to ABox and TBox are conjoined, our ability to describe things gets murky. We sometimes call mere controlled vocabularies “ontologies”, for example, which only acts to dilute the concept. We have facts and assertions and relations and hierarchies and stuff ranging from the minutiae to the abstract and sublime being lumped and described with the same terminology. Because we can not clarify and describe to ourselves roles and responsibilities for this stuff, no wonder we can’t communicate well with the broader public.
I believe if the semantic Web community could stand back and try again to apply the rigor of description logics to its enterprise, now that we are gaining some real exposure and success with linked data, we could begin to clean up this emerging mess we are creating for ourselves.
Here are some starting suggestions. Let’s call the combination of ABox and TBox a knowledge base, not an ontology. Let’s reserve the term ontology for the terminological relationships and concepts at the TBox level. And let’s focus on ABox instances as requiring only simple vocabularies to describe the assertions of attributes (what we might call schema consistent with RDFS and relational database schema). We thus could see a set of pieces similar to:
Note I suggest a couple of interesting work items at the interface between the TBox and ABox: disambiguating instances and determining the identity relatedness (for example, ‘same as’) between instances. This is work that should be kept apart from the ABox, but may or may not be best handled in the TBox (and, in any case, is generally separate work from the conceptual structure of the TBox).
This separation of concerns, or something akin to it, would result in a much cleaner — and, therefore, simpler — terminology for communicating with the interested public.
Prima facie, an instance schema that merely needs to capture attribute assertions for an instance will be much simpler than current practice. In turn, that should lead to more patterned schema with easier and quicker extension to new domains and vocabularies. And, that, in turn, will aid ABox consistency checking.
Without the need for ontology mapping at time of conversion, existing RDFizers could be more readily applied to convert other structured data forms to simple RDF schema.
Splitting the pie as suggested is merely the application of separation of concerns, which I believe all would largely acknowledge as leading to better substitutability and modularity. Besides swapping alternative world views to test their implications against common ABox datasets (the Benefit #1 case), we would also likely see quicker improvements in methods and algorithms for ABox consistency checking.
There has been growing interest and effort behind finding methods and vocabularies for describing datasets. The Sindice effort has led to the creation of suggested sitemap standard for crawling purposes; UMBEL has suggested standard vocabularies for describing what datasets are “about”, and voiD has been working to standardize how to characterize the nature of a dataset.
Insofar as the ABox and TBox are more cleanly separated, the decisions and tradeoffs for accomplishing these tasks should enable better dataset descriptions.
The discourse between the OWL and RDF communities can often be strained and at cross purposes. Many data publishers in the OWL community are from the sciences, where reasoning and decidability is imperative . Many in the linked data community are trying to get as much data exposed and published as possible. Kendall Clark recently blogged about these ‘tribes’ to which I also commented.
Like any world view, there is nothing inherently wrong with being more comfortable or wanting to live in one world as opposed to another. But ultimately, the assertions made by most linked data at the ABox level needs to be tested for reasonableness, and structure and an organizational view of the world (TBox) is not terribly helpful without instance data.
I wonder, in fact, whether it might be best for linked data publishers to eschew OWL altogether. Different RDF predicates could be adopted to claim sameAs-type assertions, for example, and ABox vocabularies and schema could be greatly simplified and patterned for easier development and templating. No matter how we cut it, all of this published data and its properties are only assertions until they can be tested for reasonableness, so why not accept that and make linked data generation faster and easier?
Everyone knows that data for data’s sake — linked or not — has to be tested for reasonableness before it can be relied upon for real work. Simple RDF schema for structured search purposes can work alone just fine: simply look at the error rates with current search engines. But, beyond search and non-critical linked browsing, reasoning is necessary.
The reasoning community has known for some time that all of these linked data assertions will have to be tested anyway. So, why not accept roles? Make linked data easier for search and browsing and publishing, and keep silly entailment assertions out of the mix. Then, allocate the reasoning work to coherent ontologies that know their world view and how to test for it. Instance records and ABoxes are not decidable on their own, so why pretend otherwise?
Ever since I first started to learn in earnest about ontology, something has been gnawing at me. The term seemed to be (shall I say?) an obtuse one whose obscurity was not the result of subtle precision or technicality, but rather one of fuzziness. As I introduced my Intrepid Guide to Ontology two years ago, I noted:
Since then, I have continued to find ontology one of the hardest concepts to communicate to clients and quite a muddled mess even as used by practitioners. I have come to the conclusion that this problem is not because I have failed to grasp some ephemeral nuance, but because the term as used in practice is indeed fuzzy and imprecise.
Even two years ago, I noted more than 40 different types of information structure that have at one time or another been labelled as an example of an “ontology”:
Since then, I could add even more terms to this list.
Lack of precision as to what ontology means has meant that it has been sloppily defined. As I have harped upon many times regarding semantic Web terminology, this is a sad state of affairs for the semWeb endeavor that has meaning at the core of its purpose.
I’m pretty sure that the original intent in embracing the concept of ontology within the realm of knowledge representation was not to see this term so broadly misused or mis-applied. I suspect, as well, that if we could sharpen up our understanding and remove some of the fuzziness that we could improve communications with the lay public across many levels of the semWeb enterprise.
Recently, I have been looking to the semantic Web’s roots in description logics. One of my writings, Thinking ‘Inside the Box’ with Description Logics, looked at the conceptual distinctions between the so-called ‘TBox‘ and ‘ABox‘. That is, a knowledge base is a logical schema of roles and concepts and the relationships between them (the TBox), which is populated by the actual data (instances) asserting memberships and attributes (“facts”) (the ABox).
By analogy, in a conventional relational database system, the database or logical schema would correspond to the TBox; the actual data records or tables would correspond to the ABox. Often, the term ontology is used to cover both ABox and TBox statements (which, I argue, only makes the understanding of the ‘ontology’ concept more difficult).
My recent writing, Back to the Future with Description Logics, discussed at some length the advantages of keeping the TBox and ABox separate. This current article now expands on those thoughts, particularly with respect to the definition and understanding of ontology.
The starting point for this new mindset is to return to the ideas of data records or data tables v. the logical schema that is prevalent in relational databases.
The last time I took a census, about a year ago, there were more than 100 converters of various record and data structure types to RDF . These converters — also sometimes known as translators or ‘RDFizers’ — generally take some input data records with varying formats or serializations and convert them to a form of RDF serialization (such as RDF/XML or N3), often with some ontology matching or characterizations. That last census listed these converters:
||Note that MIT’s SIMILE RDFizers also recognizes these formats:||There is a growing list of third-party RDFizers as well:|
This wealth of formats shows the robustness of the RDF data model to capture structure and data relationships from virtually any input form. This is what makes RDF so exciting as a canonical target for getting data to interoperate.
However — and this is crucial — standard users for decades have preferred simple, text-based and human readable formats for writing and transferring their structured data.
These various forms, sometimes well specified with APIs and sometimes almost ad hoc as in spreadsheet listings, are what we call ‘structs‘. Structs can all be displayed as text and have, at minimum, explicit or inferrable key-value pairs to convey data relationships and attributes, with data types and values often noted by various white space, delimiter or other text conventions.
There is no doubt that the vast majority of extant data is found in such formats, including the results of data or information extraction from unstructured text. Indeed, even HTML and many markup languages with their angle bracket-delimited fields fall into this category.
There have literally been hundreds of various formats proposed over decades for conveying lightweight data structures. Most have been proprietary or limited to specific domains or users. Some, such as fielded text, structured text, simple declarative language (SDL), or more recently YAML or its simpler cousin JSON, have become more widely adopted and supported by formal specifications, tools or APIs. JSON, especially, is a preferred form for Web 2.0 applications.
Some, like microformats or this example BibTeX record below (with some non-standard extensions), rely less on syntax conventions and may use reserved keywords (such as AUTHOR or TITLE as shown) to signal the key type for the key-value pair:
ID_LOCAL arXiv:0711.3808 AUTHOR <a href="#Schramm_O">Oded Schramm</a> BIBTYPE ARTICLE ID arXiv:0711.3808 JOURNAL Electron. Res. Announc. Math. Sci. PAGES 17--23 SUBJECTS geom TITLE Hyperfinite graph limits URL http://www.aimsciences.org/journals/doIpChk.jsp?paperID=3117&mode=full URL http://www.aimsciences.org/journals/displayPapers0.jsp?comments=&pubID=221&journID=14&pubString_num=Volume: 15, 2008 Journal Issue VOLUME 15 YEAR 2008
Some of these simple formats have been more successful than others, though none have achieved market dominance. There also appear to be few universal principles that have emerged as to syntax or format. Nonetheless, any of these various struct forms are easy for casual readers to understand and easy for domain experts to write.
For modeling and interoperability purposes, many of these forms are patently inadequate. That is why many of these simpler forms might be called “naïve”: they achieve their immediate purpose of simple relationships and communication, but require understood or explicit context in order to be meaningfully (semantically) related to other forms or data.
Yet, if we have learned nothing else with the phenomenal success of the Web it is this: simplicity trumps elegance or expressivity.
The RDF (Resource Description Framework) data model is expressed as simple subject-predicate-object “triple” statements. That sounds fancy, but just substitute verb for predicate and noun for subject and object. In other words: Dick sees Jane; or, the ball is round. It may sound like a kindergartner reader, but it is how data can be easily represented and built up into more complex structures and stories.
RDF triples can be applied equally to all structured, semi-structured and unstructured content. RDF is clearly a most capable data model that — through its ability to be extended with further concepts and relationships (vocabulary) — can create elegant and logical structures to represent comprehensive domains and knowledge bases. Finding such a model has been a quest in my professional life; I believe we finally have a winner to facilitate data interoperability using RDF.
But RDF has not achieved the market acceptance that its suitability as a data representation model might suggest. I think there are three reasons for this:
Canonical forms embody all of the specification that the canon guiding them requires. What we may have failed to see in embracing RDF, however, is that getting useful data into the system need not carry all of this burden.
So, what does all of this have to do with my starting diatribe about the term ontology?
Whether a single database or the federation across all information known to human kind, we have data records (structs of instances) and a logical schema (ontology of concepts and relationships) by which we try to relate this information. This is a natural and meaningful split: structure and relationships v. the instances that populate that structure.
Stated this way, particularly for anyone with a relational database background, the split between schema and data is clear and obvious. Yet, the RDF, semantic Web and linked data communities have done an abysmal job of recognizing this fundamental separation of concerns.
We create “ontologies” that mix instances and schema. We insist on simple data record conversions that are burdened with relationship specifications as well. We tout a “linked data” infrastructure that is based solely on the same identity of instances without respect or attention to structure or conceptual relationships. We dismiss communities that work to express their data with useful local structures. We insist on standards and practices up and down the data staging and preparation chain that turns off the general market and makes us seem arrogant and dismissive. Frankly, in so many ways, we just don’t get it .
What has struck me personally over the past few months as these realizations have unfolded has been how much our own mindsets and language may be trapping us.
At least for this diatribe, my essential conclusion is that we need to shift the burden of the schema and conceptual relations and (yes) world views to the TBox. We need to skinny down the ABox and make it a warm and welcoming environment by which any structured data (including the most naïve) can join.
So, ultimately, the bottom line is this: the burden of the semantic Web rests on us, not the providers of structured data.
It is time to streamline the ABox to smooth data contributions, assume as publishers the responsibility for the TBox, and keep those concerns separate. As for instance-related stuff, I now intend to refer to them as structs governed by a controlled vocabulary (at most). I intend to reserve ontology as a means to describe a given world view, a TBox, the schema and its relations of the domain at hand. And, frankly, this definition of ontology brings it back in balance with its roots in ontos and the nature of the world.
It’s a good time to lighten up!
I was glad to see Kendall Clark pick up on parts of my earlier piece on Thinking ‘Inside the Box’ with Description Logics. He took one point of view in his posting — that I mostly agree with — but I’d also like to reinforce some other thoughts. And, those thoughts are: description logics (DL) provides earlier lessons and insights that our current zeal for linked data should not overlook, and the lessons we can gain from DL are really fundamental and architectural.
For those of you who have not read Kendall’s piece — which I heartily recommend — let me give you my Cliffs Note’s summary: there are those within the semantic Web community that want to capture the conceptual relationships within knowledge and domains, the Maximum Fidelity tribe, and then those that want to link and describe as many things as possible, the Maximum Scalability tribe, with those (like Kendall’s firm, Clark & Parsia) residing in the middle and following the precepts of DL. The theme is that extremes exist and need to be bridged. 
Posing these contrasts is an effective way to describe different ideas and approaches, but, like all straw men, perhaps it hides nuances and complexity. And, as I note below, it may also pose the wrong straw man dichotomy.
Jim Hendler, for one, took exception to Kendall’s characterization to make the obvious point that different use cases demand different approaches. What was interesting, however, in these interchanges was that a nerve was seemingly struck about differences in viewpoints and approaches. Indeed, the very reference to “tribes” seemed to bring out the (ahem) tribal response.
So, just so we are clear, in what I say below I take on the position of a tribe of one; that is, my own opinion. Of course, this is what all of us do. By positing tribes and viewpoints we simplify what is nuanced and subtly convey that opinions are cultural (“tribal”) and not subject to learning and change. Perhaps within the temporal viewpoint of whatever may be today’s trends and “memes” such thinking may hold, but I fundamentally disagree with such a static view of collective understanding and communities over more meaningful periods of years or decades. But, I digress. . . .
At the risk of being simplistic, I think we can say that there was a rich academic and intellectual history behind description logics going back to the early 1990s . Then, with the seminal semantic Web paper built from thinking in the late 90s by Berners-Lee and published by him and Hendler and Lassila in 2001 in Scientific American , a real marker was put down for machine-readable and -actionable data (via “agents”) accessible on the Web. Many have been disappointed at the slow pace of the semWeb’s unfolding and some have blamed and rejected AI and “big” ontologies for this slowness. As usable standards finally emerged, a newer set of acolytes pushed “just getting data out there” and RDF linked data began to assume prominence from about 2006 onward, spearheaded by DBpedia and the linked open data community.
In so many ways we are coming full circle — coming back to the future — in seeing how our new linked data techniques can again benefit from this earlier DL thinking. Rather then poles and spectrums, I think we are experiencing the need to revisit our intellectual past now that workable publishing mechanisms and scalability and organization assume real prominence. Though clearly not intentional, the linked data community (and, in a related way, microformats), may just have stumbled upon a very cool architectural design that can leverage DL precepts.
Some of this DL and semWeb terminology can be off-putting. But it is helpful to know the lingo if one wants to look into the technical literature. Though most of this stuff can be described without resorting to such terms and can be readily grasped on an intuitive basis, here are some important grounding terms:
Within description logics and for our purposes herein, the two concepts we will most focus upon are the ABox and the TBox. As the definitions above suggest, the TBox is more structural and reflects the logical and conceptual relationships within a domain; that is, the role and concept and class relationships. The ABox provides the data (instance) records and characterizations within that schema; that is the instances and facts assertions. By analogy, in a conventional relational database system, the database or logical schema would correspond to the TBox; the actual data records or tables would correspond to the ABox.
These distinctions suggest very different purposes and roles, then, for the TBox and the ABox:
While certainly many of the ABox tests and checks require TBox structure, there is a pretty clear separation of purpose and role. Moreover: 1) the scale of the information in each “box” is vastly different (perhaps a few to hundreds to at most thousands of concepts in the TBox in contrast to potentially millions or more instances in ABoxes); and 2) ABox dataset repositories may also be (indeed, often are!) numerous, spatially distributed and semantically heterogeneous.
DL and semantic Web stuff in general are data and logic models, not architectural guidance. So, rarely does one see discussion of the architectural imperatives that some of these logical underpinnings provide. We see knowledge bases and ontologies both used as umbrella terms encompassing both the ABox and the TBox.
However, our own deployment experiences and the literature suggest there are manifest advantages to keeping the TBox and ABox separate:
|Advantages of Keeping the TBox and ABox Separate|
It would be useful to refrain from lumping the very different purposes of ABoxes and TBoxes under the umbrella rubric of ‘ontology’. It would also be useful for designers and vocabulary authors to be more explicit in their own minds as to purpose and content when formulating new ontologies. Smushing all of these concepts into one bubbling mess may not lead to clarity nor good performance.
Taking these basic ideas we can visualize a general schematic for best practice splits within the ontology or knowledge base:
The TBox is clearly focused on the domain at hand, but also includes links and equivalents to external ontologies. The TBox level should be entirely free of instance data, though all attributes, properties and concepts that might be found at the ABox level are also defined with their relationships at the TBox level. Like any semWeb ontology, this TBox level should also re-use common Web ontologies such as FOAF, SIOC, UMBEL, etc.
It is also the case that because of the reasoning needs at the TBox layer, the semantic Web language used should likely be a dialect of OWL (see below).
(BTW, for my own practice, I will try to limit my use of the ‘ontology’ term to the concepts and classes at this TBox level.)
The ABox level, in contrast, may consist of multiple datasets and name spaces. These structures are most appropriately seen as lightweight controlled vocabularies with limited structure; if written by scratch perhaps limited to RDF or RDFS (the schema variant). This layer, however, can also remain in non-semWeb native form — such as RDBMS data tables, microformats or other formats — that are wrapperized for interoperability through one or more ‘RDFizers‘ or GRDDL.
These structures should likely not make many external assertions, if any, and if done, perhaps in separate mapping or linkage file that can be processed and analyzed independently. It is important, however, to make sure that all attributes at this ABox layer have a counterpart with relationships and structure defined at the TBox layer.
This architectural design enables complete independence of the instance datasets from the inferencing logic or federation that might be applied to them.
Since it first took off in 2006, linked data and the various datasets now shown in the ‘LOD cloud‘ have been dominated by instance data. There are perhaps 10 million to 20 million instance objects available as linked data, many of which are derived from Wikipedia (via DBpedia) with attributes or structure coming from the Wikipedia infoboxes.
“There need not be a trade-off between expressiveness and scalability. Proper design, language choice and architecture can readily achieve both — while maintaining independence of scope or purpose.”
A similarly fast explosion has taken place with structured records via microformats and other simple data structures. For example, some earlier estimates suggest there are perhaps more than 2 billion pages that include microformats .
I have at times recently made comments about the dominance of instance data within the linked data community and the need for organizing structure. While this observation, I believe, remains true and provides a rationale for UMBEL as an organizing subject structure (or any other organizing structure, for that matter), perhaps I have been missing a more fundamental point: linked data (at least as practiced to date) is really about exposing ABoxes with simple structure. Perhaps, by serendipity, linked data (and other light structures like microformats) are showing the way to a distributed, mixed ABox-TBox structure for the Web.
With this altered viewpoint, a number of new observations emerge:
Linked data and microformats and other lightweight structures are now giving us the exposed instance data to begin reasoning and showing differences due to inferencing and other logic advantages for the semantic Web. Now that the ABox is being proven, let’s move on and stress-test the TBox!
Since the first version of OWL there has been confusion and some limitations with the dialects of the language. Only OWL Full allowed classes to be treated both as instances and classes (so-called metamodeling), and was therefore used as the basis for mapping UMBEL, for example, to RDF and RDFS vocabularies and to Cyc. This design was necessary, but left UMBEL undecidable using standard DL reasoners; only the two dialects of OWL DL and OWL Lite met description logics requirements.
Indeed, it was even hard to determine what dialect an OWL file represented, among many other problems and issues. The technical committee behind OWL 2, in fact, has written an excellent critique of issues with this first version of OWL .
For nearly two years the next version of OWL, OWL 2, has been undergoing development, with the last draft now published and available for last comment before January 23 . Lessons and refinements to the use of DL have also occurred. Some have criticized this effort and have criticized the need for OWL 2′s growing expressiveness and vocabulary [see 1, for example]. I believe these criticisms to be unfair and to miss many of the thoughtful improvements in this new version.
Version control and expressiveness are two of these benefits. A broader benefit, though, has been the keen attention the developers have given to compliance with description logics and the ability to formulate fragments (called “profiles”) that only present subsets of DL useful for computational considerations . For example, one profile, OWL 2 QL, appears well suited to the ABox; another, EL, appears well suited to the TBox. Users and tools builders may define other subsets of OWL 2 to deal with different use cases.
What is emerging are possible design patterns that would have comprehensive TBox guidance and inference structures that first receive a query, then do query rewriting for less capable OWL dialects and mapping to distributed ABox datasets, some of which might be kept in native relational DB or other structural forms [9, 14, 15]. Other approaches and designs, such as overviewed for DLDB2, KAON2, OWLIM, BigOWLIM and Minerva, are testing other architectural and DL combinations [see 15]. And, at the level of the specific triplestore, other optimizations are being made such as owl:sameAs or query rewrite with Virtuoso . This new version of OWL and its profiles have adapted to past lessons and can be matched well to the emerging hardware and architectural designs.
These changes appear to now provide the option for various dialects of OWL to be matched with reasoners and architectural designs in order to optimize for different purposes. Rather than a spectrum, we appear to be learning and maturing. Hopefully, getting back to the architectural implications of the TBox – ABox split can show us there need not be a trade-off between expressiveness and scalability. Proper design, language and dialect choice, and architecture can readily achieve both — while maintaining independence of scope or purpose.
Thanks, OWL 2! You have fulfilled your commitment to description logics. It is now our turn to figure out the best practices for working with these tools.