Posted:February 23, 2009

Making Linked Data Reasonable using Description Logics, Part 4

Concluding with a Simplified Instance Record Vocabulary for Linked Data ABoxes

In Part 1 of this series, I advocated the placement of linked data in an ABox construct from description logics [1] based on a separation of concerns argument. In Part 2, I reinforced that argument from the perspective of the work to be done within a knowledge base. In Part 3 we surveyed some of the key literature, finding justification for the split of the TBox from the ABox and the use of specialty RDFS and OWL dialects for work-oriented reasoning in the context of an integral logics.

We now conclude this series and try to bring these threads full circle to address what might be a vocabulary for an ABox instance record design. We’d very much like to thank Dr. Jim Pitman of the Bibliographic Knowledge Network project for having stimulated much of the thinking about the benefits and design of simple, human-authored and -readable instance records.

A Re-cap

Up until about six to eight months ago Fred Giasson and I were spending much of our thinking and design time on UMBEL, ontologies and what we now more precisely define as the TBox. Our intent all along was to get our process and thinking down pat there, and then turn ourselves to the representation of the actual entity data.

We have wanted to keep data records separate from logic and structure all along. Some clients have their own specific data records but may still want to interact with Web stuff or apply similar logic. Moreover, some client data is proprietary, some public. By organizing the data into “named entity dictionaries” we could modularize the architecture to allow swapping in and out of data appropriate to the customer or circumstance at hand.

Our initial design of this and what we share publicly has UMBEL and various standard public ontologies (FOAF, DC, SIOC, BIBO, etc) for the TBox, with Wikipedia entities and stuff from the BBC at the entity level (the ABox).

However, earlier work with another client showed us that our initial named entity structure was not sufficiently general or robust. That company’s records have complex relationships, such as affiliations for entities embedded in the same data record.

For linked data to become truly successful, we need to find easier ways for data publishers to write, expose and share structured data on the Web.

In order to improve the design, we went back to the drawing board to see if we could find guidance from the literature and other researchers as to how to “best” architect instance data in relation to the logic in the TBox (though we were not yet thinking and framing our questions viz description logics, or DL).

This series of postings itself, and some of its predecessor articles, were motivated by probing the description logics space and the guidance it might provide to help determine performant architectures and designs.

Folks, We’re Making Linked Data Just Too Tough

For linked data to become truly successful, we need to find easier ways for data publishers to write, expose and share structured data on the Web.

As anyone who reads my blog knows, I frequently rail against poor semantics or other aspects of the linked data space that I feel are counterproductive. At the same time, I’d like to think that I am also a vocal advocate and proponent for linked data. I am indeed a fan.

To me, the fundamental precepts of RDF as a data model able to capture virtually any data structure or relationship, and the use of Web URIs as linkable identifiers for a global ‘Web of Data’, are simply foundational and game changing. Stuff like this quickens my pulse.

But look at what it takes someone today to publish linked data:

He must understand the terminology and standards and best practices — and actually, even amongst current practitioners, few do
She must assign Web identifiers (URIs) to her data objects, which means finding them and making them (gawd, I hate this word) “dereferencable”
He must understand the semantics of the relationships and linkages his data asserts (which, unfortunately, many don’t)
She must present her data in serialized subject-predicate-object “triples”, which are arcane and difficult for most to understand, and
They both often confuse data and instances with structure and world views.

Now, come on. This is not the recipe to success.

Simple and unbreakable and forgiving is the recipe to success.

As I noted in an earlier posting, there are many different data structures (‘structs‘) for describing and conveying (transmitting) data records. Most of these are easy to understand and easy to read. We know that microformats have tried to capture a part of this space, but so has in other ways data serializations such as JSON or others. What can we learn from such formats?

Well, one thing I have learned is that many on the Web positively want to expose their data. Another thing I have learned is that there is much structured data that will not get exposed without hurdle rates that are small.

Revenge of the ABox

The phrase ‘revenge of the ABox’ comes from Heiko Stoermer’s thesis [2]; it conveys well, I think, the fact that everyone wants to capture and structure “world views” via ontologies and the big picture, but many do not want to grub around at the level of individual instances and data records. As he states, “. . . the most valuable knowledge is typically the one about individuals, but research on ontology integration has traditionally concentrated on concepts and relations.”

(The perverse outcome of this is that even though linked data as practiced to date is almost 100% about instance data, the discussion rarely looks at ABox-level work or instance data integrity.)

As this series and its predecessor posts have argued, description logics (DL) is an excellent guiding framework for how to make architectural and design decisions about linked data. DL and the ABox – TBox have meshed beautifully with our earlier intuition to split ontologies and a structural and organizational view of the world (TBox) from the instance records (ABox, or what we had been calling internally our ‘named entity dictionaries’).

As this four-part series and its predecessor pieces indicate, not only can we gain better conceptual understanding and realization of some of this semantic Web stuff by using DL, but also, perhaps, many of today’s silly or inefficient design practices may be remedied by better grounding our architectures in these logics.

One area, for example, that has helped us much is to get away from the confusing terminology of ‘individuals’ v ‘instances’. Once we come to see an instance record as just that (so, that is why collections can play on an equal footing with individual things, for example), we now only need worry about asserting the attributes of the instance. We can defer all of the logic and reasoning about individuals and members and sets and collections and classes, etc., to the TBox and just get on with capturing and conveying our instance record, as an ABox.

For this reason alone (but there are others), Structured Dynamics has now abandoned the terminology of a ‘named entity dictionary’ in favor or ‘instance dictionaries’ or ABox (either term of which is understood to contain one or more instance records).

The ‘Instance Record’

An instance record is simply a means to either represent or convey the information (“attributes”) of a given instance. An instance is the thing at hand, and need not represent an individual; it could, for example, represent the entire holdings or collection of books in a given library.

An instance record may convey information about multiple instances, but each block of information for each instance is about that instance alone. Thus, for example, if the instance is a paper citation, the instance is the paper. If as attributes it asserts multiple authors, each with different institutional affiliations, those affiliations get asserted in a separate instance for each author. They are attributes of the authors, not of the paper.

In this manner it is easy to see attributes as only pertaining to a given instance. If the overall information to be conveyed discusses attributes for multiple instances, than the instance record presents in series each instance that is characterized.

The Simplicity of Key-Value Pairs

The objective is to make it easy for data owners to write, read and publish data. This means the starting format should be a human readable, easily writable means for authoring and conveying these instance records (that is, instances and their attributes and assigned values).

The simplest, naÃ¯ve format (independent of syntax or serialization) is the key-value (name-value) pair. In the key-value pair, the subject is always implied. So, for me, MikeBergman, as the subject:

first_name:Mike

sex:male

citizenship:USA

town:Iowa City

Because an instance record only describes attributes for a single instance at a time, all assertions can easily be transformed into the subject–predicate–object (s–p–o) “triples” of RDF. So,

<subject:MikeBergman> <hasFirstName> <Mike>

Now, of course, in conventional linked data many of these entries need to be expressed as URIs in order to “define” the item. Our design allows for that, of course, but also allows the user to simply provide literals (that is, not identifiers, but text strings or numeric or actual values) for each item. Thus, the declaration of a “new” attribute only need occur by its expression, with its value also as simply declared.

Separate, specialized services (see below) may be (and often will need to be!) employed to look up and de-reference URIs, do datatype or data instance validation checks, evaluate identity relationships, disambiguate terms and so forth. The data supplier may choose to publish more-or-less complete “records” on their own, or they may not.

Through this design, nothing need change with regard to how linked data is being done today (other than the addition of some simple converters to accommodate the new format; see below). But, by shifting testing and validation work to external services, we can make it much easier for more data to get exposed and published. It is now time for linked data intermediaries and services to evolve in the linked data ecosystem.

In its most naÃ¯ve form, this key-value pair format allows for fast and easy instance record creation with the ability to create instances and new attributes on the fly. Sure, these assertions need to be checked, but so does most data when it is asked to participate in any meaningful work.

This simple design, then, is very much in keeping with the limited roles and work associated with an ABox. Only attributes and metadata for an instance are being asserted. Conceptual relationships and specialized work that might be applied against the ABox to determine data validity or whatever is shifted to be external to the instance record, where it properly and logically belongs.

Relation to RDF

In Part 3 we discussed how fragments of the RDF and OWL languages can be used for specialized purposes within a knowledge base while keeping the overall logics of the system integral and decidable. Clearly, this instance record approach where the sole purpose is to assert attributes and values for an instance does not require any OWL. In fact, most linked data to date only brings OWL into the picture for the owl:sameAs property, the common errors of which we discussed in Part 2.

The instance record only requires a small subset of the RDF language. But it does require use of RDFS (Schema) because of the appropriate use of datatypes within the instance data record.

At the level of the TBox and the “specialized work” areas, there are other fragments of OWL, now called profiles in the soon to be released OWL 2 [3], that similarly can be applied to areas such as instance checking and validation, identity relation testing, etc., that I mentioned above. In other words, we can logically fragment RDF and OWL to do the individual parts of a complete system in order to simplify things and aid performance and computational efficiency.

The Instance Record Vocabulary

We are implementing this design internally through what we call the Instance Record Vocabulary (QName: irv). It is still quite experimental and we are testing some important aspects, some of which we describe below. As we get these nuances worked out better, we will release this vocabulary publicly for any to use and comment.

As we presently see it, the namespace languages required for the IRV vocabulary are RDF, RDFS, DCterms and XSD. The RDFS (Schema) is required because, at minimum, of the incorporation of XML Schema datatypes (XSD), which we think to be a desirable requirement for what is, after all, an instance data specification and transfer protocol. However, the actual RDF and RDFS vocabulary used would be extremely minimal, with no OWL required.

In pseudo-form, with many serializations and simple syntaxes possible, this Instance Record Vocabulary has the following properties. Note as discussed above that the <s> in s–p–o is implied. Thus, in its naÃ¯ve or handwritten form, it could be expressed in pretty simple key-value pairs:

<InstanceRecord>
<Instance>
<hasLabel> <[literal]> @en
<hasAltLabel> <[literal]> @en
<hasURI> <[URI]>
<hasDescription> <[literal]> @en
<Attribute>
<hasAttribute1> <[literal with optional XSD (@en) or URI]>
<hasAttribute2> <[literal with optional XSD (@en) or URI]>
<hasAttribute3> <[literal with optional XSD (@en) or URI]>
<hasAttributeX> <[literal with optional XSD (@en) or URI]>
</Attribute>
<assertIdentity> <[literal or URI]>
<assertType> <[literal or URI]>
<hasSource> <[literal or URI]>
<hasVetting> <[literal or URI]>
</Instance>
<Instance>
. . . repeat as needed . . . 
</Instance>
</InstanceRecord>

Note that most values allow either literal or URI specifications. Some of the properties are obviously optional, others, such as hasLabel, will be required. hasURI, for example, is one case of an optional property that then may require a separate lookup service to complete it as a linked data record.

Instance records with literal specifications would need to be validated and checked before actually used for standard linked data or meaningful data purposes. However, this approach is already well-proved through, for example, OpenLink’s Virtuoso Sponger cartridges and design. Sure some work would need to be done at time of ingest, but there are no technical challenges.

The language used to write a literal can be specified for any kind of attribute (metadata or not). The language is specified using the “@lang-tag” at the end of the literal. This method is similar to the N3 serialization of RDF, which is also equivalent to the XML serialization of RDF using the “xml:lang” attribute.

Metadata

Most of the first properties are simply metadata describing the instance. The strings could be qualified by language.

Attributes

The bulk of the instance record is devoted to the attributes and their values. Attributes could be optionally declared with XSD datatypes. URI references could be specified or later substituted by vetting services (see below).

Attributes could also optionally be characterized in a list format, similar to the Lists specification for Notation 3 (N3).

Asserted Relations

Identity and class membership (rdf:type) assertions could be made; these could later be checked for correctness or identity relations with external or specialized services. The assertIdentity property, in particularly, is the replacement with more appropriate ABox semantics for owl:sameAs.

hasSource

A separate Source record is being developed to cover source or dataset characterizations. A single instance extraction from a Web page, for example, would be accompanied by a simple source characterization. Instances of particular types, such as microformats for example, would be so noted (as they might invoke specialized processors or carry certain authority). Instances from large datasets would have a still longer list of possible characterizations.

This property may look closely at what is also being done for the voiD dataset vocabulary.

Certain parameters in a Source record, such as language for example, may also be applied in special ways by the IRV parser at time of ingest with respect to specific literal specifications.

In any event, this is one of the properties still needing much more thought and definition.

hasVetting

This property, too, needs much more thought and definition.

The hasVetting property, for which multiples are allowed, would identify the specific checks and services applied to the instance data. Depending on service, such checks might include URI lookup or de-referencing, identity relations and testing, record completeness and sufficiency checks, data type checking and validation, general instance checking, disambiguation, and so forth (see “Specialized Work” below).

Some services might also re-write the instance record with corrected values or URIs returned in place of literals.

Best practice for external services would suggest identifying them by URI, though literals would also be allowed to identify internal checks or for lookup purposes.

This property is meant to be a key indicator of how third parties may want to rely on the data. Combined with hasSource, these hasVetting entries provide essential authority and provenance information about the data at hand.

Putting it All Together

This diagram attempts to show the relationship of how many of these pieces may interact:

Some of these bubbles deserve some additional commentary.

Hand-crafted Input

An important objective in this design is to allow naïve, simple text specifications to be hand-crafted for instance records. There are many relatively simple formats for specifying key-value pairs with a relatively few conventions, ranging from BibTeX to YAML and JSON and others. There are literally hundreds of such formats available, as my earlier overview of Naïve Representations and Structs discussed.

There may be justification for still another form in relation to this Instance Record Vocabulary or not; this topic is still under active discussion.

External Structs

However, whether there is a separate format or not, that same earlier piece overviewed the many simple data structs presently out there. It also noted the nearly 100 existing converters for these forms to RDF. These same converters, with quite slight modifications, could all output the Instance Record Vocabulary in an appropriate serialization as well.

Hooks to Functional and Scripting Languages

Another option is to combine this design with a functional language front-end to generate these records. (Though they could be produced in other ways, as well.) For example, lambda calculus or even a domain-specific language (DSL) could be used to create this very simple record generator. This simple system, in turn, could have a straightforward API that would allow existing scripting languages (such as Python or others) to be used as well.

Specialized Work

So, in fact, we can also now see the specialized work (see also Part 2) that itself is not part of the ABox but can and often should be applied to the instance data in the ABox:

Record sufficiency checking
De-duplication
Membership testing
Most specific concept identifying

Datatype checking
Identity relation testing
New attribute checking
ABox consistency testing

Data range checking
Disambiguation
Source-specific testing
Uniqueness testing

URI lookup
URI de-referencing
Satisfiability checking
Others . . .

Though, strictly speaking, such specialty work could be seen to occur at the TBox level, it is actually different and separate logic from “standard” inferencing or reasoning. Specialized work can therefore often occur as separate tests or in batch mode with fragments of OWL or other dedicated indexes and algorithms. Some of this specialized work may take advantage of the conceptual relationships in the TBox, but may not necessarily need to do so. In these manners, the inferencing work of the TBox can be kept clean and efficient.

Beyond Browsing and Unvalidated Queries

Today, linked data has largely been used for browsing and providing unvalidated responses to queries; focus and attention to its ABox roles are important to move beyond this baseline into meaningful work [2]. In those limited instances where this linked data has been looked at and evaluated as a complete knowledge base, such as the SWSE search engine with the SAOR approach as discussed in Part 3, more than 97% of the RDF triples provided in those cases were removed from consideration, often for logical or mis-assertion reasons [4].

The ideas presented here for a simpler linked data specification that can be easily represented in readable text is not new. RDF in JSON has been looked at in this way by Talis and JDIL, YAML has been looked at similarly, and similar and simpler approaches have been looked at closely for topic maps. There are other examples.

A key thrust of these efforts is to make it easier for the data publisher, thereby encouraging the exposure of more structured data.

These emerging ideas do not change in any way the usefulness of current linked data. Our suggested approach interoperates seamlessly with current practices and easily co-resides with them. But, these ideas do:

Provide a simpler path for writing and publishing human-readable instance data
Provide an ABox instance record structure that can have much specialized work applied against it in a consistent way, and
Contributes to an overall logic and architecture that is performant and scalable for doing meaningful work.

Though still needing further thought and refinement, this broad outline of roles and architecture and structure for the ABox completes the last missing piece to Structured Dynamics’ overall approach to linked data and RDF. Much time, thought and research have gone into it. Again, we’d very much like to thank Jim Pitman for his ideas that have helped catalyze this design [5].

We think the combination of a generalized Instance Record Vocabulary that can be reasoned over for ABox-level data checking, and that works with a simple, text-based key-value pair input format, might be a winning combination.

[1] This is our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”

[2] Heiko Stoermer, 2008. Okkam: Enabling Entity-centric Information Integration in the Semantic Web, Ph.D. thesis presented to the DIT – University of Trento, January 2008, 185 pp. See http://eprints.biblio.unitn.it/archive/00001394/01/dissertation_camera_ready.pdf.

[3] Boris Motik et al., eds., 2008. “OWL 2 Web Ontology Language: Profiles,” a W3C Working Draft, December 2, 2008. See http://www.w3.org/TR/owl2-profiles/.

[4] Aidan Hogan, Andreas Harth and Axel Polleres, 2008. “Scalable Authoritative OWL Reasoning on a Billion Triples,” in Proceedings of Billion Triple Semantic Web Challenge 2008, at the 7th International Semantic Web Conference (ISWC2008), Karlsruhe, Germany, 2008. See http://sw.deri.org/~aidanh/docs/saor_billiontc08.pdf.

[5] This input has come as a result of research supported in part by NSF Award 0835851, Bibliographic Knowledge Network.

Posted:February 19, 2009

Satisfied and Tickled, Too

Refinancing Can Save You Money, and Is Patriotic!

I don’t normally divert from my normal topics of the semantic and structured Web, but a serendipitous event of the past week warrants an exception, I think.

Last week, I was at my local bank arranging a new transaction for me involving an international wire transfer of funds. It was taking a bit of time as the staff put all of this in place for future transactions. Because I had some time, I asked the local manager to fill me in on mortgage rates, etc. I could not really remember the details of my own current home mortgage, so the upshot was to have one of the bank’s mortgage bankers give me a later call.

The next day I got the call back (kudos to Cindy Lynch at MidwestOne and exceptional service!). Within two days, we have now refinanced our home mortgage. While there is no need for our specifics or details, what I found blew my mind and may have some broader implications.

Our General Circumstance

We have good — but not exceptional — credit ratings and had 8 yrs remaining on a 15-yr fixed loan. When we got that mortgage in 2002, the rates were as low as they ever had been at that time. Though I knew rates had dropped somewhat, I also thought current rates were not sufficiently lower to justify a refinancing.

I was wrong.

Circumstances will certainly vary due to many, many aspects, but, in our case, we learned we could decrease our monthly mortgage payment by 30% while only adding a year to our payoff period. Furthermore, we are able to recoup the total closing and mortgage refinance costs in a bit over a month. Needless to say, we jumped at the chance and locked in a refinancing today.

Be Patriotic by Removing your Assets from Toxic Consideration

That is all well and good and would have remained just a family matter except for something we realized over dinner tonight: We are being patriots!

And here’s why.

Much of what apparently is sick at the core of the current worldwide financial mess is “toxic assets”, most of which are due to property mortgages and financing. For example, our personal existing mortgage holder, Washington Mutual (WaMu), in fact, was one of the troubled firms recently gobbled up by JP Morgan Chase, at a purchase price of $1.9 billion based on an book asset base of $310 billion; in other words, less than 1% on the dollar.

Wow. Simply, wow.

These things happen because of uncertainty and fear. Because of its mortgage basis, however, no one seems to know what the true valuation for these companies might be because we have this fundamental circumstance:

(toxic assets) + good assets

______________________________

existing assets

And, of course, no one seems to know what the various ratios and aspects are.

Now, however, if each of us good citizens goes forward to refinance our homes, by the way saving money to boot!, we now could see the equation shifting as follows:

(toxic assets)

______________________________

existing assets – good assets

By refinancing, we have shifted existing assets from uncertain to good, thereby reducing risk and saving money at the same time! The denominator of what is at risk is decreased, and the numerator of what is of concern gets clearer and becomes smaller.

This is, of course, a bit naive, since demand for refinancing may reduce the mortgage rate split and thereby reduce the individual homeowner’s incentive to even participate. But, even if the total appraised value of your home has decreased, that does not affect your ability to refinance so long as you meet standard ratios, which remains true for the vast majority of us.

So, even if each of us only saves a few dollars per month by taking this approach, we can both save ourselves money and help increase certainty.

Absent each of us contributing in this way, some body or system will need to be put into place to look top down and inspect all of these asset holdings. Just like the Internet has shown us the power of collective effort across many, many individuals, each of us chosing to refinance can multiple many fold whatever sort of centralized solution the government might eventually be able to mount, and quicker while saving ourselves money!

It is that simple. This effect and delta may not last forever, but each of us has some power to act in a minor and ultimately forceful way to help work ourselves out of this mess.

Not Meaning to be Parochial

I suspect what I have discussed above applies to most countries and circumstances around the globe. If my example and the money is too centered on the US, just replace with your own currency and terminology. I think you will still find that you can save yourself money and be patriotic!

Posted:February 18, 2009

Making Linked Data Reasonable using Description Logics, Part 3

Historical and Research Support for Splitting the TBox and ABox

In Part 1 of this series, I advocated the placement of linked data in an ABox construct from description logics [1] based on a separation of concerns argument. In Part 2 of this series, I reinforced that argument from the perspective of the work to be done within a knowledge base.

I came to these viewpoints independently. I do not have any special background in these disciplines; I am a recent researcher and practitioner in the field, perhaps akin to a gentleman natural scientist of the 1800s. As these ideas have formed, therefore, I have also attempted to see what some of the noted experts in the field have said and wrote.

Like any other field, there is no common viewpoint or doctrine about these matters. But, there is considerable — and historic — support for this viewpoint of splitting the TBox and the ABox for many different reasons. To my knowledge, this viewpoint has yet to be consolidated and applied to linked data. Perhaps this series will help stimulate that discussion.

A Bit of ‘Ancient History’

The first specific discussion of this matter I was able to discover (though I suspect it had been discussed in earlier internal forums or papers) was on the W3C’s RDF logic mail lists in 2001. While only eight years ago, it does feel a bit like ancient history with regard to the development and understanding of semantic Web languages.

The mail list topic was what role RDF should assume, at a formative point in the language’s development. A possibly restricted scope for RDF akin to a relational database or even as an “ABox” was being discussed. (This restriction, of course, was dropped for the more open, free scope of the present RDF.) To help clarify these matters, Jérôme Euzenat first noted in a thread with Pat Hayes, who would later author the RDF semantics W3C standards document [2] two years hence, that:

The TBox defines a terminology, i.e. assign terms (which are interpreted as sets) to names, while the ABox asserts formulas generally about individuals. The ABox can use the names assigned in the TBox, while the TBox usually do not refer to those interpreted as individuals in the ABox (but this is evolving). This meets the mathematical delimitation between definitions and assertions (no?!).

Logically this can be cleanly defined by separating the languages. In model theoretic terms, this means that all terms are interpreted as sets (and all assertions can be reduced to inclusion between these sets).

There is an important consequence on that separation (that held at least in first DL languages): the terminology itself cannot be found inconsistent. This is because, it only deals with sets (e.g. the set of things with more than four legs and less than two legs) and the worst that can happen to them is to be empty (but having a whole theory about plenty of empty sets is not inconsistent). On the contrary the ABox can be inconsistent, just because it asserts things (e.g. that there exists something with four legs belonging to the set of things with at most three legs).

This elucidated some discussion about TBox and ABox roles and purposes. Ian Horrocks, one of the lead authors of DAML+OIL and now OWL, replied later in that thread:

The DL distinction between Tbox and Abox has changed over the years. In the days when reasoning algorithms where less well understood, the split was more or less into that part of the KB [knowledge base] for which (something like) sound and complete (subsumption) reasoning could be performed (the Tbox) and the part for which it could not (the Abox). Nowadays, the distinction is simply the type of axioms: the Tbox consists of (class) concept inclusion axioms (and/or equivalence) – e.g., “C subsumes D” while the Abox consists of individual/tuple membership axioms – e.g., “x is an instance of C” or “<x,y> is an instance of R”. For most DLs this is a useful distinction as a simpler algorithm can be used for reasoning with a Tbox alone (no individual names).

. . . Even today, where Tboxes are assumed to contain arbitrary inclusion axioms, it is useful in practice to divide the Tbox into a set of “definition” axioms and the remaining “general” axioms, which set is made as small as possible by applying a rewriting optimisation known as absorption. Tableaux algorithms can then exploit the definitional part of the Tbox by using a lazy expansion technique.

. . . I think Jerome is right. Without the ONE-OF construct, DAML+OIL [what has now evolved to OWL] could be seen as a Tbox with “raw” RDF acting as an Abox.

I find two important ideas emerging at this point. First, the roles and purposes of the TBox and ABox are being made clearer, consistent with the definition we have been using [1] and with better clarity and applicability to the semantic Web than what was earlier presented in the Description Logics Handbook [3]. And, second, the idea of a language split between OWL and RDF viz the TBox and ABox is made public for the first time.

Similarities to the Relational Model

The conceptual foundations of the relational model and RDF are indeed quite similar, based as they are on set theory and relationships. The relational model and description logics are both based on FOL (first-order predicate logic). The linkage with the data table of relational database systems is especially close and direct and was first a topic of a design architecture document by Tim Berners-Lee in 1998 [4].

Indeed, today:

Many RDF triple stores are based on relational database (RDB) systems
Close analogies exist between the RDB query language SQL and the RDF query language SPARQL
There are effective working systems for translating from relational data to RDF (such as OpenLink’s RDF Views), and
A W3C incubator group is now proposed to transition into a full work group devoted to RDB to RDF mappings.

There is a rich literature to investigate these aspects in detail. And, of course, these matters are of critical import because 99% of the current structured data in the world is being managed by RDBMs [5].

However, of more direct interest to this specific series of articles is how this close relationship between RDB and RDF is viewed with respect to the ABox and TBox separation of concerns.

Ian Horrocks (not surprisingly given the nature of his comments above), among others, has played a prominent role in looking at questions such as “hybrid DL-DB” (description logics + [relational] database) systems [6] and building conceptual links between relational databases and ontological-level reasoning [7].

We need not deconstruct his observations and arguments here in detail. What I glean from his strong background in description logics, however, is that relational data tables can be left in situ as ABox constructs, in the process gaining the efficiency of limited ABox reasoning and the efficiency of RDBMs. With proper design — which I also understand to be pretty straightforward — it should be possible to design hybrid ABox and TBox systems that work in a distributed context.

(Hmmm; sounds to me like ideas applicable to linked data !).

Horrocks’ et al. more recent paper [7] and its expansion [8] propose an extension of OWL for such hybrid or split knowledge bases. These extensions are designed to allow modelers to designate a subset of TBox axioms as integrity constraints. For TBox-level reasoning, these axioms are treated as usual. However, these axioms may also be applied separately to ABox instance data to perform integrity checks. Integrity can then be checked in advance with those axioms ignored during standard TBox reasoning, thus also improving performance. I think these points also have direct relevance to linked data.

As a proponent of the OWL side of the spectrum, Horrocks’ viewpoints have perhaps been too readily dismissed by some in the linked data community. Yet the major reason for looking at all of these questions from the perspective of description logics is to gain a coherent view across the entire semWeb enterprise. In the end, we are linking data for a purpose, to be able to do meaningful work with it. Just as RDB data tables can be looked at and integrated productively as ABoxes in a DL construct, so may linked data.

What of a Simpler RDFS for the ABox?

If there truly is a separation of concerns between instance records (ABox) and reasoning constructs (TBox ontologies), what does that begin to tell us about the languages we need for these purposes? If we can postulate no OWL in a linked data instance construct (the ABox), why not narrow RDFS [9] as well in order to have a vocabulary only as expressive as what an instance record and its assertions require?

de Bruijn et al in 2005 demonstrated logically how RDF models can be related to description logics-based ontology languages, especially OWL DL, without the need to change syntax or sematics in either language [10]. They noted specifically the use of RDF graphs as ABoxes that could be readily queried using SPARQL.

Herman ter Horst wrote another influential paper in 2005 where he looked closely at the proofs of completeness and decidability for RDF and RDFS [11]. He defined a general RDF graph extension that was fully decidable, and importantly looked at each statement in the language from the standpoints of complexity and computational tractability. He was particularly seeking logics that would be more computationally efficient due to fewer entailments, while still being “decidable” (that is, provable to reach computational closure). Here is the basic chart of plotting the various language dialects he investigated:

He noted that inclusion of XML datatypes required the use of RDFS for closure and the addition of the so-called ‘D*entailment‘ could extend RDFS to include reasoning with datatypes. He then extended that construct into what he called the ‘pD*semantics,’ which was intended to allow useful conclusions to be drawn about instances in the presence of an ontology with relatively low computational complexity.

What this construct means — as I understand it in the context of this series — is that specialized dialects (pD*) could govern the work of instance checking and other specialized work at the ABox level (RDFS) while being fully compatible with the TBox level (OWL) ontology. This means that languages and dialects could be tailored for the work at hand for efficiency and representational reasons, while maintaining logical integrity. Indeed, this very pD* dialect of OWL is now included as one of the proposed profiles, OWL 2 RL, for the new release of OWL 2 [12].

In a different vein, the paper that won the best award at ESWC in 2007 looked at the question of simplifying RDFS [13]. The authors were able to identify a fragment of RDFS that captured the complete semantics of RDF by carefully removing pieces that only described or allowed reasoning of the language itself.

A relatively streamlined and simplified structure for the ABox is not a new idea. Through version 3x, the P rotégé ontology editor included a built-in for SWRL (the Semantic Web Rule Language) that included an ABox ontology [14]:

<rdf:RDF xml:base='http://swrl.stanford.edu/ontologies/built-ins/3.3/abox.owl'>
<owl:Ontology rdf:about=' '/>
<swrl:Builtin rdf:ID='hasValue'/>
<swrl:Builtin rdf:ID='hasURI'/>
<swrl:Builtin rdf:ID='isNumeric'/>
<swrl:Builtin rdf:ID='notNumeric'/>
<swrl:Builtin rdf:ID='isIndividual'/>
<swrl:Builtin rdf:ID='isConstant'/>
<swrl:Builtin rdf:ID='hasClass'/>
<swrl:Builtin rdf:ID='hasProperty'/>
<swrl:Builtin rdf:ID='hasIndividual'>
<swrlb:maxArgs rdf:datatype='http://www.w3.org/2001/XMLSchema#int'>1</swrlb:maxArgs>
<swrlb:args rdf:datatype='http://www.w3.org/2001/XMLSchema#int'>1</swrlb:args>
<swrlb:minArgs rdf:datatype='http://www.w3.org/2001/XMLSchema#int'>1</swrlb:minArgs>
</swrl:Builtin>
<swrl:Builtin rdf:ID='setValue'/>
</rdf:RDF>

Don’t be fooled by the OWL designation in this file; for these uses, Protégé by convention requires all of its files to be of the OWL type. Note the simple vocabulary above has solely RDF predicates. We do not think this is yet the correct design (see Part 4), but it captures the right idea.

Logical and mathematical advances since the first releases of RDF and OWL now suggest that, with proper care and design, various dialects or fragments can be designed for specific purposes and for computational efficiency while maintaining — in their combination — logical integrity. An RDFS fragment, if you will, dedicated for linked instance data and ABox instance record purposes, appears conceptually doable. And, it may be computationally advisable.

SAOR: One Approach to Combining These Pieces

The SWSE (“swizzy”) project from DERI and the National University of Ireland in Galway has an interesting legacy and has been combining many of these threads into one approach to a semantic web search engine (hence, SWSE). You can use and test for yourself the new VisiNav interface and service, the newest instantiation of SWSE.

Relative to ABox (instance) data, the volume of TBox (structural) data on the Web is small: only around 0.7% of statements were classifiable as TBox statements [16].

In building what they call SAOR (for Scalable, Authoritative OWL Reasoner) for the SWSE effort, Aidan Hogan, Andreas Harth and Axel Polleres have intersected a number of interesting approaches and have taken some innovative paths to the questions of separating the TBox and ABox [15]. They have further applied this to the large-scale Billion Triples Challenge with interesting findings and results [16].

In their approach to building SAOR, the designers:

Found, at scale, that complete inferencing at the instance level is neither feasible nor desirable
Separated terminological knowledge (TBox) from assertional data (ABox) according to their use of the RDFS and OWL vocabulary
Did not carry over owl:sameAs inferences to the TBox, with their rationale being that such an approach was in-line with a first-order-logic point of view where equalities do not affect predicates
Stored all forward-chained inferences in the ABox only
Tested for and removed what they called “ontology hijacking”, which they defined as third parties attempting to broaden the definition of concepts for which they were not the original, authoritative authors, and
Constructed a resulting synthesized, static TBox based on these rules.

They further picked up on a variant of the ter Horst pD*semantics noted above to speed the reasoner for calculating the forward-chaining inferences.

According to these decisions and rules, they found the overwhelming majority of statements within the 315,000 sources they crawled as being “non-authoritative” and indeed made many decisions that, in essence, threw out statements in the source instance sets. One interpretation, related to the thesis of this series but not directly noted by the authors, is that much of the linked data presently available on the Web is either over-specified or mis-specified. (I would argue that is due in part to linked data instance records trying to do more than their natural assertional role.)

Now, perhaps one could quibble with the rules and the decisions the authors employed (indeed, we do), but that is a topic for another day. What is interesting about the entire SAOR approach, I think, is its close attention to value and authoritativeness, all being split and recast into more tractable ABox and TBox portions, for handling reasoning at scale over large numbers of instances.

In my opinion, this is a seminal approach to the next generation of linked data that warrants much inspection and discussion.

Research on Other Work Tasks

A cursory discussion of the literature also shows some efforts that address the interstitial work areas noted in my conceptual architecture from Part 2.

Full-text Search Engines

Part 2 discussed full-text search engines in the broad semantic sense, and not specifically related to the ABox-TBox split. A couple of those and some others deserve a look because of their tighter integration of full-text search and attention to work splits.

A sister project to SWSE is Sindice, which also uses Solr and employs the ter Horst pD* semantic framework [17]. An inspection of Sweet Tools, the semantic Web and -related tools listing, also suggests Aperture (a broadscale, full-text harvester with semantic capabilities); LARQ (which adds free text search to ARQ); Virtuoso (full-text and faceted search on top of a universal datastore); Watson (full-text search of metadata fields); and Zebra (specializing in structured library data and related).

Identity Relations

A couple of different approaches are being taken to identity testing, similarity or relations. The more direct approach is to do identity matching with a canonical ID or similar.

The SWSE group has one approach to object consolidation [18], which uses a clever method based on the owl:InverseFunctionalProperty (IFP) for performing large-scale consolidation of object identifiers for equivalent instances across data sources. Yet, as the authors note,

“One major issue we discovered involved foaf:weblog, which is defined as being an owl:IFP in the FOAF ontology. Thus its semantics mandates that the property uniquely defines the foaf:Person instance for which it is described.”

This is both bad logic and wrong in many cases [see 19 for a critique]. The authors therefore needed to drop this assignment from their method. But, frankly, I think the broader problem again is too much predicate firepower for what should be a simple assertion that Joe Farmer has a blog (in fact, may have three!) and here are their URIs.

A very large, multi-year project to assign unique identifiers to entities is the OKKAM project [20]. The intent is to provide a single and globally unique identifier for any entity through an ‘Entity Name System‘, plus tools. Many methods will be employed to assign the identity relationship; specifics are still forthcoming with dozens of researchers working on the problem. I should note that the reference paper also touches upon some of the massive challenges associated with the current use of owl:sameAs.

Others have questioned a centralized ID service, instead preferring a mechanism that is more local and builds on co-reference research [21]. The ReSIST project has noted some of the issues of owl:sameAs use and management. It has proposed, instead, a ‘Consistent Reference Service‘ (or CRS). Asserting a co-reference in this approach is like its use in linguistics: it means a URI that describes the same entity, as does ‘he’, ‘she’ or ‘it’ as a co-reference in a sentence. This predicate indicates that the two resources are describing the same thing without carrying all of the heavy entailment of the owl:sameAs predicate that semantically means the two resources are exactly the same. The CRS are proposed to be set up and managed locally and in a distributed fashion.

A very different approach to identity assignments is Rough DL, which is a qualitative, “fuzzy” ontology for relating entities or concepts to their similar resources [22]. The method has also been applied to the very difficult problem of bibliographic records [23], where similarity is harder to judge because of use of initials and abbreviations. Rough DL may be especially appealing because even with the best state-of-the-art, there are error rates in any of the identity relating or disambiguation methods available. And, rather than try to assess these similarities with a probabilistic score, the “fuzzy” approach may even be one that can be reasoned over.

Disambiguation

To my knowledge, there is no disambiguation of entities presently taking place for distributed linked data sets. But, if not already, it soon will.

An example of how such a service might occur is the uBio Taxonomic Name Server from The Marine Biological Laboratory at Woods Hole. Via Web service or direct HTML form, an entity name (in this case a biological species) or its variants can be submitted for disambiguation and assignment to the proper identifier (name).

There is much research behind the algorithms and approaches to named entity disambiguation beyond the scope of this present series.

Some Concluding ‘Big Picture’ Implications

Our arguments to this point in this series do not suggest nor require that current practice need change. Clearly, we are seeing growth, uptake and use with current practices regarding linked data.

One of the beautiful aspects of RDF as a data model and the semantic Web is that the underlying languages and standards are so flexible. Find a way to do stuff better in the future? Fine; go ahead and do it, because what has come before can be easily transitioned or accommodated.

The real thrust of this series has been “best practice.” There are certainly many viewpoints on that topic, and the understanding of it for a linked data environment at scale is also evolving. This is healthy, vibrant and exciting. Who knows what is truly best practice? I personally believe the market will determine that by what gets adopted and becomes self-sustaining by providing value.

However, as Structured Dynamics attempts to think through these issues — to look seriously at moving from simply proving the exposure of Web data to one of meaningfully doing work and relying on it at large scale — we see warts and challenges. Such is growth. It is natural. And change does not mean that what came before was wrong.

So, what do we see as some of these ‘big picture’ implications?:

Linked data is surely proving the idea of a Web of Data and is bringing broader awareness about the usefulness of the semantic Web, its languages, and its standards
The semantic Web does not require “reasoning” at the point of linked data publishing, only that the linked data be published in a form that ultimately supports reasoning
Linked data is largely a basis for exposing and sharing instance data
As instance records, linked data is about assertions and attributes, not provability or decidability
Linked data can be guided by the constructs of description logics
Linked data instance records only require a subset of RDFS to be structured sufficiently for assertions and attributes
Work applied to linked data in a distributed dataset setting can be segregated and optimized
This specialized work (identity testing, disambiguation or full-text retrieval) does not belong at the ABox level, but also can be conducted separately from standard TBox reasoning and inference
Linked data has appropriately embraced RDF, but has often overstepped its natural bounds by “cherry-picking” OWL predicates without regard to actual use in an open-world knowledge base
As linked data is incorporated into knowledge bases, fragments and dialects of both RDF and OWL can be applied to specific work tasks to improve scalability and computational tractability
owl:sameAs and other owl:statements within linked data instance records are being rejected by aggregation and consolidation services; let’s figure better ways to assert identities, memberships and linkages without entailing what is not being (and can not logically be) supported.

If we can do these things, we can simplify what it means to publish “linked data-ready” structured data. Being coherent about these matters is a key.

[1] This is our working definition for description logics:

[2] Patrick Hayes, 2004. “RDF Semantics,” a W3C Recommendation, February 2004. See http://www.w3.org/TR/rdf-mt/.

[3] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, 2003. Sample chapters may be viewed from Enrico Franconi’s Description Logics course notes and tutorial at http://www.inf.unibz.it/~franconi/dl/course/, which is an excellent starting reference point on the subject.

[4] See Tim Berners-Lee, 1998. Relational Databases on the Semantic Web; see http://www.w3.org/DesignIssues/RDB-RDF.html; updated in 2002. (An interesting aside from that document was its observation that there may only be a limited number of RDF ontologies moving forward, with quite the opposite now apparent.) For a general discussion of relationships between RDF and standard relational data tables, I recommend: Andrew Newman, 2007. “A Relational View of the Semantic Web,” an online article at XML.com, March 14, 2007; see http://www.xml.com/pub/a/2007/03/14/a-relational-view-of-the-semantic-web.html.

[5] Yet structured data is typically seen as only a minor portion of all available information, with text (unstructured) data generally estimated as comprising 80% to 85% of all useful information. Most text information is managed by text engines [also known as search or information retrieval (IR) index engines], and not by RDBMs.

[6] Sean Bechhofer, Ian Horrocks, and Daniele Turi. “The OWL Instance Store: System Description,” in Proceedings of the 20th International Conference on Automated Deduction (CADE-20), Lecture Notes in Artificial Intelligence, pages 177-181. Springer, 2005. See http://www.comlab.ox.ac.uk/people/ian.horrocks/Publications/download/2005/BeHT05.pdf.

[7] Boris Motik, Ian Horrocks, and Ulrike Sattler, 2007. “Bridging the Gap Between OWL and Relational Databases,” in Proceedings of the Sixteenth International World Wide Web Conference (WWW 2007). See http://www.comlab.ox.ac.uk/people/ian.horrocks/Publications/download/2007/MoHS07a.pdf.

[8] Boris Motik, Ian Horrocks and Ulrike Sattler, 2006. Integrating Description Logics and Relational Databases, Technical Report from the University of Manchester, UK, December 6, 2006, 44 pp. See http://web.comlab.ox.ac.uk/people/Boris.Motik/pubs/mhs06constraints-report.pdf. This is really a very in-depth, comprehensive treatment of the subject.

[9] We are still evaluating whether the simpler ABox vocabulary discussed in Part 4 can be met solely with RDF or requires RDFS. Our preference is RDF, but subClassOf relationships and reasoning over datatypes [11] seem to point to RDFS. What is presented in Part 4 is still quite preliminary and subject to change.

[10] Jos de Bruijn, Enrico Franconi and Sergio Tessaris, 2005. “Logical Reconstruction of Normative RDF,” presented at the International Workshop on OWL: Experiences and Directions (OWLED 2005), Galway, Ireland. See http://www.inf.unibz.it/~jdebruijn/publications/owl-05.pdf.

[11] Herman J. ter Horst, 2005. “Completeness, Decidability and Complexity of Entailment for RDF Schema and a Semantic Extension involving the OWL Vocabulary,” Journal of Web Semantics, Vol. 3, 2005, pp. 79-115. See http://www.websemanticsjournal.org/papers/20050719/document5.pdf.

[12] Boris Motik et al., eds., 2008. “OWL 2 Web Ontology Language: Profiles,” a W3C Working Draft, December 2, 2008. See http://www.w3.org/TR/owl2-profiles/, esp. http://www.w3.org/TR/owl2-profiles/#OWL_2_RL.There are also reasoners available for dealing with such specialized dialects, such as for example, http://www.ivan-herman.net/Misc/PythonStuff/RDFSClosure/Doc/RDFSClosure-module.html.

[13] Sergio Muñoz, Jorge Perez and Claudio Gutierrez, 2007. “Minimal Deductive Systems for RDF,” paper presented at the 4th European Semantic Web Conference, ESWC 2007, Innsbruck, Austria, June 2007. See http://www2.ing.puc.cl/~jperez/papers/minimal-rdf-camera-ready-ext.pdf.

[14] See http://protege.cim3.net/cgi-bin/wiki.pl?SWRLABoxBuiltIns for a description; the actual ontology may be found at http://swrl.stanford.edu/ontologies/built-ins/3.3/abox.owl.

[15] Aidan Hogan, Andreas Harth and Axel Polleres, 2008. “SAOR: Authoritative Reasoning for the Web,” in Proceedings of the 3rd Asian Semantic Web Conference (ASWC 2008), Bankok, Thailand, Dec. 2008. See http://sw.deri.org/~aidanh/docs/aswc08.pdf.

[16] Aidan Hogan, Andreas Harth and Axel Polleres, 2008. “Scalable Authoritative OWL Reasoning on a Billion Triples,” in Proceedings of Billion Triple Semantic Web Challenge 2008, at the 7th International Semantic Web Conference (ISWC2008), Karlsruhe, Germany, 2008. See http://sw.deri.org/~aidanh/docs/saor_billiontc08.pdf.

[17] Giovanni Tummarello, Renaud Delbru and Eyal Oren, 2007. “Sindice.com: Weaving the Open Linked Data,” in Proceedings of the International Semantic Web Conference (ISWC), 2007. See presentation slides at http://www.eyaloren.org/slides/iswc2007.pdf.

[18] Aidan Hogan, Andreas Harth and Stefan Decker, 2007. “Performing Object Consolidation on the Semantic Web Data Graph,” in 1st I3 Workshop: Identity, Identifiers, Identification Workshop, 2007. http://sw.deri.org/2007/02/objcon/paper.pdf.

[19] See http://esw.w3.org/topic/InverseFunctionalProperty.

[20] Paolo Bouquet, Heiko Stoermer, Claudia Niederee, and Antonio Mana, 2008. “Entity Name System: The Backbone of an Open and Scalable Web of Data,” in Proceedings of the IEEE International Conference on Semantic Computing, ICSC 2008, August 2008. See http://www.okkam.org/publications/stoermer-EntityNameSystem.pdf.

[21] Afraz Jaffri, Hugh Glaser and Ian Millard, 2008. “URI Identity Management for Semantic Web Data Integration and Linkage,” in 3rd International Workshop On Scalable Semantic Web Knowledge Base Systems, November 25-30, 2007, Vilamoura, Algarve, Portugal. See http://eprints.ecs.soton.ac.uk/14361/1/URI_Identity_Management_for_Semantic_Web_Data_Integration_and_Linkage.pdf.

[22] Stefan Schlobach, Michel Klein and Linda Peelen, 2007. “Description Logics with Approximate Definitions: Precise Modeling of Vague Concepts,” in Proceedings of the 20th International Joint Conference on Artificial Intelligence, IJCAI 07, Hyderabad, India, January 2007; see http://www.ijcai.org/papers07/Papers/IJCAI07-088.pdf.

[23] Michel C.A. Klein, Peter Mika and Stefan Schlobach, 2007. “Rough Description Logics for Modeling Uncertainty in Instance Unification,” in Proceedings of the Third ISWC Workshop on Uncertainty Reasoning for the Semantic Web, Busan, Korea, November 12, 2007. See http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-327/paper4.pdf.

Posted:February 15, 2009

Making Linked Data Reasonable using Description Logics, Part 2

Good Design is Based on the Work to be Done

In Part 1 of this series, I advocated the placement of linked data in an ABox construct from description logics [1] based on a separation of concerns argument. Actually, this broader position arose from close inspection of an earlier table on TBox and ABox purposes and roles that I had assembled from the literature.

That table synthesized TBox and ABox purposes and roles from the citations listed in Back to the Future with Description Logics, though was based for the most part on the The Description Logic Handbook [2]. As we continued to look further at the assignments in that table we saw a few issues:

First, some of the assignments seemed to be in error (as likely due to my own misunderstandings when making the assignments than anything else)
Second, the use of my phrase “purposes and roles” in describing the assignments seemed too mushy: what work actually could occur within a ‘box’ seemed to be a more precise way of thinking and evaluating the assignments
Third, two columns were inadequate: some work requires both the ontology and instances in order to be performed (that is, the work occurs at the conjoined layer of the knowledge base), and
Last, some useful or necessary work may not even occur in whole or part within the confines of either the ABox or TBox.

Work, of course, is what our computers do for us on the data. Properly identifying and isolating these work tasks is a good starting point for teasing out architectural and schema design.

These perspectives allowed Fred Giasson and me to re-evaluate this earlier assignments table (see earlier version), as shown below. Note that some items have been moved to a different column (shown in blue, all of which were formerly in the ‘ABox’) and some have been added and may be external (shown in green):

Work Tasks for an ‘Open World’ Knowledge Base

TBox	TBox <–> ABox	ABox
Definitions of the concepts and properties (relationships) of the controlled vocabulary Declarations of concept axioms or roles Inferencing of relationships, be they transitive, symmetric, functional or inverse to another property Equivalence testing as to whether two classes or properties are equivalent to one another Subsumption, which is checking whether one concept is more general than another Satisfiability, which is the problem of checking whether a concept has been defined (is not an empty concept) Classification, which places a new concept in the proper place in a taxonomic hierarchy of concepts Logical implication, which is whether a generic relationship is a logical consequence of the declarations in the TBox Infer property assertions implicit through the transitive property	Entailments, which are whether other propositions are implied by the stated condition Instance checking, which verifies whether a given individual is an instance of (belongs to) a specified concept Knowledge base consistency, which is to verify whether all concepts admit at least one individual Realization, which is to find the most specific concept for an individual object Retrieval, which is to find the individuals that are instances of a given concept [Identity relations, which is to determine the equivalence or relatedness of instances in different datasets] [Disambiguation, which is resolving references to the proper instance]	Membership assertions, either as concepts or as roles Attributes assertions Linkages assertions that capture the above but also assert the external sources for these assignments Consistency checking of instances Satisfiability checks, which are that the conditions of instance membership are met

Blue = moved
Green = added

As the table now shows, the TBox is where the reasoning work occurs, the ABox is where assertions and data integrity occurs, and knowledge base work in the middle (among other aspects) requires both.

The State of Linked Data

As Part 1 discussed, linked data in its current form, which are mostly instance records, most closely resembles ABoxes. The linkage in current linked data occurs via asserted relations and identities with things in other datasets (or sources). This might be the references to the objects themselves or to the predicates (properties) that describe the nature of the asserted relationship. The identifiers for these references are URIs, which are often external to the originating dataset.

This construction has led to the now well-known ‘LOD cloud‘ for publicly accessible open linked data:

Each of the arrows on the diagram reflects an external linkage based on these property or object identity assertions.

This same approach can be applied as well for specific domains. This example from the LODD (‘linking open drug data’) “cloud” shows [3] a similar “linked” structure; also note the central node of DBpedia, also shared with the previous diagram:

Part of my argument in Part 1 was for the linked data community and its advocates to view its role via an ABox perspective. I further argued that this could help simplify schema and vocabulary design for publishing linked data, helping to promote faster and broader uptake.

We can now take these arguments a bit further by adding the perspective of useful work that can be performed within this ABox framework, building upon the table above.

Looking at linked data from the perspective of work makes sense as we see the scale of linked data grow. Besides the sheer increase in numbers arising from that growth, newer publishers may have fewer skills and less conformance to best practice. Questions as to quality and erroneous mappings now appear more frequently on the support forums and mailing lists.

Linked data — or even more generically, distributed or federated data systems — have particular needs for search, for checking identity relations, and for disambiguation that arise from their distributed nature. These are challenges, and were not generally topics in the initial discourse around description logics. On the other hand, I think it fair to say that linked data has already shown its benefits in browsing and discovery, and for linked data destination sites with SPARQL endpoints, benefits in local search.

Where Linked Data Works

The greatest strength of linked data, imo, is that it is showing the way to how the Web can become a global, interoperating database, the Web of Data. Standards are being applied, practices are being worked out, and actual data is being exposed. Visibility and the realization is growing. This is the real story — and an important one — and frankly a massive development that is a mere two years old.

Fundamental to this growth has been the superb flexibility and simplicity of RDF as a data model. No matter what else, the benefits of RDF and linked data have a signal that now exceeds the noise. I believe it can be comfortably asserted that the next generation of information systems will be built on this and REST, the combination of which has been called Web-oriented architecture (WOA).

OK Browsing and Discovery

If you know the locations and if you know the format and (sometimes) if you know the syntax, you can see and discover much from linked data on the Web. If you can get beyond these start-up hurdles, you can now start swimming in a sea of linked references.

Granted, sometimes you don’t know what those cryptic link references mean; other times those links may take you to dead ends or silly stuff. I think it is fair to say that the community that is exposing this linked data stuff would acknowledge that their user interfaces and work flows are not yet intuitive or often easily grasped.

But, bear with this for a moment. Try a couple of these links:

(BTW, any URI will work so long as it has RDF in the correct, servicable form; a challenge unto itself! [4])

Just like we first discovered the magic of moving from hyperlink to hyperlink in the initial Web, we are now beginning to see the process of moving from hyperdata to hyperdata with regard to structured and organized information.

Browsing and “stumbling upon” new cool stuff is the current main strength of linked data; again, assuming you are given some tips about how to start. Here are some linked data browers that might help you on this discovery journey, often with useful accompanying help and link examples: OpenLink Data Explorer or Zitgist Data Viewer or Marbles or DISCO or Tabulator. Don’t be surprised if some of them break or don’t work! (Again,if you don’t know where to begin, try issuing one of the two link queries above after the ‘url’ or ‘uri’ part.)

Good Individual Dataset Querying

If you are a bit more experienced, some of the linked data source sites can be queried directly. The language used is SPARQL, which bears many resemblances to SQL for standard relational databases.

Generally (unless you are an expert), you will issue your queries over the local database at the target site. To see and test some examples, first bring up one of the many SPARQL clients, for example this one presents a rather complete overview and demo environment:

http://demo.openlinksw.com/sparql_demo/ — try it with both local and remote options.

A couple of other options with pre-loaded queries include:

General: http://librdf.org/query (scroll down and use the Run this query links)
Bio2RDF: http://bio2rdf.wiki.sourceforge.net/Demo+queries (and use the Try it out! links)

The SPARQL facility is really quite powerful and elegant; for the knowledgable user or system, this query syntax can also be applied across linked datasets. (Indeed, SPARQL is a key component in the background to Structured Dynamics’ services.) SPARQL itself is a deserving topic in its own right.

OK Property and Object Search

There are a few RDF search engines that can do property, object or ontology searches, including Falcons (IWS China), Sindice (DERI Ireland), Watson (KMi), Semantic Web Search Engine (SWSE) (DERI Ireland) and Swoogle (the Ebiquity Group at UMBC). My personal favorite is Falcons.

We also are beginning to see structured data and attribute information creep into major search engines, often with little fanfare or notice; however, these are structured supplements to standard search results and not directly searchable alone. And, finally, there is a class of semantic search engines that do provide structured search, but not directly relevant to linked data (that is, they add structure and semantics to conventional search results). Prominent examples include Powerset (Microsoft), Hakia, True Knowledge and Cognition.

Where Linked Data is Not Working

In general, user interfaces have not been strong points for linked data. The emphasis, after all, has been more on the principles of linking and getting more data exposed.

More critical, however, from the standpoint of making the compelling case for an evolving Web of Data, is the weak tools support as the number and breadth of sources goes up. These limitations include full-text search, which requires performance equivalent to standard search engines; a lack of identity testing; and — increasingly — the need for disambiguation.

Only Partial Full-text Searching

Though there are linked data datastores (generally called “triple stores”) for RDF that do full text search, they do not perform as well as dedicated text search engines. Depending on scale or hardware choices and configuration, these issues may often be overcome and not directly apparent to end users.

A more interesting concern and weakness arises from the nature of linked data itself. For example, here is the triple RDF statement for Mike Bergman knows Fred Giasson:

<rdf:resource=”http://mkbergman.com/me/” foaf:knows rdf:resource=”http://fgiasson.com/me/”>

Now, what happens if I enter the search phrase “Fred Giasson” into my search box? Will this record appear?

The answer, often times, is no. And the reason it may not is that the “thing” of “Fred Giasson” is now represented by the resource URI of “http://fgiasson.com/me/” that does not contain the “Fred Giasson” search string.

In fact, one of the first observations that new users exposed to searching link data make is, Where are all of the results? Because, you see, they expect to see the same complete listing of results that would be returned by a conventional search engine.

The reason this occurs is that one of the very strengths of linked data — to give unique resources a unique Web ID or URI — is also what abstracts the resource from its literal string descriptor.

Now, this issue can be overcome, though to my knowledge no one is doing it. First, for a given resource, you can get all the triples where the object is a text “string” and index them. Then, for a subsequent resource, you can get all of the triples where the object is a resource and, if it is of the right type indicating a label, we then try to find the matching property for that resource.

This is generally not work done by triple stores. We are in essence trying to trace back from a URI identifier to its literal text descriptor (if it has one) and substituting the text descriptor into the search index such that it can be found during a full-text search (whew!). That sounds hard and like a lot of work. Indeed it is.

However, as time goes on, this is likely functionality that users will demand. And, because the nature of the work is really text search and not triple stores or inferencing based from a graph, it may require special processing at ingest and the use of dedicated text engines.

No Identity Testing

Remember what we said earlier about linked data and its relation to instances and the ABox? It can assert it is related to or has properties or attributes of one kind or another, but from the information contained in the instance record itself we can not determine if those assertions are true. We can believe or trust some sources more than others as being more authoritative, but even for authoritative sources we may want to test the assertion.

Perhaps this is less of a problem when the linked datasets come from a relatively small community, as has been the case. But that is rapidly changing, and how do we begin to test identity relations?

Again, to my knowledge, identity testing is not yet being applied to linked instance data; we are relying on assertions and trust, each confounded by differences of opinion as to what asserting “identity” even means. Identity testing is a good example of a possible service or component that could either reside external to the TBox, or could use the concept graph at the TBox level to aid its logic.

Identity relations would check the assertions of relatedness made at the ABox level. At its simplest level, this may only be a lookup service for synonyms or aliases (or what we more broadly call ‘semsets‘ in UMBEL). In more complex or comprehensive forms it could, for example, do a check for similar instances across the instance base (all ABoxes) using techniques similar to disambiguation. In this manner, for example, owl:sameAs could now be better determined, and under well defined conditions that users could understand and then accept or reject. Disjointedness and similarity are two additional identity checks possible.

Identity checking could thus better work across disparate datasets and could have a relatively simple and optimized index structure. This would be useful for real-time access; newly discovered identities could be separately slip-streamed and later evaluated more rigorously.

No Disambiguation

Disambiguation is another component that could similarly reside internal or external to the standard system. Like identity relations, it might rely on concept graph information but its logic need not be explicitly modeled at the TBox level.

Disambiguation basically tries to test whether Joe Farmer the farmer is the same or different than Joe Farmer the truck driver. TBox information can certainly aid this — such as whether agricultural concepts or attributes are in association with the entity at hand — and the identity testing and synonyms or alternate name forms can also play into this determination.

Again, this is work that has largely not been critical for the early phases of linked data, but is becoming essential as the application of linked data is being contemplated for doing actual, meaningful stuff.

No Reasoning or Inference

Of course, linked data and the ABox by definition don’t provide this kind of work; reasoning and inference are appropriately the work of the TBox.

Some Implications for Architectural Design

We can now start diagramming these pieces into a conceptual architecture as follows:

This figure maps the work activities noted in the table at the top of this article and described throughout. Note there are a number of possible and specialized work activities at the interstices between the TBox and ABox, some of which are somewhat new to the description logics discourse as noted in the prior section.

I’m certainly no AI or KR expert, but it appears to me that there are pragmatic work tasks that emerge that do not easily fit into purely conceptual discussions of these things. The interstitial items possibly fall into this category, and those in the middle “bubbles” could even be done via separate processing or indexes not invoking the TBox level at all.

I suspect, though, that would be a mistake. The organizational view of the world at the TBox level provides useful reasoning and inferential bases for aiding other work tasks. For example, for disambiguation work, knowing that the instance of Joe Farmer was found in context with concepts relating to fertilizer or farm implements would help distinguish from Joe Farmer the lawyer (though would need additional clues or reasoning to dismiss that is was Joe Farmer the trucker shipping these things).

If you recall from Part 1, considerable discussion was devoted to the suggestion that linked data belonged in the ABox and that even if linkages are being made to other entities or instances, those assignments remain only assertions. We discussed identity evaluation and disambiguation in the previous section. These really come down to the questions of whether we are talking about the same or different things across multiple data sources.

Maintaining identity relations and disambiguation as separate components also has the advantage of enabling different methodologies or algorithms to be determined or swapped out as better methods become available. A low-fidelity service, for example, could be applied for quick or free uses, with more rigorous methods reserved for paid or batch mode analysis. Similarly, maintaining full-text search as a separate component means the work can be done by optimized search engines with built-in faceting (such as the excellent open-source Solr application). I will likely have more to say on this aspect after this series concludes.

A Couple Thoughts to Conclude this Part

People can do as they like and will within the semantic scope of the RDF and OWL languages. But, besides my recommendation to view linked data as ABoxes through the lens of description logics, I’d like to suggest a couple of additional ‘best practices.’

I think linked data does itself a disservice to throw the owl:sameAs predicate around as much as it does. Ooooh! OWL; this is complicated stuff, no? And, actually, owl:sameAs is perhaps the most powerful entailment predicate around [5,6], which is also reflexive and transitive. How can we really use this property when all we are really talking about is an assertion at the ABox instance level?

I think use of predicates like this confuse ourselves and confuse the public. We give stuff a gloss and patina that is not supportable by any reasoning.

The OWL2 folks, I think, understood this issue in moving toward specific “assertion” language in the new draft version [7]. OWL2 now proposes these assertion predicates of owl:SameIndividual, owl:DifferentIndividuals, owl:ClassAssertion, owl:ObjectPropertyAssertion, owl:NegativeObjectPropertyAssertion, owl:DataPropertyAssertion, and owl:NegativeDataPropertyAssertion. Note that, axiomatically, the owl:SameIndividual is now the functional equivalent of the older owl:sameAs [8].

I think it is a step in the right direction, though frankly I would have preferred a semantics that used the “assert” naming for instances (individuals) as well.

But, actually, I have a more fundamental question, related to confusion argument above: Why should assertions be an OWL property? Would it not make better and cleaner sense to establish assertion predicates within RDF or RDFS and to minimize stronger implications of possible decidability or entailment? Would this approach not lead to a cleaner split between instance records (ABoxes) and linked data that could be kept solely within the RDF vocabulary without invoking OWL at all?

In this manner, we could leave the various emerging profiles for OWL2 [9] reserved for TBox-level reasoning. It would make for easier and understandable reasoners, while making it easier for publishers to expose linked instance data.

Well, OK; so, I have now gotten most of the arguments off my chest about ABoxes and linked data, the separation of concerns, and maintaining distinctions in language.

I will discuss in the next Part 3 what other researchers have to say on this use of the TBox and ABox and the split between them. Then, in the concluding Part 4, I discuss how this perspective might lead to a pretty simple RDF vocabulary and structure for linked data and instance records.

[1] This is our working definition for description logics:

[2] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, 2003. Sample chapters may be viewed from Enrico Franconi’s Description Logics course notes and tutorial at http://www.inf.unibz.it/~franconi/dl/course/, which is an excellent starting reference point on the subject.

[3] This figure is from a presentation by Christian Bizer, “Linking Open Drug Data (HCLSIG LODD),” presented at the ISWC 2008 Tutorial on Semantic Web for Health Care and Life Sciences, Karlsruhe, Germany, October 27, 2008. See http://carbon.videolectures.net/2008/active/iswc08_karlsruhe/prudhommeaux_swhcls/iswc08_prudhommeaux_swhcls_05.ppt. Also, for a broader evaluation of LODD datasets, see http://esw.w3.org/topic/HCLSIG/LODD/Data/DataSetEvaluation.

[4] See, particularly, Chris Bizer, Richard Cyganiak and Tom Heath, “How to Publish Linked Data on the Web,”, last updated July 2008, at http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/.

[5] owl:sameAs is even more powerful than owl:equivalentClass. Pat Hayes can often be counted on for clear explanations about semantic Web semantics, as this quote (hat tip to Tim Finin for pulling this out of the archives) describes:

“Allow me to suggest the appropriate intuition. In RDFS and OWL-Full there is a distinction between a class and the (set which is the) extension of the class. So two different classes might have the same sets of instances and yet still be distinct classes. (This is often described by saying that RDFS and OWLFull classes are ‘intensional’ as opposed to ‘extensional’. For example, the classes of Human and HairlessBiped have the same members, but one might want to distinguish them since they have different defining conditions on membership. OWL-DL refuses to countenance such a possibility, although this may be rectified in OWL 1.1.) Thus there is an intuitive distinction in meaning between equivalentClass (having the same instances) and sameAs (being exactly the same thing). When, as in RDFS and OWL-Full, classes can have properties, one wants to preserve this distinction by saying that if A sameAs B then all the properties of A are also properties of B (since A and B are the very same thing); but this does not follow for A equivalentClass B, since A and B might still be distinct even if they do have the same members.”

[6] Though long at 59 pp and geared to another purpose, I found first sections of this paper to also be very helpful in understanding entailments and decidability for the various semantic Web languages of RDF, RDFS and the dialects of OWL (Lite, DL and Full): Herman J. ter Horst, 2005. “Completeness, Decidability and Complexity of Entailment for RDF Schema and a Semantic Extension involving the OWL Vocabulary,” paper preprint submitted to Elsevier Science, May 31, 2005, 59 pp. See http://www.websemanticsjournal.org/papers/20050719/document5.pdf.

[7] Boris Motik, Peter F. Patel-Schneider, Bijan Parsia, eds., 2008. “OWL 2 Web Ontology Language:Structural Specification and Functional-Style Syntax,” a W3C Working Draft, 02 December 2008. See for latest version http://www.w3.org/TR/owl2-syntax/. Note the XML serialization used is documented at http://www.w3.org/TR/owl2-xml-serialization/. I think it fair to say that many of these OWL2 changes were the result of mistakes and learning from the current OWL.

[8] Bernardo Cuenca Grau and Boris Motik, 2007. “OWL 1.1 Web Ontology Language Mapping to RDF Graphs,” Editor’s Draft May 23, 2007. See http://www.webont.com/owl/1.1/rdf_mapping.html.

[9] Boris Motik et al., eds., 2008. “OWL 2 Web Ontology Language: Profiles,” a W3C Working Draft, December 2, 2008. See http://www.w3.org/TR/owl2-profiles/.

Posted:February 11, 2009

Making Linked Data Reasonable using Description Logics, Part 1

Can a ‘Separation of Concerns’ Lead to Better Slicing of the Pie?

It is clear that Fred Giasson and I have been spending considerable time on description logics of late. While we could perhaps claim that we like hard-to-read and -understand stuff, our reasons for this interest have been quite pragmatic: How to apply linked data principles to real-world commercial and organizational environments? Indeed, what should those principles even be?

Our first intuition, reaching back nearly two years now, was that linked data needed a context for bringing related datasets together. This belief led us to construct the UMBEL subject concept ontology, a basic reference roadmap for helping to point to information related (or “about”) similar subjects. UMBEL as a set of subject concepts has proved useful as a reference roadmap; and the approach to construct UMBEL and its resulting vocabulary (heavily based in SKOS) has also proved helpful to construct specific domain-level ontologies.

By their nature, these ontologies have been conceptual and structural. They define relationships, but are instance-poor. They focus on ways to describe various lenses — world views — into the domains for which we have been engaged.

But superstructures are meant to be built upon and fleshed out. For that, real instance data is required.

The Formative ‘Named Entities’

Thus, we have more recently shifted from concepts and structure to focus on how to represent the actual things that populate that structure; that is, a domain’s actual objects or instances. We appreciate that different audiences and proponents will use terminology such as instance, or object, or entity, or individual or even foo to describe such things, but for us (Peirceian logic aside) we simply wanted a way to describe the referents to a specific real-world thing [1].

We initially chose the term ‘named entities‘ to describe these actual objects. This naming arose from the work of Sekine and his 200 named entity types [2]. Typical named entities are specific (individual) people, organizations, events, artifacts (‘Mona Lisa’), places, products, or whatever. For example, here is our first published definition describing ‘named entities‘:

Named entities are the real things or instances in the world that are themselves natural and notable class members of subject concepts. Named entities are the instances of the subject concepts in the standard definition of the term. Each named entity is mapped to a governing subject concept for ontology purposes.

Actually, ‘named entities‘, in even that sense, do not all have proper names with capitalization. Some accepted ‘named entities‘ are also written in lower case, with examples such as rocks (‘gneiss’) or common animals or plants (‘daisy’) or chemicals (‘ozone’) or minerals (‘mica’) or drugs (‘aspirin’) or foods (‘sushi’) or whatever.

City of Iowa City
Clinton St., Iowa City
Location in the state of Iowa
Coordinates: 41°39′21″N 91°31′30″W 41.65583°N 91.525°W
Country	United States
State	Iowa
County	Johnson
Metro	Iowa City Metropolitan Area
Government
– Type	Council-manager government
– Mayor	Regenia Bailey
– City Manager	Michael Lombardo
Area
– City	24.4 sq mi (63.3 km²)
– Land	24.2 sq mi (62.6 km²)
– Water	0.3 sq mi (0.7 km²)
Elevation	668 ft (203.6 m)
Population (2007 est.)
– City	67,062
– Density	2,748.4/sq mi (1,059.4/km²)
– Metro	147,038
Time zone	CST (UTC-6)
– Summer (DST)	CDT (UTC-5)
ZIP codes	52240-52246
Area code(s)	319
FIPS code	19-38595
GNIS feature ID	0457827
Website	http://www.icgov.org/

But, hmmm. While ‘daisy’ might be an instance of the common flowers, is that the same as a specific daisy flower? especially when I can see literally thousands of daisy flowers at present in my back yard?

This epistemological question of thing v instance v individual can really mess you up! Furthermore, from the standpoint of describing these things on the Web, are we talking about the real thing, a symbol of some sort (Peirce again!) for that thing, or a multitude of similar descriptive terms (flower, bloom, daisy, florescence, bellis, chrysanthenum) for that thing?

Whether a thing is an instance or an individual or even a class depends on context. A plant taxonomy could represent its terminal nodes as specific species or subspecies of daisy. But, in a flower show, the specific thing being referred to could actually be a unique individual, the Pretty Miss Daisy blue-ribbon winner.

Description logics with its TBox and ABox splits [3] actually helps considerably to unravel these potentially confounding distinctions. The ABox covers the description of instances with their asserted attributes or characteristics. Thus, we can have an ABox description of the daisy instance that refers to daisies in general or daisies as a species, or we can have an ABox description of an individual daisy with specific proper name in a flower show.

This instance idea is really a very clean one. As long as we focus on the idea of an instance and its attributes, we can put off for the moment (or defer to another layer, that is the reasoning TBox) what kind of instance this is.

After another segue, we’ll return to this instance concept in a moment.

Happy Birthday!: DBpedia and the Beginning of Linked Data

DBpedia, the structured and linked data extraction of “facts” from Wikipedia, was first released about two years ago. Happy 2nd Birthday! I first wrote about DBpedia shortly thereafter claiming, I think somewhat accurately, the birth of the structured Web. We now know that phenomenon and the many additional datasets that nucleate around DBpedia as linked data.

When first explained, DBpedia used examples of so-called Wikipedia infoboxes for the cities of Leipzig or Innsbruck to describe the source of its structured data. (Subsequently Berlin has also been commonly referenced, all understandable given DBpedia’s two principal founders of SÃ¶ren Auer at the UniversitÃ¤t Leipzig and Chris Bizer at Freie UniversitÃ¤t Berlin; of course, many others have joined and meaningfully supported the project since.) Infoboxes are Wikipedia templates that provide standardized, structured information across related articles of a similar type.

I have copied a similar infobox — in this case for the same city type from Wikipedia for my home town of Iowa City — to show one of these structured data templates (shown to right). In using it I am, of course being a bit parochial, but it is also interesting to see the growth of structured data (attributes) that such templates now contain compared to what was available at the time of DBpedia’s first release.

This infobox is a perfect example of an ABox. The instance it describes is the ‘City of Iowa City’. Each of the items that follow show an attribute or data characteristic of some form with its associated value as a key-value pair. Sometimes those values refer to other instances, some of which are individuals, such as the county or name of the mayor.

In ABox terminology, these values are asserted for each attribute. Because this is Wikipedia, which has a reputation for accuracy and authority, we tend to believe and accept these assertions. But, we also know that sometimes these values are not correct, even for Wikipedia. We also know that instance records can come from many, many different sources, perhaps most not with the accuracy or authority of Wikipedia.

It is these types of instance records (for many other types of things than city, of course!) that are now being published as linked data. Today more than 50 general public datasets and perhaps another 50 from the sciences (especially biology) have been published. The total assertions across all datasets now exceeds millions, and the RDF statements that capture all of the relationships between these instances, attributes and concepts that describe them exceed one billion, as the recent Billion Triples challenge attests.

What is nice about this ABox structure is that they are relatively simple — instances characterized by attributes — with the “facts” so expressed understood to be assertions and not necessarily verified truth or accuracy. No matter what the source, there is no guarantee that all assertions will be complete and accurate. (Though, as has proved to be the case for DBpedia because of its Wikipedia heritage, some of the sources can be comfortably asserted to be authoritative.)

Assertions about many of the attributes are relatively straightforward such as, in the Iowa City instance, zip codes or time zones or population. (Still, the estimates used could also be out of date or the estimation methods could be argued.) However, other assertions, more based on interpretation or personal opinion, such as subject matter or political or religious affiliation or bias, can be quite controversial.

Another potential source of error is the linked data assertion that one instance is the owl:sameAs a different instance in a different dataset. Erroneous ‘same as’ assertions can arise quite simply and not require malice or stupidity. For example, for me, I actually live in Coralville, Iowa, not Iowa City. But, Coralville completely abuts Iowa City, shares a school district, and my wife works in Iowa City. I more often than not claim Iowa City as my location, though my actual mailing address is Coralville. How does one reasonably say that the identity of Michael Bergman of Coralville is the same as the Michael Bergman of Iowa City?

Cutting the ABox Slice Out of the Pie

So, what can the perspective of the ABox and description logics tell us about these issues?:

First, instance records on their own can only contain assertions; instance records can not alone be a basis to decide the reasonableness of those assertions
Second, if we stay focused on the idea of an instance record, we can wiggle off the hook about whether we are talking about classes or groups or individuals. The instance is merely the thing at hand, with appropriate attributes or not based on the nature of the instance
Third, the role — or “burden” if you will — of the instance record is merely to convey attribute assertions about a single instance. The ABox can be streamlined with comparatively little structure and comparatively little semantics
Fourth, some attribute assertions are more straightforward and more easily tested, other attribute assertions are more problematic. That consideration should not limit the scope of any assertions that can be made in an instance record, just that certain attribute types may be harder to test or accept
And, fifth, and most importantly, these considerations strongly suggest a clean break between data characterizations and structures to describe instances (the ABox) from how instances relate to one another or whether the attributes asserted for a given instance are reasonable or not (both being the work of the TBox).

This is the rationale from an earlier posting from me called Back to the Future with Description Logics that clearly separates the TBox and ABox functions:

Now, it is true that the ABox and TBox distinctions are conceptual, and in practice not often actual, with no mandate or requirement based in description logics that they remain separate. However, for reasons of tractability and communication and computational performance at scale, there may be justification for keeping these constructs separate [4].

In the diagram, note that each ABox instance has the simple appearance of an instance record. Also note that the attributes that describe or characterize those instances should also be included and described with relationships modeled at the TBox level. The TBox is the proper place to describe all of the attribute relationships.

So, for Structured Dynamics, we have made a clean split in these roles and data structures in those client architectures over which we have design control. Ontologies populate the TBox level. Instance records assembled into instance dictionaries populate the ABox level, with various instance types governed by their own lightweight schema and vocabularies. This simple functional split leads to cleaner architectures and easier decisions about what belongs in which box or another for a given circumstance. It is also more performant, but more on that in a later part.

Summary of Benefits for Keeping the Pie Slices Separate

Of course, this is not how the linked data and semantic Web is currently architected or conceptualized. This smearing of roles and work responsibilities leads, we think, to many communication issues and slower uptake. As our own thinking gets clearer on these issues, we see there are some key benefits arising from keeping distinct the TBox (ontologies) and ABox (instances) pieces of the semWeb pie.

Benefit 1: Keeps World View Separate from Facts and Assertions

Wikipedia is a good case in point where conjoining facts with a world view does not work well. One part that does work well are the “facts” in the specific Wikipedia pages that describe things. They are the ABox structure of the Wikipedia knowledge base. Another useful aspect of Wikipedia, kind of at the interstices of the ABox and TBox, are its see also and disambiguation pages. These, too, have proved to be very useful for gathering synonyms for a specific instance or for disambiguating two similarly named instances.

But at the conceptual level of how the world is organized — what are the relations between instances and how those instances are categorized — Wikipedia has arguably been unsatisfactory. Why that might be is a discussion for another time.

One perhaps could make an inverse observation about the Cyc knowledge base where a quite coherent world view of concepts exists (and, actually, many world views through Cyc’s very useful microtheories construct), but is often hard to discern and discover because of the admixing of instances, the coverage of which is also quite lumpy. Some domains have many instances, others are quite sparse.

Trying assiduously to keep bodies of facts and assertions (ABox) separate from how to interpret that world (TBox) brings distinct benefits. The facts base (ABox) is more easily tested for consistency. Different world views (TBox) can be more easily applied and compared against these fact bases. Testing and accepting different aspects of different sources is made easier if the ABox and TBox are not conjoined.

Benefit 2: Keeps Terminology Simpler

When the different purposes and roles and resulting work that might be applied to ABox and TBox are conjoined, our ability to describe things gets murky. We sometimes call mere controlled vocabularies “ontologies”, for example, which only acts to dilute the concept. We have facts and assertions and relations and hierarchies and stuff ranging from the minutiae to the abstract and sublime being lumped and described with the same terminology. Because we can not clarify and describe to ourselves roles and responsibilities for this stuff, no wonder we can’t communicate well with the broader public.

I believe if the semantic Web community could stand back and try again to apply the rigor of description logics to its enterprise, now that we are gaining some real exposure and success with linked data, we could begin to clean up this emerging mess we are creating for ourselves.

Here are some starting suggestions. Let’s call the combination of ABox and TBox a knowledge base, not an ontology. Let’s reserve the term ontology for the terminological relationships and concepts at the TBox level. And let’s focus on ABox instances as requiring only simple vocabularies to describe the assertions of attributes (what we might call schema consistent with RDFS and relational database schema). We thus could see a set of pieces similar to:

Knowledge Base = Ontology + [Disambiguation] + [Identity Relatedness] + Instance Schema

Note I suggest a couple of interesting work items at the interface between the TBox and ABox: disambiguating instances and determining the identity relatedness (for example, ‘same as’) between instances. This is work that should be kept apart from the ABox, but may or may not be best handled in the TBox (and, in any case, is generally separate work from the conceptual structure of the TBox).

This separation of concerns, or something akin to it, would result in a much cleaner — and, therefore, simpler — terminology for communicating with the interested public.

Benefit 3: Enables Simpler Instance Schema

Prima facie, an instance schema that merely needs to capture attribute assertions for an instance will be much simpler than current practice. In turn, that should lead to more patterned schema with easier and quicker extension to new domains and vocabularies. And, that, in turn, will aid ABox consistency checking.

Benefit 4: Easier Conversion of non-RDF Structured Data

Without the need for ontology mapping at time of conversion, existing RDFizers could be more readily applied to convert other structured data forms to simple RDF schema.

Benefit 5: Enables Better Substitutability, Modularity

Splitting the pie as suggested is merely the application of separation of concerns, which I believe all would largely acknowledge as leading to better substitutability and modularity. Besides swapping alternative world views to test their implications against common ABox datasets (the Benefit #1 case), we would also likely see quicker improvements in methods and algorithms for ABox consistency checking.

Benefit 6: Enables Better Dataset Descriptions

There has been growing interest and effort behind finding methods and vocabularies for describing datasets. The Sindice effort has led to the creation of suggested sitemap standard for crawling purposes; UMBEL has suggested standard vocabularies for describing what datasets are “about”, and voiD has been working to standardize how to characterize the nature of a dataset.

Insofar as the ABox and TBox are more cleanly separated, the decisions and tradeoffs for accomplishing these tasks should enable better dataset descriptions.

Benefit 7: Minimize Tensions Between OWL and RDF Proponents

The discourse between the OWL and RDF communities can often be strained and at cross purposes. Many data publishers in the OWL community are from the sciences, where reasoning and decidability is imperative [5]. Many in the linked data community are trying to get as much data exposed and published as possible. Kendall Clark recently blogged about these ‘tribes’ to which I also commented.

Like any world view, there is nothing inherently wrong with being more comfortable or wanting to live in one world as opposed to another. But ultimately, the assertions made by most linked data at the ABox level needs to be tested for reasonableness, and structure and an organizational view of the world (TBox) is not terribly helpful without instance data.

I wonder, in fact, whether it might be best for linked data publishers to eschew OWL altogether. Different RDF predicates could be adopted to claim sameAs-type assertions, for example, and ABox vocabularies and schema could be greatly simplified and patterned for easier development and templating. No matter how we cut it, all of this published data and its properties are only assertions until they can be tested for reasonableness, so why not accept that and make linked data generation faster and easier?

Everyone knows that data for data’s sake — linked or not — has to be tested for reasonableness before it can be relied upon for real work. Simple RDF schema for structured search purposes can work alone just fine: simply look at the error rates with current search engines. But, beyond search and non-critical linked browsing, reasoning is necessary.

The reasoning community has known for some time that all of these linked data assertions will have to be tested anyway. So, why not accept roles? Make linked data easier for search and browsing and publishing, and keep silly entailment assertions out of the mix. Then, allocate the reasoning work to coherent ontologies that know their world view and how to test for it. Instance records and ABoxes are not decidable on their own, so why pretend otherwise?

[1] Including imaginary ones, such as fantasy or mythical things like Gandalf.

[2] In a named entity, the word named applies to entities that have a “rigid designators” as defined by Kripke for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind of terms like biological species and substances. BBN categories proposed in 2002 consists of 29 types and 64 subtypes; Sekine’s extended hierarchy also proposed in 2002 is made up of 200 subtypes. We use Sekine (http://nlp.cs.nyu.edu/ene/version6_1_0eng.html) as our guide. For example, Sekine’s top 15 named entity classes are: Name_Other, Person, Organization, Location, Facility, Product, Event, Natural_Object, Title, Unit, Vocation, Disease, God, Id_Number and Color; the remaining types are subsumed under these. See further http://en.wikipedia.org/wiki/Named_entity_recognition. Generally, named entities are the instances of classes.

[3] As I earlier wrote in Thinking ‘Inside the Box’ with Description Logics (now updating ‘instances’ for ‘individuals’):

[4] From the Wikipedia article on Description Logics (2/9/09):

So why was the distinction [between TBox/ABox] introduced? The primary reason is that the separation can be useful when describing and formulating decision-procedures for various DLs. For example, a reasoner might process the TBox and ABox separately, in part because certain key inference problems are tied to one but not the other one (‘classification’ is related to the TBox, ‘instance checking’ to the ABox). Another example is that the complexity of the TBox can greatly affect the performance of a given decision-procedure for a certain DL, independently of the ABox. Thus, it is useful to have a way to talk about that specific part of the knowledge base.

The secondary reason is that the distinction can make sense from the knowledge base modeler’s perspective. It is plausible to distinguish between our conception of terms/concepts in the world (class axioms in the TBox) and particular manifestations of those terms/concepts (instance assertions in the ABox.)

[5] See, most recently, Alan Ruttenberg’s comment on the W3C’s semantic-web mailing list at http://lists.w3.org/Archives/Public/semantic-web/2009Feb/0081.html.

Main Links

Search

Concluding with a Simplified Instance Record Vocabulary for Linked Data ABoxes

A Re-cap

Folks, We’re Making Linked Data Just Too Tough

Revenge of the ABox

The ‘Instance Record’

The Simplicity of Key-Value Pairs

Relation to RDF

The Instance Record Vocabulary

Metadata

Attributes

Asserted Relations

hasSource

hasVetting

Putting it All Together

Hand-crafted Input

External Structs

Hooks to Functional and Scripting Languages

Specialized Work

Beyond Browsing and Unvalidated Queries

Refinancing Can Save You Money, and Is Patriotic!

Our General Circumstance

Be Patriotic by Removing your Assets from Toxic Consideration

Not Meaning to be Parochial

Historical and Research Support for Splitting the TBox and ABox

A Bit of ‘Ancient History’

Similarities to the Relational Model

What of a Simpler RDFS for the ABox?

SAOR: One Approach to Combining These Pieces

Research on Other Work Tasks

Full-text Search Engines

Identity Relations

Disambiguation

Some Concluding ‘Big Picture’ Implications

Good Design is Based on the Work to be Done

The State of Linked Data

Where Linked Data Works

OK Browsing and Discovery

Good Individual Dataset Querying

OK Property and Object Search

Where Linked Data is Not Working

Only Partial Full-text Searching

No Identity Testing

No Disambiguation

No Reasoning or Inference

Some Implications for Architectural Design

A Couple Thoughts to Conclude this Part

Can a ‘Separation of Concerns’ Lead to Better Slicing of the Pie?

The Formative ‘Named Entities’

Happy Birthday!: DBpedia and the Beginning of Linked Data

Cutting the ABox Slice Out of the Pie

Summary of Benefits for Keeping the Pie Slices Separate

Benefit 1: Keeps World View Separate from Facts and Assertions

Benefit 2: Keeps Terminology Simpler

Benefit 3: Enables Simpler Instance Schema

Benefit 4: Easier Conversion of non-RDF Structured Data

Benefit 5: Enables Better Substitutability, Modularity

Benefit 6: Enables Better Dataset Descriptions

Benefit 7: Minimize Tensions Between OWL and RDF Proponents