# Making Linked Data Reasonable using Description Logics, Part 1

## Can a ‘Separation of Concerns’ Lead to Better Slicing of the Pie?

It is clear that Fred Giasson and I have been spending considerable time on description logics of late. While we could perhaps claim that we like hard-to-read and -understand stuff, our reasons for this interest have been quite pragmatic: How to apply linked data principles to real-world commercial and organizational environments? Indeed, what should those principles even be?

Our first intuition, reaching back nearly two years now, was that linked data needed a context for bringing related datasets together. This belief led us to construct the UMBEL subject concept ontology, a basic reference roadmap for helping to point to information related (or “about”) similar subjects. UMBEL as a set of subject concepts has proved useful as a reference roadmap; and the approach to construct UMBEL and its resulting vocabulary (heavily based in SKOS) has also proved helpful to construct specific domain-level ontologies.

By their nature, these ontologies have been conceptual and structural. They define relationships, but are instance-poor. They focus on ways to describe various lenses — world views — into the domains for which we have been engaged.

But superstructures are meant to be built upon and fleshed out. For that, real instance data is required.

### The Formative ‘Named Entities’

Thus, we have more recently shifted from concepts and structure to focus on how to represent the actual things that populate that structure; that is, a domain’s actual objects or instances. We appreciate that different audiences and proponents will use terminology such as instance, or object, or entity, or individual or even foo to describe such things, but for us (Peirceian logic aside) we simply wanted a way to describe the referents to a specific real-world thing [1].

We initially chose the term ‘named entities‘ to describe these actual objects. This naming arose from the work of Sekine and his 200 named entity types [2]. Typical named entities are specific (individual) people, organizations, events, artifacts (‘Mona Lisa’), places, products, or whatever. For example, here is our first published definition describing ‘named entities‘:

Named entities are the real things or instances in the world that are themselves natural and notable class members of subject concepts. Named entities are the instances of the subject concepts in the standard definition of the term. Each named entity is mapped to a governing subject concept for ontology purposes.

Actually, ‘named entities‘, in even that sense, do not all have proper names with capitalization. Some accepted ‘named entities‘ are also written in lower case, with examples such as rocks (‘gneiss’) or common animals or plants (‘daisy’) or chemicals (‘ozone’) or minerals (‘mica’) or drugs (‘aspirin’) or foods (‘sushi’) or whatever.

 City of Iowa City Clinton St., Iowa City Location in the state of Iowa Coordinates: Country United States State Iowa County Johnson Metro Iowa City Metropolitan Area Government - Type Council-manager government - Mayor Regenia Bailey - City Manager Michael Lombardo Area - City 24.4 sq mi (63.3 km2) - Land 24.2 sq mi (62.6 km2) - Water 0.3 sq mi (0.7 km2) Elevation 668 ft (203.6 m) Population (2007 est.) - City 67,062 - Density 2,748.4/sq mi (1,059.4/km2) - Metro 147,038 Time zone CST (UTC-6) - Summer (DST) CDT (UTC-5) ZIP codes 52240-52246 Area code(s) 319 FIPS code 19-38595 GNIS feature ID 0457827 Website http://www.icgov.org/

But, hmmm. While ‘daisy’ might be an instance of the common flowers, is that the same as a specific daisy flower? especially when I can see literally thousands of daisy flowers at present in my back yard?

This epistemological question of thing v instance v individual can really mess you up! Furthermore, from the standpoint of describing these things on the Web, are we talking about the real thing, a symbol of some sort (Peirce again!) for that thing, or a multitude of similar descriptive terms (flower, bloom, daisy, florescence, bellis, chrysanthenum) for that thing?

Whether a thing is an instance or an individual or even a class depends on context. A plant taxonomy could represent its terminal nodes as specific species or subspecies of daisy. But, in a flower show, the specific thing being referred to could actually be a unique individual, the Pretty Miss Daisy blue-ribbon winner.

Description logics with its TBox and ABox splits [3] actually helps considerably to unravel these potentially confounding distinctions. The ABox covers the description of instances with their asserted attributes or characteristics. Thus, we can have an ABox description of the daisy instance that refers to daisies in general or daisies as a species, or we can have an ABox description of an individual daisy with specific proper name in a flower show.

This instance idea is really a very clean one. As long as we focus on the idea of an instance and its attributes, we can put off for the moment (or defer to another layer, that is the reasoning TBox) what kind of instance this is.

After another segue, we’ll return to this instance concept in a moment.

### Happy Birthday!: DBpedia and the Beginning of Linked Data

DBpedia, the structured and linked data extraction of “facts” from Wikipedia, was first released about two years ago. Happy 2nd Birthday! I first wrote about DBpedia shortly thereafter claiming, I think somewhat accurately, the birth of the structured Web. We now know that phenomenon and the many additional datasets that nucleate around DBpedia as linked data.

When first explained, DBpedia used examples of so-called Wikipedia infoboxes for the cities of Leipzig or Innsbruck to describe the source of its structured data. (Subsequently Berlin has also been commonly referenced, all understandable given DBpedia’s two principal founders of SÃ¶ren Auer at the UniversitÃ¤t Leipzig and Chris Bizer at Freie UniversitÃ¤t Berlin; of course, many others have joined and meaningfully supported the project since.) Infoboxes are Wikipedia templates that provide standardized, structured information across related articles of a similar type.

I have copied a similar infobox — in this case for the same city type from Wikipedia for my home town of Iowa City — to show one of these structured data templates (shown to right). In using it I am, of course being a bit parochial, but it is also interesting to see the growth of structured data (attributes) that such templates now contain compared to what was available at the time of DBpedia’s first release.

This infobox is a perfect example of an ABox. The instance it describes is the ‘City of Iowa City’. Each of the items that follow show an attribute or data characteristic of some form with its associated value as a key-value pair. Sometimes those values refer to other instances, some of which are individuals, such as the county or name of the mayor.

In ABox terminology, these values are asserted for each attribute. Because this is Wikipedia, which has a reputation for accuracy and authority, we tend to believe and accept these assertions. But, we also know that sometimes these values are not correct, even for Wikipedia. We also know that instance records can come from many, many different sources, perhaps most not with the accuracy or authority of Wikipedia.

It is these types of instance records (for many other types of things than city, of course!) that are now being published as linked data. Today more than 50 general public datasets and perhaps another 50 from the sciences (especially biology) have been published. The total assertions across all datasets now exceeds millions, and the RDF statements that capture all of the relationships between these instances, attributes and concepts that describe them exceed one billion, as the recent Billion Triples challenge attests.

What is nice about this ABox structure is that they are relatively simple — instances characterized by attributes — with the “facts” so expressed understood to be assertions and not necessarily verified truth or accuracy. No matter what the source, there is no guarantee that all assertions will be complete and accurate. (Though, as has proved to be the case for DBpedia because of its Wikipedia heritage, some of the sources can be comfortably asserted to be authoritative.)

Assertions about many of the attributes are relatively straightforward such as, in the Iowa City instance, zip codes or time zones or population. (Still, the estimates used could also be out of date or the estimation methods could be argued.) However, other assertions, more based on interpretation or personal opinion, such as subject matter or political or religious affiliation or bias, can be quite controversial.

Another potential source of error is the linked data assertion that one instance is the owl:sameAs a different instance in a different dataset. Erroneous ‘same as’ assertions can arise quite simply and not require malice or stupidity. For example, for me, I actually live in Coralville, Iowa, not Iowa City. But, Coralville completely abuts Iowa City, shares a school district, and my wife works in Iowa City. I more often than not claim Iowa City as my location, though my actual mailing address is Coralville. How does one reasonably say that the identity of Michael Bergman of Coralville is the same as the Michael Bergman of Iowa City?

### Cutting the ABox Slice Out of the Pie

So, what can the perspective of the ABox and description logics tell us about these issues?:

• First, instance records on their own can only contain assertions; instance records can not alone be a basis to decide the reasonableness of those assertions
• Second, if we stay focused on the idea of an instance record, we can wiggle off the hook about whether we are talking about classes or groups or individuals. The instance is merely the thing at hand, with appropriate attributes or not based on the nature of the instance
• Third, the role — or “burden” if you will — of the instance record is merely to convey attribute assertions about a single instance. The ABox can be streamlined with comparatively little structure and comparatively little semantics
• Fourth, some attribute assertions are more straightforward and more easily tested, other attribute assertions are more problematic. That consideration should not limit the scope of any assertions that can be made in an instance record, just that certain attribute types may be harder to test or accept
• And, fifth, and most importantly, these considerations strongly suggest a clean break between data characterizations and structures to describe instances (the ABox) from how instances relate to one another or whether the attributes asserted for a given instance are reasonable or not (both being the work of the TBox).

This is the rationale from an earlier posting from me called Back to the Future with Description Logics that clearly separates the TBox and ABox functions:

Now, it is true that the ABox and TBox distinctions are conceptual, and in practice not often actual, with no mandate or requirement based in description logics that they remain separate. However, for reasons of tractability and communication and computational performance at scale, there may be justification for keeping these constructs separate [4].

In the diagram, note that each ABox instance has the simple appearance of an instance record. Also note that the attributes that describe or characterize those instances should also be included and described with relationships modeled at the TBox level. The TBox is the proper place to describe all of the attribute relationships.

So, for Structured Dynamics, we have made a clean split in these roles and data structures in those client architectures over which we have design control. Ontologies populate the TBox level. Instance records assembled into instance dictionaries populate the ABox level, with various instance types governed by their own lightweight schema and vocabularies. This simple functional split leads to cleaner architectures and easier decisions about what belongs in which box or another for a given circumstance. It is also more performant, but more on that in a later part.

### Summary of Benefits for Keeping the Pie Slices Separate

Of course, this is not how the linked data and semantic Web is currently architected or conceptualized. This smearing of roles and work responsibilities leads, we think, to many communication issues and slower uptake. As our own thinking gets clearer on these issues, we see there are some key benefits arising from keeping distinct the TBox (ontologies) and ABox (instances) pieces of the semWeb pie.

#### Benefit 1: Keeps World View Separate from Facts and Assertions

Wikipedia is a good case in point where conjoining facts with a world view does not work well. One part that does work well are the “facts” in the specific Wikipedia pages that describe things. They are the ABox structure of the Wikipedia knowledge base. Another useful aspect of Wikipedia, kind of at the interstices of the ABox and TBox, are its see also and disambiguation pages. These, too, have proved to be very useful for gathering synonyms for a specific instance or for disambiguating two similarly named instances.

But at the conceptual level of how the world is organized — what are the relations between instances and how those instances are categorized — Wikipedia has arguably been unsatisfactory. Why that might be is a discussion for another time.

One perhaps could make an inverse observation about the Cyc knowledge base where a quite coherent world view of concepts exists (and, actually, many world views through Cyc’s very useful microtheories construct), but is often hard to discern and discover because of the admixing of instances, the coverage of which is also quite lumpy. Some domains have many instances, others are quite sparse.

Trying assiduously to keep bodies of facts and assertions (ABox) separate from how to interpret that world (TBox) brings distinct benefits. The facts base (ABox) is more easily tested for consistency. Different world views (TBox) can be more easily applied and compared against these fact bases. Testing and accepting different aspects of different sources is made easier if the ABox and TBox are not conjoined.

#### Benefit 2: Keeps Terminology Simpler

When the different purposes and roles and resulting work that might be applied to ABox and TBox are conjoined, our ability to describe things gets murky. We sometimes call mere controlled vocabularies “ontologies”, for example, which only acts to dilute the concept. We have facts and assertions and relations and hierarchies and stuff ranging from the minutiae to the abstract and sublime being lumped and described with the same terminology. Because we can not clarify and describe to ourselves roles and responsibilities for this stuff, no wonder we can’t communicate well with the broader public.

I believe if the semantic Web community could stand back and try again to apply the rigor of description logics to its enterprise, now that we are gaining some real exposure and success with linked data, we could begin to clean up this emerging mess we are creating for ourselves.

Here are some starting suggestions. Let’s call the combination of ABox and TBox a knowledge base, not an ontology. Let’s reserve the term ontology for the terminological relationships and concepts at the TBox level. And let’s focus on ABox instances as requiring only simple vocabularies to describe the assertions of attributes (what we might call schema consistent with RDFS and relational database schema). We thus could see a set of pieces similar to:

Knowledge Base = Ontology + [Disambiguation] + [Identity Relatedness] + Instance Schema

Note I suggest a couple of interesting work items at the interface between the TBox and ABox: disambiguating instances and determining the identity relatedness (for example, ‘same as’) between instances. This is work that should be kept apart from the ABox, but may or may not be best handled in the TBox (and, in any case, is generally separate work from the conceptual structure of the TBox).

This separation of concerns, or something akin to it, would result in a much cleaner — and, therefore, simpler — terminology for communicating with the interested public.

#### Benefit 3: Enables Simpler Instance Schema

Prima facie, an instance schema that merely needs to capture attribute assertions for an instance will be much simpler than current practice. In turn, that should lead to more patterned schema with easier and quicker extension to new domains and vocabularies. And, that, in turn, will aid ABox consistency checking.

#### Benefit 4: Easier Conversion of non-RDF Structured Data

Without the need for ontology mapping at time of conversion, existing RDFizers could be more readily applied to convert other structured data forms to simple RDF schema.

#### Benefit 5: Enables Better Substitutability, Modularity

Splitting the pie as suggested is merely the application of separation of concerns, which I believe all would largely acknowledge as leading to better substitutability and modularity. Besides swapping alternative world views to test their implications against common ABox datasets (the Benefit #1 case), we would also likely see quicker improvements in methods and algorithms for ABox consistency checking.

#### Benefit 6: Enables Better Dataset Descriptions

There has been growing interest and effort behind finding methods and vocabularies for describing datasets. The Sindice effort has led to the creation of suggested sitemap standard for crawling purposes; UMBEL has suggested standard vocabularies for describing what datasets are “about”, and voiD has been working to standardize how to characterize the nature of a dataset.

Insofar as the ABox and TBox are more cleanly separated, the decisions and tradeoffs for accomplishing these tasks should enable better dataset descriptions.

#### Benefit 7: Minimize Tensions Between OWL and RDF Proponents

The discourse between the OWL and RDF communities can often be strained and at cross purposes. Many data publishers in the OWL community are from the sciences, where reasoning and decidability is imperative [5]. Many in the linked data community are trying to get as much data exposed and published as possible. Kendall Clark recently blogged about these ‘tribes’ to which I also commented.

Like any world view, there is nothing inherently wrong with being more comfortable or wanting to live in one world as opposed to another. But ultimately, the assertions made by most linked data at the ABox level needs to be tested for reasonableness, and structure and an organizational view of the world (TBox) is not terribly helpful without instance data.

I wonder, in fact, whether it might be best for linked data publishers to eschew OWL altogether. Different RDF predicates could be adopted to claim sameAs-type assertions, for example, and ABox vocabularies and schema could be greatly simplified and patterned for easier development and templating. No matter how we cut it, all of this published data and its properties are only assertions until they can be tested for reasonableness, so why not accept that and make linked data generation faster and easier?

Everyone knows that data for data’s sake — linked or not — has to be tested for reasonableness before it can be relied upon for real work. Simple RDF schema for structured search purposes can work alone just fine: simply look at the error rates with current search engines. But, beyond search and non-critical linked browsing, reasoning is necessary.

The reasoning community has known for some time that all of these linked data assertions will have to be tested anyway. So, why not accept roles? Make linked data easier for search and browsing and publishing, and keep silly entailment assertions out of the mix. Then, allocate the reasoning work to coherent ontologies that know their world view and how to test for it. Instance records and ABoxes are not decidable on their own, so why pretend otherwise?

[1] Including imaginary ones, such as fantasy or mythical things like Gandalf.
[2] In a named entity, the word named applies to entities that have a “rigid designators” as defined by Kripke for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind of terms like biological species and substances. BBN categories proposed in 2002 consists of 29 types and 64 subtypes; Sekine’s extended hierarchy also proposed in 2002 is made up of 200 subtypes. We use Sekine (http://nlp.cs.nyu.edu/ene/version6_1_0eng.html) as our guide. For example, Sekine’s top 15 named entity classes are: Name_Other, Person, Organization, Location, Facility, Product, Event, Natural_Object, Title, Unit, Vocation, Disease, God, Id_Number and Color; the remaining types are subsumed under these. See further http://en.wikipedia.org/wiki/Named_entity_recognition. Generally, named entities are the instances of classes.

[3] As I earlier wrote in Thinking ‘Inside the Box’ with Description Logics (now updating ‘instances’ for ‘individuals’):

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”

[4] From the Wikipedia article on Description Logics (2/9/09):

So why was the distinction [between TBox/ABox] introduced? The primary reason is that the separation can be useful when describing and formulating decision-procedures for various DLs. For example, a reasoner might process the TBox and ABox separately, in part because certain key inference problems are tied to one but not the other one (‘classification’ is related to the TBox, ‘instance checking’ to the ABox). Another example is that the complexity of the TBox can greatly affect the performance of a given decision-procedure for a certain DL, independently of the ABox. Thus, it is useful to have a way to talk about that specific part of the knowledge base.

The secondary reason is that the distinction can make sense from the knowledge base modeler’s perspective. It is plausible to distinguish between our conception of terms/concepts in the world (class axioms in the TBox) and particular manifestations of those terms/concepts (instance assertions in the ABox.)
[5] See, most recently, Alan Ruttenberg’s comment on the W3C’s semantic-web mailing list at http://lists.w3.org/Archives/Public/semantic-web/2009Feb/0081.html.

# Back to the Future with Description Logics

## Have Linked Data, Microformats Stumbled into an Adaptive Design?; Benefits from Keeping the TBox and ABox Separate

I was glad to see Kendall Clark pick up on parts of my earlier piece on Thinking ‘Inside the Box’ with Description Logics. He took one point of view in his posting — that I mostly agree with — but I’d also like to reinforce some other thoughts. And, those thoughts are: description logics (DL) provides earlier lessons and insights that our current zeal for linked data should not overlook, and the lessons we can gain from DL are really fundamental and architectural.

For those of you who have not read Kendall’s piece — which I heartily recommend — let me give you my Cliffs Note’s summary: there are those within the semantic Web community that want to capture the conceptual relationships within knowledge and domains, the Maximum Fidelity tribe, and then those that want to link and describe as many things as possible, the Maximum Scalability tribe, with those (like Kendall’s firm, Clark & Parsia) residing in the middle and following the precepts of DL. The theme is that extremes exist and need to be bridged. [1]

Posing these contrasts is an effective way to describe different ideas and approaches, but, like all straw men, perhaps it hides nuances and complexity. And, as I note below, it may also pose the wrong straw man dichotomy.

### We Are All Tribes of One

Jim Hendler, for one, took exception to Kendall’s characterization to make the obvious point that different use cases demand different approaches. What was interesting, however, in these interchanges was that a nerve was seemingly struck about differences in viewpoints and approaches. Indeed, the very reference to “tribes” seemed to bring out the (ahem) tribal response.

So, just so we are clear, in what I say below I take on the position of a tribe of one; that is, my own opinion. Of course, this is what all of us do. By positing tribes and viewpoints we simplify what is nuanced and subtly convey that opinions are cultural (“tribal”) and not subject to learning and change. Perhaps within the temporal viewpoint of whatever may be today’s trends and “memes” such thinking may hold, but I fundamentally disagree with such a static view of collective understanding and communities over more meaningful periods of years or decades. But, I digress. . . .

At the risk of being simplistic, I think we can say that there was a rich academic and intellectual history behind description logics going back to the early 1990s [2]. Then, with the seminal semantic Web paper built from thinking in the late 90s by Berners-Lee and published by him and Hendler and Lassila in 2001 in Scientific American [3], a real marker was put down for machine-readable and -actionable data (via “agents”) accessible on the Web. Many have been disappointed at the slow pace of the semWeb’s unfolding and some have blamed and rejected AI and “big” ontologies for this slowness. As usable standards finally emerged, a newer set of acolytes pushed “just getting data out there” and RDF linked data began to assume prominence from about 2006 onward, spearheaded by DBpedia and the linked open data community.

In so many ways we are coming full circle — coming back to the future — in seeing how our new linked data techniques can again benefit from this earlier DL thinking. Rather then poles and spectrums, I think we are experiencing the need to revisit our intellectual past now that workable publishing mechanisms and scalability and organization assume real prominence. Though clearly not intentional, the linked data community (and, in a related way, microformats), may just have stumbled upon a very cool architectural design that can leverage DL precepts.

### Some Terminology Revisited

Some of this DL and semWeb terminology can be off-putting. But it is helpful to know the lingo if one wants to look into the technical literature. Though most of this stuff can be described without resorting to such terms and can be readily grasped on an intuitive basis, here are some important grounding terms:

• TBox — according to [2], a TBox “contains intensional knowledge in the form of a terminology (hence the term ‘TBox,’ but ‘taxonomy’ could be used as well) and is built through declarations that describe general properties of concepts. Because of the nature of the subsumption relationships among the concepts that constitute the terminology, TBoxes are usually thought of as having a lattice-like structure; this mathematical structure is entailed by the subsumption relationship — it has nothing to do with any implementation.” A TBox uses a controlled vocabulary to define the concepts and roles of a domain of interest and the relations or properties amongst them
• ABox — according to [2], an ABox “contains extensional knowledge — also called assertional knowledge (hence the term ‘ABox’) — that is specific to the individuals of the domain of discourse.” An ABox provides the concept and role membership assertions for instance data, as well as assertions or “facts” about the attributes of those instances using the same controlled vocabulary as defined in the TBox
• First-order logic (FOL) — is a formal deductive system with unambiguous logic and mathematical structures for declaring, testing and inferring propositions (statements) and predicates (relations). A first-order theory consists of a set of axioms (usually finite or recursive) and the statements deducible from them based on FOL’s base logical axioms (such as the operators found in classical set theory, which itself is built on FOL)
• Description logics (DL) — are any of a family of knowledge representation languages that can be translated and characterized according to first-order logic. A DL language has a syntax that consists of unary predicate symbols to denote concepts, binary relations to denote roles, and recursion. DL semantics define concepts as sets of individuals and roles as sets of pairs of individuals. The expressivity of a DL language is a function of the logical operators the language supports (shown with representations such as $\mathcal{SROIQ}^\mathcal{(D)}$, the expressiveness of OWL 2). DL languages can be translated into other DL languages that support the same expressivity, regardless of syntax, but more expressive languages can not be equivalently represented by less expressive ones. The current OWL dialects of OWL Lite and OWL DL are DL languages
• Axiom — in traditional logic or FOL, an axiom (also called a ‘postulate’) is a proposition that is not proved or demonstrated but considered to be either self-evident or consistent with the base logic of the system. As such, its truth is taken for granted and the axiom serves as a starting point for deducing and inferring other (theory-dependent) truths
• Intensional — is a form of set membership that is based on the propositions and concepts which defines the set; there may be many possible members that remain unenumerated so long as they meet the conditions for membership. The intensional principal judges objects to be a member based on the properties or conditions they must have
• Extensional — is a form of set membership that arises from (“extends”) its listed set members. The extensional principle judges objects to be a member if they have the same external characteristics (whether as explicitly defined properties or not)
• Ontology — as used in knowledge representation or information science, this term is most often defined using Tom Gruber’s “explicit specification of a conceptualization” [4]. In practice on the semantic Web, it is any defined schema or data record structure including the most lightweight controlled vocabularies and structures (such as microformats). In DL, both ABox and TBox specifications and statements are lumped under the term
• Vocabulary — also ‘controlled vocabulary,’ is an organized, variously structured set of terms used for information retrieval or characterization. In its simplest form, a controlled vocabulary is merely a list for checking possible matches for set membership or not; at its more complex, it is the set of terms contained within a detailed and specified ontology or schema with formalized (axiomatized) relationships
• Knowledge base — in the DL community, a knowledge base is simply defined as TBox + Abox. In other words, a knowledge base is a logical schema of roles and concepts and the relationships between them (the TBox) as populated by the actual data (instances) asserting memberships and attributes (“facts”) (the ABox).

### TBox v ABox: Different Purposes and Roles

Within description logics and for our purposes herein, the two concepts we will most focus upon are the ABox and the TBox. As the definitions above suggest, the TBox is more structural and reflects the logical and conceptual relationships within a domain; that is, the role and concept and class relationships. The ABox provides the data (instance) records and characterizations within that schema; that is the instances and facts assertions. By analogy, in a conventional relational database system, the database or logical schema would correspond to the TBox; the actual data records or tables would correspond to the ABox.

These distinctions suggest very different purposes and roles, then, for the TBox and the ABox:

 TBox ABox Definitions of the concepts and properties (relationships) of the controlled vocabulary Declarations of concept axioms or roles Inferencing of relationships, be they transitive, symmetric, functional or inverse to another property Equivalence testing as to whether two classes or properties are equivalent to one another Subsumption, which is checking whether one concept is more general than another Satisfiability, which is the problem of checking whether a concept has been defined (is not an empty concept) Classification, which places a new concept in the proper place in a taxonomic hierarchy of concepts Logical implication, which is whether a generic relationship is a logical consequence of the declarations in the TBox Membership assertions, either as concepts or as roles Attributes assertions Consistency checking of instances Entailments, which are whether other propositions are implied by the stated condition Satisfiability checks, which are that the conditions of instance membership are met Infer property assertions implicit through the transitive property Instance checking, which verifies whether a given individual is an instance of (belongs to) a specified concept Knowledge base consistency, which is to verify whether all concepts admit at least one individual Realization, which is to find the most specific concept for an individual object Retrieval, which is to find the individuals that are instances of a given concept

While certainly many of the ABox tests and checks require TBox structure, there is a pretty clear separation of purpose and role. Moreover: 1) the scale of the information in each “box” is vastly different (perhaps a few to hundreds to at most thousands of concepts in the TBox in contrast to potentially millions or more instances in ABoxes); and 2) ABox dataset repositories may also be (indeed, often are!) numerous, spatially distributed and semantically heterogeneous.

### The Wisdom of Separating Concerns

DL and semantic Web stuff in general are data and logic models, not architectural guidance. So, rarely does one see discussion of the architectural imperatives that some of these logical underpinnings provide. We see knowledge bases and ontologies both used as umbrella terms encompassing both the ABox and the TBox.

However, our own deployment experiences and the literature suggest there are manifest advantages to keeping the TBox and ABox separate:

 Advantages of Keeping the TBox and ABox Separate Better performance by keeping inference and reasoning purposes separate Better scalability through separation of function Use of tailored reasoners and rules engines based on purpose [5] More modular design, including keeping attribute information separate from structural and conceptual relationships Faster, global instance checking using summary ABox tests [6] Assignment of named entities (instances) to distinct and disjoint super types [7] that can bring significant tableaux benefits to ABox reasoning Easier partitioning of ABoxes [8] Easy swapping in and mix-and-matching of varied, multiple and private or public named entity dictionaries (ABoxes) Integration with extant relational (RDBMs) data structures and data stores for instance (ABox) data [9] Integration with other lightweight structures (microformats, other) for instance (ABox) data Faster retrievals via TBox routing to appropriate ABoxes Simpler ABox vocabularies that are easier to understand and extend (including continued reliance on RDF and RDFS) More capable TBox ontologies, including integration of rules systems Relatively easy extension of the TBox schema ontology into specific domains Easy ABox data entry and updating via wiki or sematic wiki Ability to triangulate between separate concept (TBox) and instance (ABox) disambiguation approaches to improve overall precision and recall.

It would be useful to refrain from lumping the very different purposes of ABoxes and TBoxes under the umbrella rubric of ‘ontology’. It would also be useful for designers and vocabulary authors to be more explicit in their own minds as to purpose and content when formulating new ontologies. Smushing all of these concepts into one bubbling mess may not lead to clarity nor good performance.

### A Simple Schematic of Best Practice

Taking these basic ideas we can visualize a general schematic for best practice splits within the ontology or knowledge base:

The TBox is clearly focused on the domain at hand, but also includes links and equivalents to external ontologies. The TBox level should be entirely free of instance data, though all attributes, properties and concepts that might be found at the ABox level are also defined with their relationships at the TBox level. Like any semWeb ontology, this TBox level should also re-use common Web ontologies such as FOAF, SIOC, UMBEL, etc.

It is also the case that because of the reasoning needs at the TBox layer, the semantic Web language used should likely be a dialect of OWL (see below).

(BTW, for my own practice, I will try to limit my use of the ‘ontology’ term to the concepts and classes at this TBox level.)

The ABox level, in contrast, may consist of multiple datasets and name spaces. These structures are most appropriately seen as lightweight controlled vocabularies with limited structure; if written by scratch perhaps limited to RDF or RDFS (the schema variant). This layer, however, can also remain in non-semWeb native form — such as RDBMS data tables, microformats or other formats — that are wrapperized for interoperability through one or more ‘RDFizers‘ or GRDDL.

These structures should likely not make many external assertions, if any, and if done, perhaps in separate mapping or linkage file that can be processed and analyzed independently. It is important, however, to make sure that all attributes at this ABox layer have a counterpart with relationships and structure defined at the TBox layer.

This architectural design enables complete independence of the instance datasets from the inferencing logic or federation that might be applied to them.

### The Relevance to Linked Data Instances

Since it first took off in 2006, linked data and the various datasets now shown in the ‘LOD cloud‘ have been dominated by instance data. There are perhaps 10 million to 20 million instance objects available as linked data, many of which are derived from Wikipedia (via DBpedia) with attributes or structure coming from the Wikipedia infoboxes.

“There need not be a trade-off between expressiveness and scalability. Proper design, language choice and architecture can readily achieve both — while maintaining independence of scope or purpose.”

A similarly fast explosion has taken place with structured records via microformats and other simple data structures. For example, some earlier estimates suggest there are perhaps more than 2 billion pages that include microformats [10].

I have at times recently made comments about the dominance of instance data within the linked data community and the need for organizing structure. While this observation, I believe, remains true and provides a rationale for UMBEL as an organizing subject structure (or any other organizing structure, for that matter), perhaps I have been missing a more fundamental point: linked data (at least as practiced to date) is really about exposing ABoxes with simple structure. Perhaps, by serendipity, linked data (and other light structures like microformats) are showing the way to a distributed, mixed ABox-TBox structure for the Web.

With this altered viewpoint, a number of new observations emerge:

• Linked data instance structures perhaps need to be consciously designed as such, with lightweight structure and limited external (OWL-based or TBox-oriented) class structure
• The recent interest shown in so-called VoCamps (for lightweight vocabulary development) and voiD (vocabulary of interlinked datasets) might be usefully viewed specifically through an ABox “lens” with perhaps best practices for structures and vocabulary syntaxes to emerge
• TBox class and relationship structures should remain apart from the instance data and can be used to operate independently of the datasets
• Architectural and design changes might also emerge through clear separation of ABox and TBox that can benefit semantic Web scalability.

Linked data and microformats and other lightweight structures are now giving us the exposed instance data to begin reasoning and showing differences due to inferencing and other logic advantages for the semantic Web. Now that the ABox is being proven, let’s move on and stress-test the TBox!

### OWL 2 and Query Rewriting

Since the first version of OWL there has been confusion and some limitations with the dialects of the language. Only OWL Full allowed classes to be treated both as instances and classes (so-called metamodeling), and was therefore used as the basis for mapping UMBEL, for example, to RDF and RDFS vocabularies and to Cyc. This design was necessary, but left UMBEL undecidable using standard DL reasoners; only the two dialects of OWL DL and OWL Lite met description logics requirements.

Indeed, it was even hard to determine what dialect an OWL file represented, among many other problems and issues. The technical committee behind OWL 2, in fact, has written an excellent critique of issues with this first version of OWL [11].

For nearly two years the next version of OWL, OWL 2, has been undergoing development, with the last draft now published and available for last comment before January 23 [12]. Lessons and refinements to the use of DL have also occurred. Some have criticized this effort and have criticized the need for OWL 2′s growing expressiveness and vocabulary [see 1, for example]. I believe these criticisms to be unfair and to miss many of the thoughtful improvements in this new version.

Version control and expressiveness are two of these benefits. A broader benefit, though, has been the keen attention the developers have given to compliance with description logics and the ability to formulate fragments (called “profiles”) that only present subsets of DL useful for computational considerations [13]. For example, one profile, OWL 2 QL, appears well suited to the ABox; another, EL, appears well suited to the TBox. Users and tools builders may define other subsets of OWL 2 to deal with different use cases.

What is emerging are possible design patterns that would have comprehensive TBox guidance and inference structures that first receive a query, then do query rewriting for less capable OWL dialects and mapping to distributed ABox datasets, some of which might be kept in native relational DB or other structural forms [9, 14, 15]. Other approaches and designs, such as overviewed for DLDB2, KAON2, OWLIM, BigOWLIM and Minerva, are testing other architectural and DL combinations [see 15]. And, at the level of the specific triplestore, other optimizations are being made such as owl:sameAs or query rewrite with Virtuoso [16]. This new version of OWL and its profiles have adapted to past lessons and can be matched well to the emerging hardware and architectural designs.

These changes appear to now provide the option for various dialects of OWL to be matched with reasoners and architectural designs in order to optimize for different purposes. Rather than a spectrum, we appear to be learning and maturing. Hopefully, getting back to the architectural implications of the TBox – ABox split can show us there need not be a trade-off between expressiveness and scalability. Proper design, language and dialect choice, and architecture can readily achieve both — while maintaining independence of scope or purpose.

Thanks, OWL 2! You have fulfilled your commitment to description logics. It is now our turn to figure out the best practices for working with these tools.

[1] For those with a spare 90 minutes or so, you may also want to view this panel session and debate that took place on “An OWL 2 Far?” at ISWC ’08 in Karlsruhe, Germany, on October 28, 2008. The panel was chaired by Peter F. Patel-Schneider (Bell Labs, Alcathor) with the panel members of Stefan Decker (DERI Galway), Michel Dumontier (Carleton University), Tim Finin (University of Maryland) and Ian Horrocks (University of Oxford), with much audience participation. See http://videolectures.net/iswc08_panel_schneider_owl/.
[2] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, 2003. See Chapter 1. Sample chapters may be viewed from Enrico Franconis Description Logics course notes and tutorial at http://www.inf.unibz.it/~franconi/dl/course/, which is an excellent starting reference point on the subject.
[3] Tim Berners-Lee, James Hendler and Ora Lassila, 2001. “The Semantic Web,” in Scientific American, 284(5):34-43, May 2001; see http://www.sciam.com/article.cfm?id=the-semantic-web.
[4] Thomas R. Gruber, 1993. “A Translation Approach to Portable Ontology Specifications,” in Knowledge Acquisition 5(2): 199-220; see http://tomgruber.org/writing/ontolingua-kaj-1993.pdf.
[5] Georgios Meditskos and Nick Bassiliades, 2008. “Combining a DL Reasoner and a Rule Engine for Improving Entailment-based OWL Reasoning,” presented at the 7th International Semantic Web Conference (ISWC2008); see http://lpis.csd.auth.gr/publications/med-iswc08.pdf.
[6] Achille Fokoue, Aaron Kershenbaum, Li Ma, Edith Schonberg, and Kavitha Srinivas, 2006. “The Summary Abox: Cutting Ontologies Down to Size,” presented at the 5th International Semantic Web Conference, Athens, GA, USA, November 5-9, 2006; see http://iswc2006.semanticweb.org/items/Kershenbaum2006qo.pdf.
[7] These are akin to the lexicographer supersenses that have been applied in WordNet for nouns and verbs (though only nouns are used here). See Massimiliano Ciaramita and Mark Johnson, 2003. Supersense Tagging of Unknown Nouns in WordNet, in Proceedings of the Conf. on Empirical Methods in Natural Language Processing, pp. 168173, 2003. See http://www.aclweb.org/anthology-new/W/W03/W03-1022.pdf.
[8] Yuanbo Guo and Jeff Heflin, 2006. “A Scalable Approach for Partitioning OWL Knowledge Bases,” in Proceedings of the 2nd International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS2006), Athens, Georgia, USA, November, 2006; see http://swat.cse.lehigh.edu/pubs/guo06c.pdf.
[9] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini and Riccardo Rosati, 2007. “Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family,” in Journal of Automated Reasoning, 39 (3), October 2007; see http://www.dis.uniroma1.it/~degiacom/papers/2007/calv-etal-JAR-2007.pdf.
[10] The original citation could not be found, but it is referenced on the Microformats mailing list on Oct. 1, 2008; see http://microformats.org/discuss/mail/microformats-discuss/2008-October/012550.html.
[11] Bernardo Cuenca Grau, Ian Horrocks, Boris Motik, Bijan Parsia, Peter Patel-Schneider and Ulrike Sattler, 2008. “OWL2: The Next Step for OWL,” in Journal of Web Semantics, 6(4): 309-322, November 2008; see http://www.comlab.ox.ac.uk/people/ian.horrocks/Publications/download/2008/CHMP+08.pdf.
[12] Boris Motik, Peter F. Patel-Schneider, Bijan Parsia, eds., 2008. OWL 2 Web Ontology Language: Structural Specification and Functional-Style Syntax, W3C Working Draft, December 2, 2008. See http://www.w3.org/TR/owl2-syntax/.
[13] Boris Motik, Bernardo Cuenca Grau, Ian Horrocks, Zhe Wu, Achille Fokoue and Carsten Lutz, eds., 2008. OWL 2 Web Ontology Language: Profiles, W3C Working Draft, December 2, 2008. See http://www.w3.org/TR/owl2-profiles/.
[14] Luciano Serafini and Andrei Tamilin, 2007. “Instance Migration in Heterogeneous Ontology Environments,” in Proceedings of 6th International Semantic Web Conference / 2nd Asian Semantic Web Conference (ISWC/ASWC 2007), pages 452-465, 2007. See http://iswc2007.semanticweb.org/papers/449.pdf.
[15] Zhengxiang Pan, Xingjian Zhang and Jeff Heflin, 2008. “DLDB2: A Scalable Multi-Perspective Semantic Web Repository,” in WI 08: Proceedings of the International Conference on Web Intelligence, IEEE Computer Society Press, pp. 489-495; see http://swat.cse.lehigh.edu/pubs/pan08a.pdf.
[16] Orri Erling and Ivan Mikhailov, 2007. “RDF Support in the Virtuoso DBMS,” in Proceedings of the 1st Conference on Social Semantic Web, Leipzig, Germany, Sep 26-28, 2007; see http://aksw.org/cssw07/paper/5_erling.pdf.

# Multi-part Federated Search Interview

## Topics Range from the Deep Web to Semantic Web in this Search Luminaries Series

I’m pleased to wrap up a multi-part interview with the Federated Search Blog as part of their ongoing ‘Search Luminaries’ series. Sol Lederman, editor of the blog, does a thorough and comprehensive job! Over the past month on every Friday, I have answered some 25 or so of his detailed questions.

Federated Search Blog was particularly interested in the deep Web, its discovery and size. Many of the early questions deal with those themes. However, by Part 4 things get a bit more current, with the topics shifting to the semantic Web, linked data and Zitgist.

Here are the links to the series:

To give you a flavor of the interview, here is an example of one of the questions (and probably my favorite):

20. Tim Berners-Lee, credited with inventing the World Wide Web, has been talking about the importance and value of the Semantic Web for years yet common folks don't see much evidence of the Semantic Web gaining traction. Is there substance to the Semantic Web? What's happening with it now and what does its future look like?

Wow, in 10,000 words or less?

No, actually, this is a very good question. As things go, I am a relative newbie to the semantic Web, only having studied and followed it closely since about 2005. I'm sure my perspective in coming later to the party may not be shared by those at the beginning, which dates to the mid-1990s as Berners-Lee's vision naturally progressed from a Web of documents, as most of us currently know the Web, to a Web of data.

I think there is indeed incredibly important substance to the semantic Web. But, as I have written elsewhere, the semantic Web is more of a vision than a discernable point in time or a milestone.

The basic idea of the semantic Web is to shift the focus from documents to data. Give data a unique Web address. Characterize that data with rich metadata. Describe how things are related to one another so that relationships and connections can be traced. Provide defined structures for what these things and relationships "mean"; this is what provides the semantics, with the structures and their defined vocabularies known as "ontologies" (which in one analog can be seen as akin to a relational database schema).

As these structures and definitions get put in place, the Web itself then becomes the infrastructure for relating information from everywhere and anywhere on any given topic or subject. While this vision may sound grandiose, just think back to what the Web itself has done for us and documents over the past decade or so. This same architecture and infrastructure can and should be extended to the actual information in those documents, the data. And, oh, by the way, conventional databases can now join this party as well. The vision is very powerful and very cool.

Progress has indeed been slow. Many advocates fairly point to how long it takes to get standards in place and for a while people spoke of the "chicken-and-egg" problem of getting over the threshold of having enough structured data to consume to make it worthwhile to create the tools and applications and showcases that consume that data.

From my perspective, the early visions of the semantic Web were too abstract, a bit off perhaps. First, there was the whole idea of artificial intelligence and machines using the data as opposed to better ways for humans to draw use from the data at hand. The fundamental and exciting engine underneath the semantic Web — the RDF (Resource Description Framework) data model — was not initially treated on its own. It got admixed with XML that made understanding difficult and distinctions vague. There is and remains too much academia and not enough pragmatics driving the bus.

But that is changing and fast.

There is now an immediate and practical "flavor" of the semantic Web called linked data. It has three simple bases:

(1) RDF as the simple but adaptable data model that can represent any information — structured or unstructured — as the basic "triple" statement of subject-predicate-object. That sounds fancy, but just substitute verb for predicate and noun for subject and object. In other words: Dick sees Jane; or the ball is round. It sounds like a kindergartner reader, but that is how data can be easily represented and built up into more complex structures and stories

(2) Give all objects a unique Web identifier. Unique identifiers are common to any database; in linked data, we just make sure those identifiers conform to the same URIs we see constantly in the address bar of our Web browsers, and:

(3) Post and expose this stuff as accessible on the Web (namely, HTTP).

My company adds some essential "spice" to these flavors with respect to reference structures and concepts to give the information context, but these simple bases remain the foundation.

These are really not complex steps. They are really no different than the early phases of posting documents on the Web. Only now, we are exposing data.

More importantly, we can forget the chicken-and-egg problem. Each new data link we make brings value, in the similar way that adding a node to a network brings value according to Metcalfe's Law. Only with linked data, we already have the nodes — the data — we are just establishing the link connections (the verbs, predicates or relations) to flesh out the network graph. Same principle, only our focus is now to connect what is there rather than to add more nodes. (Of course, adding more linked nodes helps as well!)

The absolutely amazing thing about our current circumstance as Web users is that we truly now have simple and readily deployable mechanisms available to finally overcome the decades of enterprise stovepipes. The whole answer is so simple it can be mistaken as snake oil when first presented and not inspected a bit.

As an industry accustomed to hype and cynical about so much of this, I only ask that your readers check out these assertions for themselves and suspend their normal and expected disbelief. For me, in a career of more than 30 years focusing on information and access, I feel like we finally now have the tools, data model and architecture at hand to actually achieve data interoperability.

Thanks again to Sol and Federated Search Blog for this opportunity.

# Thinking ‘Inside the Box’ with Description Logics

## Linked Data Need Not Rediscover the Past; A Surprise in Every Box

A standard cliché of management consultants is the exhortation to think “outside the box.” Of course, what is meant by this is to question assumptions, to think differently, to look at problems from new perspectives.

With our recent release of the (linked open data) ‘LOD constellation‘ of linked data classes based around UMBEL, I have been fielding a lot of inquiries on what the relationship is of UMBEL to DBpedia. (See, for example, this current interview by the Semantic Web Company with me and Sören Auer of the DBpedia project.) This also fits into the ongoing distinction we have made in the UMBEL project between our subject concepts (classes) and named entities (instances).

What has actually most been helping my thinking is to get fully inside the box (or, rather, boxes, hehe). Let me explain.

The problem with urging outside-the-box thinking is that many of us do a less-than-stellar job of thinking inside the box. We often fail to realize the options and opportunities that are blatantly visible inside the box that could dramatically improve our chances of success.

Naomi Karten [1]

### The Description Logics Underpinnings of the Semantic Web

Description logics are one of the key underpinnings to the semantic Web. They grew out of earlier frame-based logic systems from Marvin Minsky and also semantic networks; the term and discipline was first given definition in the 1980s by Ron Brachman, among many others [2].

Description logics (DL, most often expressed in the plural) are a logic semantics for knowledge representation (KR) systems based on first-order predicate logic (FOL). They are a kind of logical metalanguage that can help describe and determine (with various logic tests) the consistency, decidability and inferencing power of a given KR language. The semantic Web ontology languages, OWL Lite and OWL DL (which stands for description logics), are based on DL and were themselves outgrowths of earlier DL languages.

Description logics and their semantics traditionally split concepts and their relationships from the different treatment of individuals and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships.

Thus, the model is an abstraction of a concrete world where the concepts are interpreted as subsets of the domain as required by the TBox and where the membership of the individuals to concepts and their relationships with one another in terms of roles respect the assertions in the ABox.

Franz Baader and Werner Nutt [3]

The second split of individuals is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of individuals, the roles between individuals, and other assertions about individuals regarding their class membership with the TBox concepts. Both the TBox and ABox are consistent with set-theoretic principles.

TBox and ABox logic operations differ and their purposes differ. TBox operations are based more on inferencing and tracing or verifying class memberships in the hierarchy (that is, the structural placement or relation of objects in the structure). ABox operations are more rule-based and govern fact checking, instance checking, consistency checking, and the like [3]. ABox reasoning is generally more complex and at a larger scale than that for the TBox.

Early semantic Web systems tended to be very diligent about maintaining these “box” distinctions of purpose, logic and treatment. One might argue, as I do herein, that the usefulness and basis for these splits has been lost somewhat in our first implementations and publishing of linked data systems.

### ABox and TBox Analogs in the Linked Data Web

Most of the semantic Web work at the beginning of this decade was pretty explicit about references to description logics and related inferencing engines and computational efficiency. Some of the early commercial semantic Web vendors are still very much focused on this space.

However, with the first release and emphasis on linked data about two years ago, the emphasis seemed to shift to the more pragmatic questions of actually posting and getting data out there. Best practices for cool URIs and publishing and linkage modes assumed prominence. The linking open data (LOD) movement began in earnest and gained mindshare. Of course, many in the DL and OWL development communities continued to discuss logic and inferencing, but now seemingly more as a separate camp to which the linked data tribe paid little heed.

The central hub of this linked data effort has been DBpedia and its pivotal place within the ‘LOD cloud.’ What is remarkable about the LOD cloud, however, is that it is almost entirely an ABox representation of the world and its instances. Starting from the core set of individual instances within Wikipedia, this cloud has now grown to many other sources and the central place for finding linked instance data. If one looks carefully at the LOD cloud and its linkages we can see the prevalence of instance-level relationships and attributes.

In fact, the LOD cloud diagram to upper right from the Wikipedia article on linked data has become the key visual metaphor for the movement. But, as noted, this view is almost exclusively one at the ABox instance level.

The UMBEL project began at roughly the same time and as a response to the release of DBpedia. My question in looking at the first data linked to DBpedia was, What is this content about? Sure, I might be able to find multiple records discussing Abraham Lincoln as a US president regarding attributes like birth date and a list of children, but where could I retrieve records about other presidents or, more broadly, other types of leaders such as prime ministers, kings or dictators?

The intuition was that the linked data and the various FOAF and other distributed instance records it was combining lacked a coherent reference structure of subject topics or concepts with which to describe content. The further intuition was that — while tagging systems and folksonomies would allow any and all users to describe this content with their own metadata — a framework for relating these various assignments to one another was still lacking.

In the nearly two years of development leading to the first beta release of UMBEL we have tried many analogies and metaphors to describe the basis of the 20,000 subject concept classes within UMBEL in relation to its role and other linked data initiatives. While many of those metaphors help visualize use and role, the more formal basis offered by description logics actually helps to most precisely cast UMBEL’s role. For example, in today’s interview with the Semantic Web Company, I note:

“. . . we have described UMBEL as a roadmap, or middleware, or a backbone, or a concept ontology, or an infocline, or a meta layer for metadata, and others. Today, what I tend to use, particularly in reference to DBpedia, is the TBox-ABox distinction in computer science and description logics. UMBEL is more of a class or structural and concept relationships schema — a TBox — while DBpedia is more of an an instance and entity layer with attributes — an ABox. I think they are pretty complementary. . . “

The resulting class level structure produced by UMBEL and its mappings to other classes within existing linked data enabled us to create and then publish the ‘LOD constellation‘, a complementary TBox structure to the linked data’s existing ABox one. This diagram to the lower right from the Wikipedia article on linked data now shows this complement.

### Completeness and Sufficiency

Description logics have arisen to aid our creating and understanding of knowledge representation systems. From this basis, we can see that the first efforts of the linked data initiative have lacked context, the TBox. At a specific level, the question is not DBpedia v. UMBEL or cloud v. constellation. Both types of structure are required in order to complete the logical framework. By thinking inside the box — by paying attention to our logical underpinnings — we can see that both TBoxes and ABoxes are essential and complementary to creating a useful knowledge representation system.

By more explicitly adopting a description logics framework we can also better address many prior questions of context, coherence and sufficiency. These have been constant themes in my recent writings that I will be revisiting again through the helpful prism of formal description logics.

My interview today with Sören Auer also brought up some important points regarding context. As we have said in other venues, it is important that any TBox be available for context purposes. Whether that should be UMBEL or some other framework depends on the use case. As I noted in the interview, “UMBEL’s specific purpose is to provide a coherent framework for serious knowledge engineers looking to federate data.” Other uses may warrant other frameworks, and certainly not always UMBEL.

But, in any event, I have two cautions to the linked data community: 1) do not take the suggestion to have a reference framework of concepts as being equivalent to adopting a single ontology for the Web; think of any reference structure as an essential missing TBox, and not some call to adopt “one ontology to rule them all,” but 2) in adopting alternative frameworks, take care that whatever is designed or adopted itself be able to meet basic DL logic tests of consistency and coherence.

### A Serendipitous Surprise

No one has yet elaborated the significant advantages from design, performance, architectural and flexibility perspectives from a distinct and explicit separation of TBox from ABox — but they’re there!

The many advantages from separate TBox and ABox frameworks are one serendipitous surprise coming from the early development of linked data. To my knowledge, no one has yet elaborated the significant advantages from design, performance, architectural and flexibility perspectives from a distinct and explicit separation of TBox from ABox. We believe these advantages to be substantial.

Realize, as distributed, UMBEL already has both TBox and ABox components. The TBox component is the lightweight UMBEL ontology, with its 20,000 subject concept classes and their hierarchical and other relationships. This component has a vocabulary (or terminology) for aiding the linking to external ontologies. The vocabulary is quite suitable for extension into new domains as well.

The ABox component is the named entities part of instances drawn from Wikipedia and the BBC’s John Peel sessions. Besides being of common, broad interest, these 1.5 million instances (per the current version) are included in the distribution to instantiate the ontology for demonstration and sandbox purposes.

So, UMBEL’s world is quite simple: subject concepts (SCs) and named entities (NEs). Subject concepts are the TBox and classes that define the structure and concept relationships. Named entities are the individual “things” in the world (some lower case such as animals or foods) and are the ABox of instances that populate this structure.

In our early efforts, we concentrated on the SC portion of UMBEL. Most recently, we have been concentrating on the NE component and its NE dictionaries. It was these investigations that drew us into an ABox perspective when looking at design options. The logic and rationale had been sitting there for some years, but it took cracking open the older textbooks to become reacquainted with it.

Once we again began looking inside the box, we began to see and enumerate some significant advantages to an explicit TBox-ABox design, as well as advantages for keeping these components distinct:

• Easier understood ontologies with a very limited number of predicates
• Lightweight schema design that is easy to extend
• Ability to “triangulate” between separate SC (concept) and NE (instance) disambiguation approaches to improve overall precision and recall
• Attribute information is kept separate from structural and conceptual relationships
• Easy to swap in varied, multiple and private or public named entity dictionaries
• Relatively easy extension of the schema ontology into specific domains
• A design suitable to computation efficiency (rules for ABox; inference and standard reasoning for TBox), and
• Assignment of NEs to distinct and disjoint “super types” [4] that can bring significant tableaux benefits to ABox reasoning.

We are still learning about these advantages and will document them further in pending work on coherence and named entity dictionary (NED) creation.

### Thinking Inside the TBox and ABox

The two main points of this article have been to: 1) recognize the important intellectual legacy of description logics and how they can inform the linked data enterprise moving forward; and 2) be explicit about the functional and architectural splits of the TBox from the ABox. Making this split brings many advantages.

There will continue to be many design challenges as linked data proliferates and actually begins to play its role of aiding meaningful knowledge work. The grounding in description logics and the use of DL for testing alternative designs and approaches is a powerful addition to our toolkit.

Sometimes there are indeed many benefits to thinking inside the box.

[2] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, 2003. See Chapter 1. Sample chapters may be viewed from Enrico Franconi’s Description Logics course notes and tutorial at http://www.inf.unibz.it/~franconi/dl/course/, which is an excellent starting reference point on the subject.
[3] Ibid.; see Chapter 2.
[4] These are akin to the lexicographer supersenses that have been applied in WordNet for nouns and verbs (though only nouns are used here). See Massimiliano Ciaramita and Mark Johnson, 2003. Supersense Tagging of Unknown Nouns in WordNet, in Proceedings of the Conf. on Empirical Methods in Natural Language Processing, pp. 168173, 2003. See http://www.aclweb.org/anthology-new/W/W03/W03-1022.pdf.

# Virtuoso Now Supports UMBEL

## Version 5.0.9 includes UMBEL Class Lookups and Named Entity Extraction

I first wrote about OpenLink Software‘s stellar suite of structured Web-related software back in April 2007, with a spotlight on Virtuoso, the company’s flagship ‘universal server’ product. As it has for years, OpenLink continues a steady drumbeat of new releases and extensions. The most recent version upgrade, 5.0.9, was announced today.

In the intervening period I have now personally had the chance to experience Virtuoso first hand, both as the standard hosting platform for Zitgist’s linked data products and services, and as the hosting environment for UMBEL‘s various and growing Web services. I can state quite categorically that our ability to get things done fast with few resources depends critically on the unbelievable high-productivity platform that Virtuoso provides. (And, hehe, given our close relationship to OpenLink, we also get great responsiveness and technical support! Though, truthfully, OpenLink continues to amaze with its outreach and embrace of all of the important initiatives within the semantic Web community.)

I normally let these standard Virtuoso release announcements pass without comment. But today’s release v. 5.0.9 has an especially important feature from my parochial perspective: the first support for UMBEL.

### Virtuoso Reprised

Just to refresh memories, OpenLink’s Virtuoso is a cross-platform universal server for SQL, XML, and RDF data, including data management. It includes a powerful virtual database engine, full-text indexing, native hosting of existing applications, Web Services (WS*) deployment platform, Web application server, and bridges to numerous existing programming languages. Now in version 5.0, Virtuoso is also offered in an open source version. The basic technical architecture of Virtuoso and its robust capabilities is:

[Click on image for full-size pop-up]

From an RDF and linked data perspective, Virtuoso is the most scalable and fastest platform on the market. Critically from Zitgist’s perspective is Virtuoso’s more than 100 built-in RDF-izers (or “Sponger cartridges”) that address all major data formats, serializations, relational data and Web 2.0 APIs. But don’t take my word for it: Check out OpenLink’s impressive list of these cartridges and their various linkages throughout the linked data space.

### UMBEL Support

The key aspect of the new UMBEL support in Virtuoso is its incorporation of UMBEL lookups and its use of Named Entity extraction into the RDF-izer cartridges. This is but the first of growing support anticipated for UMBEL.

### Other New Features

In addition to UMBEL, this version 5.0.9 includes significant performance optimizations to the SQL Engine, SPARQL+RDF Engine, and the ODBC and JDBC drivers.

Other new features include:

• An Excel mime-type output option in the SPARQL endpoint
• Enhanced triple options for bif:contains plus new options for transitivity
• New RDF-izer Cartridges for the Sponger RDF Middleware Layer
• Support for very large HTTP client requests
• A sparql-auth endpoint with digest authentication for using SPARUL via SPARQL Protocol
• New commands for the Ubiquity Firefox plugin.

Finally, per usual, there are also minor bug-fixes:

• Memory leaks
• SQL query syntax handling
• SPARQL ‘select distinct’
• XHTML and Javascript validation and other UI issues in the ODS application suite.

### For More Details

For more details, you can see these Virtuoso release notes: https://sourceforge.net/project/shownotes.php?release_id=626647&group_id=161622

You can also get information on the Virtuoso open source edition or download it.