# Back to the Future with Description Logics

## Have Linked Data, Microformats Stumbled into an Adaptive Design?; Benefits from Keeping the TBox and ABox Separate

I was glad to see Kendall Clark pick up on parts of my earlier piece on Thinking ‘Inside the Box’ with Description Logics. He took one point of view in his posting — that I mostly agree with — but I’d also like to reinforce some other thoughts. And, those thoughts are: description logics (DL) provides earlier lessons and insights that our current zeal for linked data should not overlook, and the lessons we can gain from DL are really fundamental and architectural.

For those of you who have not read Kendall’s piece — which I heartily recommend — let me give you my Cliffs Note’s summary: there are those within the semantic Web community that want to capture the conceptual relationships within knowledge and domains, the Maximum Fidelity tribe, and then those that want to link and describe as many things as possible, the Maximum Scalability tribe, with those (like Kendall’s firm, Clark & Parsia) residing in the middle and following the precepts of DL. The theme is that extremes exist and need to be bridged. [1]

Posing these contrasts is an effective way to describe different ideas and approaches, but, like all straw men, perhaps it hides nuances and complexity. And, as I note below, it may also pose the wrong straw man dichotomy.

### We Are All Tribes of One

Jim Hendler, for one, took exception to Kendall’s characterization to make the obvious point that different use cases demand different approaches. What was interesting, however, in these interchanges was that a nerve was seemingly struck about differences in viewpoints and approaches. Indeed, the very reference to “tribes” seemed to bring out the (ahem) tribal response.

So, just so we are clear, in what I say below I take on the position of a tribe of one; that is, my own opinion. Of course, this is what all of us do. By positing tribes and viewpoints we simplify what is nuanced and subtly convey that opinions are cultural (“tribal”) and not subject to learning and change. Perhaps within the temporal viewpoint of whatever may be today’s trends and “memes” such thinking may hold, but I fundamentally disagree with such a static view of collective understanding and communities over more meaningful periods of years or decades. But, I digress. . . .

At the risk of being simplistic, I think we can say that there was a rich academic and intellectual history behind description logics going back to the early 1990s [2]. Then, with the seminal semantic Web paper built from thinking in the late 90s by Berners-Lee and published by him and Hendler and Lassila in 2001 in Scientific American [3], a real marker was put down for machine-readable and -actionable data (via “agents”) accessible on the Web. Many have been disappointed at the slow pace of the semWeb’s unfolding and some have blamed and rejected AI and “big” ontologies for this slowness. As usable standards finally emerged, a newer set of acolytes pushed “just getting data out there” and RDF linked data began to assume prominence from about 2006 onward, spearheaded by DBpedia and the linked open data community.

In so many ways we are coming full circle — coming back to the future — in seeing how our new linked data techniques can again benefit from this earlier DL thinking. Rather then poles and spectrums, I think we are experiencing the need to revisit our intellectual past now that workable publishing mechanisms and scalability and organization assume real prominence. Though clearly not intentional, the linked data community (and, in a related way, microformats), may just have stumbled upon a very cool architectural design that can leverage DL precepts.

### Some Terminology Revisited

Some of this DL and semWeb terminology can be off-putting. But it is helpful to know the lingo if one wants to look into the technical literature. Though most of this stuff can be described without resorting to such terms and can be readily grasped on an intuitive basis, here are some important grounding terms:

• TBox — according to [2], a TBox “contains intensional knowledge in the form of a terminology (hence the term ‘TBox,’ but ‘taxonomy’ could be used as well) and is built through declarations that describe general properties of concepts. Because of the nature of the subsumption relationships among the concepts that constitute the terminology, TBoxes are usually thought of as having a lattice-like structure; this mathematical structure is entailed by the subsumption relationship — it has nothing to do with any implementation.” A TBox uses a controlled vocabulary to define the concepts and roles of a domain of interest and the relations or properties amongst them
• ABox — according to [2], an ABox “contains extensional knowledge — also called assertional knowledge (hence the term ‘ABox’) — that is specific to the individuals of the domain of discourse.” An ABox provides the concept and role membership assertions for instance data, as well as assertions or “facts” about the attributes of those instances using the same controlled vocabulary as defined in the TBox
• First-order logic (FOL) — is a formal deductive system with unambiguous logic and mathematical structures for declaring, testing and inferring propositions (statements) and predicates (relations). A first-order theory consists of a set of axioms (usually finite or recursive) and the statements deducible from them based on FOL’s base logical axioms (such as the operators found in classical set theory, which itself is built on FOL)
• Description logics (DL) — are any of a family of knowledge representation languages that can be translated and characterized according to first-order logic. A DL language has a syntax that consists of unary predicate symbols to denote concepts, binary relations to denote roles, and recursion. DL semantics define concepts as sets of individuals and roles as sets of pairs of individuals. The expressivity of a DL language is a function of the logical operators the language supports (shown with representations such as $\mathcal{SROIQ}^\mathcal{(D)}$, the expressiveness of OWL 2). DL languages can be translated into other DL languages that support the same expressivity, regardless of syntax, but more expressive languages can not be equivalently represented by less expressive ones. The current OWL dialects of OWL Lite and OWL DL are DL languages
• Axiom — in traditional logic or FOL, an axiom (also called a ‘postulate’) is a proposition that is not proved or demonstrated but considered to be either self-evident or consistent with the base logic of the system. As such, its truth is taken for granted and the axiom serves as a starting point for deducing and inferring other (theory-dependent) truths
• Intensional — is a form of set membership that is based on the propositions and concepts which defines the set; there may be many possible members that remain unenumerated so long as they meet the conditions for membership. The intensional principal judges objects to be a member based on the properties or conditions they must have
• Extensional — is a form of set membership that arises from (“extends”) its listed set members. The extensional principle judges objects to be a member if they have the same external characteristics (whether as explicitly defined properties or not)
• Ontology — as used in knowledge representation or information science, this term is most often defined using Tom Gruber’s “explicit specification of a conceptualization” [4]. In practice on the semantic Web, it is any defined schema or data record structure including the most lightweight controlled vocabularies and structures (such as microformats). In DL, both ABox and TBox specifications and statements are lumped under the term
• Vocabulary — also ‘controlled vocabulary,’ is an organized, variously structured set of terms used for information retrieval or characterization. In its simplest form, a controlled vocabulary is merely a list for checking possible matches for set membership or not; at its more complex, it is the set of terms contained within a detailed and specified ontology or schema with formalized (axiomatized) relationships
• Knowledge base — in the DL community, a knowledge base is simply defined as TBox + Abox. In other words, a knowledge base is a logical schema of roles and concepts and the relationships between them (the TBox) as populated by the actual data (instances) asserting memberships and attributes (“facts”) (the ABox).

### TBox v ABox: Different Purposes and Roles

Within description logics and for our purposes herein, the two concepts we will most focus upon are the ABox and the TBox. As the definitions above suggest, the TBox is more structural and reflects the logical and conceptual relationships within a domain; that is, the role and concept and class relationships. The ABox provides the data (instance) records and characterizations within that schema; that is the instances and facts assertions. By analogy, in a conventional relational database system, the database or logical schema would correspond to the TBox; the actual data records or tables would correspond to the ABox.

These distinctions suggest very different purposes and roles, then, for the TBox and the ABox:

 TBox ABox Definitions of the concepts and properties (relationships) of the controlled vocabulary Declarations of concept axioms or roles Inferencing of relationships, be they transitive, symmetric, functional or inverse to another property Equivalence testing as to whether two classes or properties are equivalent to one another Subsumption, which is checking whether one concept is more general than another Satisfiability, which is the problem of checking whether a concept has been defined (is not an empty concept) Classification, which places a new concept in the proper place in a taxonomic hierarchy of concepts Logical implication, which is whether a generic relationship is a logical consequence of the declarations in the TBox Membership assertions, either as concepts or as roles Attributes assertions Consistency checking of instances Entailments, which are whether other propositions are implied by the stated condition Satisfiability checks, which are that the conditions of instance membership are met Infer property assertions implicit through the transitive property Instance checking, which verifies whether a given individual is an instance of (belongs to) a specified concept Knowledge base consistency, which is to verify whether all concepts admit at least one individual Realization, which is to find the most specific concept for an individual object Retrieval, which is to find the individuals that are instances of a given concept

While certainly many of the ABox tests and checks require TBox structure, there is a pretty clear separation of purpose and role. Moreover: 1) the scale of the information in each “box” is vastly different (perhaps a few to hundreds to at most thousands of concepts in the TBox in contrast to potentially millions or more instances in ABoxes); and 2) ABox dataset repositories may also be (indeed, often are!) numerous, spatially distributed and semantically heterogeneous.

### The Wisdom of Separating Concerns

DL and semantic Web stuff in general are data and logic models, not architectural guidance. So, rarely does one see discussion of the architectural imperatives that some of these logical underpinnings provide. We see knowledge bases and ontologies both used as umbrella terms encompassing both the ABox and the TBox.

However, our own deployment experiences and the literature suggest there are manifest advantages to keeping the TBox and ABox separate:

 Advantages of Keeping the TBox and ABox Separate Better performance by keeping inference and reasoning purposes separate Better scalability through separation of function Use of tailored reasoners and rules engines based on purpose [5] More modular design, including keeping attribute information separate from structural and conceptual relationships Faster, global instance checking using summary ABox tests [6] Assignment of named entities (instances) to distinct and disjoint super types [7] that can bring significant tableaux benefits to ABox reasoning Easier partitioning of ABoxes [8] Easy swapping in and mix-and-matching of varied, multiple and private or public named entity dictionaries (ABoxes) Integration with extant relational (RDBMs) data structures and data stores for instance (ABox) data [9] Integration with other lightweight structures (microformats, other) for instance (ABox) data Faster retrievals via TBox routing to appropriate ABoxes Simpler ABox vocabularies that are easier to understand and extend (including continued reliance on RDF and RDFS) More capable TBox ontologies, including integration of rules systems Relatively easy extension of the TBox schema ontology into specific domains Easy ABox data entry and updating via wiki or sematic wiki Ability to triangulate between separate concept (TBox) and instance (ABox) disambiguation approaches to improve overall precision and recall.

It would be useful to refrain from lumping the very different purposes of ABoxes and TBoxes under the umbrella rubric of ‘ontology’. It would also be useful for designers and vocabulary authors to be more explicit in their own minds as to purpose and content when formulating new ontologies. Smushing all of these concepts into one bubbling mess may not lead to clarity nor good performance.

### A Simple Schematic of Best Practice

Taking these basic ideas we can visualize a general schematic for best practice splits within the ontology or knowledge base:

The TBox is clearly focused on the domain at hand, but also includes links and equivalents to external ontologies. The TBox level should be entirely free of instance data, though all attributes, properties and concepts that might be found at the ABox level are also defined with their relationships at the TBox level. Like any semWeb ontology, this TBox level should also re-use common Web ontologies such as FOAF, SIOC, UMBEL, etc.

It is also the case that because of the reasoning needs at the TBox layer, the semantic Web language used should likely be a dialect of OWL (see below).

(BTW, for my own practice, I will try to limit my use of the ‘ontology’ term to the concepts and classes at this TBox level.)

The ABox level, in contrast, may consist of multiple datasets and name spaces. These structures are most appropriately seen as lightweight controlled vocabularies with limited structure; if written by scratch perhaps limited to RDF or RDFS (the schema variant). This layer, however, can also remain in non-semWeb native form — such as RDBMS data tables, microformats or other formats — that are wrapperized for interoperability through one or more ‘RDFizers‘ or GRDDL.

These structures should likely not make many external assertions, if any, and if done, perhaps in separate mapping or linkage file that can be processed and analyzed independently. It is important, however, to make sure that all attributes at this ABox layer have a counterpart with relationships and structure defined at the TBox layer.

This architectural design enables complete independence of the instance datasets from the inferencing logic or federation that might be applied to them.

### The Relevance to Linked Data Instances

Since it first took off in 2006, linked data and the various datasets now shown in the ‘LOD cloud‘ have been dominated by instance data. There are perhaps 10 million to 20 million instance objects available as linked data, many of which are derived from Wikipedia (via DBpedia) with attributes or structure coming from the Wikipedia infoboxes.

“There need not be a trade-off between expressiveness and scalability. Proper design, language choice and architecture can readily achieve both — while maintaining independence of scope or purpose.”

A similarly fast explosion has taken place with structured records via microformats and other simple data structures. For example, some earlier estimates suggest there are perhaps more than 2 billion pages that include microformats [10].

I have at times recently made comments about the dominance of instance data within the linked data community and the need for organizing structure. While this observation, I believe, remains true and provides a rationale for UMBEL as an organizing subject structure (or any other organizing structure, for that matter), perhaps I have been missing a more fundamental point: linked data (at least as practiced to date) is really about exposing ABoxes with simple structure. Perhaps, by serendipity, linked data (and other light structures like microformats) are showing the way to a distributed, mixed ABox-TBox structure for the Web.

With this altered viewpoint, a number of new observations emerge:

• Linked data instance structures perhaps need to be consciously designed as such, with lightweight structure and limited external (OWL-based or TBox-oriented) class structure
• The recent interest shown in so-called VoCamps (for lightweight vocabulary development) and voiD (vocabulary of interlinked datasets) might be usefully viewed specifically through an ABox “lens” with perhaps best practices for structures and vocabulary syntaxes to emerge
• TBox class and relationship structures should remain apart from the instance data and can be used to operate independently of the datasets
• Architectural and design changes might also emerge through clear separation of ABox and TBox that can benefit semantic Web scalability.

Linked data and microformats and other lightweight structures are now giving us the exposed instance data to begin reasoning and showing differences due to inferencing and other logic advantages for the semantic Web. Now that the ABox is being proven, let’s move on and stress-test the TBox!

### OWL 2 and Query Rewriting

Since the first version of OWL there has been confusion and some limitations with the dialects of the language. Only OWL Full allowed classes to be treated both as instances and classes (so-called metamodeling), and was therefore used as the basis for mapping UMBEL, for example, to RDF and RDFS vocabularies and to Cyc. This design was necessary, but left UMBEL undecidable using standard DL reasoners; only the two dialects of OWL DL and OWL Lite met description logics requirements.

Indeed, it was even hard to determine what dialect an OWL file represented, among many other problems and issues. The technical committee behind OWL 2, in fact, has written an excellent critique of issues with this first version of OWL [11].

For nearly two years the next version of OWL, OWL 2, has been undergoing development, with the last draft now published and available for last comment before January 23 [12]. Lessons and refinements to the use of DL have also occurred. Some have criticized this effort and have criticized the need for OWL 2′s growing expressiveness and vocabulary [see 1, for example]. I believe these criticisms to be unfair and to miss many of the thoughtful improvements in this new version.

Version control and expressiveness are two of these benefits. A broader benefit, though, has been the keen attention the developers have given to compliance with description logics and the ability to formulate fragments (called “profiles”) that only present subsets of DL useful for computational considerations [13]. For example, one profile, OWL 2 QL, appears well suited to the ABox; another, EL, appears well suited to the TBox. Users and tools builders may define other subsets of OWL 2 to deal with different use cases.

What is emerging are possible design patterns that would have comprehensive TBox guidance and inference structures that first receive a query, then do query rewriting for less capable OWL dialects and mapping to distributed ABox datasets, some of which might be kept in native relational DB or other structural forms [9, 14, 15]. Other approaches and designs, such as overviewed for DLDB2, KAON2, OWLIM, BigOWLIM and Minerva, are testing other architectural and DL combinations [see 15]. And, at the level of the specific triplestore, other optimizations are being made such as owl:sameAs or query rewrite with Virtuoso [16]. This new version of OWL and its profiles have adapted to past lessons and can be matched well to the emerging hardware and architectural designs.

These changes appear to now provide the option for various dialects of OWL to be matched with reasoners and architectural designs in order to optimize for different purposes. Rather than a spectrum, we appear to be learning and maturing. Hopefully, getting back to the architectural implications of the TBox – ABox split can show us there need not be a trade-off between expressiveness and scalability. Proper design, language and dialect choice, and architecture can readily achieve both — while maintaining independence of scope or purpose.

Thanks, OWL 2! You have fulfilled your commitment to description logics. It is now our turn to figure out the best practices for working with these tools.

[1] For those with a spare 90 minutes or so, you may also want to view this panel session and debate that took place on “An OWL 2 Far?” at ISWC ’08 in Karlsruhe, Germany, on October 28, 2008. The panel was chaired by Peter F. Patel-Schneider (Bell Labs, Alcathor) with the panel members of Stefan Decker (DERI Galway), Michel Dumontier (Carleton University), Tim Finin (University of Maryland) and Ian Horrocks (University of Oxford), with much audience participation. See http://videolectures.net/iswc08_panel_schneider_owl/.
[2] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, 2003. See Chapter 1. Sample chapters may be viewed from Enrico Franconis Description Logics course notes and tutorial at http://www.inf.unibz.it/~franconi/dl/course/, which is an excellent starting reference point on the subject.
[3] Tim Berners-Lee, James Hendler and Ora Lassila, 2001. “The Semantic Web,” in Scientific American, 284(5):34-43, May 2001; see http://www.sciam.com/article.cfm?id=the-semantic-web.
[4] Thomas R. Gruber, 1993. “A Translation Approach to Portable Ontology Specifications,” in Knowledge Acquisition 5(2): 199-220; see http://tomgruber.org/writing/ontolingua-kaj-1993.pdf.
[5] Georgios Meditskos and Nick Bassiliades, 2008. “Combining a DL Reasoner and a Rule Engine for Improving Entailment-based OWL Reasoning,” presented at the 7th International Semantic Web Conference (ISWC2008); see http://lpis.csd.auth.gr/publications/med-iswc08.pdf.
[6] Achille Fokoue, Aaron Kershenbaum, Li Ma, Edith Schonberg, and Kavitha Srinivas, 2006. “The Summary Abox: Cutting Ontologies Down to Size,” presented at the 5th International Semantic Web Conference, Athens, GA, USA, November 5-9, 2006; see http://iswc2006.semanticweb.org/items/Kershenbaum2006qo.pdf.
[7] These are akin to the lexicographer supersenses that have been applied in WordNet for nouns and verbs (though only nouns are used here). See Massimiliano Ciaramita and Mark Johnson, 2003. Supersense Tagging of Unknown Nouns in WordNet, in Proceedings of the Conf. on Empirical Methods in Natural Language Processing, pp. 168173, 2003. See http://www.aclweb.org/anthology-new/W/W03/W03-1022.pdf.
[8] Yuanbo Guo and Jeff Heflin, 2006. “A Scalable Approach for Partitioning OWL Knowledge Bases,” in Proceedings of the 2nd International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS2006), Athens, Georgia, USA, November, 2006; see http://swat.cse.lehigh.edu/pubs/guo06c.pdf.
[9] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini and Riccardo Rosati, 2007. “Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family,” in Journal of Automated Reasoning, 39 (3), October 2007; see http://www.dis.uniroma1.it/~degiacom/papers/2007/calv-etal-JAR-2007.pdf.
[10] The original citation could not be found, but it is referenced on the Microformats mailing list on Oct. 1, 2008; see http://microformats.org/discuss/mail/microformats-discuss/2008-October/012550.html.
[11] Bernardo Cuenca Grau, Ian Horrocks, Boris Motik, Bijan Parsia, Peter Patel-Schneider and Ulrike Sattler, 2008. “OWL2: The Next Step for OWL,” in Journal of Web Semantics, 6(4): 309-322, November 2008; see http://www.comlab.ox.ac.uk/people/ian.horrocks/Publications/download/2008/CHMP+08.pdf.
[12] Boris Motik, Peter F. Patel-Schneider, Bijan Parsia, eds., 2008. OWL 2 Web Ontology Language: Structural Specification and Functional-Style Syntax, W3C Working Draft, December 2, 2008. See http://www.w3.org/TR/owl2-syntax/.
[13] Boris Motik, Bernardo Cuenca Grau, Ian Horrocks, Zhe Wu, Achille Fokoue and Carsten Lutz, eds., 2008. OWL 2 Web Ontology Language: Profiles, W3C Working Draft, December 2, 2008. See http://www.w3.org/TR/owl2-profiles/.
[14] Luciano Serafini and Andrei Tamilin, 2007. “Instance Migration in Heterogeneous Ontology Environments,” in Proceedings of 6th International Semantic Web Conference / 2nd Asian Semantic Web Conference (ISWC/ASWC 2007), pages 452-465, 2007. See http://iswc2007.semanticweb.org/papers/449.pdf.
[15] Zhengxiang Pan, Xingjian Zhang and Jeff Heflin, 2008. “DLDB2: A Scalable Multi-Perspective Semantic Web Repository,” in WI 08: Proceedings of the International Conference on Web Intelligence, IEEE Computer Society Press, pp. 489-495; see http://swat.cse.lehigh.edu/pubs/pan08a.pdf.
[16] Orri Erling and Ivan Mikhailov, 2007. “RDF Support in the Virtuoso DBMS,” in Proceedings of the 1st Conference on Social Semantic Web, Leipzig, Germany, Sep 26-28, 2007; see http://aksw.org/cssw07/paper/5_erling.pdf.

# Thinking ‘Inside the Box’ with Description Logics

## Linked Data Need Not Rediscover the Past; A Surprise in Every Box

A standard cliché of management consultants is the exhortation to think “outside the box.” Of course, what is meant by this is to question assumptions, to think differently, to look at problems from new perspectives.

With our recent release of the (linked open data) ‘LOD constellation‘ of linked data classes based around UMBEL, I have been fielding a lot of inquiries on what the relationship is of UMBEL to DBpedia. (See, for example, this current interview by the Semantic Web Company with me and Sören Auer of the DBpedia project.) This also fits into the ongoing distinction we have made in the UMBEL project between our subject concepts (classes) and named entities (instances).

What has actually most been helping my thinking is to get fully inside the box (or, rather, boxes, hehe). Let me explain.

The problem with urging outside-the-box thinking is that many of us do a less-than-stellar job of thinking inside the box. We often fail to realize the options and opportunities that are blatantly visible inside the box that could dramatically improve our chances of success.

Naomi Karten [1]

### The Description Logics Underpinnings of the Semantic Web

Description logics are one of the key underpinnings to the semantic Web. They grew out of earlier frame-based logic systems from Marvin Minsky and also semantic networks; the term and discipline was first given definition in the 1980s by Ron Brachman, among many others [2].

Description logics (DL, most often expressed in the plural) are a logic semantics for knowledge representation (KR) systems based on first-order predicate logic (FOL). They are a kind of logical metalanguage that can help describe and determine (with various logic tests) the consistency, decidability and inferencing power of a given KR language. The semantic Web ontology languages, OWL Lite and OWL DL (which stands for description logics), are based on DL and were themselves outgrowths of earlier DL languages.

Description logics and their semantics traditionally split concepts and their relationships from the different treatment of individuals and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships.

Thus, the model is an abstraction of a concrete world where the concepts are interpreted as subsets of the domain as required by the TBox and where the membership of the individuals to concepts and their relationships with one another in terms of roles respect the assertions in the ABox.

Franz Baader and Werner Nutt [3]

The second split of individuals is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of individuals, the roles between individuals, and other assertions about individuals regarding their class membership with the TBox concepts. Both the TBox and ABox are consistent with set-theoretic principles.

TBox and ABox logic operations differ and their purposes differ. TBox operations are based more on inferencing and tracing or verifying class memberships in the hierarchy (that is, the structural placement or relation of objects in the structure). ABox operations are more rule-based and govern fact checking, instance checking, consistency checking, and the like [3]. ABox reasoning is generally more complex and at a larger scale than that for the TBox.

Early semantic Web systems tended to be very diligent about maintaining these “box” distinctions of purpose, logic and treatment. One might argue, as I do herein, that the usefulness and basis for these splits has been lost somewhat in our first implementations and publishing of linked data systems.

### ABox and TBox Analogs in the Linked Data Web

Most of the semantic Web work at the beginning of this decade was pretty explicit about references to description logics and related inferencing engines and computational efficiency. Some of the early commercial semantic Web vendors are still very much focused on this space.

However, with the first release and emphasis on linked data about two years ago, the emphasis seemed to shift to the more pragmatic questions of actually posting and getting data out there. Best practices for cool URIs and publishing and linkage modes assumed prominence. The linking open data (LOD) movement began in earnest and gained mindshare. Of course, many in the DL and OWL development communities continued to discuss logic and inferencing, but now seemingly more as a separate camp to which the linked data tribe paid little heed.

The central hub of this linked data effort has been DBpedia and its pivotal place within the ‘LOD cloud.’ What is remarkable about the LOD cloud, however, is that it is almost entirely an ABox representation of the world and its instances. Starting from the core set of individual instances within Wikipedia, this cloud has now grown to many other sources and the central place for finding linked instance data. If one looks carefully at the LOD cloud and its linkages we can see the prevalence of instance-level relationships and attributes.

Linking Open Data’s “ABox”

Linking Open Data’s “TBox”

In fact, the LOD cloud diagram to upper right from the Wikipedia article on linked data has become the key visual metaphor for the movement. But, as noted, this view is almost exclusively one at the ABox instance level.

The UMBEL project began at roughly the same time and as a response to the release of DBpedia. My question in looking at the first data linked to DBpedia was, What is this content about? Sure, I might be able to find multiple records discussing Abraham Lincoln as a US president regarding attributes like birth date and a list of children, but where could I retrieve records about other presidents or, more broadly, other types of leaders such as prime ministers, kings or dictators?

The intuition was that the linked data and the various FOAF and other distributed instance records it was combining lacked a coherent reference structure of subject topics or concepts with which to describe content. The further intuition was that — while tagging systems and folksonomies would allow any and all users to describe this content with their own metadata — a framework for relating these various assignments to one another was still lacking.

In the nearly two years of development leading to the first beta release of UMBEL we have tried many analogies and metaphors to describe the basis of the 20,000 subject concept classes within UMBEL in relation to its role and other linked data initiatives. While many of those metaphors help visualize use and role, the more formal basis offered by description logics actually helps to most precisely cast UMBEL’s role. For example, in today’s interview with the Semantic Web Company, I note:

“. . . we have described UMBEL as a roadmap, or middleware, or a backbone, or a concept ontology, or an infocline, or a meta layer for metadata, and others. Today, what I tend to use, particularly in reference to DBpedia, is the TBox-ABox distinction in computer science and description logics. UMBEL is more of a class or structural and concept relationships schema — a TBox — while DBpedia is more of an an instance and entity layer with attributes — an ABox. I think they are pretty complementary. . . “

The resulting class level structure produced by UMBEL and its mappings to other classes within existing linked data enabled us to create and then publish the ‘LOD constellation‘, a complementary TBox structure to the linked data’s existing ABox one. This diagram to the lower right from the Wikipedia article on linked data now shows this complement.

### Completeness and Sufficiency

Description logics have arisen to aid our creating and understanding of knowledge representation systems. From this basis, we can see that the first efforts of the linked data initiative have lacked context, the TBox. At a specific level, the question is not DBpedia v. UMBEL or cloud v. constellation. Both types of structure are required in order to complete the logical framework. By thinking inside the box — by paying attention to our logical underpinnings — we can see that both TBoxes and ABoxes are essential and complementary to creating a useful knowledge representation system.

By more explicitly adopting a description logics framework we can also better address many prior questions of context, coherence and sufficiency. These have been constant themes in my recent writings that I will be revisiting again through the helpful prism of formal description logics.

My interview today with Sören Auer also brought up some important points regarding context. As we have said in other venues, it is important that any TBox be available for context purposes. Whether that should be UMBEL or some other framework depends on the use case. As I noted in the interview, “UMBEL’s specific purpose is to provide a coherent framework for serious knowledge engineers looking to federate data.” Other uses may warrant other frameworks, and certainly not always UMBEL.

But, in any event, I have two cautions to the linked data community: 1) do not take the suggestion to have a reference framework of concepts as being equivalent to adopting a single ontology for the Web; think of any reference structure as an essential missing TBox, and not some call to adopt “one ontology to rule them all,” but 2) in adopting alternative frameworks, take care that whatever is designed or adopted itself be able to meet basic DL logic tests of consistency and coherence.

### A Serendipitous Surprise

No one has yet elaborated the significant advantages from design, performance, architectural and flexibility perspectives from a distinct and explicit separation of TBox from ABox — but they’re there!

The many advantages from separate TBox and ABox frameworks are one serendipitous surprise coming from the early development of linked data. To my knowledge, no one has yet elaborated the significant advantages from design, performance, architectural and flexibility perspectives from a distinct and explicit separation of TBox from ABox. We believe these advantages to be substantial.

Realize, as distributed, UMBEL already has both TBox and ABox components. The TBox component is the lightweight UMBEL ontology, with its 20,000 subject concept classes and their hierarchical and other relationships. This component has a vocabulary (or terminology) for aiding the linking to external ontologies. The vocabulary is quite suitable for extension into new domains as well.

The ABox component is the named entities part of instances drawn from Wikipedia and the BBC’s John Peel sessions. Besides being of common, broad interest, these 1.5 million instances (per the current version) are included in the distribution to instantiate the ontology for demonstration and sandbox purposes.

So, UMBEL’s world is quite simple: subject concepts (SCs) and named entities (NEs). Subject concepts are the TBox and classes that define the structure and concept relationships. Named entities are the individual “things” in the world (some lower case such as animals or foods) and are the ABox of instances that populate this structure.

In our early efforts, we concentrated on the SC portion of UMBEL. Most recently, we have been concentrating on the NE component and its NE dictionaries. It was these investigations that drew us into an ABox perspective when looking at design options. The logic and rationale had been sitting there for some years, but it took cracking open the older textbooks to become reacquainted with it.

Once we again began looking inside the box, we began to see and enumerate some significant advantages to an explicit TBox-ABox design, as well as advantages for keeping these components distinct:

• Easier understood ontologies with a very limited number of predicates
• Lightweight schema design that is easy to extend
• Ability to “triangulate” between separate SC (concept) and NE (instance) disambiguation approaches to improve overall precision and recall
• Attribute information is kept separate from structural and conceptual relationships
• Easy to swap in varied, multiple and private or public named entity dictionaries
• Relatively easy extension of the schema ontology into specific domains
• A design suitable to computation efficiency (rules for ABox; inference and standard reasoning for TBox), and
• Assignment of NEs to distinct and disjoint “super types” [4] that can bring significant tableaux benefits to ABox reasoning.

We are still learning about these advantages and will document them further in pending work on coherence and named entity dictionary (NED) creation.

### Thinking Inside the TBox and ABox

The two main points of this article have been to: 1) recognize the important intellectual legacy of description logics and how they can inform the linked data enterprise moving forward; and 2) be explicit about the functional and architectural splits of the TBox from the ABox. Making this split brings many advantages.

There will continue to be many design challenges as linked data proliferates and actually begins to play its role of aiding meaningful knowledge work. The grounding in description logics and the use of DL for testing alternative designs and approaches is a powerful addition to our toolkit.

Sometimes there are indeed many benefits to thinking inside the box.

[2] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, 2003. See Chapter 1. Sample chapters may be viewed from Enrico Franconi’s Description Logics course notes and tutorial at http://www.inf.unibz.it/~franconi/dl/course/, which is an excellent starting reference point on the subject.
[3] Ibid.; see Chapter 2.
[4] These are akin to the lexicographer supersenses that have been applied in WordNet for nouns and verbs (though only nouns are used here). See Massimiliano Ciaramita and Mark Johnson, 2003. Supersense Tagging of Unknown Nouns in WordNet, in Proceedings of the Conf. on Empirical Methods in Natural Language Processing, pp. 168173, 2003. See http://www.aclweb.org/anthology-new/W/W03/W03-1022.pdf.

# An Intrepid Guide to Ontologies

## There’s an Endless Variety of World Views, and Almost as Many Ways to Organize and Describe Them

Ontology is one of the more daunting terms for those exposed for the first time to the semantic Web. Not only is the word long and without many common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.

The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”

While there have been attempts to strap on more or less formal understandings or machinery around ontology, it still has very much the sense of a world view, a means of viewing and organizing and conceptualizing and defining a domain of interest. As is made clear below, I personally prefer a loose and embracing understanding of the term (consistent with Deborah McGuinness’s 2003 paper, Ontologies Come of Age [1]).

There has been a resurgence of interest in ontologies of late. Two reasons have been the emergence of Web 2.0 and tagging and folksonomies, as well as the nascent emergence of the structured Web. In fact, on April 23-24 one of the noted communities of practice around ontologies, Ontolog, sponsored the Ontology Summit 2007 ,“Ontology, Taxonomy, Folksonomy: Understanding the Distinctions.”

These events have sparked my preparing this guide to ontologies. I have to admit this is a somewhat intrepid endeavor given the wealth of material and diversity of opinions.

### Overview and Role of Ontologies

Of course, a fancy name is not sufficient alone to warrant an interest in ontologies. There are reasons why understanding, using and manipulating ontologies can bring practical benefit:

• Depending on their degree of formalism (an important dimension), ontologies help make explicit the scope, definition, and language and meaning (semantics) of a given domain or world view
• Ontologies may provide the power to generalize about their domains
• Ontologies, if hierarchically structured in part (and not all are), can provide the power of inheritance
• Ontologies provide guidance for how to correctly “place” information in relation to other information in that domain
• Ontologies may provide the basis to reason or infer over its domain (again as a function of its formalism)
• Ontologies can provide a more effective basis for information extraction or content clustering
• Ontologies, again depending on their formalism, may be a source of structure and controlled vocabularies helpful for disambiguating context; they can inform and provide structure to the “lexicons” in particular domains
• Ontologies can provide guiding structure for browsing or discovery within a domain, and
• Ontologies can help relate and “place” other ontologies or world views in relation to one another; in other words, ontologies can organize ontologies from the most specific to the most abstract.

Both structure and formalism are dimensions for classifying ontologies, which combined are often referred to as an ontology’s “expressiveness.” How one describes this structure and formality differs. One recent attempt is this figure from the Ontology Summit 2007‘s wrap-up communique:

Note the bridging role that an ontology plays between a domain and its content. (By its nature, every ontology attempts to “define” and bound a domain.) Also note that the Summit’s 50 or so participants were focused on the trade-off between semantics v. pragmatic considerations. This was a result of the ongoing attempts within the community to understand, embrace and (possibly) legitimize “less formal” Web 2.0 efforts such as tagging and the folksonomies that can result from them.

There is an M.C. Escher-like recursion of the lizard eating its tail when one observes ontologists creating an ontology to describe the ontological domain. The above diagram, which itself would be different with a slight change in Summit participation or editorship, is, of course, but one representative view of the world. Indeed, a tremendous variety of scientific and research disciplines concern themselves with classifying and organizing the “nature of things.” Those disciplines go by such names as logicians, taxonomists, philosophers, information architects, computer scientists, librarians, operations researchers, systematicists, statisticians, historians, and so forth. (In short, given our ontos, every area of human endeavor has the urge to classify, to organize.) In each of these areas not only do their domains differ, but so do the adopted structures and classification schemes often used.

There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach:

Actual domains or subject coverage are then mostly orthogonal to these approaches.

Loosely defined, the number of possible ontologies is therefore close to infinite: domain X perspective X schema. (Just kidding — sort of! In fact, UMBC’s Swoogle ontology search service claims 10,000 ontologies presently on the Web; the actual data from August 2006 ranges from about 16,000 to 92,000 ontologies, depending on how “formal” the definition. These counts are also limited to OWL-based ontologies.)

Many have misunderstood the semantic Web because of this diversity and the slipperiness of the concept of an ontology. This misunderstanding becomes flat wrong when people claim the semantic Web implies one single grand ontology or organizational schema, One Ring to Rule Them All. Human and domain diversities makes this viewpoint patently false.

### Diversity, ‘Naturalness’ and Change

The choice of an ontological approach to organize Web and structured content can be contentious. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language or microformats. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) and the still greater formalism of OWL’s various dialects.

Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications. All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it to a canonical form, we can then move on to needed developments in semantic mediation, the threshold condition for the semantic Web.

 There are at least 40 concepts — loosely defined — that could be called an “ontology” framework or approach.

So, diversity is inevitable and should be accepted. But that observation need not also embrace chaos.

In my early training in biological systematics, Ernst Haeckel’s recapitulation theory that “ontogeny recapitulates phylogeny” (note the same ontos root, the difference from ontology being growth v. study) was losing favor fast. The theory was that the development of an organism through its embryological phases mirrors its evolutionary history. Today, modern biologists recognize numerous connections between ontogeny and phylogeny, explain them using evolutionary theory, or view them as supporting evidence for that theory.

Yet, like the construction of phylogenetic trees, systematicists strive for their classifications of the relatedness of organisms to be “natural”, to reflect the true nature of the relationship. Thus, over time, that understanding of a “natural” system has progressed from appearance → embryology → embryology + detailed morphology → species and interbreeding → DNA. While details continue to be worked out, the degree of genetic relatedness is now widely accepted by biologists as a “natural” basis for organizing the Tree of Life.

It is not unrealistic to also seek “naturalness” in the organization of other knowledge domains, to seek “naturalness” in the organization of their underlying ontologies. Like natural systems in biology, this naturalness should emerge from the shared understandings and perceptions of the domain’s participants. While subject matter expertise and general and domain knowledge are essential to this development, they are not the only factors. As tagging systems on the Web are showing, common usage and broad acceptance by the community at hand is important as well.

While it may appear that a domain such as the biological relatedness of organisms is more empirical than the concepts and ambiguous words in most domains of human endeavor, these attempts at naturalness are still not foolish. The phylogeny example shows that understanding changes over time as knowledge is gained. We now accept DNA over the recapitulation theory.

As the formal SKOS organizational schema for knowledge systems recognizes (see below), the ideas of narrower and broader concepts can be readily embraced, as well as concepts of relatedness and aliases (synonyms). These simple constructs, I would argue, plus the application of knowledge being gained in related domains, will enable tomorrow’s understandings to be more “natural” than today’s, no matter the particular domain at hand.

So, in seeking a “naturalness” within our organizational schema, we can also see that change is a constant. We also see that the tools and ideas underlying the seemingly abstract cause of merging and relating existing ontologies to one another will further a greater “naturalness” within our organizations of the world.

### A Spectrum of Formalisms

According to the Summit, expressiveness is the extent and ease by which an ontology can describe domain semantics. Structure they define as the degree of organization or hierarchical extent of the ontology. They further define granularity as the level of detail in the ontology. And, as the diagram above alludes, they define other dimensions of use, logical basis, purpose and so forth of an ontology.

The over fifty respondents from 42 communities submitted some 70 different ontologies under about 40 terms to a survey that was used by the Summit to construct their diagram. These submissions included:

. . . formal ontologies (e.g., BFO, DOLCE, SUMO), biomedical ontologies (e.g., Gene Ontology, SNOMED CT, UMLS, ICD), thesauri (e.g., MeSH, National Agricultural Library Thesaurus), folksonomies (e.g., Social bookmarking tags), general ontologies (WordNet, OpenCyc) and specific ontologies (e.g., Process Specification Language). The list also includes markup languages (e.g., NeuroML), representation formalisms (e.g., Entity-Relation model, OWL, WSDL-S) and various ISO standards (e.g., ISO 11179). This [Ontolog] sample clearly illustrates the diversity of artifacts collected under “ontology”.

I think the simplest spectrum for such distinctions is the formalism of the ontology and its approach (and language or syntax, not further discussed here). More formal ontologies have greater expressiveness and structure and inferential power, less formal ones the opposite. Constructing more formal ontologies is more demanding, and takes more effort and rigor, resulting in an approach that is more powerful but also more rigid and less flexible. Like anything else, there are always trade-offs.

Based on work by Leo Obrst of Mitre as interpreted by Dan McCreary, we can view this as a trade-off as one of semantic clarity v. the time and money required to construct the formalism [12, 13]:

[Click on image for full-size pop-up]

Note this diagram reflects the more conventional, practitioner’s view of the “formal” ontology, which does not include taxonomies or controlled vocabularies (for example) in the definition. This represents the more “closely defined” end of the ontology (semantic) spectrum.

However, since we are speaking here of ontologies and the structured Web or the semantic Web, I believe we need to embrace a concept of ontology aligned to actual practice. Not all content providers can or want to employ ontology engineers to enable formal inferencing of their content. Yet, on the other hand, their content in its various forms does have some meaningful structure, some organization. The trick is to extract this structure for more meaningful use such as data exchange or data merging.

### Ontology Approaches on the Web

Under such “loosely defined” bases we can thus see a spectrum of ontology approaches on the Web, proceeding from less structure and formalism to more so:

 Type or Schema Examples Comments on Structure and Formalism Standard Web Page entire Web General metadata fields in the and internal HTML codes and tags provide minimal, but useful sources of structure; other HTTP and retrieval data can also contribute Blog / Wiki Page examples from Technorati, Bloglines, Wikipedia Provides still greater formalism for the organization and characterization of content (subjects, categories, posts, comments, date/time stamps, etc.). Importantly, with the addition of plug-ins, some of the basic software may also provide other structured characterizations or output (SIOC, FOAF, etc.; highly varied and site-specific given the diversity of publishing systems and plug-ins) RSS / Atom feeds most blogs and most news feeds RSS extends basic XML schema for more robust syndication of content with a tightly controlled vocabulary for feed concepts and their relationships. Because of its ubiquity, this is becoming a useful baseline of structure and formalism; also, the nature of adoption shows much about how ontological structure is an artifact, not driver, for use RSS / Atom feeds with tags or OPML Grazr, most newsfeed aggregators can import and export OPML lists of RSS feeds The OPML specification defines an outline as a hierarchical, ordered list of arbitrary elements. The specification is fairly open which makes it suitable for many types of list data. See also OML and XOXO Hierarchical Faceted Metadata XFML, Flamenco These and related efforts from the information architecture (IA) community are geared more to library science. However, they directly contribute to faceted browsing, which is one of the first practical instantiations of the semantic Web Folksonomies Flickr, del.icio.us Based on user-generated tags and informal organizations of the same; not linked to any “standard” Web protocols. Both tags and hierarchical structure are arbitrary, but some researchers now believe over large enough participant sets that structural consensus and value does emerge Microformats Example formats include hAtom, hCalendar, hCard, hReview, hResume, rel-directory, xFolk, XFN and XOXO A microformat is HTML mark up to express semantics with strictly controlled vocabularies. This markup is embedded using specific HTML attributes such as class, rel, and rev. This method is easy to implement and understand, but is not free-form Embedded RDF RDFa, eRDF An embedded format, like microformats, but free-form, and not subject to the approval strictures associated with microformats Topic Maps Infoloom, Topic Maps Search Engine A topic map can represent information using topics (representing any concept, from people, countries, and organizations to software modules, individual files, and events), associations (which represent the relationships between them), and occurrences (which represent relationships between topics and information resources relevant to them) RDF Many; DBpedia, etc. RDF has become the canonical data model since it represents a “universal” conversion format RDF Schema SKOS, SIOC, DOAP, FOAF RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources. This becomes the canonical ontology common meeting ground OWL Lite Here are some existing OWL ontologies; also see Swoogle for OWL search facilities The Web Ontology Language (OWL) is a language for defining and instantiating Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. OWL is designed for use by applications that need to process the content of information instead of just presenting information to humans. It facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. The three language versions are in order of increasing expressiveness OWL DL OWL Full Higher-order “formal” and “upper-level” ontologies SUMO, DOLCE, PROTON, BFO, Cyc, OpenCyc These provide comprehensive ontologies and often related knowledge bases, with the goal of enabling AI applications to perform human-like reasoning. Their reasoning languages often use higher-order logics

As a rule of thumb, items that are less “formal” can be converted to a more formal expression, but the most formal forms can generally not be expressed in less formal forms.

As latter sections elaborate, I see RDF as the universal data model for representing this structure into a common, canonical format, with RDF Schema (specifically SKOS, but also supplemented by FOAF, DOAP and SIOC) as the organizing ontology knowledge representation language (KRL).

This is not to say that the various dialects of OWL should be neglected. In bounded environments, they can provide superior reasoning power and are warranted if they can be sufficiently mandated or enforced. But the RDF and RDF-S systems represent the most tractable “meeting place” or “middle ground,” IMHO.

### Still-Another “Level” of Ontologies

As if the formalism dimension were not complicated enough, there is also the practice within the ontology community to characterize ontologies by “levels”, specifically upper, middle and lower levels. For example, chances are that you have heard particularly of “upper-level” ontologies.

The following figure helps illustrate this “level” dimension. This diagram is also from Leo Obrst of Mitre [12], and was also used in another 2006 talk by Jack Park and Patrick Durusau (discussed further below for other reasons):

Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus — that is, real ontos stuff) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak [2].

The above diagram conveys a sense of how multiple ontologies can relate to one another both in terms of narrower and broader topic matter and at the same “levels” of generalization. Such “meta-structure” (if you will) can provide a reference structure for relating multiple ontologies to one another.

 The relationships and mappings amongst ontologies is a critical infrastructure component of the semantic Web.

It resides exactly in such bindings or relationships that we can foresee the promise of querying and relating multiple endpoints on the Web with accurate semantics in order to connect dots and combine knowledge bases. Thus, the understanding of the relationships and mappings amongst ontologies becomes a critical infrastructural component of the semantic Web.

### The SUMO Example

We can better understand these mapping and inter-relationship concepts by using a concrete example with a formal ontology. We’ll choose to use the Suggested Upper Merged Ontology simply because it is one of the best known. We could have also selected another upper-level system such as PROTON [3] or Cyc [4] or one of the many with narrower concept or subject coverage.

SUMO is one of the formal ontologies that has been mapped to the WordNet lexicon, which adds to its semantic richness. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE. The ontologies that extend SUMO are available under GNU General Public License.

The abstract, conceptual organization of SUMO is shown by this diagram, which also points to its related MILO (MId-Level Ontology), which is being developed as a bridge between the abstract content of the SUMO and the richer detail of various domain ontologies:

At this level, the structure is quite abstract. But one can easily browse the SUMO structure. A nifty tool to do so is the KSMSA (Knowledge Support for Modeling and Simulation) ontology browser. Using a hierarchical tree representation, you can navigate through SUMO, MILO, WordNet, and (with the locally installed version) Wikipedia.

The figure below shows the upper-level entity concept on the left; the right-hand panel shows a drill-down into the example atom entity:

[Click on image for full-size pop-up]

These views may be a bit misleading because the actual underlying structure, while it has hierarchical aspects as shown here, really is in the form of a directed acyclic graph (showing other relatedness options, not just hierarchical ones). So, alternate visualizations include traditional network graphs.

The other thing to note is that the “things” covered in the ontology, the entities, are also fairly abstract. That is because the intention of a standard “upper-level” ontology is to cover all relevant knowledge aspects of each entity’s domain. This approach results in a subject and topic coverage that feels less “concrete” than the coverage in, say, an encyclopedia, directory or card catalog.

### Ontology Binding and Integration Mechanisms

According to Park and Durusau, upper ontologies are diverse, middle ontologies are even more diverse, and lower ontologies are more diverse still. A key observation is that ontological diversity is a given and increases as we approach real user interaction levels. Moreover, because of the “loose” nature of ontologies on the Web (now and into the future), diversity of approach is a further key factor.

Recall the initial discussion on the role and objectives of ontologies. About half of those roles involve effectively accessing or querying more than one ontology. The objective of “upper-level” ontologies, many with their own binding layers, is also expressly geared to ontology integration or federation. So, what are the possible mechanisms for such binding or integration?

A fundamental distinction within mechanisms to combine ontologies is whether it is a unified or centralized approach (often imposed or required by some party) or whether it is a schema mapping or binding approach. We can term this distinction centralized v. federated.

#### Centralized Approaches

Centralized approaches can take a number of forms. At the most extreme, adherence to a centralized approach can be contractual. At the other end are reference models or standards. For example, illustrative reference models include:

• the Data Reference Model (DRM), one of the five reference models of the Federal Enterprise Architecture (FEA)
• UDEF (Unified Data Element Framework), an approach toward a unified description framework, or
• the eXtended MetaData Registry (XMDR) project.

Though I have argued that One Ring to Rule them All is not appropriate to the general Web, there may be cases within certain enterprises or where through funding clout (such as government contracts), some form of centralized approach could be imposed [5]. And, frankly, even where compliance can not be assured, there are advantages in economy, efficiency and interoperability to attempt central ontologies. Certain industries — notably pharmaceuticals and petrochemicals — and certain disciplines — such as some areas of biology among others — have through trade associations or community consensus done admirable jobs in adopting centralized approaches.

#### Federated Approaches

However, combining ontologies in the context of the broader Internet is more likely through federated approaches. (Though federated approaches can also be improved when there are consensual standards within specific communities.) The key aspect of a federated approach is to acknowledge that multiple schema need to be brought together, and that each contributing data set and its schema will not be altered directly and will likely remain in place.

Thus, the key distinctions within this category are the mechanisms by which those linkages may take place An important goal in any federated approach is to achieve interoperability at the data or instance level without unacceptable loss of information or corruption of the semantics. Numerous specific approaches are possible, but three example areas in RDF-topic map interoperability, the use of “subject maps”, and binding layers can illustrate some of the issues at hand.

In 2006 the W3C set up a working group to look at the issue of RDF and topic maps interoperability. Topic maps have been embraced by the library and information architecture community for some time, and have standards that have been adopted under ISO. Somewhat later but also in parallel was the development of the RDF standard by W3C. The interesting thing was that the conceptual underpinnings and objectives between these two efforts were quite similar. Also, because of the substantive thrust of topic maps and the substantive needs of its community, quite a few topic maps had been developed and implemented.

One of the first efforts of the W3C work group was to evaluate and compare five or six extant proposals for how to relate RDF and topic maps [6]. That report is very interesting reading for any one desirous of learning more about specific issues in combining ontologies and their interoperability. The result of that evaluation then led to some guidelines for best practices in how to complete this mapping [7]. Evaluations such as these provide confidence that interoperability can be achieved between relatively formal schema definitions without unacceptable loss in meaning.

A different, “looser” approach, but one which also grew out of the topic map community, is the idea of “subject maps.” This effort, backed by Park and Durusau noted above, but also with the support of other topic map experts such as Steve Newcomb and Robert Barta via their proposed Topic Maps Reference Model (ISO 13250-5), seems to be one of the best attempts I’ve seen that both respects the reality of the actual Web while proposing a workable, effective scheme for federation.

The basic idea of a subject map is built around a set of subject “proxies.” A subject proxy is a computer representation of a subject that can be implemented as an object, must have an identity, and must be addressable (this point provides the URI connector to RDF). Each contributing schema thus defines its own subjects, with the mappings becoming meta-objects. These, in turn, would benefit from having some accepted subject reference schema (not specifically addressed by the proponents) to reduce the breadth of the ultimate mapped proxy “space.”

I don’t have the expertise to judge further the specifics, but I find the presentation and papers by Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies to be worthwhile reading in any case. I highly recommend these papers for further background and clarity.

As the third example, “binding layers” are a comparatively newer concept. Leading upper-level ontologies such as SUMO or PROTON propose their own binding protocols to their “lower” domains, but that approach takes place within the construct of the parent upper ontology and language. Such designs are not yet generalized solutions. By far the most promising generalized binding solution is the SKOS (Simple Knowledge Organization System). Because of its importance, the next section is devoted to it.

Finally, with respect to federated approaches, there are quite a few software tools that have been developed to aid or promote some of these specific methods. For, example, about twenty of the software applications in my Sweet Tools listing of 500+ semantic Web and -related tools could be interpreted as aiding ontology mapping or creation. You may want to check out some of these specific tools depending on your preferred approach [8].

### The Role of SKOS – the Simple Knowledge Organization System

SKOS, or the Simple Knowledge Organization System, is a formal language and schema designed to represent such structured information domains as thesauri, classification schemes, taxonomies, subject-heading systems, controlled vocabularies, or others; in short, most all of the “loosely defined” ontology approaches discussed herein. It is a W3C initiative more fully defined in its SKOS Core Guide [9].

SKOS is built upon the RDF data model of the subject-predicate-object “triple.” The subjects and objects are akin to nouns, the predicate a verb, in a simple Dick-sees-Jane sentence. Subjects and predicates by convention are related to a URI that provides the definitive reference to the item. Objects may be either a URI resource or a literal (in which case it might be some indexed text, an actual image, number to be used in a calculation, etc.).

Being an RDF Schema simply means that SKOS adds some language and defined relationships to this RDF baseline. This is a bit of recursive understanding, since RDFS is itself defined in RDF by virtue of adding some controlled vocabulary and relations. The power, though, is that these schema additions are also easily expressed and referenced.

This RDFS combination can thus be shown as a standard RDF triple graph, but with the addition of the extended vocabulary and relations:

The power of the approach arises from the ability of the triple to express virtually any concept, further extended via the RDFS language defined for SKOS. SKOS includes concepts such as “broader” and “narrower”, which enable hierarchical relations to be modeled, as well as “related” and “member” to support networks and arrays, respectively [9].

We can visualize this transforming power by looking at how an “ontology” in a totally foreign scheme can be related to the canonical SKOS scheme. In the figure below the left-hand portion shows the native hierarchical taxonomy structure of the UK Archival Thesaurus (UKAT), next as converted to SKOS on the right (with the overlap of categories shown in dark purple). Note the hierarchical relationships visualize better via a taxonomy, but that the RDF graph model used by SKOS allows a richer set of additional relationships including related and alternative names:

[Click on image for full-size pop-up]

SKOS also has a rich set of annotation and labeling properties to enhance human readability of schema developed in it [9]. There is also a useful draft schema that the W3C’s SWEO (Semantic Web Education and Outreach) group is developing to organize semantic Web-related information [10].

Combined, these constructs provide powerful mechanisms for giving contributory ontologies a common conceptualization. When added to other sibling RDF schema such as FOAF or SIOC or DOAP, still additional concepts can be collated.

### Conclusions

While not addressed directly in this piece, it is obviously of first importance to have content with structure before the questions of connecting that information can even arise. Then, that structure must also be available in a form suitable for merging or connection.

At that point, the subjects of this posting come into play.

 We are stubbing our toes on the rocks while we gaze at the heavens.

We see that the daily Web has a diversity of schema or ontologies “loosely defined” for representing the structure of the content. These representations can be transferred to more complex schema, but not in the opposite direction. Moreover, the semantic basis for how to make these mappings also needs some common referents.

RDF provides the canonical data model for the data transfers and representations. RDFS, especially in the form of SKOS, appears to form one basis for the syntax and language for these transformations. And SKOS, with other schema, also appears to offer much of the appropriate “middle ground” for data relationships mapping.

However, lacking in this story is a referential structure for subject relationships [11]. (Also lacking are the ultimately critical domain specifics required for actual implementation.)

Abstract concepts of interest to philosophers and deep thinkers have been given much attention. Sadly, to date, concrete subject structures in which tangible things and tangible actions can be shared, is still very, very weak. We are stubbing our toes on the rocks while we gaze at the heavens.

Yet, despite this, simple and powerful infrastructures are well in-hand to address all foreseeable syntactic and semantic issues. There appear to be no substantive limits to needed next steps.

Lastly, many valuable resources for further reading and learning may be found within the Ontolog Community, W3C, TagCommons and Topics Maps groups. Enjoy! And be wary of ontology no longer.

[1] Deborah L. McGuinness. “Ontologies Come of Age”. In Dieter Fensel, Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, 2003. See http://www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-(with-citation).htm

[2] I think it would be much clearer to refer to “upper level” ontologies as abstract or conceptual, “mid levels” as mapping or binding, and “lower levels” as domain (without any hierarchical distinctions such as lower or lowest or sub-domain), but current practice is probably too entrenched to change now.

[3] There are many aspects that make PROTON one of the more attractive reference ontologies. The PROTON ontology (PROTo ONtology), developed within the scope of the SEKT project, is attractive because of its understandability, relatively small size, modular architecture and a simple subsumption hierarchy. It is available in an OWL Lite form and is easy to adopt and extend. On the face of it, the Topic class within PROTON, which is meant to serve as a bridge between different ontologies, may also provide a binding layer to specific subject topics as sub-classes or class instances.

[4] See my earlier post on Cyc.

[5] Even with such clout, it is questionable to get rather complete adherence, as Ada showed within the Federal government. However, where circumstances allow it, central schema and ontologies may be worth pursuing because of improved interoperability and lower costs, even where some portions do not adhere and are more chaotic like the standard Web.

[6] See, A Survey of RDF/Topic Maps Interoperability Proposals, W3C Working Group Note 10 February 2006, Pepper, Vitali, Garshol, Gessa, Presutti (eds.)

[7] See, Guidelines for RDF/Topic Maps Interoperability, W3C Editor’s Draft 30 June 2006, Pepper, Presutti, Garshol, Vitali (eds.)

[8] Here are some Sweet Tools that may have a usefulness to ontology federation and creation:

• Adaptiva — is a user-centered ontology building environment, based on using multiple strategies to construct an ontology, minimising user input by using adaptive information extraction
• Altova SemanticWorks — is a visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design
• CMS — the CROSI Mapping System is a structure matching system that capitalizes on the rich semantics of the OWL constructs found in source ontologies and on its modular architecture that allows the system to consult external linguistic resources
• ConcepTool — is a system to model, analyze, verify, validate, share, combine, and reuse domain knowledge bases and ontologies, reasoning about their implication
• ConRef — is a service discovery system which uses ontology mapping techniques to support different user vocabularies
• FOAM — is the Framework for Ontology Alignment and Mapping. It is based on heuristics (similarity) of the individual entities (concepts, relations, and instances)
• hMAFRA (Harmonize Mapping Framework) — is a set of tools supporting semantic mapping definition and data reconciliation between ontologies. The targeted formats are XSD, RDFS and KAON
• IF-Map — is an Information Flow based ontology mapping method. It is based on the theoretical grounds of logic of distributed systems and provides an automated streamlined process for generating mappings between ontologies of the same domain
• IODT — is IBM’s toolkit for ontology-driven development. The toolkit includes EMF Ontology Definition Metamodel (EODM), EODM workbench, and an OWL Ontology Repository (named Minerva)
• KAON — is an open-source ontology management infrastructure targeted for business applications. It includes a comprehensive tool suite allowing easy ontology creation and management and provides a framework for building ontology-based applications. An important focus of KAON is scalable and efficient reasoning with ontologies
• LinKFactory — is Language & Computing’s ontology management tool. It provides an effective and user-friendly way to create, maintain and extend extensive multilingual terminology systems and ontologies (English, Spanish, French, etc.). It is designed to build, manage and maintain large, complex, language independent ontologies
• M3t4.Studio Semantic Toolkit — is Metatomix’s free set of Eclipse plug-ins to allow developers to create and manage OWL ontologies and RDF documents
• MAFRA Toolkit — the Ontology MApping FRAmework Toolkit allows to create semantic relations between two (source and target) ontologies, and apply such relations in translating source ontology instances into target ontology instances
• OntoEngine — is a step toward allowing agents to communicate even though they use different formal languages (i.e., different ontologies). It translates data from a “source” ontology to a “target.”
• OntoPortal — enables the authoring and navigation of large semantically-powered portals
• OWLS-MX — the hybrid semantic Web service matchmaker OWLS-MX 1.0 utilizes both description logic reasoning, and token based IR similarity measures. It applies different filters to retrieve OWL-S services that are most relevant to a given query
• pOWL — is a semantic Web development platform for ontologies in PHP. pOWL consists of a number of components, including RAP
• Protege — is an open source visual ontology editor written in Java with many plug-in tools
• Semantic Net Generator — is a utility for generating topic maps automatically from different data sources by using rules definitions specified with Jelly XML syntax. This Java library provides Jelly tags to access and modify data sources (also RDF) to create a semantic network
• SOFA — is a Java API for modeling ontologies and Knowledge Bases in ontology and Semantic Web applications. It provides a simple, abstract and language neutral ontology object model, inferencing mechanism and representation of the model with OWL, DAML+OIL and RDFS languages
• Terminator — is a tool for creating term to ontology resource mappings (documentation in Finnish)
• WebOnto — supports the browsing, creation and editing of ontologies through coarse grained and fine grained visualizations and direct manipulation.

[9] The SKOS language has the following classes:

• CollectableProperty — A property which can be used with a skos:Collection
• Collection — A meaningful collection of concepts
• Concept — An abstract idea or notion; a unit of thought
• ConceptScheme — A set of concepts, optionally including statements about semantic relationships between those concepts. Thesauri, classification schemes, subject heading lists, taxonomies, ‘folksonomies’, and other types of controlled vocabulary are all examples of concept schemes. Concept schemes are also embedded in glossaries and terminologies.
• OrderedCollection — An ordered collection of concepts, where both the grouping and the ordering are meaningful

. . . and the following properties:

• altLabel — An alternative lexical label for a resource. Acronyms, abbreviations, spelling variants, and irregular plural/singular forms may be included among the alternative labels for a concept
• altSymbol — An alternative symbolic label for a resource
• broader — A concept that is more general in meaning. Broader concepts are typically rendered as parents in a concept hierarchy (tree)
• changeNote — A note about a modification to a concept
• definition — A statement or formal explanation of the meaning of a concept
• editorialNote — A note for an editor, translator or maintainer of the vocabulary
• example — An example of the use of a concept
• hasTopConcept — A top level concept in the concept scheme
• hiddenLabel — A lexical label for a resource that should be hidden when generating visual displays of the resource, but should still be accessible to free text search operations
• historyNote — A note about the past state/use/meaning of a concept
• inScheme — A concept scheme in which the concept is included. A concept may be a member of more than one concept scheme
• isPrimarySubjectOf — A resource for which the concept is the primary subject
• isSubjectOf –A resource for which the concept is a subject
• member — A member of a collection
• memberList — An RDF list containing the members of an ordered collection
• narrower — A concept that is more specific in meaning. Narrower concepts are typically rendered as children in a concept hierarchy (tree)
• note — A general note, for any purpose. The other human-readable properties of definition, scopeNote, example, historyNote, editorialNote and changeNote are all sub-properties of note
• prefLabel — The preferred lexical label for a resource, in a given language. No two concepts in the same concept scheme may have the same value for skos:prefLabel in a given language
• prefSymbol — The preferred symbolic label for a resource
• primarySubject — A concept that is the primary subject of the resource. A resource may have only one primary subject per concept scheme
• related — A concept with which there is an associative semantic relationship
• scopeNote — A note that helps to clarify the meaning of a concept
• semanticRelation — A concept related by meaning. This property should not be used directly, but as a super-property for all properties denoting a relationship of meaning between concepts
• subject — A concept that is a subject of the resource
• subjectIndicator — A subject indicator for a concept. [The notion of 'subject indicator' is defined here with reference to the latest definition endorsed by the OASIS Published Subjects Technical Committee]
• symbol — An image that is a symbolic label for the resource. This property is roughly analagous to rdfs:label, but for labelling resources with images that have retrievable representations, rather than RDF literals. Symbolic labelling means labelling a concept with an image.

[10] The SWEO classification ontology is still under active development and has these draft classes. Note, however, the relative lack of actual subject or topic matter:

Classes are currently defined as:

• article – magazine article
• blog – blog discussing SW topics
• book – indicates a textbook, applies to the book’s home page, review or listing in Amazon or such
• casestudy – Article on a business case
• conference/event – conferences or events where you can learn about the Semantic Web
• demo/demonstration – interactive SW demo
• forum – a forum on semantic web or related topics
• presentation – Powerpoint or similar slide show
• person – If this is a person’s home page or blog, see below
• publication – a scientific publication
• ontology – a formalisation of a shared conceptualization using OWL, RDFS, SKOS or something else based on RDF
• organization – If the page is the home page of an organization, research, vendor etc, see below
• portal – a portal website Semantic Web or related topics, usually hosting information items, mailinglists, community tools
• project – a research (for example EU-IST) or other project that addresses Semantic Web issues
• mailinglist – a mailinglist on semantic Web or related topics
• person – ideally a person that is well known regarding the Semantic Web (people who can do keynote speakers), may also be any related person
• press – a press release by a company or an article about Semantic Web
• recommended – If the resource is seen to be in the top 10 of its kind
• specification – a Semantic Web specification (RDF, RDF/S, OWL, etc)
• categories – (perhaps using tags or other free form annotation
• successstory – Article that can contain advertisment and clearly shows the benefit of semantic web
• tutorial – a tutorial teaching some aspect of semantic web, an example
• vocabulary – a RDF vocabulary
• software project/tool – For product/project home pages

If the page describes an organization, it can be tagged as:

• vendor
• research
• enduser

If the page is a person’s home page or blog or similar, it could be:

• opinionleader
• researcher
• journalist
• executive
• geek

The type of audience can also be tagged, for example:

• general public
• beginners
• technicians
• researchers.

[11] The OASIS Topic Maps Published Subjects Technical Committee was formed a number of years back to promote Topic Maps interoperability through the use of Published Subjects Indicators (PSIs). Their resulting report was a very interesting effort that unfortunately did not lead to wide adoption, perhaps because the effort was a bit ahead of its time or it was in advance of the broader acceptance of RDF. This general topic is the subject of a later posting by me.

[12] See further, Leo Obrst, “The Semantic Spectrum & Semantic Models,” a Powerpoint presentation (http://ontolog.cim3.net/file/resource/presentation/LeoObrst_20060112/OntologySpectrumSemanticModels–LeoObrst_20060112.ppt)
made as part of an Ontolog Forum (http://ontolog.cim3.net/) presentation in two parts, “What is an Ontology? – A Briefing on the Range of Semantic Models” (see http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2006_01_12), in January 2006. Leo Obrst is a principal artificial intelligence scientist at MITRE’s (http://www.mitre.org) Center for Innovative Computing and Informatics and a co-convener of the Ontolog Forum. His presentation is a rich source of practical overview information on ontologies.

[13] The actual diagram is an unattributed modification by Dan McCreary (see http://www.danmccreary.com/presentations/sem_int/sem_int.ppt) based on Obrst’s material in [12].