Ontology is one of the more daunting terms for those exposed for the first time to the semantic Web. Not only is the word long and without many common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.
The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”
While there have been attempts to strap on more or less formal understandings or machinery around ontology, it still has very much the sense of a world view, a means of viewing and organizing and conceptualizing and defining a domain of interest. As is made clear below, I personally prefer a loose and embracing understanding of the term (consistent with Deborah McGuinness’s 2003 paper, Ontologies Come of Age ).
There has been a resurgence of interest in ontologies of late. Two reasons have been the emergence of Web 2.0 and tagging and folksonomies, as well as the nascent emergence of the structured Web. In fact, on April 23-24 one of the noted communities of practice around ontologies, Ontolog, sponsored the Ontology Summit 2007 ,”Ontology, Taxonomy, Folksonomy: Understanding the Distinctions.”
These events have sparked my preparing this guide to ontologies. I have to admit this is a somewhat intrepid endeavor given the wealth of material and diversity of opinions.
Of course, a fancy name is not sufficient alone to warrant an interest in ontologies. There are reasons why understanding, using and manipulating ontologies can bring practical benefit:
Both structure and formalism are dimensions for classifying ontologies, which combined are often referred to as an ontology’s “expressiveness.” How one describes this structure and formality differs. One recent attempt is this figure from the Ontology Summit 2007‘s wrap-up communique:
Note the bridging role that an ontology plays between a domain and its content. (By its nature, every ontology attempts to “define” and bound a domain.) Also note that the Summit’s 50 or so participants were focused on the trade-off between semantics v. pragmatic considerations. This was a result of the ongoing attempts within the community to understand, embrace and (possibly) legitimize “less formal” Web 2.0 efforts such as tagging and the folksonomies that can result from them.
There is an M.C. Escher-like recursion of the lizard eating its tail when one observes ontologists creating an ontology to describe the ontological domain. The above diagram, which itself would be different with a slight change in Summit participation or editorship, is, of course, but one representative view of the world. Indeed, a tremendous variety of scientific and research disciplines concern themselves with classifying and organizing the “nature of things.” Those disciplines go by such names as logicians, taxonomists, philosophers, information architects, computer scientists, librarians, operations researchers, systematicists, statisticians, historians, and so forth. (In short, given our ontos, every area of human endeavor has the urge to classify, to organize.) In each of these areas not only do their domains differ, but so do the adopted structures and classification schemes often used.
There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach:
Actual domains or subject coverage are then mostly orthogonal to these approaches.
Loosely defined, the number of possible ontologies is therefore close to infinite: domain X perspective X schema. (Just kidding — sort of! In fact, UMBC’s Swoogle ontology search service claims 10,000 ontologies presently on the Web; the actual data from August 2006 ranges from about 16,000 to 92,000 ontologies, depending on how “formal” the definition. These counts are also limited to OWL-based ontologies.)
Many have misunderstood the semantic Web because of this diversity and the slipperiness of the concept of an ontology. This misunderstanding becomes flat wrong when people claim the semantic Web implies one single grand ontology or organizational schema, One Ring to Rule Them All. Human and domain diversities makes this viewpoint patently false.
The choice of an ontological approach to organize Web and structured content can be contentious. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language or microformats. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) and the still greater formalism of OWL’s various dialects.
Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications. All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it to a canonical form, we can then move on to needed developments in semantic mediation, the threshold condition for the semantic Web.
So, diversity is inevitable and should be accepted. But that observation need not also embrace chaos.
In my early training in biological systematics, Ernst Haeckel’s recapitulation theory that “ontogeny recapitulates phylogeny” (note the same ontos root, the difference from ontology being growth v. study) was losing favor fast. The theory was that the development of an organism through its embryological phases mirrors its evolutionary history. Today, modern biologists recognize numerous connections between ontogeny and phylogeny, explain them using evolutionary theory, or view them as supporting evidence for that theory.
Yet, like the construction of phylogenetic trees, systematicists strive for their classifications of the relatedness of organisms to be “natural”, to reflect the true nature of the relationship. Thus, over time, that understanding of a “natural” system has progressed from appearance → embryology → embryology + detailed morphology → species and interbreeding → DNA. While details continue to be worked out, the degree of genetic relatedness is now widely accepted by biologists as a “natural” basis for organizing the Tree of Life.
It is not unrealistic to also seek “naturalness” in the organization of other knowledge domains, to seek “naturalness” in the organization of their underlying ontologies. Like natural systems in biology, this naturalness should emerge from the shared understandings and perceptions of the domain’s participants. While subject matter expertise and general and domain knowledge are essential to this development, they are not the only factors. As tagging systems on the Web are showing, common usage and broad acceptance by the community at hand is important as well.
While it may appear that a domain such as the biological relatedness of organisms is more empirical than the concepts and ambiguous words in most domains of human endeavor, these attempts at naturalness are still not foolish. The phylogeny example shows that understanding changes over time as knowledge is gained. We now accept DNA over the recapitulation theory.
As the formal SKOS organizational schema for knowledge systems recognizes (see below), the ideas of narrower and broader concepts can be readily embraced, as well as concepts of relatedness and aliases (synonyms). These simple constructs, I would argue, plus the application of knowledge being gained in related domains, will enable tomorrow’s understandings to be more “natural” than today’s, no matter the particular domain at hand.
So, in seeking a “naturalness” within our organizational schema, we can also see that change is a constant. We also see that the tools and ideas underlying the seemingly abstract cause of merging and relating existing ontologies to one another will further a greater “naturalness” within our organizations of the world.
According to the Summit, expressiveness is the extent and ease by which an ontology can describe domain semantics. Structure they define as the degree of organization or hierarchical extent of the ontology. They further define granularity as the level of detail in the ontology. And, as the diagram above alludes, they define other dimensions of use, logical basis, purpose and so forth of an ontology.
The over fifty respondents from 42 communities submitted some 70 different ontologies under about 40 terms to a survey that was used by the Summit to construct their diagram. These submissions included:
I think the simplest spectrum for such distinctions is the formalism of the ontology and its approach (and language or syntax, not further discussed here). More formal ontologies have greater expressiveness and structure and inferential power, less formal ones the opposite. Constructing more formal ontologies is more demanding, and takes more effort and rigor, resulting in an approach that is more powerful but also more rigid and less flexible. Like anything else, there are always trade-offs.
Note this diagram reflects the more conventional, practitioner’s view of the “formal” ontology, which does not include taxonomies or controlled vocabularies (for example) in the definition. This represents the more “closely defined” end of the ontology (semantic) spectrum.
However, since we are speaking here of ontologies and the structured Web or the semantic Web, I believe we need to embrace a concept of ontology aligned to actual practice. Not all content providers can or want to employ ontology engineers to enable formal inferencing of their content. Yet, on the other hand, their content in its various forms does have some meaningful structure, some organization. The trick is to extract this structure for more meaningful use such as data exchange or data merging.
Under such “loosely defined” bases we can thus see a spectrum of ontology approaches on the Web, proceeding from less structure and formalism to more so:
|Type or Schema||Examples||Comments on Structure and Formalism|
|Standard Web Page||entire Web||General metadata fields in the and internal HTML codes and tags provide minimal, but useful sources of structure; other HTTP and retrieval data can also contribute|
|Blog / Wiki Page||examples from Technorati, Bloglines, Wikipedia||Provides still greater formalism for the organization and characterization of content (subjects, categories, posts, comments, date/time stamps, etc.). Importantly, with the addition of plug-ins, some of the basic software may also provide other structured characterizations or output (SIOC, FOAF, etc.; highly varied and site-specific given the diversity of publishing systems and plug-ins)|
|RSS / Atom feeds||most blogs and most news feeds||RSS extends basic XML schema for more robust syndication of content with a tightly controlled vocabulary for feed concepts and their relationships. Because of its ubiquity, this is becoming a useful baseline of structure and formalism; also, the nature of adoption shows much about how ontological structure is an artifact, not driver, for use|
|RSS / Atom feeds with tags or OPML||Grazr, most newsfeed aggregators can import and export OPML lists of RSS feeds||The OPML specification defines an outline as a hierarchical, ordered list of arbitrary elements. The specification is fairly open which makes it suitable for many types of list data. See also OML and XOXO|
|Hierarchical Faceted Metadata||XFML, Flamenco||These and related efforts from the information architecture (IA) community are geared more to library science. However, they directly contribute to faceted browsing, which is one of the first practical instantiations of the semantic Web|
|Folksonomies||Flickr, del.icio.us||Based on user-generated tags and informal organizations of the same; not linked to any “standard” Web protocols. Both tags and hierarchical structure are arbitrary, but some researchers now believe over large enough participant sets that structural consensus and value does emerge|
|Microformats||Example formats include hAtom, hCalendar, hCard, hReview, hResume, rel-directory, xFolk, XFN and XOXO||A microformat is HTML mark up to express semantics with strictly controlled vocabularies. This markup is embedded using specific HTML attributes such as class, rel, and rev. This method is easy to implement and understand, but is not free-form|
|Embedded RDF||RDFa, eRDF||An embedded format, like microformats, but free-form, and not subject to the approval strictures associated with microformats|
|Topic Maps||Infoloom, Topic Maps Search Engine||A topic map can represent information using topics (representing any concept, from people, countries, and organizations to software modules, individual files, and events), associations (which represent the relationships between them), and occurrences (which represent relationships between topics and information resources relevant to them)|
|RDF||Many; DBpedia, etc.||RDF has become the canonical data model since it represents a “universal” conversion format|
|RDF Schema||SKOS, SIOC, DOAP, FOAF||RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources. This becomes the canonical ontology common meeting ground|
|OWL Lite||Here are some existing OWL ontologies; also see Swoogle for OWL search facilities||The Web Ontology Language (OWL) is a language for defining and instantiating Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. OWL is designed for use by applications that need to process the content of information instead of just presenting information to humans. It facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. The three language versions are in order of increasing expressiveness|
|Higher-order “formal” and “upper-level” ontologies||SUMO, DOLCE, PROTON, BFO, Cyc, OpenCyc||These provide comprehensive ontologies and often related knowledge bases, with the goal of enabling AI applications to perform human-like reasoning. Their reasoning languages often use higher-order logics|
As a rule of thumb, items that are less “formal” can be converted to a more formal expression, but the most formal forms can generally not be expressed in less formal forms.
As latter sections elaborate, I see RDF as the universal data model for representing this structure into a common, canonical format, with RDF Schema (specifically SKOS, but also supplemented by FOAF, DOAP and SIOC) as the organizing ontology knowledge representation language (KRL).
This is not to say that the various dialects of OWL should be neglected. In bounded environments, they can provide superior reasoning power and are warranted if they can be sufficiently mandated or enforced. But the RDF and RDF-S systems represent the most tractable “meeting place” or “middle ground,” IMHO.
As if the formalism dimension were not complicated enough, there is also the practice within the ontology community to characterize ontologies by “levels”, specifically upper, middle and lower levels. For example, chances are that you have heard particularly of “upper-level” ontologies.
The following figure helps illustrate this “level” dimension. This diagram is also from Leo Obrst of Mitre , and was also used in another 2006 talk by Jack Park and Patrick Durusau (discussed further below for other reasons):
Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus — that is, real ontos stuff) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak .
The above diagram conveys a sense of how multiple ontologies can relate to one another both in terms of narrower and broader topic matter and at the same “levels” of generalization. Such “meta-structure” (if you will) can provide a reference structure for relating multiple ontologies to one another.
It resides exactly in such bindings or relationships that we can foresee the promise of querying and relating multiple endpoints on the Web with accurate semantics in order to connect dots and combine knowledge bases. Thus, the understanding of the relationships and mappings amongst ontologies becomes a critical infrastructural component of the semantic Web.
We can better understand these mapping and inter-relationship concepts by using a concrete example with a formal ontology. We’ll choose to use the Suggested Upper Merged Ontology simply because it is one of the best known. We could have also selected another upper-level system such as PROTON  or Cyc  or one of the many with narrower concept or subject coverage.
SUMO is one of the formal ontologies that has been mapped to the WordNet lexicon, which adds to its semantic richness. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE. The ontologies that extend SUMO are available under GNU General Public License.
The abstract, conceptual organization of SUMO is shown by this diagram, which also points to its related MILO (MId-Level Ontology), which is being developed as a bridge between the abstract content of the SUMO and the richer detail of various domain ontologies:
At this level, the structure is quite abstract. But one can easily browse the SUMO structure. A nifty tool to do so is the KSMSA (Knowledge Support for Modeling and Simulation) ontology browser. Using a hierarchical tree representation, you can navigate through SUMO, MILO, WordNet, and (with the locally installed version) Wikipedia.
The figure below shows the upper-level entity concept on the left; the right-hand panel shows a drill-down into the example atom entity:
These views may be a bit misleading because the actual underlying structure, while it has hierarchical aspects as shown here, really is in the form of a directed acyclic graph (showing other relatedness options, not just hierarchical ones). So, alternate visualizations include traditional network graphs.
The other thing to note is that the “things” covered in the ontology, the entities, are also fairly abstract. That is because the intention of a standard “upper-level” ontology is to cover all relevant knowledge aspects of each entity’s domain. This approach results in a subject and topic coverage that feels less “concrete” than the coverage in, say, an encyclopedia, directory or card catalog.
According to Park and Durusau, upper ontologies are diverse, middle ontologies are even more diverse, and lower ontologies are more diverse still. A key observation is that ontological diversity is a given and increases as we approach real user interaction levels. Moreover, because of the “loose” nature of ontologies on the Web (now and into the future), diversity of approach is a further key factor.
Recall the initial discussion on the role and objectives of ontologies. About half of those roles involve effectively accessing or querying more than one ontology. The objective of “upper-level” ontologies, many with their own binding layers, is also expressly geared to ontology integration or federation. So, what are the possible mechanisms for such binding or integration?
A fundamental distinction within mechanisms to combine ontologies is whether it is a unified or centralized approach (often imposed or required by some party) or whether it is a schema mapping or binding approach. We can term this distinction centralized v. federated.
Centralized approaches can take a number of forms. At the most extreme, adherence to a centralized approach can be contractual. At the other end are reference models or standards. For example, illustrative reference models include:
Though I have argued that One Ring to Rule them All is not appropriate to the general Web, there may be cases within certain enterprises or where through funding clout (such as government contracts), some form of centralized approach could be imposed . And, frankly, even where compliance can not be assured, there are advantages in economy, efficiency and interoperability to attempt central ontologies. Certain industries — notably pharmaceuticals and petrochemicals — and certain disciplines — such as some areas of biology among others — have through trade associations or community consensus done admirable jobs in adopting centralized approaches.
However, combining ontologies in the context of the broader Internet is more likely through federated approaches. (Though federated approaches can also be improved when there are consensual standards within specific communities.) The key aspect of a federated approach is to acknowledge that multiple schema need to be brought together, and that each contributing data set and its schema will not be altered directly and will likely remain in place.
Thus, the key distinctions within this category are the mechanisms by which those linkages may take place An important goal in any federated approach is to achieve interoperability at the data or instance level without unacceptable loss of information or corruption of the semantics. Numerous specific approaches are possible, but three example areas in RDF-topic map interoperability, the use of “subject maps”, and binding layers can illustrate some of the issues at hand.
In 2006 the W3C set up a working group to look at the issue of RDF and topic maps interoperability. Topic maps have been embraced by the library and information architecture community for some time, and have standards that have been adopted under ISO. Somewhat later but also in parallel was the development of the RDF standard by W3C. The interesting thing was that the conceptual underpinnings and objectives between these two efforts were quite similar. Also, because of the substantive thrust of topic maps and the substantive needs of its community, quite a few topic maps had been developed and implemented.
One of the first efforts of the W3C work group was to evaluate and compare five or six extant proposals for how to relate RDF and topic maps . That report is very interesting reading for any one desirous of learning more about specific issues in combining ontologies and their interoperability. The result of that evaluation then led to some guidelines for best practices in how to complete this mapping . Evaluations such as these provide confidence that interoperability can be achieved between relatively formal schema definitions without unacceptable loss in meaning.
A different, “looser” approach, but one which also grew out of the topic map community, is the idea of “subject maps.” This effort, backed by Park and Durusau noted above, but also with the support of other topic map experts such as Steve Newcomb and Robert Barta via their proposed Topic Maps Reference Model (ISO 13250-5), seems to be one of the best attempts I’ve seen that both respects the reality of the actual Web while proposing a workable, effective scheme for federation.
The basic idea of a subject map is built around a set of subject “proxies.” A subject proxy is a computer representation of a subject that can be implemented as an object, must have an identity, and must be addressable (this point provides the URI connector to RDF). Each contributing schema thus defines its own subjects, with the mappings becoming meta-objects. These, in turn, would benefit from having some accepted subject reference schema (not specifically addressed by the proponents) to reduce the breadth of the ultimate mapped proxy “space.”
I don’t have the expertise to judge further the specifics, but I find the presentation and papers by Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies to be worthwhile reading in any case. I highly recommend these papers for further background and clarity.
As the third example, “binding layers” are a comparatively newer concept. Leading upper-level ontologies such as SUMO or PROTON propose their own binding protocols to their “lower” domains, but that approach takes place within the construct of the parent upper ontology and language. Such designs are not yet generalized solutions. By far the most promising generalized binding solution is the SKOS (Simple Knowledge Organization System). Because of its importance, the next section is devoted to it.
Finally, with respect to federated approaches, there are quite a few software tools that have been developed to aid or promote some of these specific methods. For, example, about twenty of the software applications in my Sweet Tools listing of 500+ semantic Web and -related tools could be interpreted as aiding ontology mapping or creation. You may want to check out some of these specific tools depending on your preferred approach .
SKOS, or the Simple Knowledge Organization System, is a formal language and schema designed to represent such structured information domains as thesauri, classification schemes, taxonomies, subject-heading systems, controlled vocabularies, or others; in short, most all of the “loosely defined” ontology approaches discussed herein. It is a W3C initiative more fully defined in its SKOS Core Guide .
SKOS is built upon the RDF data model of the subject-predicate-object “triple.” The subjects and objects are akin to nouns, the predicate a verb, in a simple Dick-sees-Jane sentence. Subjects and predicates by convention are related to a URI that provides the definitive reference to the item. Objects may be either a URI resource or a literal (in which case it might be some indexed text, an actual image, number to be used in a calculation, etc.).
Being an RDF Schema simply means that SKOS adds some language and defined relationships to this RDF baseline. This is a bit of recursive understanding, since RDFS is itself defined in RDF by virtue of adding some controlled vocabulary and relations. The power, though, is that these schema additions are also easily expressed and referenced.
This RDFS combination can thus be shown as a standard RDF triple graph, but with the addition of the extended vocabulary and relations:
The power of the approach arises from the ability of the triple to express virtually any concept, further extended via the RDFS language defined for SKOS. SKOS includes concepts such as “broader” and “narrower”, which enable hierarchical relations to be modeled, as well as “related” and “member” to support networks and arrays, respectively .
We can visualize this transforming power by looking at how an “ontology” in a totally foreign scheme can be related to the canonical SKOS scheme. In the figure below the left-hand portion shows the native hierarchical taxonomy structure of the UK Archival Thesaurus (UKAT), next as converted to SKOS on the right (with the overlap of categories shown in dark purple). Note the hierarchical relationships visualize better via a taxonomy, but that the RDF graph model used by SKOS allows a richer set of additional relationships including related and alternative names:
SKOS also has a rich set of annotation and labeling properties to enhance human readability of schema developed in it . There is also a useful draft schema that the W3C’s SWEO (Semantic Web Education and Outreach) group is developing to organize semantic Web-related information .
Combined, these constructs provide powerful mechanisms for giving contributory ontologies a common conceptualization. When added to other sibling RDF schema such as FOAF or SIOC or DOAP, still additional concepts can be collated.
While not addressed directly in this piece, it is obviously of first importance to have content with structure before the questions of connecting that information can even arise. Then, that structure must also be available in a form suitable for merging or connection.
At that point, the subjects of this posting come into play.
We see that the daily Web has a diversity of schema or ontologies “loosely defined” for representing the structure of the content. These representations can be transferred to more complex schema, but not in the opposite direction. Moreover, the semantic basis for how to make these mappings also needs some common referents.
RDF provides the canonical data model for the data transfers and representations. RDFS, especially in the form of SKOS, appears to form one basis for the syntax and language for these transformations. And SKOS, with other schema, also appears to offer much of the appropriate “middle ground” for data relationships mapping.
However, lacking in this story is a referential structure for subject relationships . (Also lacking are the ultimately critical domain specifics required for actual implementation.)
Abstract concepts of interest to philosophers and deep thinkers have been given much attention. Sadly, to date, concrete subject structures in which tangible things and tangible actions can be shared, is still very, very weak. We are stubbing our toes on the rocks while we gaze at the heavens.
Yet, despite this, simple and powerful infrastructures are well in-hand to address all foreseeable syntactic and semantic issues. There appear to be no substantive limits to needed next steps.
What does it mean to interoperate information on the Web? With linked data and other structured data now in abundance, why don’t we see more information effectively combined? Why express your information as linked data if no one is going to use it?
Interoperability comes down to the nature of things and how we describe those things or quite similar things from different sources. This was the major thrust of my recent keynote presentation to the Dublin Core annual conference. In that talk I described two aspects of the semantic “gap”:
I’ll discuss the first “gap” in a later post. What we’ll discuss here is the fact that most relationships between putatively same things on the Web are rarely exact, and are most often approximate in nature.
The use of labels for matching or descriptive purposes was the accepted practice in early libraries and library science. However, with the move to electronic records and machine bases for matching, appreciation for ambiguities and semantics have come to the fore. Labels are no longer an adequate — let alone a sufficient — basis for matching references.
The ambiguity point is pretty straightforward. Refer to Jimmy Johnson by his name, and you might be referring to a former football coach, a NASCAR driver, a former boxing champ, a blues guitarist, or perhaps even a plumber in your home town. Or perhaps none of these individuals. Clearly, the label “Jimmy Johnson” is insufficient to establish identity.
Of course, not all things are named entities such as a person’s name. Some are general things or concepts. But, here, semantic heterogeneities can also lead to confusion and mismatches. It is always helpful to revisit the sources and classification of semantic heterogeneities, which I first discussed at length nearly five years ago. Here is a schema classifying more than 40 categories of potential semantic mismatches :
|Generalization / Specialization|
|Internal Path Discrepancy|
|Missing Item||Content Discrepancy|
|Attribute List Discrepancy|
|DOMAIN||Schematic Discrepancy||Element-value to Element-label Mapping|
|Attribute-value to Element-label Mapping|
|Element-value to Attribute-label Mapping|
|Attribute-value to Attribute-label Mapping|
|Scale or Units|
|Data Representation||Primitive Data Type|
|ID Mismatch or Missing ID|
|LANGUAGE||Encoding||Ingest Encoding Mismatch|
|Ingest Encoding Lacking|
|Query Encoding Mismatch|
|Query Encoding Lacking|
|Parsing / Morphological Analysis Errors (many)|
|Syntactical Errors (many)|
|Semantic Errors (many)|
Even with the same label, two items in different information sources can refer generally to the same thing, but may not be the same thing or may define it with a different scope and content. In broad terms, these mismatches can be due to structure, domain, data or language, with many nuances within each type.
The sameAs approach used by many of the inter-dataset linkages in linked data ignores these heterogeneities. In a machine and reasoning sense, indeed even in a linking sense, these assertions can make as little or nonsensical sense as talking about the plumber with the facts about the blues guitarist.
The first example is the seemingly simple idea of “cats”. In one source, the focus might be on house cats, in another domestic cats, and in a third, cats as pets. Are these ideas the same thing? Now, let’s bring in some taxonomic information about the cat family, the Felidae. Now, the idea of “cats” includes lynx, tigers, lions, cougars and many other kinds of cats, domestic and wild (and, also extinct!). Clearly, the “cat” label used alone fails us miserably here.
Another example is one that Fred Giasson and I brought up one year ago in When Linked Data Rules Fail . That piece discussed many poor practices within linked data, and used as one case the treatment of articles in the New York Times about the (deceased) actor Paul Newman. The NYT dataset is about various articles written about people historically in the newspaper. Their record about Paul Newman was about their pool of articles with attributes such as first published and so forth, with no direct attribute information about Paul Newman the person. Then, they asserted a sameAs relationship with external records in Freebase and DBpedia, which acts to commingle person attributes like birth, death and marriage with article attributes such as first and last published. Clearly, the NYT has confused the topic ( Paul Newman) of a record with the nature of that record (articles about topics). This misunderstanding of the “thing” at hand makes the entailed assertions from the multiple sources illogical and useless .
Our third example is the concept or idea or named entity of Great Britain. Depending on usage and context, Great Britain can refer to quite different scopes and things. In one sense, Great Britain is an island. In a political sense, Great Britain can comprise the territory of England, Scotland and Wales. But, even more precise understandings of that political grouping may include a number of outlying islands such as the Isle of Wight, Anglesey, the Isles of Scilly, the Hebrides, and the island groups of Orkney and Shetland. Sometimes the Isle of Man and the Channel Islands, which are not part of the United Kingdom, are fallaciously included in that political grouping. And, then, in a sporting context, Great Britain may also include Northern Ireland. Clearly, these, plus other confusions, can mean quite different things when referring to “Great Britain.” So, without definition, a seemingly simple question such as what the population of Great Britain is could legitimately return quite disparate values (not to mention the time dimension and how that has changed boundaries as well!).
These cases are quite usual for what “things” mean when provided from different sources with different perspectives and with different contexts. If we are to get meaningful interoperation or linkage of these things, we clearly need some different linking predicates.
The realization that many connections across datasets on the Web need to be “approximate” is growing. Here is the result of an informal survey for leading predicates in this regard :
Besides the standard OWL and RDFS predicates, SKOS, UMBEL and DOLCE  provide the largest number of choices above. In combination, these predicates probably provide a good scoping of “approximateness” in mappings.
It is time for some leadership to emerge to provide a more canonical set of linking predicates for these real-world connection requirements. It would also be extremely useful to have such a canonical set adopted by some leading reasoners such that useful work could be done against these properties.
Structured Dynamics and Ontotext are pleased to announce the latest release of UMBEL, version 0.80. It has been more than a year since the last update of UMBEL, and well past earlier announced targets for this upgrade. UMBEL was first publicly released as version 0.70 on July 16, 2008.
UMBEL (Upper Mapping and Binding Exchange Layer) has two roles. It is firstly a vocabulary for building reference ontologies to guide the interoperation of domain information. It is secondly a reference ontology in its own right that contains about 21,000 general reference concepts. With more than two years of practical experience with UMBEL, much has been learned.
This learning has now been reflected into five major changes for the system, embodying numerous minor changes. I summarize these major changes below. The formal release of UMBEL v. 0.80 is also being accompanied by a complete revamping and updating of the project’s Web site. I hope you will find these changes as compelling and exciting as we do.
In the broader context, it is probably best to view this release as but the interim first step of a two-step release sequence leading to UMBEL version 1.00. We are on track to release version 1.00 by the end of this year. This second step will include a complete mapping to the PROTON upper-level ontology and the re-organization and categorization of Wikipedia content into the UMBEL structure. We anticipate the pragmatic challenges in this massive effort will also inform some further refinements to UMBEL itself, which will also lead to further changes in its specification.
Nonetheless, UMBEL v. 0.80 does embody most of the language and structural changes anticipated over this evolution. It is fully ready for use and evaluation; it will, for example, be incorporated into a next version of FactForge. But, do be aware that the major revisions discussed herein are subject to further refinements as the efforts leading to version 1.00 are culminated over the next few weeks.
Let’s now overview these major changes in UMBEL v. 0.80.
The genesis of UMBEL more than three years ago was the recognition that data interoperability on the semantic Web depended on shared reference concepts to link related content. We spent much effort to construct such a reference structure with about 21,000 concepts. That purpose remains.
But, the way in which we created this structure — its vocabulary — has also proven to have value in its own right. The same basic way that we constructed the original UMBEL we have now applied to multiple, specific domain ontologies. With use, it has become clear that the vocabulary for creating reference ontologies is on an equal footing to the reference concepts themselves.
With this understanding has come clarity of role and description of UMBEL. With version 0.80, we now have explicitly split and defined these roles and files.
Thus, UMBEL’s first purpose is to provide a general vocabulary (the UMBEL Vocabulary) of classes and predicates for describing domain ontologies, with the specific aim of promoting interoperability with external datasets and domains. It is usable exclusive of the UMBEL Reference Concept Ontology.
The UMBEL Vocabulary recognizes that different sources of information have different contexts and different structures. A meaningful vocabulary is necessary that can express potential relationships between two information sources with respect to their differences in structure and scope. By nature, these connections are not always exact. Means for expressing the “approximateness” of relationships are essential.
The vocabulary has been greatly simplified from earlier versions (see Major Change #2 below); it now defines two classes:
These are explained further below. And, the vocabulary has 10 properties:
(Note, the latter four are also in SKOS; see .)
In addition, UMBEL re-uses certain properties from external vocabularies. These classes and properties are used to instantiate the UMBEL Reference Concept ontology (see next), and to link Reference Concepts to external ontology classes. For more detail on the vocabulary see Part I: Vocabulary Specification in the specifications.
The second purpose of UMBEL is to provide a coherent framework of broad subjects and topics, the “reference concepts” or RefConcepts, expressed as the UMBEL Reference Concept Ontology. The RefConcepts act as binding nodes for mapping relevant Web-accessible content, also with the specific aim of promoting interoperability and to reason over a coherent reference structure and its linked resources. UMBEL presently has about 21,000 of these reference concepts drawn from the Cyc knowledge base, which are organized into more than 30 mostly disjoint SuperTypes (see Major Change #3).
The UMBEL Reference Concept Ontology is, in essence, a content graph of subject nodes related to one another via
narrower-than relations. In turn, these internal UMBEL RefConcepts may be related to external classes and individuals (instances and named entities) via a set of relational, equivalent, or alignment predicates (the UMBEL Vocabulary, see above).
The actual RefConcepts used are the least changed part in UMBEL from previous versions, and still have the same identifiers as prior versions. The Reference Concept Ontology now uses a recently updated release of the OpenCyc KB v3. Cycorp also added some additional mapping predicates in this release that allows items such as fields of study to be added to the structure. (Thanks, Cycorp!)
Here is a large-graph view of the 21,000 reference concepts in the ontology (click to expand; large file):
More detail on the RefConcepts is provided in Part II: Reference Concepts Specification of the full specifications.
Another set of major changes was the simplification and streamlining of the predicates and construction of the UMBEL Vocabulary . Again, the specifications detail these changes, but the significant ones include:
See further the Part II in the full specifications.
Shortly after the first public release of UMBEL, it was apparent that the 21,000 reference concepts tended to “cluster” into some natural groupings. Further, upon closer investigation, it was also apparent that most of these concepts were disjoint with one another. As subsequent analysis showed, more fully detailed in the Annex G document, fully 75% of the reference concepts in the UMBEL ontology are disjoint with one another.
Natural clusters provide a tractable way to access and manage some 21,000 items. And, large degrees of disjointedness between concepts also can lead to reasoning benefits and faster processing and selection of those items.
For these reasons a dedicated analysis to analyze and assign all UMBEL reference concepts to a new class of SuperTypes was undertaken. SuperTypes are now a major enhancement to UMBEL v. 0.80. The assignment results and the SuperType specification are discussed in Part II, with full analysis results in Annex G.
In addition, all of these SuperTypes are clustered into nine “dimensions”, which are useful for aggregation and organizational purposes, but which have no direct bearing on logic assertions or disjointedness testing. These nine dimensions, with their associated SuperTypes, are shown in the table to the right. Note the last two dimensions (and four SuperTypes), shown in italics, are by definition non-disjoint.
The construct of the SuperType may be applied to any domain ontology constructed with the UMBEL Vocabulary. The UMBEL Reference Concept Ontology includes all disjoint assertions for all of its RefConcepts.
One of the most challenging improvements in the new UMBEL version 0.80 was to make its vocabulary and ontology compliant with the new OWL 2 Web Ontology Language. We wanted to convert to OWL 2 in order to:
The latter reason is the most important given the reference role of UMBEL and ontologies based on the UMBEL Vocabulary. It is not unusual to want to treat things either as a class or an instance in an ontology. Among other aspects, this is known as metamodeling and it can be accomplished in a number of ways. “Punning” is one metamodeling technique that importantly allows us to use concepts in ontologies as either classes or instances, depending on context.
To better understand why we should metamodel, let’s look at a couple of examples, both of which combine organizing categories of things and then describing or characterizing those things. This dual need is common to most domains .
As one example, let’s take a categorization of apes as a kind of mammal, which is then a kind of animal. In these cases, ape is a class, which relates to other classes, and apes may also have members, be they particular kinds of apes or individual apes. Yet, at the same time, we want to assert some characteristics of apes, such as being hairy, two legs and two arms, no tails, capable of walking bipedally, with grasping hands, and with some being endangered species. These characteristics apply to the notion of apes as an instance.
As another example we may have the category of trucks, which may further be split into truck types, brands of trucks, type of engine, and so forth. Yet, again, we may want to characterize that a truck is designed primarily for the transport of cargo (as opposed to automobiles for people transport), or that trucks may have different drivers license requirements or different license fees than autos. These descriptive properties refer to trucks as an instance.
These mixed cases combine both the organization of concepts in relation to one another and with respect to their set members, with the description and characterization of these concepts as things unto themselves. This is a natural and common way to express most any domain of interest. It is also a general requirement for a reference ontology, as we use in the sense of UMBEL.
When we combine this “punning” aspect of OWL 2 with our standard way of relating concepts in a hierarchical manner, this general view of the predicates within UMBEL emerges (click to expand):
On the left-hand side (quadrants A and C) is the “class” view of the structure; the right-hand side is the “individual” (or instance) view of the structure (quadrants B and D). These two views represent alternative perspectives for looking at the UMBEL reference concepts based on metamodeling.
The top side of the diagram (quadrants A and B) is an internal view of UMBEL reference concepts (RefConcept) and their predicates (properties). This internal view applies to the UMBEL Reference Concept Ontology or to domain ontologies based on the UMBEL Vocabulary. These relationships show how RefConcepts are clustered into SuperTypes or how hierarchical relationships are established between Reference Concepts (via the
skos:broaderTransitive relations). The concept relationships and their structure is a “class” view (quadrant A); treating these concepts as instances in their own right and relating them to SKOS is provided by the right-hand “individual” (instance) view (quadrant B).
The bottom of the diagram (quadrants C and D) shows either classes or individuals in external ontologies. The key mapping predicates cross this boundary (the broad dotted line) between UMBEL-based ontologies and external ontologies. See further Part I in the full specification for more detailed discussed of this figure and its relation to metamodelling.
These changes also warranted better documentation and a better project Web site. From a documentation standpoint, the organization was simplified between the actual specifications and related annexes. Also, because of a more collaborative basis resulting from the new partnership with Ontotext, we also established an internal wiki following TechWiki designs. Initial authoring occurs there, with final results re-purposed and published on the project Web site.
The UMBEL Web site also underwent a major upgrade. It is now based on Drupal, and therefore will be able to embrace our conStruct advances in visualization and access over time. We also posted the full Reference Concept Ontology as an OWLDoc portal.
We feel these changes have now resulted in a clean and easy-to-maintain framework for the next phase in UMBEL’s growth and maturation.
As noted in the intro, this version is but an interim step to the pending next release of UMBEL v. 1.00. This next version will provide mappings to leading ontologies and knowledge bases, as well as the upgrade of existing Web services and other language support features. Intended production or commercial uses would best await this next version.
However, the current version 0.80 is fully consistent and OWL 2-compliant. It loads and can be reasoned over with OWL 2 reasoners (see those available with Protégé 4.1, for example). We encourage you to download, test and comment upon this version. Specifics are:
As co-editors, Frédérick Giasson and I are extremely enthused about the changes and cleanliness of version 0.80. It is already helping our client work. We think these improvements are a good harbinger for UMBEL version 1.00 to come by the end of the year. We hope you agree.
This past Friday the Peg project was unveiled for the first time to an enthusiastic welcome at the Winnipeg Poverty Reduction Partnership Forum. A beta version of its website (www.mypeg.ca) was also launched. Peg is an innovative Web portal for community indicators of well-being for the city of Winnipeg, Manitoba. First conceived in 2002, with much subsequent refinement, its strong consortium of members from the local community and recent backing have now allowed it to be shared with the public.
Since early this year, Structured Dynamics has been the lead technical contractor on the project. But Peg is about people and involvement, not technology. Peg is an effort of community and perspectives and information and stories, all designed to coalesce how to make Winnipeg a better community moving forward. So, while the technology underlying the site is innovative (yes, we’re proud ), more so is the effort and vision of the community making it happen. Though just a beta release, the current site and the commitment behind it points to some exciting future developments.
Here is the main screen for Peg (clicking on any of the screen captures below will take you directly to the relevant part of the site):
Winnipeg’s community indicator system (CIS) is organized around themes, cross-cutting issues that bridge across themes, and indicators and supporting data to track and measure the city’s well-being. Peg’s major themes, agreed upon after extensive community consultation, are: basic needs; health; education & learning; social vitality; governance; built environment; economy; and natural environment. In this first beta release, the emphasis has been on the cross-cutting issue of poverty and some of the indicators to track it.
The perspective being brought to bear on these questions of well-being is comprehensive and embracing. Data and demographics and quantitative indicators of well-being are matched with stories and narratives from affected parties, videos, and a variety of display and visualization options. Much of the supporting data is organized by the 236 neighborhoods in Winnipeg, or broader community areas, with comparative baselines to city, province and nation. The information is both hard and soft, and presented in engaging, exciting and dynamic ways. Using the best of current social media, Peg is meant to be a virtual meeting place and town hall for the public to share and engage one another.
This beta is but a first expression of Peg’s longer-term vision, yet already has the backbone to take on these labors. A concept explorer allows the public to explore and navigate through the entire information space; much information is mapped and presented in locational relevance; narratives and stories and videos are linked contextually to topics and issues; and many, many dashboards can be created and displayed for showing trends and comparing neighborhoods, and letting the data speak visually:
The current beta is but a start. The Peg project, in continued consultation with stakeholders, will be developing further indicators for each of its eight major themes, providing information about past and current trends, and expanding into additional cross-cutting issues. Daily, the site will see an increase in richness and relevance.
Peg has been spearheaded by the United Way of Winnipeg and the International Institute for Sustainable Development (IISD), also based in Winnipeg, with the partnership of the Province of Manitoba, the City of Winnipeg, Health in Common, and a cross section of community interests and members across the city. Peg is a non-profit effort, and is embarking on a new three-year work plan to oversee further funding and expansion.
Peg is governed by a Steering Committee with budgetary and strategic responsibilities. Peg also works with an Engagement Group — a broad-based group of Winnipeggers — that serves as a testing ground for ideas, direction and policy. The site provides credits for the various entities involved and responsible for the effort.
IISD has provided overall project management for the current effort. As personal thanks, we’d especially like to recognize Connie Walker, Laszlo Pinter, Christa Rust and Charles Thrift. Tactica, also of Winnipeg, has been the lead graphics and site designer for Peg. SD has worked closely with them to ensure a smooth launch, and they’ve done a great job. Thanks to all!
Of course, for more on the project, go directly to the Peg site or those of its other major participants and contributors. But, in our role as implementers of the behind-the-scenes wizardry powering the site, we would be remiss if we did not mention a couple of technical items.
As lead technical developer, SD was responsible for all data access, management, development and visualization software for the site. The site was developed in Drupal, with Virtuoso as the RDF data store and Solr for faceted site search. As part of its Open Semantic Framework, based on the Citizen Dan local government appliance, SD contributed and extended major open source software for Peg. These contributions included the structWSF Web services framework, conStruct modules for linking the system into Drupal, and the Flex-based semantic Components including the explorer, map, story viewer, browse/search, dashboard, workbench and back office widgets. We also developed the adaptive ontology driving the entire site, based on the Peg framework vocabulary already hashed out by the community participants.
During the course of the project we developed an entirely new workbench capability for creating new, persistent dashboards. We extended the sRelationBrowser semantic component with complete and flexible theming and styling; virtually all aspects of nodes, edges and behavior have now been exposed for tailoring, including fonts, colors and use of images. We enhanced the irON format to make it easier for project participants to submit spreadsheet datasets to the site for new indicator data. We will be migrating these advances to our existing open source software over the coming weeks. Check Fred Giasson’s blog for release details; he has also begun a series on the technology details.
But, in my opinion, what is most remarkable about all of this is that these bloody details are completely hidden from the user. Though real geeks can get RDF and linked data via export options, for the standard user they simple interact and experience the site. No triples are shoved in their face, no technology screams out for attention, and ne’er any URIs are to be found. The thing simply works, all the while being flexible, contextual, attractive and fun.
And that, folks, I submit, is semantics done right!
Jennifer Zaino of SemanticWeb.com has just published an interview with me regarding our recently announced partnership with Ontotext and its relation to linked data. Thanks, Jenny, for a fair and accurate representation of our conversation!
Some of the questions related to reference vocabularies and linking predicates are somewhat hard to convey. Jenny did a very nice job capturing some nuanced concepts. I invite you to read the article yourself and judge.