Posted:May 29, 2007

Where are the Road Signs for the Structured Web?

The Why, How and What of the Semantic Web are Becoming Clear — Yet the Maps and Road Signs to Guide Our Way are Largely Missing

There has been much recent excitement surrounding RDF, its linked data, “RDFizers” and GRDDL to convert existing structured information to RDF. The belief is that 2007 is the breakout year for the semantic Web.

The why of the semantic Web is now clear. The how of the semantic Web is now appearing clear, built around RDF as the canonical data model and the availability and maturation of a growing set of tools. And the what is also becoming clear, with the massive new datastores of DBpedia, Wikipedia³, the HCLS demo, Musicbrainz, Freebase, and the Encyclopedia of Life only being some of the most recent and visible exemplars.

Yet the where aspect seems to be largely missing.

By where I mean: Where do we look for our objects or data? If we have new objects or data, where do we plug into the system? Amongst all possible information and domains, where do we fit?

These questions seem simplistic or elemental in their basic nature. It is almost incomprehensible to wonder where all of this data now emerging in RDF relates to each other — what the overall frame of reference is — but my investigations seem to point to such a gap. In other words, a key piece of the emerging semantic Web infrastructure — where is this stuff — seems to be missing. This gap is across domains and across standard ontologies.

What are the specific components of this missing where, this missing infrastructure? I believe them to be:

Lack of a central look-up point for where to find this RDF data in reference to desired subject matter
Lack of a reference subject context for where this relevant RDF data fits; where can we place this data in a contextual frame of reference — animal, mineral, vegetable?
Lack of a open means where any content structure — from formal ontologies to “RDFized” documents to complete RDF data sets — can “bind” or “map” to other data sets relevant to its subject domain
Lack of a registration or publication mechanism for data sets that do become properly placed, the where of finding SPARQL or similar query endpoints, and
In filling these gaps, the need for a broad community process to give these essential infrastructure components legitimacy.

I discuss these missing components in a bit more detail below, concluding with some preliminary thoughts on how the problem of this critical infrastructure can be redressed. The good news, I believe, is that these potholes on the road to the semantic Web can be relatively easily and quickly filled.

The Lack of Road Signs Causes Collisions and Missed Turns

I think it is fair to say that structure on the current Web is a jumbled mess.

As my recent Intrepid Guide to Ontologies pointed out, there are at least 40 different approaches (or types of ontologies, loosely defined) extant on the Web for organizing information. These approaches embrace every conceivable domain and subject. The individual data sets using these approaches span many, many orders of magnitude in size and range of scope. Diversity and chaos we have aplenty, as the illustrative diagram of this jumbled structural mess shows below.

[Click on image for full-size pop-up]

Mind you, we are not yet even talking about whether one dot is equivalent or can be related to another dot and in what way (namely, connecting the dots via real semantics), but rather at a more fundamental level. Does one entire data set have a relation to any other data set?

Unfortunately, the essential precondition of getting data into the canonical RDF data model — a challenge in its own right — does little to guide us as to where these data sets exist or how they may relate. Even in RDF form, all of this wonderful RDF data exists as isolated and independent data sets, bouncing off of one another in some gross parody of Brownian motion.

What this means, of course, is that useful data that could be of benefit is overlooked or not known. As with problems of data silos everywhere, that blindness leads to unnecessary waste, incomplete analysis, inadequate understanding, and duplicated effort [1].

These gaps were easy to overlook when the focus of attention was on the why, what and how of the semantic Web. But, now that we are seeing RDF data sets emerge in meaningful numbers, the time is ripe to install the road signs and print up the maps. It is time to figure out where we want to go.

The Need for a Lightweight Subject Mapping Layer

As I discussed in an earlier posting, there’s not yet enough backbone to the structured Web. I believe this structure should firstly be built around a lightweight subject- or topic-oriented reference layer.

An umbrella subject reference becomes the “super-structure” to which other specific ontologies can place themselves in an “info-spatial” context.

Unlike traditional upper-level ontologies (see the Intrepid Guide), this backbone is not meant to be comprised of abstract concepts or a logical completeness of the “nature of knowledge”. Rather, it is meant to be only the thinnest veneer of (mostly) hierarchically organized subjects and topic references (see more below).

This subject or topic vocabulary (at least for the backbone) is meant to be quite small, likely more than a few hundred reference subjects, but likely less than many thousands. (There may be considerably more terms in the overall controlled vocabulary to assist context and disambiguation.)

This “umbrella” subject structure could be thought of as the reference subject “super-structure” to which other specific ontologies could place themselves in a sort of locational or “info-spatial” context.

One way to think of these subject reference nodes is as the major destinations — the key cities, locations or interchanges — on the broader structured Web highway system. A properly constructed subject structure could also help disambiguate many common misplacements by virtue of the context of actual subject mappings.

For example, an ambiguous term such as “driver” becomes unambiguous once it is properly mapped to one of its possible topics such as golf, printers, automobiles, screws, NASCAR, or whatever. In this manner, context is also provided for other terms in that contributing domain. (For example, we would now know how to disambiguate “cart” as a term for that domain.)

A high-level and lightweight subject mapping layer does not warrant difficult (and potentially contentious) specificity. The point is not to comprehensively define the scope of all knowledge, but to provide the fewest choices necessary for what subject or subjects a given domain ontology may appropriately reference. We want a listing of the major destinations, not every town and parish in existence.

(That is not to say that more specific subject references won’t emerge or be appropriate for specific domains. Indeed, the hope is that an “umbrella” reference subject structure might be a tie-in point for such specific maps. The more salient issue addressed here is to create such an “umbrella” backbone in the first place.)

This subject reference “super-structure” would in no way impose any limits on what a specific community might do itself with respect to its own ontology scope, definition, format, schema or approach. Moreover, there would be no limit to a community mapping its ontology to multiple subject references (or “destinations”, if you will).

The reason for this high-level subject structure, then, is simply to provide a reference map for where we might want to go — no more, no less. Such a reference structure would greatly aid finding, viewing and querying actual content ontologies — of whatever scope and approach — wherever that content may exist on the Web.

This is not a new idea. About the year 2000 the topic map community was active with published subject indicators (PSIs) [2] and other attempts at topic or subject landmarks. For example, that report stated:

The goal of any application which aggregates information, be it a simple back-of-book index, a library classification system, a topic map or some other kind of application, is to achieve the “collocation objective;” that is, to provide binding points from which everything that is known about a given subject can be reached. In topic maps, binding points take the form of topics; for a topic map application to fully achieve the collocation objective there must be an exact one-to-one correspondence between subjects and topics: Every topic must represent exactly one subject and every subject must be represented by exactly one topic.

When aggregating information (for example, when merging topic maps), comparing ontologies, or matching vocabularies, it is crucially important to know when two topics represent the same subject, in order to be able to combine them into a single topic. To achieve this, the correspondence between a topic and the subject that it represents needs to be made clear. This in turn requires subjects to be identified in a non-ambiguous manner.

The identification of subjects is not only critical to individual topic map applications and to interoperability between topic map applications; it is also critical to interoperability between topic map applications and other applications that make explicit use of abstract representations of subjects, such as RDF.

From that earlier community, Bernard Vatant has subsequently spoken of the need and use of “hubjects” as organizing and binding points, as has Jack Park and Patrick Durusau using the related concept of “subject maps” [3]. An effort that has some overlap with a subject structure is also the Metadata Registry being maintained by the National Science Digital Library (NSDL).

However, while these efforts support the idea of subjects as partial binding or mapping targets, none of them actually proposed a reference subject structure. Actual subject structures may be a bit of a “third rail” in ontology topics due to the historical artifact of wanting to avoid the pitfalls of older library classification systems such as the Dewey Decimal Classification or the Library of Congress Subject Headings.

Be that as it may. I now think the timing is right for us to close this subject gap.

A General Conceptual Model

This mapping layer lends itself to a three-tiered general conceptual model. The first tier is the subject structure, the conceptualization embracing all possible subject content. This referential layer is the lookup point that provides guidance for where to search and find “stuff.”

The second layer is the representation layer, made up of informal to “formal” ontologies. Depending on the formalism, the ontology provides more or less understanding about the subject matter it represents, but at minimum binds to the major subject concepts in the top subject mapping layer.

The third layer are the data sets and “data spaces” [4] that provide the actual content instantiations of these subjects and their ontology representations. This data space layer is the actual source for getting the target information.

Here is a diagram of this general conceptual model:

[Click on image for full-size pop-up]

The layers in this general conceptual model progress from the more abstract and conceptual at the upper level, useful for directing where traffic needs to go, to concrete information and data at the lower level, the real object of manipulation and analysis.

The data spaces and ontologies of various formalisms in the lower two tiers exist in part today. The upper mapping layer does not.

By its nature the general upper mapping layer, in its role as a universal reference “backbone,” must be somewhat general in the scope of its reference subjects. As this mapping infrastructure begins to be fleshed out, it is also therefore likely that additional intermediate mapping layers will emerge for specific domains, which will have more specific scopes and terminology for more accurate and complete understanding with their contributing specific data spaces.

Six Principles for a Possible Lightweight Binding Mechanism

So, let’s beg the question for a moment of what an actual reference subject structure might be. Let’s assume one exists. What should we do with this structure? How should we bind to it?

First, we need to assume that the various ontologies that might bind to this structure reside in the real world, and have a broad diversity of domains, topics and formality of structure. Therefore, we should: a) provide a binding mechanism responsive to the real-world range of formalisms (that is, make no grand assumptions or requirements of structure; each subject structure will be be provided as is); and b) thus place the responsibility to register or bound the subject mapping assignment(s) to the publisher of the contributing content ontology [5].
Second, we can assume that the reference subject structure (light green below) and its binding ontology basis (dark green) are independent actors. As with many other RESTful services, this needs to work in a peer-to-peer (P2P) manner.
Third, as the Intrepid Guide argues, RDF and its emergent referential schema provide the natural data model and characterization “middle ground” across all Web ontology formalisms. This observation leads to SKOS as the organizing schema for ontology integration, supplemented by the related RDF schema of DOAP, SIOC, FOAF and Geonames for the concepts of projects, communities, people and places, respectively. Other standard referents may emerge, which should also be able to be incorporated.
Fourth, the actual binding point of “subjects” are themselves only that: binding points. In other words, this makes them representative “proxies” not to be confused with the actual subjects themselves. This implies no further semantics than a possible binding, and no assertion about the accuracy, relevance or completeness of the binding. How such negotiation may resolve itself needs to be outside of this scope of a simple mapping and binding reference layer (see again [3]).
Fifth, the binding structure and its subject structure needs to have community acceptance; no “wise guys” here, just well-intentioned bozos.
And, sixth, keep it simple. While not all publishers of Web sites need comply — and the critical threshold is to get some of the major initial publishers to comply — the truth like everything else on the Web is that the network effect makes thinks go. By keeping it simple, individual publishers and tool makers are more likely to use and contribute to the system.

How this might look in a representative subject binding to two candidate ontology data sets is shown below:

This diagram shows two contributing data sets and their respective ontologies (no matter how “formal”) binding to a given “subject.” This subject proxy may not be the only one bound by a given data set and its ontology. Also note the “subject” is completely arbitrary and, in fact, is only a proxy for the binding to the potential topic.

One Possible Road Map

Clearly, this approach does not provide a powerful inferential structure.

But, what it does provide is a quite powerful organizational structure. Access and the where for the sake of simplicity and adoption are given preference over inferential elegance. Thus, assuming now, again, a subject structure backbone, matched with these principles and the same jumbled structures as first noted above, we can now see an organizational order emerge from the chaos:

Lightweight Binding to an Upper Subject Structure Can Bring Order

[Click on image for full-size pop-up]

The assumptions to get to this point are not heroic. Simple binding mechanisms matched with a high-level subject “backbone” are all that is required mechanically for such an approach to emerge. All we have done to achieve this road map is to follow the trails above.

Use Existing Consensus to Achieve Authority

So, there is nothing really revolutionary in any of the discussion to this point. Indeed, many have cited the importance of reference structures previously. Why hasn’t such a subject structure yet been developed?

One explanation is that no one accepts any global “external authority” for such subject identifications and organization. The very nature of the Web is participatory and democratic (with a small “d”). Everyone’s voice is equal, and any structure that suggests otherwise will not be accepted.

It is not unusual, therefore, that some of the most accepted venues on the Web are the ones where everyone has an equal chance to participate and contribute: Wikipedia, Flickr, eBay, Amazon, Facebook, YouTube, etc. Indeed, figuring out what generates such self-sustaining “magic” is the focus of many wannabe ventures.

While that is not our purpose here, our purpose is to set the preconditions for what would constitute a referential subject structure that can achieve broad acceptance on the Web. And in the paragraph above, we already have insight into the answer: Build a subject structure from already accepted sources rich in subject content.

A suitable subject structure must be adaptable and self-defining. These criteria reflect expressions of actual social usage and practice, which of course changes over time as knowledge increases and technologies evolve.

One obvious foundation to building a subject structure is thus Wikipedia. That is because the starting basis of Wikipedia information has been built entirely from the bottom up — namely, what is a deserving topic. This has served Wikipedia and the world extremely well, with now nearly 1.8 million articles online in English alone (versions exist for about 100 different languages) [6]. There is also a wealth of internal structure within Wikipedia’s “infobox” templates, structure that has been utilized by DBpedia (among others) to actually transform Wikipedia into an RDF database (as I described in an earlier article). As socially-driven and -evolving, I foresee Wikipedia to continue to be the substantive core at the center of a knowledge organizational framework for some time to come.

But Wikipedia was never designed with an organizing, high-level subject structure in mind. For my arguments herein, creating such an organizing (yes, in part, hierarchical) structure is pivotal.

One innovative approach to provide a more hierarchical structural underpinning to Wikipedia has been YAGO (“yet another great ontology”), an effort from the Max-Planck-Institute SaarbrÃ¼cken [7]. YAGO matches key nouns between Wikipedia and WordNet, and then uses WordNet’s well-defined taxonomy of synsets to superimpose the hierarchical class structure. The match is more than 95% accurate; YAGO is also designed for extensibility with other quality data sets.

I believe YAGO or similar efforts show how the foundational basis of Wikipedia can be supplemented with other accepted lexicons to derive a suitable subject structure with appropriate high-level “binding” attributes. In any case, however constructed, I believe that a high-level reference subject structure must evolve from the global community of practice, as has Wikipedia and WordNet.

I have previously described this formula as W + W + S + ? (for Wikipedia + WordNet + SKOS + other?). There indeed may need to be “other” contributing sources to construct this high-level reference subject structure. Other such potential data sets could be analyzed for subject hierarchies and relationships using fairly well accepted ontology learning methods. Additional techniques will also be necessary for multiple language versions. Those are important details to be discussed and worked out.

The real point, however, is that existing and accepted information systems already exist on the Web that can inform and guide the construction of a high-level subject map. As the contributing sources evolve over time, so could periodic updates and new versions of this subject structure be generated.

Though the choice of the contributing data sets from which this subject structure could be built will never be unanimous, using sources that have already been largely selected through survival of the “fittest” by large portions of the Web-using public will go a long ways to establishing authoritativeness. Moreover, since the subject structure is only intended as a lightweight reference structure — and not a complete closed-world definition — we are also setting realistic thresholds for acceptance.

Conclusion and Next Steps

The specific topic of this piece has been on a subject reference mapping and binding layer that is lightweight, extensible, and reflects current societal practice (broadly defined). In the discussion, there has been recognition of existing schema in such areas as people (FOAF), projects (DOAP), communities (SIOC) and geographical places (Geonames) that might also contribute to the overall binding structure. There very well may need to be some additional expansions in other dimensions such as time and events, organizations, products or whatever. I hope that a consensus view on appropriate high-level dimensions emerges soon.

There are a number of individuals presently working on a draft proposal for an open process to create this subject structure. What we are working quickly to draft and share with the broader community is a proposal related to:

A reference umbrella subject binding ontology, with its own high-level subject structure
Lightweight mechanisms for binding subject-specific community ontologies to this structure
Identification of existing data sets for high-level subject extraction
Codification of high-level subject structure extraction techniques
Identification and collation of tools to work with this subject structure, and
A public Web site for related information, collaboration and project coordination.

We believe this to be an exciting and worthwhile endeavor. Prior to the unveiling of our public Web site and project, I encourage any of you with interest in helping to further this cause to contact me directly at mike at mkbergman dot com [8].

[1] I attempted to quantify this problem in a white paper from about two years ago, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents, BrightPlanet Corporation, 42 pp., July 20, 2005. Some reasons for how such waste occurs were documented in a four-part series on this AI3 blog, Why Are $800 Billion in Document Assets Wasted Annually?, beginning in October 2005 through parts two, three and four concluding in November 2005.

[2] See Published Subjects: Introduction and Basic Requirements (OASIS Published Subjects Technical Committee Recommendation, 2003-06-24.

[3] See especially Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies.

[4] The concept of “data spaces” has been well-articulated by Kingsley Idehen of OpenLink Software and Frédérick Giasson of Zitgist LLC. A “data space” can be personal, collective or topical, and is a virtual “container” for related information irrespective of storage location, schema or structure.

[5] If the publisher gets it wrong, and users through the reference structure don’t access their desired content, there will be sufficient motivation to correct the mapping.

[6] See Wikipedia’s statistics sections.

[7] Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, “Yago – A Core of Semantic Knowledge” (also in bib or ppt). Presented at the 16th International World Wide Web Conference (WWW 2007) in Banff, Alberta, on May 8-12, 2007. YAGO contains over 900,000 entities (like persons, organizations, cities, etc.) and 6 million facts about these entities, organized under a hierarchical schema. YAGO is available for download (400Mb) and converters are available for XML, RDFS, MySQL, Oracle and Postgres. The YAGO data set may also be queried directly online.

[8] I’d especially like to thank Frédérick Giasson and Bernard Vatant of Mondeca for their reviews of a draft of this posting. Fred was also instrumental in suggesting the ideas behind the figure on the general conceptual model.

Posted:May 25, 2007

Massively Cool (aka)

T-SIOC, Object-centered Sociality

AI3 Note: This is the first experiment of directly publishing a blog post from another blog based solely on its RDF. The output came from the new WP SIOC plugin from the inestimable CaptSolo. I also created a new title and then edited the original title slightly as the sub-title. The direct stuff comes next. Thx, Capt!

This post was created by the WordPress SIOC Import plugin based on this SIOC RDF data describing a post located at http://www.johnbreslin.com/blog/2007/04/23/t-sioc-object-centred-sociality/.

I've been reading Jyri Zengestrom's post about object-centred sociality again and I think this illustrates one usage of our SIOC Types module (T-SIOC) very nicely. I've extended my previous picture showing a person being linked across communities to this idea of people (via their user profiles) being connected by the content they create together, co-annotate, or for which they use similar annotations. Bob and Carol are connected via bookmarked URLs that they both have annotated and also through events that they are both attending, and Alice and Bob are using similar tags and are subscribed to the same blogs.

Posted:May 16, 2007

An Intrepid Guide to Ontologies

There’s an Endless Variety of World Views, and Almost as Many Ways to Organize and Describe Them

Download PDF

Ontology is one of the more daunting terms for those exposed for the first time to the semantic Web. Not only is the word long and without many common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.

The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”

While there have been attempts to strap on more or less formal understandings or machinery around ontology, it still has very much the sense of a world view, a means of viewing and organizing and conceptualizing and defining a domain of interest. As is made clear below, I personally prefer a loose and embracing understanding of the term (consistent with Deborah McGuinness’s 2003 paper, Ontologies Come of Age [1]).

There has been a resurgence of interest in ontologies of late. Two reasons have been the emergence of Web 2.0 and tagging and folksonomies, as well as the nascent emergence of the structured Web. In fact, on April 23-24 one of the noted communities of practice around ontologies, Ontolog, sponsored the Ontology Summit 2007 ,“Ontology, Taxonomy, Folksonomy: Understanding the Distinctions.”

These events have sparked my preparing this guide to ontologies. I have to admit this is a somewhat intrepid endeavor given the wealth of material and diversity of opinions.

Overview and Role of Ontologies

Of course, a fancy name is not sufficient alone to warrant an interest in ontologies. There are reasons why understanding, using and manipulating ontologies can bring practical benefit:

Depending on their degree of formalism (an important dimension), ontologies help make explicit the scope, definition, and language and meaning (semantics) of a given domain or world view
Ontologies may provide the power to generalize about their domains
Ontologies, if hierarchically structured in part (and not all are), can provide the power of inheritance
Ontologies provide guidance for how to correctly “place” information in relation to other information in that domain
Ontologies may provide the basis to reason or infer over its domain (again as a function of its formalism)
Ontologies can provide a more effective basis for information extraction or content clustering
Ontologies, again depending on their formalism, may be a source of structure and controlled vocabularies helpful for disambiguating context; they can inform and provide structure to the “lexicons” in particular domains
Ontologies can provide guiding structure for browsing or discovery within a domain, and
Ontologies can help relate and “place” other ontologies or world views in relation to one another; in other words, ontologies can organize ontologies from the most specific to the most abstract.

Both structure and formalism are dimensions for classifying ontologies, which combined are often referred to as an ontology’s “expressiveness.” How one describes this structure and formality differs. One recent attempt is this figure from the Ontology Summit 2007‘s wrap-up communique:

Note the bridging role that an ontology plays between a domain and its content. (By its nature, every ontology attempts to “define” and bound a domain.) Also note that the Summit’s 50 or so participants were focused on the trade-off between semantics v. pragmatic considerations. This was a result of the ongoing attempts within the community to understand, embrace and (possibly) legitimize “less formal” Web 2.0 efforts such as tagging and the folksonomies that can result from them.

There is an M.C. Escher-like recursion of the lizard eating its tail when one observes ontologists creating an ontology to describe the ontological domain. The above diagram, which itself would be different with a slight change in Summit participation or editorship, is, of course, but one representative view of the world. Indeed, a tremendous variety of scientific and research disciplines concern themselves with classifying and organizing the “nature of things.” Those disciplines go by such names as logicians, taxonomists, philosophers, information architects, computer scientists, librarians, operations researchers, systematicists, statisticians, historians, and so forth. (In short, given our ontos, every area of human endeavor has the urge to classify, to organize.) In each of these areas not only do their domains differ, but so do the adopted structures and classification schemes often used.

There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach:

Actual domains or subject coverage are then mostly orthogonal to these approaches.

Loosely defined, the number of possible ontologies is therefore close to infinite: domain X perspective X schema. (Just kidding — sort of! In fact, UMBC’s Swoogle ontology search service claims 10,000 ontologies presently on the Web; the actual data from August 2006 ranges from about 16,000 to 92,000 ontologies, depending on how “formal” the definition. These counts are also limited to OWL-based ontologies.)

Many have misunderstood the semantic Web because of this diversity and the slipperiness of the concept of an ontology. This misunderstanding becomes flat wrong when people claim the semantic Web implies one single grand ontology or organizational schema, One Ring to Rule Them All. Human and domain diversities makes this viewpoint patently false.

Diversity, ‘Naturalness’ and Change

The choice of an ontological approach to organize Web and structured content can be contentious. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language or microformats. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) and the still greater formalism of OWL’s various dialects.

Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications. All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it to a canonical form, we can then move on to needed developments in semantic mediation, the threshold condition for the semantic Web.

There are at least 40 concepts — loosely defined — that could be called an “ontology” framework or approach.

So, diversity is inevitable and should be accepted. But that observation need not also embrace chaos.

In my early training in biological systematics, Ernst Haeckel’s recapitulation theory that “ontogeny recapitulates phylogeny” (note the same ontos root, the difference from ontology being growth v. study) was losing favor fast. The theory was that the development of an organism through its embryological phases mirrors its evolutionary history. Today, modern biologists recognize numerous connections between ontogeny and phylogeny, explain them using evolutionary theory, or view them as supporting evidence for that theory.

Yet, like the construction of phylogenetic trees, systematicists strive for their classifications of the relatedness of organisms to be “natural”, to reflect the true nature of the relationship. Thus, over time, that understanding of a “natural” system has progressed from appearance → embryology → embryology + detailed morphology → species and interbreeding → DNA. While details continue to be worked out, the degree of genetic relatedness is now widely accepted by biologists as a “natural” basis for organizing the Tree of Life.

It is not unrealistic to also seek “naturalness” in the organization of other knowledge domains, to seek “naturalness” in the organization of their underlying ontologies. Like natural systems in biology, this naturalness should emerge from the shared understandings and perceptions of the domain’s participants. While subject matter expertise and general and domain knowledge are essential to this development, they are not the only factors. As tagging systems on the Web are showing, common usage and broad acceptance by the community at hand is important as well.

While it may appear that a domain such as the biological relatedness of organisms is more empirical than the concepts and ambiguous words in most domains of human endeavor, these attempts at naturalness are still not foolish. The phylogeny example shows that understanding changes over time as knowledge is gained. We now accept DNA over the recapitulation theory.

As the formal SKOS organizational schema for knowledge systems recognizes (see below), the ideas of narrower and broader concepts can be readily embraced, as well as concepts of relatedness and aliases (synonyms). These simple constructs, I would argue, plus the application of knowledge being gained in related domains, will enable tomorrow’s understandings to be more “natural” than today’s, no matter the particular domain at hand.

So, in seeking a “naturalness” within our organizational schema, we can also see that change is a constant. We also see that the tools and ideas underlying the seemingly abstract cause of merging and relating existing ontologies to one another will further a greater “naturalness” within our organizations of the world.

A Spectrum of Formalisms

According to the Summit, expressiveness is the extent and ease by which an ontology can describe domain semantics. Structure they define as the degree of organization or hierarchical extent of the ontology. They further define granularity as the level of detail in the ontology. And, as the diagram above alludes, they define other dimensions of use, logical basis, purpose and so forth of an ontology.

The over fifty respondents from 42 communities submitted some 70 different ontologies under about 40 terms to a survey that was used by the Summit to construct their diagram. These submissions included:

. . . formal ontologies (e.g., BFO, DOLCE, SUMO), biomedical ontologies (e.g., Gene Ontology, SNOMED CT, UMLS, ICD), thesauri (e.g., MeSH, National Agricultural Library Thesaurus), folksonomies (e.g., Social bookmarking tags), general ontologies (WordNet, OpenCyc) and specific ontologies (e.g., Process Specification Language). The list also includes markup languages (e.g., NeuroML), representation formalisms (e.g., Entity-Relation model, OWL, WSDL-S) and various ISO standards (e.g., ISO 11179). This [Ontolog] sample clearly illustrates the diversity of artifacts collected under “ontology”.

I think the simplest spectrum for such distinctions is the formalism of the ontology and its approach (and language or syntax, not further discussed here). More formal ontologies have greater expressiveness and structure and inferential power, less formal ones the opposite. Constructing more formal ontologies is more demanding, and takes more effort and rigor, resulting in an approach that is more powerful but also more rigid and less flexible. Like anything else, there are always trade-offs.

Based on work by Leo Obrst of Mitre as interpreted by Dan McCreary, we can view this as a trade-off as one of semantic clarity v. the time and money required to construct the formalism [12, 13]:

Structure and Formalism Increases Semantic Expressiveness

[Click on image for full-size pop-up]

Note this diagram reflects the more conventional, practitioner’s view of the “formal” ontology, which does not include taxonomies or controlled vocabularies (for example) in the definition. This represents the more “closely defined” end of the ontology (semantic) spectrum.

However, since we are speaking here of ontologies and the structured Web or the semantic Web, I believe we need to embrace a concept of ontology aligned to actual practice. Not all content providers can or want to employ ontology engineers to enable formal inferencing of their content. Yet, on the other hand, their content in its various forms does have some meaningful structure, some organization. The trick is to extract this structure for more meaningful use such as data exchange or data merging.

Ontology Approaches on the Web

Under such “loosely defined” bases we can thus see a spectrum of ontology approaches on the Web, proceeding from less structure and formalism to more so:

Type or Schema	Examples	Comments on Structure and Formalism
Standard Web Page	entire Web	General metadata fields in the and internal HTML codes and tags provide minimal, but useful sources of structure; other HTTP and retrieval data can also contribute
Blog / Wiki Page	examples from Technorati, Bloglines, Wikipedia	Provides still greater formalism for the organization and characterization of content (subjects, categories, posts, comments, date/time stamps, etc.). Importantly, with the addition of plug-ins, some of the basic software may also provide other structured characterizations or output (SIOC, FOAF, etc.; highly varied and site-specific given the diversity of publishing systems and plug-ins)
RSS / Atom feeds	most blogs and most news feeds	RSS extends basic XML schema for more robust syndication of content with a tightly controlled vocabulary for feed concepts and their relationships. Because of its ubiquity, this is becoming a useful baseline of structure and formalism; also, the nature of adoption shows much about how ontological structure is an artifact, not driver, for use
RSS / Atom feeds with tags or OPML	Grazr, most newsfeed aggregators can import and export OPML lists of RSS feeds	The OPML specification defines an outline as a hierarchical, ordered list of arbitrary elements. The specification is fairly open which makes it suitable for many types of list data. See also OML and XOXO
Hierarchical Faceted Metadata	XFML, Flamenco	These and related efforts from the information architecture (IA) community are geared more to library science. However, they directly contribute to faceted browsing, which is one of the first practical instantiations of the semantic Web
Folksonomies	Flickr, del.icio.us	Based on user-generated tags and informal organizations of the same; not linked to any “standard” Web protocols. Both tags and hierarchical structure are arbitrary, but some researchers now believe over large enough participant sets that structural consensus and value does emerge
Microformats	Example formats include hAtom, hCalendar, hCard, hReview, hResume, rel-directory, xFolk, XFN and XOXO	A microformat is HTML mark up to express semantics with strictly controlled vocabularies. This markup is embedded using specific HTML attributes such as class, rel, and rev. This method is easy to implement and understand, but is not free-form
Embedded RDF	RDFa, eRDF	An embedded format, like microformats, but free-form, and not subject to the approval strictures associated with microformats
Topic Maps	Infoloom, Topic Maps Search Engine	A topic map can represent information using topics (representing any concept, from people, countries, and organizations to software modules, individual files, and events), associations (which represent the relationships between them), and occurrences (which represent relationships between topics and information resources relevant to them)
RDF	Many; DBpedia, etc.	RDF has become the canonical data model since it represents a “universal” conversion format
RDF Schema	SKOS, SIOC, DOAP, FOAF	RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources. This becomes the canonical ontology common meeting ground
OWL Lite	Here are some existing OWL ontologies; also see Swoogle for OWL search facilities	The Web Ontology Language (OWL) is a language for defining and instantiating Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. OWL is designed for use by applications that need to process the content of information instead of just presenting information to humans. It facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. The three language versions are in order of increasing expressiveness
OWL DL
OWL Full
Higher-order “formal” and “upper-level” ontologies	SUMO, DOLCE, PROTON, BFO, Cyc, OpenCyc	These provide comprehensive ontologies and often related knowledge bases, with the goal of enabling AI applications to perform human-like reasoning. Their reasoning languages often use higher-order logics

As a rule of thumb, items that are less “formal” can be converted to a more formal expression, but the most formal forms can generally not be expressed in less formal forms.

As latter sections elaborate, I see RDF as the universal data model for representing this structure into a common, canonical format, with RDF Schema (specifically SKOS, but also supplemented by FOAF, DOAP and SIOC) as the organizing ontology knowledge representation language (KRL).

This is not to say that the various dialects of OWL should be neglected. In bounded environments, they can provide superior reasoning power and are warranted if they can be sufficiently mandated or enforced. But the RDF and RDF-S systems represent the most tractable “meeting place” or “middle ground,” IMHO.

Still-Another “Level” of Ontologies

As if the formalism dimension were not complicated enough, there is also the practice within the ontology community to characterize ontologies by “levels”, specifically upper, middle and lower levels. For example, chances are that you have heard particularly of “upper-level” ontologies.

The following figure helps illustrate this “level” dimension. This diagram is also from Leo Obrst of Mitre [12], and was also used in another 2006 talk by Jack Park and Patrick Durusau (discussed further below for other reasons):

Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus — that is, real ontos stuff) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak [2].

The above diagram conveys a sense of how multiple ontologies can relate to one another both in terms of narrower and broader topic matter and at the same “levels” of generalization. Such “meta-structure” (if you will) can provide a reference structure for relating multiple ontologies to one another.

The relationships and mappings amongst ontologies is a critical infrastructure component of the semantic Web.

It resides exactly in such bindings or relationships that we can foresee the promise of querying and relating multiple endpoints on the Web with accurate semantics in order to connect dots and combine knowledge bases. Thus, the understanding of the relationships and mappings amongst ontologies becomes a critical infrastructural component of the semantic Web.

The SUMO Example

We can better understand these mapping and inter-relationship concepts by using a concrete example with a formal ontology. We’ll choose to use the Suggested Upper Merged Ontology simply because it is one of the best known. We could have also selected another upper-level system such as PROTON [3] or Cyc [4] or one of the many with narrower concept or subject coverage.

SUMO is one of the formal ontologies that has been mapped to the WordNet lexicon, which adds to its semantic richness. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE. The ontologies that extend SUMO are available under GNU General Public License.

The abstract, conceptual organization of SUMO is shown by this diagram, which also points to its related MILO (MId-Level Ontology), which is being developed as a bridge between the abstract content of the SUMO and the richer detail of various domain ontologies:

At this level, the structure is quite abstract. But one can easily browse the SUMO structure. A nifty tool to do so is the KSMSA (Knowledge Support for Modeling and Simulation) ontology browser. Using a hierarchical tree representation, you can navigate through SUMO, MILO, WordNet, and (with the locally installed version) Wikipedia.

The figure below shows the upper-level entity concept on the left; the right-hand panel shows a drill-down into the example atom entity:

[Click on image for full-size pop-up]

These views may be a bit misleading because the actual underlying structure, while it has hierarchical aspects as shown here, really is in the form of a directed acyclic graph (showing other relatedness options, not just hierarchical ones). So, alternate visualizations include traditional network graphs.

The other thing to note is that the “things” covered in the ontology, the entities, are also fairly abstract. That is because the intention of a standard “upper-level” ontology is to cover all relevant knowledge aspects of each entity’s domain. This approach results in a subject and topic coverage that feels less “concrete” than the coverage in, say, an encyclopedia, directory or card catalog.

Ontology Binding and Integration Mechanisms

According to Park and Durusau, upper ontologies are diverse, middle ontologies are even more diverse, and lower ontologies are more diverse still. A key observation is that ontological diversity is a given and increases as we approach real user interaction levels. Moreover, because of the “loose” nature of ontologies on the Web (now and into the future), diversity of approach is a further key factor.

Recall the initial discussion on the role and objectives of ontologies. About half of those roles involve effectively accessing or querying more than one ontology. The objective of “upper-level” ontologies, many with their own binding layers, is also expressly geared to ontology integration or federation. So, what are the possible mechanisms for such binding or integration?

A fundamental distinction within mechanisms to combine ontologies is whether it is a unified or centralized approach (often imposed or required by some party) or whether it is a schema mapping or binding approach. We can term this distinction centralized v. federated.

Centralized Approaches

Centralized approaches can take a number of forms. At the most extreme, adherence to a centralized approach can be contractual. At the other end are reference models or standards. For example, illustrative reference models include:

the Data Reference Model (DRM), one of the five reference models of the Federal Enterprise Architecture (FEA)
UDEF (Unified Data Element Framework), an approach toward a unified description framework, or
the eXtended MetaData Registry (XMDR) project.

Though I have argued that One Ring to Rule them All is not appropriate to the general Web, there may be cases within certain enterprises or where through funding clout (such as government contracts), some form of centralized approach could be imposed [5]. And, frankly, even where compliance can not be assured, there are advantages in economy, efficiency and interoperability to attempt central ontologies. Certain industries — notably pharmaceuticals and petrochemicals — and certain disciplines — such as some areas of biology among others — have through trade associations or community consensus done admirable jobs in adopting centralized approaches.

Federated Approaches

However, combining ontologies in the context of the broader Internet is more likely through federated approaches. (Though federated approaches can also be improved when there are consensual standards within specific communities.) The key aspect of a federated approach is to acknowledge that multiple schema need to be brought together, and that each contributing data set and its schema will not be altered directly and will likely remain in place.

Thus, the key distinctions within this category are the mechanisms by which those linkages may take place An important goal in any federated approach is to achieve interoperability at the data or instance level without unacceptable loss of information or corruption of the semantics. Numerous specific approaches are possible, but three example areas in RDF-topic map interoperability, the use of “subject maps”, and binding layers can illustrate some of the issues at hand.

In 2006 the W3C set up a working group to look at the issue of RDF and topic maps interoperability. Topic maps have been embraced by the library and information architecture community for some time, and have standards that have been adopted under ISO. Somewhat later but also in parallel was the development of the RDF standard by W3C. The interesting thing was that the conceptual underpinnings and objectives between these two efforts were quite similar. Also, because of the substantive thrust of topic maps and the substantive needs of its community, quite a few topic maps had been developed and implemented.

One of the first efforts of the W3C work group was to evaluate and compare five or six extant proposals for how to relate RDF and topic maps [6]. That report is very interesting reading for any one desirous of learning more about specific issues in combining ontologies and their interoperability. The result of that evaluation then led to some guidelines for best practices in how to complete this mapping [7]. Evaluations such as these provide confidence that interoperability can be achieved between relatively formal schema definitions without unacceptable loss in meaning.

A different, “looser” approach, but one which also grew out of the topic map community, is the idea of “subject maps.” This effort, backed by Park and Durusau noted above, but also with the support of other topic map experts such as Steve Newcomb and Robert Barta via their proposed Topic Maps Reference Model (ISO 13250-5), seems to be one of the best attempts I’ve seen that both respects the reality of the actual Web while proposing a workable, effective scheme for federation.

The basic idea of a subject map is built around a set of subject “proxies.” A subject proxy is a computer representation of a subject that can be implemented as an object, must have an identity, and must be addressable (this point provides the URI connector to RDF). Each contributing schema thus defines its own subjects, with the mappings becoming meta-objects. These, in turn, would benefit from having some accepted subject reference schema (not specifically addressed by the proponents) to reduce the breadth of the ultimate mapped proxy “space.”

I don’t have the expertise to judge further the specifics, but I find the presentation and papers by Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies to be worthwhile reading in any case. I highly recommend these papers for further background and clarity.

As the third example, “binding layers” are a comparatively newer concept. Leading upper-level ontologies such as SUMO or PROTON propose their own binding protocols to their “lower” domains, but that approach takes place within the construct of the parent upper ontology and language. Such designs are not yet generalized solutions. By far the most promising generalized binding solution is the SKOS (Simple Knowledge Organization System). Because of its importance, the next section is devoted to it.

Finally, with respect to federated approaches, there are quite a few software tools that have been developed to aid or promote some of these specific methods. For, example, about twenty of the software applications in my Sweet Tools listing of 500+ semantic Web and -related tools could be interpreted as aiding ontology mapping or creation. You may want to check out some of these specific tools depending on your preferred approach [8].

The Role of SKOS – the Simple Knowledge Organization System

SKOS, or the Simple Knowledge Organization System, is a formal language and schema designed to represent such structured information domains as thesauri, classification schemes, taxonomies, subject-heading systems, controlled vocabularies, or others; in short, most all of the “loosely defined” ontology approaches discussed herein. It is a W3C initiative more fully defined in its SKOS Core Guide [9].

SKOS is built upon the RDF data model of the subject–predicate–object “triple.” The subjects and objects are akin to nouns, the predicate a verb, in a simple Dick-sees-Jane sentence. Subjects and predicates by convention are related to a URI that provides the definitive reference to the item. Objects may be either a URI resource or a literal (in which case it might be some indexed text, an actual image, number to be used in a calculation, etc.).

Being an RDF Schema simply means that SKOS adds some language and defined relationships to this RDF baseline. This is a bit of recursive understanding, since RDFS is itself defined in RDF by virtue of adding some controlled vocabulary and relations. The power, though, is that these schema additions are also easily expressed and referenced.

This RDFS combination can thus be shown as a standard RDF triple graph, but with the addition of the extended vocabulary and relations:

The power of the approach arises from the ability of the triple to express virtually any concept, further extended via the RDFS language defined for SKOS. SKOS includes concepts such as “broader” and “narrower”, which enable hierarchical relations to be modeled, as well as “related” and “member” to support networks and arrays, respectively [9].

We can visualize this transforming power by looking at how an “ontology” in a totally foreign scheme can be related to the canonical SKOS scheme. In the figure below the left-hand portion shows the native hierarchical taxonomy structure of the UK Archival Thesaurus (UKAT), next as converted to SKOS on the right (with the overlap of categories shown in dark purple). Note the hierarchical relationships visualize better via a taxonomy, but that the RDF graph model used by SKOS allows a richer set of additional relationships including related and alternative names:

Example Structural Comparison of Hierarchical Taxonomy with Network Graph

[Click on image for full-size pop-up]

SKOS also has a rich set of annotation and labeling properties to enhance human readability of schema developed in it [9]. There is also a useful draft schema that the W3C’s SWEO (Semantic Web Education and Outreach) group is developing to organize semantic Web-related information [10].

Combined, these constructs provide powerful mechanisms for giving contributory ontologies a common conceptualization. When added to other sibling RDF schema such as FOAF or SIOC or DOAP, still additional concepts can be collated.

Conclusions

While not addressed directly in this piece, it is obviously of first importance to have content with structure before the questions of connecting that information can even arise. Then, that structure must also be available in a form suitable for merging or connection.

At that point, the subjects of this posting come into play.

We are stubbing our toes on the rocks while we gaze at the heavens.

We see that the daily Web has a diversity of schema or ontologies “loosely defined” for representing the structure of the content. These representations can be transferred to more complex schema, but not in the opposite direction. Moreover, the semantic basis for how to make these mappings also needs some common referents.

RDF provides the canonical data model for the data transfers and representations. RDFS, especially in the form of SKOS, appears to form one basis for the syntax and language for these transformations. And SKOS, with other schema, also appears to offer much of the appropriate “middle ground” for data relationships mapping.

However, lacking in this story is a referential structure for subject relationships [11]. (Also lacking are the ultimately critical domain specifics required for actual implementation.)

Abstract concepts of interest to philosophers and deep thinkers have been given much attention. Sadly, to date, concrete subject structures in which tangible things and tangible actions can be shared, is still very, very weak. We are stubbing our toes on the rocks while we gaze at the heavens.

Yet, despite this, simple and powerful infrastructures are well in-hand to address all foreseeable syntactic and semantic issues. There appear to be no substantive limits to needed next steps.

Lastly, many valuable resources for further reading and learning may be found within the Ontolog Community, W3C, TagCommons and Topics Maps groups. Enjoy! And be wary of ontology no longer.

[1] Deborah L. McGuinness. “Ontologies Come of Age”. In Dieter Fensel, Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, 2003. See http://www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-(with-citation).htm

[2] I think it would be much clearer to refer to “upper level” ontologies as abstract or conceptual, “mid levels” as mapping or binding, and “lower levels” as domain (without any hierarchical distinctions such as lower or lowest or sub-domain), but current practice is probably too entrenched to change now.

[3] There are many aspects that make PROTON one of the more attractive reference ontologies. The PROTON ontology (PROTo ONtology), developed within the scope of the SEKT project, is attractive because of its understandability, relatively small size, modular architecture and a simple subsumption hierarchy. It is available in an OWL Lite form and is easy to adopt and extend. On the face of it, the Topic class within PROTON, which is meant to serve as a bridge between different ontologies, may also provide a binding layer to specific subject topics as sub-classes or class instances.

[4] See my earlier post on Cyc.

[5] Even with such clout, it is questionable to get rather complete adherence, as Ada showed within the Federal government. However, where circumstances allow it, central schema and ontologies may be worth pursuing because of improved interoperability and lower costs, even where some portions do not adhere and are more chaotic like the standard Web.

[6] See, A Survey of RDF/Topic Maps Interoperability Proposals, W3C Working Group Note 10 February 2006, Pepper, Vitali, Garshol, Gessa, Presutti (eds.)

[7] See, Guidelines for RDF/Topic Maps Interoperability, W3C Editor’s Draft 30 June 2006, Pepper, Presutti, Garshol, Vitali (eds.)

[8] Here are some Sweet Tools that may have a usefulness to ontology federation and creation:

Adaptiva — is a user-centered ontology building environment, based on using multiple strategies to construct an ontology, minimising user input by using adaptive information extraction
Altova SemanticWorks — is a visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design
CMS — the CROSI Mapping System is a structure matching system that capitalizes on the rich semantics of the OWL constructs found in source ontologies and on its modular architecture that allows the system to consult external linguistic resources
ConcepTool — is a system to model, analyze, verify, validate, share, combine, and reuse domain knowledge bases and ontologies, reasoning about their implication
ConRef — is a service discovery system which uses ontology mapping techniques to support different user vocabularies
FOAM — is the Framework for Ontology Alignment and Mapping. It is based on heuristics (similarity) of the individual entities (concepts, relations, and instances)
hMAFRA (Harmonize Mapping Framework) — is a set of tools supporting semantic mapping definition and data reconciliation between ontologies. The targeted formats are XSD, RDFS and KAON
IF-Map — is an Information Flow based ontology mapping method. It is based on the theoretical grounds of logic of distributed systems and provides an automated streamlined process for generating mappings between ontologies of the same domain
IODT — is IBM’s toolkit for ontology-driven development. The toolkit includes EMF Ontology Definition Metamodel (EODM), EODM workbench, and an OWL Ontology Repository (named Minerva)
KAON — is an open-source ontology management infrastructure targeted for business applications. It includes a comprehensive tool suite allowing easy ontology creation and management and provides a framework for building ontology-based applications. An important focus of KAON is scalable and efficient reasoning with ontologies
LinKFactory — is Language & Computing’s ontology management tool. It provides an effective and user-friendly way to create, maintain and extend extensive multilingual terminology systems and ontologies (English, Spanish, French, etc.). It is designed to build, manage and maintain large, complex, language independent ontologies
M3t4.Studio Semantic Toolkit — is Metatomix’s free set of Eclipse plug-ins to allow developers to create and manage OWL ontologies and RDF documents
MAFRA Toolkit — the Ontology MApping FRAmework Toolkit allows to create semantic relations between two (source and target) ontologies, and apply such relations in translating source ontology instances into target ontology instances
OntoEngine — is a step toward allowing agents to communicate even though they use different formal languages (i.e., different ontologies). It translates data from a “source” ontology to a “target.”
OntoPortal — enables the authoring and navigation of large semantically-powered portals
OWLS-MX — the hybrid semantic Web service matchmaker OWLS-MX 1.0 utilizes both description logic reasoning, and token based IR similarity measures. It applies different filters to retrieve OWL-S services that are most relevant to a given query
pOWL — is a semantic Web development platform for ontologies in PHP. pOWL consists of a number of components, including RAP
Protege — is an open source visual ontology editor written in Java with many plug-in tools
Semantic Net Generator — is a utility for generating topic maps automatically from different data sources by using rules definitions specified with Jelly XML syntax. This Java library provides Jelly tags to access and modify data sources (also RDF) to create a semantic network
SOFA — is a Java API for modeling ontologies and Knowledge Bases in ontology and Semantic Web applications. It provides a simple, abstract and language neutral ontology object model, inferencing mechanism and representation of the model with OWL, DAML+OIL and RDFS languages
Terminator — is a tool for creating term to ontology resource mappings (documentation in Finnish)
WebOnto — supports the browsing, creation and editing of ontologies through coarse grained and fine grained visualizations and direct manipulation.

[9] The SKOS language has the following classes:

CollectableProperty — A property which can be used with a skos:Collection
Collection — A meaningful collection of concepts
Concept — An abstract idea or notion; a unit of thought
ConceptScheme — A set of concepts, optionally including statements about semantic relationships between those concepts. Thesauri, classification schemes, subject heading lists, taxonomies, ‘folksonomies’, and other types of controlled vocabulary are all examples of concept schemes. Concept schemes are also embedded in glossaries and terminologies.
OrderedCollection — An ordered collection of concepts, where both the grouping and the ordering are meaningful

. . . and the following properties:

altLabel — An alternative lexical label for a resource. Acronyms, abbreviations, spelling variants, and irregular plural/singular forms may be included among the alternative labels for a concept
altSymbol — An alternative symbolic label for a resource
broader — A concept that is more general in meaning. Broader concepts are typically rendered as parents in a concept hierarchy (tree)
changeNote — A note about a modification to a concept
definition — A statement or formal explanation of the meaning of a concept
editorialNote — A note for an editor, translator or maintainer of the vocabulary
example — An example of the use of a concept
hasTopConcept — A top level concept in the concept scheme
hiddenLabel — A lexical label for a resource that should be hidden when generating visual displays of the resource, but should still be accessible to free text search operations
historyNote — A note about the past state/use/meaning of a concept
inScheme — A concept scheme in which the concept is included. A concept may be a member of more than one concept scheme
isPrimarySubjectOf — A resource for which the concept is the primary subject
isSubjectOf –A resource for which the concept is a subject
member — A member of a collection
memberList — An RDF list containing the members of an ordered collection
narrower — A concept that is more specific in meaning. Narrower concepts are typically rendered as children in a concept hierarchy (tree)
note — A general note, for any purpose. The other human-readable properties of definition, scopeNote, example, historyNote, editorialNote and changeNote are all sub-properties of note
prefLabel — The preferred lexical label for a resource, in a given language. No two concepts in the same concept scheme may have the same value for skos:prefLabel in a given language
prefSymbol — The preferred symbolic label for a resource
primarySubject — A concept that is the primary subject of the resource. A resource may have only one primary subject per concept scheme
related — A concept with which there is an associative semantic relationship
scopeNote — A note that helps to clarify the meaning of a concept
semanticRelation — A concept related by meaning. This property should not be used directly, but as a super-property for all properties denoting a relationship of meaning between concepts
subject — A concept that is a subject of the resource
subjectIndicator — A subject indicator for a concept. [The notion of ‘subject indicator’ is defined here with reference to the latest definition endorsed by the OASIS Published Subjects Technical Committee]
symbol — An image that is a symbolic label for the resource. This property is roughly analagous to rdfs:label, but for labelling resources with images that have retrievable representations, rather than RDF literals. Symbolic labelling means labelling a concept with an image.

[10] The SWEO classification ontology is still under active development and has these draft classes. Note, however, the relative lack of actual subject or topic matter:

Classes are currently defined as:

article – magazine article
blog – blog discussing SW topics
book – indicates a textbook, applies to the book’s home page, review or listing in Amazon or such
casestudy – Article on a business case
conference/event – conferences or events where you can learn about the Semantic Web
demo/demonstration – interactive SW demo
forum – a forum on semantic web or related topics
presentation – Powerpoint or similar slide show
person – If this is a person’s home page or blog, see below
publication – a scientific publication
ontology – a formalisation of a shared conceptualization using OWL, RDFS, SKOS or something else based on RDF
organization – If the page is the home page of an organization, research, vendor etc, see below
portal – a portal website Semantic Web or related topics, usually hosting information items, mailinglists, community tools
project – a research (for example EU-IST) or other project that addresses Semantic Web issues
mailinglist – a mailinglist on semantic Web or related topics
person – ideally a person that is well known regarding the Semantic Web (people who can do keynote speakers), may also be any related person
press – a press release by a company or an article about Semantic Web
recommended – If the resource is seen to be in the top 10 of its kind
specification – a Semantic Web specification (RDF, RDF/S, OWL, etc)
categories – (perhaps using tags or other free form annotation
successstory – Article that can contain advertisment and clearly shows the benefit of semantic web
tutorial – a tutorial teaching some aspect of semantic web, an example
vocabulary – a RDF vocabulary
software project/tool – For product/project home pages

If the page describes an organization, it can be tagged as:

vendor
research
enduser

If the page is a person’s home page or blog or similar, it could be:

opinionleader
researcher
journalist
executive
geek

The type of audience can also be tagged, for example:

general public
beginners
technicians
researchers.

[11] The OASIS Topic Maps Published Subjects Technical Committee was formed a number of years back to promote Topic Maps interoperability through the use of Published Subjects Indicators (PSIs). Their resulting report was a very interesting effort that unfortunately did not lead to wide adoption, perhaps because the effort was a bit ahead of its time or it was in advance of the broader acceptance of RDF. This general topic is the subject of a later posting by me.

[12] See further, Leo Obrst, “The Semantic Spectrum & Semantic Models,” a Powerpoint presentation (http://ontolog.cim3.net/file/resource/presentation/LeoObrst_20060112/OntologySpectrumSemanticModels–LeoObrst_20060112.ppt)
made as part of an Ontolog Forum (http://ontolog.cim3.net/) presentation in two parts, “What is an Ontology? – A Briefing on the Range of Semantic Models” (see http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2006_01_12), in January 2006. Leo Obrst is a principal artificial intelligence scientist at MITRE’s (http://www.mitre.org) Center for Innovative Computing and Informatics and a co-convener of the Ontolog Forum. His presentation is a rich source of practical overview information on ontologies.

[13] The actual diagram is an unattributed modification by Dan McCreary (see http://www.danmccreary.com/presentations/sem_int/sem_int.ppt) based on Obrst’s material in [12].

Posted:May 15, 2007

Eat Your Greens: FOAF and SIOC are Good for You

Eat Your Greens!

You Can Make a Contribution by Adopting These Standards

You may have observed some changes to my masthead thingeys. I have added a couple of new icons Get FOAF Profile Get SIOC Profile (stage right, upper quadrant) to (hopefully) more prominently display some important stuff. These icons are for some standard RDF ontologies that are becoming prevalent, FOAF (Friend of a Friend) and SIOC (Semantically Interlinked Online Communities, pronounced “shock”).

If you click on either icon, you will see the respective FOAF or SIOC profile for my site, courtesy of Uldis Bojārs‘ SIOC browser (see below).

The truth is, today, these things are largely for the “insiders” working on semantic Web stuff. But the truth is also, tomorrow (and I mean that literally), you should know about these things and possibly adopt them for your own site [1].

OK, Let’s Hear the Acronyms Again

You have likely noticed that I have repeatedly used the acronyms of FOAF, SIOC, DOAP and SKOS (among many others!) in my recent postings. What is interesting about this alphabet soup is that they represent sort of “standard” ways to discuss people, communities, projects and “ontologies” within the semantic Web community. So, the first observation is that it is useful to have “standard” ways to describe anything.

Each of these standards is an RDF (Resource Description Framework) ontology or, perhaps in a more understood way, a common vocabulary and world view about how to discuss something. (By the way, don’t get all confused in such discussions with XML; XML is but one way of how you might describe something — which may or may not apply to any given semantic Web concept — versus what you are actually describing.)

Just as Wikipedia or Google have emerged as the “standards” within their domains, these RDF ontologies may or may not survive the brutal survival of the fittest within their own domains. So, I’m not asserting whether any of these formats is going to make it. But I am asserting that such formats are part of the emerging structured Web.

In fact, in one back story to the recently completed WWW 2007 conference in Banff, Alberta (a rousing success by all accounts!), Tim Berners-Lee noted he only wanted to have his photo taken with those who already had a FOAF profile!

The Start of Good Nutrition

A typical concern about the semantic Web is, “Who pays the tax?” In other words, to get the advantage of the metadata and characterizations of content that allows retrieval and manipulation at the object level (as opposed to the document or page level), why do it? Who benefits? Why incur the effort and cost?

We all now understand the benefits of an interlinked document world. We see it many times daily; indeed, it is now such a part of our cultural and daily life as to go unquestioned. Wow! How could that have happened in a mere decade?

The same thing is true for these semantic Web and RDF ontology constructs. We need to keep twirling the stick until enough friction takes place to cause the fire’s spark.

So, anyone interested in this next phase of the Web — that is, the move to structure and linked data — needs to eat their greens. The advantages of large numbers of the small percentage and network effects will cause this fire to grow hot. The real point behind such efforts, of course, is that if no one listens it is unimportant; but if many listen, importance grows as the function of this network effect.

Of course, over time, the mechanisms to create the structure upon which the semantic Web works will occur automatically and in the background. (Just as today many no longer hand-code the HTML in their Web pages.) But learning about the general structure of RDF and spending some time hand-coding your own FOAF profile is a worthwhile start to good semantic nutrition.

Greeens are Yummy

Do you remember your first attempt to learn HTML? What was it like to learn how to effectively query a search engine? What are the other myriad ways we have learned and adopted to this (now) constant Web environment?

So, assuming you want to get a bit exposed and expose your own Web site to these standards, what do you do next? Not to be definitive, but here are some approaches solely from my own standpoint as a blogger who uses WordPress. Other leading blog and CMS software have slightly different requirements; just search on your brand in combination with one of the alphabet soup acronyms.

FOAF

FOAF is kind of fun. FOAF is an RDF vocabulary for describing people, organizations, and relations among them (see here for specification). My own observation is that it is less used for the friend part and more as a self-description.

Conceptually, FOAF is simple: Who are you, and who are your friends? Yet it is broadly extensible to include any variety or details in terms of personal, family, work history, likes, wants and preferences (via the addition of more namespaces). And, unfortunately, it is also generally pretty crappy in how it is applied and installed. So, if you truly wanted to get your picture taken with TimBL, what the heck should you do?

First, learn about FOAF itself. Christopher Allen in his Life with Alacrity blog [2] wrote a good introduction to FOAF about three years ago. He explains online FOAF services to help you craft your first foaf.rdf file, validation services, and other nuances.

Next, you will need to publicize the availability of your FOAF profile. Perhaps the best known option for WordPress is the FOAF Output Plugin from Morten Frederiksen. This is the most full-featured option, but it is also abstract and difficult to follow from an installation and configuration standpoint [3]. (Pretty common for this stuff.) I personally chose the FOAF Header plug-in by Ben Hyde (with MIT’s Simile program), which was simple, direct and it works.

Once you’ve got a basic profile working, it is then time to really tweak your profile according to your needs and desires. To really configure your profile, consult the FOAF vocabulary. Note that not all FOAF statements are stable. (That is, third-party apps may not parse or support the statements.) Please also note that FOAF has no mechanism to restrict or limit information to different recipients. Only put stuff in your profile that you want to be public.

You may also want to check out these personal FOAF profiles or the FOAF bulletin board to get some great examples of different FOAF content and styles.

SIOC

SIOC provides methods for interconnecting discussion methods such as blogs, forums and mailing lists to each other (see here for specification). There is a partial compilation of SIOC-enabled sites on the ESW wiki.

Uldis Bojārs (nick CaptSolo) is doing a remarkable job across the board on these RDF standard ontology issues. First, he has written the definitive browser for these protocols, the SIOC browser. Second, he has contributed with others in the SIOC community to create general SIOC exporters for aggregators, DotClear, Drupal, mailing lists, a PHP API, phpBB and WordPress [4]. It is the latter version that I use on my own blog (the installation of which I initially documented last August).

Third, Uldis has written the Firefox add-on Semantic Radar. Semantic Radar inspects Web pages and detects links to semantic Web metadata from SIOC, FOAF or DOAP. When any of these forms are detected, its accompanying icon displays in the Firefox status bar and, if clicked, then displays the profile record in the SIOC browser. Very cool. (And, oh, BTW, Uldis is also one of the most helpful people around! 🙂 )

Another cool option of Semantic Radar is that it pings Ping the Semantic Web (PTSW) when it detects a compliant site. PTSW is itself a highly useful aggregator of the semantic Web community and is one of Frédérick Giasson’s innovations exploiting interlinked data. This is a good site to monitor RDF publishing on the Web and new instances of sites that comply with the alphabet standards.

Staying Healthy

So, with the addition of the icons above, I’m now eating my greens! And, you know what, they’re both fun and nutritious.

I will continue to add to and improve my various online profiles. In the immediate future, I also plan to add DOAP and SKOS characterizations of my site and work. Now, back to the hand coding . . . .

[1] If one looks to early bloggers or early podcasters or whatever, the truth is that first users tend to get more traffic and attention. If you are a new blogger today, for example, the sad truth is that your ability to find a large audience is much reduced. Of course, that statement does not mean that you can not find a large audience (if that is your objective, and I don’t mean to suggest the only meaningful one either!), just that that percentage likelihood is lower today than yesterday.

[2] Chris has really scaled back on his online writing, which is a shame. His blog and the quality of his material was one of the reasons I took up blogging myself two years ago.

[3] Too many semantic Web options are hard to understand, install, configure or use. Too bad; the FOAF Output plug-in has mostly really good stuff here that most casual users would never consider.

[4] Key tools have been written by John Breslin (phpBB, Drupal), Alexandre Passant (DotClear, PHP API, bit of Drupal) Sergio Fdez (mailing lists) and Uldis (WordPress and help on the PHP API).

Posted:May 9, 2007

The Encyclopedia of Life and Linking Open Data

This Week Marks Two New Milestones on the Road to the Semantic Web

Two milestones of the structured Web are occurring as we speak. Let’s break out the RDF and party!

Encyclopedia of Life

Today, the Encyclopedia of Life (http://www.eol.org) was formally unveiled. See EOL’s amazing intro video.

EOL is meant to be the Wikipedia of all 1.8 million known living species, backed with real money and real prestige. The effort continues a growing list of impressive and open data sources being compiled and presented on the Web.

The Encyclopedia of Life is a planned 10-year effort to create a free Internet resource to catalog and describe every one of the planet’s organisms. Initially seeded with $12.5 million in backing from two US philanthropies, the John D. and Catherine T. MacArthur Foundation and the Alfred P. Sloan Foundation, the effort is anticipated to cost $100 million by completion.

EOL is based on an idea by Harvard biologist and noted author Edward O. Wilson, a 2007 TED winner for the idea (as is better seen and explained in his acceptance video). Other initial project backers are Harvard University, the Smithsonian Institution, Missouri Botantical Garden, the Biodiversity Library Project with its international participants, and the Chicago Field Museum of Natural History.

Each species will get its own Web page with a structured record, similar to the “infoboxes” within Wikipedia. Information will include photos, technical name and phylogeny, lay description, range maps, and place within the Tree of Life. More prominent species will also get information on its genetics and behavior and compiled scientific research articles, some from centuries ago.

EOL is two and one-half orders of magnitude more ambitious than the earlier Tree of Life Web Project from the University of Arizona with its 5000 pages, and 20 times more ambitious than Wikipedia’s own Wikispecies.

Potential applications range from conservation to mapping to identifying the odd plant or animal species. But, has been found with important predecessor data sets such as GenBank (65 billion genetic sequences for a 100,000 organisms), FishBase (30,000 species) or Conabio (biodiversity in Mexico), among literally hundreds of others, popularity and use exceeds wildest expectations. Conservationists, researchers, school children, amateur naturalists, and outdoor lovers will all find use for the site.

Linking Open Data

EOL is but the newest poster child of massive and hugely important datasets being exposed on the Internet. One initiative with the sole purpose to promote this trend and the interoperability of the data utilizing RDF is the Linking Open Data project. The LOD project is part of the W3C’s Semantic Web Education and Outreach interest group (yes, the unwieldy, SWEO-IG).

The LOD project is holding its first meetings ever this week at WWW2007 in Banff, Alberta. While only a mere six-months old, the SWEO-IG is stimulating much broader interest in the semantic Web through sponsored products, FAQs (also announced yesterday and got really digged!), use cases, supporting information, and growing and useful compilations such as datasets and semantic Web tools.

Anyone interested in the semantic Web should really be familiar with the ESW wiki (which unfortunately is in need of re-organization and clean-up; but keep poking, there are many riches hidden in the nooks and crannies!). You may also want to track the public mailing list of the Linking Open Data group.

I already wrote extensively on the exciting DBpedia intiative (Did you Blink? The Structured Web Just Arrived), itself one of the catalytic datasets at the core of the Linking Open Data efforts. Other datasets actively being pursued for inclusion by the group include:

Geonames data, including its ontology and 6 million geographical places and features, including implementation of RDF data services
700 million triples of U.S. Census data from Josh Tauberer
Revyu.com, the RDF-based reviewing and rating site, including its links to FOAF, the Review Vocab and Richard Newman's Tag Ontology
The "RDFizers" from MIT Simile Project (not to mention other tools), plus 50 million triples from the MIT Libraries catalog covering about a million books
GEMET, the GEneral Multilingual Environmental Thesaurus of the European Environment Agency
And, WordNet through the YAGO project, and its potential for an improved hierarchical ontology to the Wikipedia data.

Additional candidate datasets of interest have also been identified by the SWEO interest group on this page:

Gene Ontology database and its 6 million annotations
Gene fruitfly embryogenesis images from the Berkeley Drosophila Genome Project
Roller Blog entries using the Atom/OWL vocabulary
Various semantic Web interest group and conference materials
Various FOAF-enabled profiles
The UniProt protein database with its 300 million triples
OpenGuides, a network of wiki-based city guides
dbtune, an RDF-enabled version of the Magnatune music database using the Music Ontology
The SKOS Data Zone
Other MIT Simile data collections, including the CIA's World Factbook, Library of Congress' Thesaurus of Graphic Materials, National Cancer Institute's cancer thesaurus, W3C's technical reports
The RDF version of the DMOZ Open Directory Project
GovTrack.us of U.S. Congress legislator and voting records
Chef Moz restaurant and review guides from DMOZ
DOAP Store and its DOAP project descriptions, and
MusicBrainz.

The SWEO-IG and its Linking Open Data initiative are catalyzing a new phase of excitement with relevant information, data and demos.

If anything, all I can say is: Set your sights higher! There’s data galore that needs to be “RDFized”.

So, folks, keep up the great work! And good luck with this week’s meetings in the beauty of Canada.

[BTW, please make sure that EOL makes its data available in RDF!]

Main Links

Search

The Why, How and What of the Semantic Web are Becoming Clear — Yet the Maps and Road Signs to Guide Our Way are Largely Missing

The Lack of Road Signs Causes Collisions and Missed Turns

The Need for a Lightweight Subject Mapping Layer

A General Conceptual Model

Six Principles for a Possible Lightweight Binding Mechanism

One Possible Road Map

Use Existing Consensus to Achieve Authority

Conclusion and Next Steps

T-SIOC, Object-centered Sociality

There’s an Endless Variety of World Views, and Almost as Many Ways to Organize and Describe Them

Overview and Role of Ontologies

Diversity, ‘Naturalness’ and Change

A Spectrum of Formalisms

Ontology Approaches on the Web

Still-Another “Level” of Ontologies

The SUMO Example

Ontology Binding and Integration Mechanisms

Centralized Approaches

Federated Approaches

The Role of SKOS – the Simple Knowledge Organization System

Conclusions

You Can Make a Contribution by Adopting These Standards

OK, Let’s Hear the Acronyms Again

The Start of Good Nutrition

Greeens are Yummy

FOAF

SIOC

Staying Healthy

This Week Marks Two New Milestones on the Road to the Semantic Web

Encyclopedia of Life

Linking Open Data