Ontology is one of the more daunting terms for those exposed for the first time to the semantic Web. Not only is the word long and without many common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.
The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”
While there have been attempts to strap on more or less formal understandings or machinery around ontology, it still has very much the sense of a world view, a means of viewing and organizing and conceptualizing and defining a domain of interest. As is made clear below, I personally prefer a loose and embracing understanding of the term (consistent with Deborah McGuinness’s 2003 paper, Ontologies Come of Age ).
There has been a resurgence of interest in ontologies of late. Two reasons have been the emergence of Web 2.0 and tagging and folksonomies, as well as the nascent emergence of the structured Web. In fact, on April 23-24 one of the noted communities of practice around ontologies, Ontolog, sponsored the Ontology Summit 2007 ,“Ontology, Taxonomy, Folksonomy: Understanding the Distinctions.”
These events have sparked my preparing this guide to ontologies. I have to admit this is a somewhat intrepid endeavor given the wealth of material and diversity of opinions.
Of course, a fancy name is not sufficient alone to warrant an interest in ontologies. There are reasons why understanding, using and manipulating ontologies can bring practical benefit:
Both structure and formalism are dimensions for classifying ontologies, which combined are often referred to as an ontology’s “expressiveness.” How one describes this structure and formality differs. One recent attempt is this figure from the Ontology Summit 2007‘s wrap-up communique:
Note the bridging role that an ontology plays between a domain and its content. (By its nature, every ontology attempts to “define” and bound a domain.) Also note that the Summit’s 50 or so participants were focused on the trade-off between semantics v. pragmatic considerations. This was a result of the ongoing attempts within the community to understand, embrace and (possibly) legitimize “less formal” Web 2.0 efforts such as tagging and the folksonomies that can result from them.
There is an M.C. Escher-like recursion of the lizard eating its tail when one observes ontologists creating an ontology to describe the ontological domain. The above diagram, which itself would be different with a slight change in Summit participation or editorship, is, of course, but one representative view of the world. Indeed, a tremendous variety of scientific and research disciplines concern themselves with classifying and organizing the “nature of things.” Those disciplines go by such names as logicians, taxonomists, philosophers, information architects, computer scientists, librarians, operations researchers, systematicists, statisticians, historians, and so forth. (In short, given our ontos, every area of human endeavor has the urge to classify, to organize.) In each of these areas not only do their domains differ, but so do the adopted structures and classification schemes often used.
There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach:
Actual domains or subject coverage are then mostly orthogonal to these approaches.
Loosely defined, the number of possible ontologies is therefore close to infinite: domain X perspective X schema. (Just kidding — sort of! In fact, UMBC’s Swoogle ontology search service claims 10,000 ontologies presently on the Web; the actual data from August 2006 ranges from about 16,000 to 92,000 ontologies, depending on how “formal” the definition. These counts are also limited to OWL-based ontologies.)
Many have misunderstood the semantic Web because of this diversity and the slipperiness of the concept of an ontology. This misunderstanding becomes flat wrong when people claim the semantic Web implies one single grand ontology or organizational schema, One Ring to Rule Them All. Human and domain diversities makes this viewpoint patently false.
The choice of an ontological approach to organize Web and structured content can be contentious. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language or microformats. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) and the still greater formalism of OWL’s various dialects.
Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications. All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it to a canonical form, we can then move on to needed developments in semantic mediation, the threshold condition for the semantic Web.
So, diversity is inevitable and should be accepted. But that observation need not also embrace chaos.
In my early training in biological systematics, Ernst Haeckel’s recapitulation theory that “ontogeny recapitulates phylogeny” (note the same ontos root, the difference from ontology being growth v. study) was losing favor fast. The theory was that the development of an organism through its embryological phases mirrors its evolutionary history. Today, modern biologists recognize numerous connections between ontogeny and phylogeny, explain them using evolutionary theory, or view them as supporting evidence for that theory.
Yet, like the construction of phylogenetic trees, systematicists strive for their classifications of the relatedness of organisms to be “natural”, to reflect the true nature of the relationship. Thus, over time, that understanding of a “natural” system has progressed from appearance → embryology → embryology + detailed morphology → species and interbreeding → DNA. While details continue to be worked out, the degree of genetic relatedness is now widely accepted by biologists as a “natural” basis for organizing the Tree of Life.
It is not unrealistic to also seek “naturalness” in the organization of other knowledge domains, to seek “naturalness” in the organization of their underlying ontologies. Like natural systems in biology, this naturalness should emerge from the shared understandings and perceptions of the domain’s participants. While subject matter expertise and general and domain knowledge are essential to this development, they are not the only factors. As tagging systems on the Web are showing, common usage and broad acceptance by the community at hand is important as well.
While it may appear that a domain such as the biological relatedness of organisms is more empirical than the concepts and ambiguous words in most domains of human endeavor, these attempts at naturalness are still not foolish. The phylogeny example shows that understanding changes over time as knowledge is gained. We now accept DNA over the recapitulation theory.
As the formal SKOS organizational schema for knowledge systems recognizes (see below), the ideas of narrower and broader concepts can be readily embraced, as well as concepts of relatedness and aliases (synonyms). These simple constructs, I would argue, plus the application of knowledge being gained in related domains, will enable tomorrow’s understandings to be more “natural” than today’s, no matter the particular domain at hand.
So, in seeking a “naturalness” within our organizational schema, we can also see that change is a constant. We also see that the tools and ideas underlying the seemingly abstract cause of merging and relating existing ontologies to one another will further a greater “naturalness” within our organizations of the world.
According to the Summit, expressiveness is the extent and ease by which an ontology can describe domain semantics. Structure they define as the degree of organization or hierarchical extent of the ontology. They further define granularity as the level of detail in the ontology. And, as the diagram above alludes, they define other dimensions of use, logical basis, purpose and so forth of an ontology.
The over fifty respondents from 42 communities submitted some 70 different ontologies under about 40 terms to a survey that was used by the Summit to construct their diagram. These submissions included:
. . . formal ontologies (e.g., BFO, DOLCE, SUMO), biomedical ontologies (e.g., Gene Ontology, SNOMED CT, UMLS, ICD), thesauri (e.g., MeSH, National Agricultural Library Thesaurus), folksonomies (e.g., Social bookmarking tags), general ontologies (WordNet, OpenCyc) and specific ontologies (e.g., Process Specification Language). The list also includes markup languages (e.g., NeuroML), representation formalisms (e.g., Entity-Relation model, OWL, WSDL-S) and various ISO standards (e.g., ISO 11179). This [Ontolog] sample clearly illustrates the diversity of artifacts collected under “ontology”.
I think the simplest spectrum for such distinctions is the formalism of the ontology and its approach (and language or syntax, not further discussed here). More formal ontologies have greater expressiveness and structure and inferential power, less formal ones the opposite. Constructing more formal ontologies is more demanding, and takes more effort and rigor, resulting in an approach that is more powerful but also more rigid and less flexible. Like anything else, there are always trade-offs.
Note this diagram reflects the more conventional, practitioner’s view of the “formal” ontology, which does not include taxonomies or controlled vocabularies (for example) in the definition. This represents the more “closely defined” end of the ontology (semantic) spectrum.
However, since we are speaking here of ontologies and the structured Web or the semantic Web, I believe we need to embrace a concept of ontology aligned to actual practice. Not all content providers can or want to employ ontology engineers to enable formal inferencing of their content. Yet, on the other hand, their content in its various forms does have some meaningful structure, some organization. The trick is to extract this structure for more meaningful use such as data exchange or data merging.
Under such “loosely defined” bases we can thus see a spectrum of ontology approaches on the Web, proceeding from less structure and formalism to more so:
|Type or Schema||Examples||Comments on Structure and Formalism|
|Standard Web Page||entire Web||General metadata fields in the and internal HTML codes and tags provide minimal, but useful sources of structure; other HTTP and retrieval data can also contribute|
|Blog / Wiki Page||examples from Technorati, Bloglines, Wikipedia||Provides still greater formalism for the organization and characterization of content (subjects, categories, posts, comments, date/time stamps, etc.). Importantly, with the addition of plug-ins, some of the basic software may also provide other structured characterizations or output (SIOC, FOAF, etc.; highly varied and site-specific given the diversity of publishing systems and plug-ins)|
|RSS / Atom feeds||most blogs and most news feeds||RSS extends basic XML schema for more robust syndication of content with a tightly controlled vocabulary for feed concepts and their relationships. Because of its ubiquity, this is becoming a useful baseline of structure and formalism; also, the nature of adoption shows much about how ontological structure is an artifact, not driver, for use|
|RSS / Atom feeds with tags or OPML||Grazr, most newsfeed aggregators can import and export OPML lists of RSS feeds||The OPML specification defines an outline as a hierarchical, ordered list of arbitrary elements. The specification is fairly open which makes it suitable for many types of list data. See also OML and XOXO|
|Hierarchical Faceted Metadata||XFML, Flamenco||These and related efforts from the information architecture (IA) community are geared more to library science. However, they directly contribute to faceted browsing, which is one of the first practical instantiations of the semantic Web|
|Folksonomies||Flickr, del.icio.us||Based on user-generated tags and informal organizations of the same; not linked to any “standard” Web protocols. Both tags and hierarchical structure are arbitrary, but some researchers now believe over large enough participant sets that structural consensus and value does emerge|
|Microformats||Example formats include hAtom, hCalendar, hCard, hReview, hResume, rel-directory, xFolk, XFN and XOXO||A microformat is HTML mark up to express semantics with strictly controlled vocabularies. This markup is embedded using specific HTML attributes such as class, rel, and rev. This method is easy to implement and understand, but is not free-form|
|Embedded RDF||RDFa, eRDF||An embedded format, like microformats, but free-form, and not subject to the approval strictures associated with microformats|
|Topic Maps||Infoloom, Topic Maps Search Engine||A topic map can represent information using topics (representing any concept, from people, countries, and organizations to software modules, individual files, and events), associations (which represent the relationships between them), and occurrences (which represent relationships between topics and information resources relevant to them)|
|RDF||Many; DBpedia, etc.||RDF has become the canonical data model since it represents a “universal” conversion format|
|RDF Schema||SKOS, SIOC, DOAP, FOAF||RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources. This becomes the canonical ontology common meeting ground|
|OWL Lite||Here are some existing OWL ontologies; also see Swoogle for OWL search facilities||The Web Ontology Language (OWL) is a language for defining and instantiating Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. OWL is designed for use by applications that need to process the content of information instead of just presenting information to humans. It facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. The three language versions are in order of increasing expressiveness|
|Higher-order “formal” and “upper-level” ontologies||SUMO, DOLCE, PROTON, BFO, Cyc, OpenCyc||These provide comprehensive ontologies and often related knowledge bases, with the goal of enabling AI applications to perform human-like reasoning. Their reasoning languages often use higher-order logics|
As a rule of thumb, items that are less “formal” can be converted to a more formal expression, but the most formal forms can generally not be expressed in less formal forms.
As latter sections elaborate, I see RDF as the universal data model for representing this structure into a common, canonical format, with RDF Schema (specifically SKOS, but also supplemented by FOAF, DOAP and SIOC) as the organizing ontology knowledge representation language (KRL).
This is not to say that the various dialects of OWL should be neglected. In bounded environments, they can provide superior reasoning power and are warranted if they can be sufficiently mandated or enforced. But the RDF and RDF-S systems represent the most tractable “meeting place” or “middle ground,” IMHO.
As if the formalism dimension were not complicated enough, there is also the practice within the ontology community to characterize ontologies by “levels”, specifically upper, middle and lower levels. For example, chances are that you have heard particularly of “upper-level” ontologies.
The following figure helps illustrate this “level” dimension. This diagram is also from Leo Obrst of Mitre , and was also used in another 2006 talk by Jack Park and Patrick Durusau (discussed further below for other reasons):
Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus — that is, real ontos stuff) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak .
The above diagram conveys a sense of how multiple ontologies can relate to one another both in terms of narrower and broader topic matter and at the same “levels” of generalization. Such “meta-structure” (if you will) can provide a reference structure for relating multiple ontologies to one another.
It resides exactly in such bindings or relationships that we can foresee the promise of querying and relating multiple endpoints on the Web with accurate semantics in order to connect dots and combine knowledge bases. Thus, the understanding of the relationships and mappings amongst ontologies becomes a critical infrastructural component of the semantic Web.
We can better understand these mapping and inter-relationship concepts by using a concrete example with a formal ontology. We’ll choose to use the Suggested Upper Merged Ontology simply because it is one of the best known. We could have also selected another upper-level system such as PROTON  or Cyc  or one of the many with narrower concept or subject coverage.
SUMO is one of the formal ontologies that has been mapped to the WordNet lexicon, which adds to its semantic richness. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE. The ontologies that extend SUMO are available under GNU General Public License.
The abstract, conceptual organization of SUMO is shown by this diagram, which also points to its related MILO (MId-Level Ontology), which is being developed as a bridge between the abstract content of the SUMO and the richer detail of various domain ontologies:
At this level, the structure is quite abstract. But one can easily browse the SUMO structure. A nifty tool to do so is the KSMSA (Knowledge Support for Modeling and Simulation) ontology browser. Using a hierarchical tree representation, you can navigate through SUMO, MILO, WordNet, and (with the locally installed version) Wikipedia.
The figure below shows the upper-level entity concept on the left; the right-hand panel shows a drill-down into the example atom entity:
These views may be a bit misleading because the actual underlying structure, while it has hierarchical aspects as shown here, really is in the form of a directed acyclic graph (showing other relatedness options, not just hierarchical ones). So, alternate visualizations include traditional network graphs.
The other thing to note is that the “things” covered in the ontology, the entities, are also fairly abstract. That is because the intention of a standard “upper-level” ontology is to cover all relevant knowledge aspects of each entity’s domain. This approach results in a subject and topic coverage that feels less “concrete” than the coverage in, say, an encyclopedia, directory or card catalog.
According to Park and Durusau, upper ontologies are diverse, middle ontologies are even more diverse, and lower ontologies are more diverse still. A key observation is that ontological diversity is a given and increases as we approach real user interaction levels. Moreover, because of the “loose” nature of ontologies on the Web (now and into the future), diversity of approach is a further key factor.
Recall the initial discussion on the role and objectives of ontologies. About half of those roles involve effectively accessing or querying more than one ontology. The objective of “upper-level” ontologies, many with their own binding layers, is also expressly geared to ontology integration or federation. So, what are the possible mechanisms for such binding or integration?
A fundamental distinction within mechanisms to combine ontologies is whether it is a unified or centralized approach (often imposed or required by some party) or whether it is a schema mapping or binding approach. We can term this distinction centralized v. federated.
Centralized approaches can take a number of forms. At the most extreme, adherence to a centralized approach can be contractual. At the other end are reference models or standards. For example, illustrative reference models include:
Though I have argued that One Ring to Rule them All is not appropriate to the general Web, there may be cases within certain enterprises or where through funding clout (such as government contracts), some form of centralized approach could be imposed . And, frankly, even where compliance can not be assured, there are advantages in economy, efficiency and interoperability to attempt central ontologies. Certain industries — notably pharmaceuticals and petrochemicals — and certain disciplines — such as some areas of biology among others — have through trade associations or community consensus done admirable jobs in adopting centralized approaches.
However, combining ontologies in the context of the broader Internet is more likely through federated approaches. (Though federated approaches can also be improved when there are consensual standards within specific communities.) The key aspect of a federated approach is to acknowledge that multiple schema need to be brought together, and that each contributing data set and its schema will not be altered directly and will likely remain in place.
Thus, the key distinctions within this category are the mechanisms by which those linkages may take place An important goal in any federated approach is to achieve interoperability at the data or instance level without unacceptable loss of information or corruption of the semantics. Numerous specific approaches are possible, but three example areas in RDF-topic map interoperability, the use of “subject maps”, and binding layers can illustrate some of the issues at hand.
In 2006 the W3C set up a working group to look at the issue of RDF and topic maps interoperability. Topic maps have been embraced by the library and information architecture community for some time, and have standards that have been adopted under ISO. Somewhat later but also in parallel was the development of the RDF standard by W3C. The interesting thing was that the conceptual underpinnings and objectives between these two efforts were quite similar. Also, because of the substantive thrust of topic maps and the substantive needs of its community, quite a few topic maps had been developed and implemented.
One of the first efforts of the W3C work group was to evaluate and compare five or six extant proposals for how to relate RDF and topic maps . That report is very interesting reading for any one desirous of learning more about specific issues in combining ontologies and their interoperability. The result of that evaluation then led to some guidelines for best practices in how to complete this mapping . Evaluations such as these provide confidence that interoperability can be achieved between relatively formal schema definitions without unacceptable loss in meaning.
A different, “looser” approach, but one which also grew out of the topic map community, is the idea of “subject maps.” This effort, backed by Park and Durusau noted above, but also with the support of other topic map experts such as Steve Newcomb and Robert Barta via their proposed Topic Maps Reference Model (ISO 13250-5), seems to be one of the best attempts I’ve seen that both respects the reality of the actual Web while proposing a workable, effective scheme for federation.
The basic idea of a subject map is built around a set of subject “proxies.” A subject proxy is a computer representation of a subject that can be implemented as an object, must have an identity, and must be addressable (this point provides the URI connector to RDF). Each contributing schema thus defines its own subjects, with the mappings becoming meta-objects. These, in turn, would benefit from having some accepted subject reference schema (not specifically addressed by the proponents) to reduce the breadth of the ultimate mapped proxy “space.”
I don’t have the expertise to judge further the specifics, but I find the presentation and papers by Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies to be worthwhile reading in any case. I highly recommend these papers for further background and clarity.
As the third example, “binding layers” are a comparatively newer concept. Leading upper-level ontologies such as SUMO or PROTON propose their own binding protocols to their “lower” domains, but that approach takes place within the construct of the parent upper ontology and language. Such designs are not yet generalized solutions. By far the most promising generalized binding solution is the SKOS (Simple Knowledge Organization System). Because of its importance, the next section is devoted to it.
Finally, with respect to federated approaches, there are quite a few software tools that have been developed to aid or promote some of these specific methods. For, example, about twenty of the software applications in my Sweet Tools listing of 500+ semantic Web and -related tools could be interpreted as aiding ontology mapping or creation. You may want to check out some of these specific tools depending on your preferred approach .
SKOS, or the Simple Knowledge Organization System, is a formal language and schema designed to represent such structured information domains as thesauri, classification schemes, taxonomies, subject-heading systems, controlled vocabularies, or others; in short, most all of the “loosely defined” ontology approaches discussed herein. It is a W3C initiative more fully defined in its SKOS Core Guide .
SKOS is built upon the RDF data model of the subject-predicate-object “triple.” The subjects and objects are akin to nouns, the predicate a verb, in a simple Dick-sees-Jane sentence. Subjects and predicates by convention are related to a URI that provides the definitive reference to the item. Objects may be either a URI resource or a literal (in which case it might be some indexed text, an actual image, number to be used in a calculation, etc.).
Being an RDF Schema simply means that SKOS adds some language and defined relationships to this RDF baseline. This is a bit of recursive understanding, since RDFS is itself defined in RDF by virtue of adding some controlled vocabulary and relations. The power, though, is that these schema additions are also easily expressed and referenced.
This RDFS combination can thus be shown as a standard RDF triple graph, but with the addition of the extended vocabulary and relations:
The power of the approach arises from the ability of the triple to express virtually any concept, further extended via the RDFS language defined for SKOS. SKOS includes concepts such as “broader” and “narrower”, which enable hierarchical relations to be modeled, as well as “related” and “member” to support networks and arrays, respectively .
We can visualize this transforming power by looking at how an “ontology” in a totally foreign scheme can be related to the canonical SKOS scheme. In the figure below the left-hand portion shows the native hierarchical taxonomy structure of the UK Archival Thesaurus (UKAT), next as converted to SKOS on the right (with the overlap of categories shown in dark purple). Note the hierarchical relationships visualize better via a taxonomy, but that the RDF graph model used by SKOS allows a richer set of additional relationships including related and alternative names:
SKOS also has a rich set of annotation and labeling properties to enhance human readability of schema developed in it . There is also a useful draft schema that the W3C’s SWEO (Semantic Web Education and Outreach) group is developing to organize semantic Web-related information .
Combined, these constructs provide powerful mechanisms for giving contributory ontologies a common conceptualization. When added to other sibling RDF schema such as FOAF or SIOC or DOAP, still additional concepts can be collated.
While not addressed directly in this piece, it is obviously of first importance to have content with structure before the questions of connecting that information can even arise. Then, that structure must also be available in a form suitable for merging or connection.
At that point, the subjects of this posting come into play.
We see that the daily Web has a diversity of schema or ontologies “loosely defined” for representing the structure of the content. These representations can be transferred to more complex schema, but not in the opposite direction. Moreover, the semantic basis for how to make these mappings also needs some common referents.
RDF provides the canonical data model for the data transfers and representations. RDFS, especially in the form of SKOS, appears to form one basis for the syntax and language for these transformations. And SKOS, with other schema, also appears to offer much of the appropriate “middle ground” for data relationships mapping.
However, lacking in this story is a referential structure for subject relationships . (Also lacking are the ultimately critical domain specifics required for actual implementation.)
Abstract concepts of interest to philosophers and deep thinkers have been given much attention. Sadly, to date, concrete subject structures in which tangible things and tangible actions can be shared, is still very, very weak. We are stubbing our toes on the rocks while we gaze at the heavens.
Yet, despite this, simple and powerful infrastructures are well in-hand to address all foreseeable syntactic and semantic issues. There appear to be no substantive limits to needed next steps.
 Deborah L. McGuinness. “Ontologies Come of Age”. In Dieter Fensel, Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, 2003. See http://www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-(with-citation).htm
 I think it would be much clearer to refer to “upper level” ontologies as abstract or conceptual, “mid levels” as mapping or binding, and “lower levels” as domain (without any hierarchical distinctions such as lower or lowest or sub-domain), but current practice is probably too entrenched to change now.
 There are many aspects that make PROTON one of the more attractive reference ontologies. The PROTON ontology (PROTo ONtology), developed within the scope of the SEKT project, is attractive because of its understandability, relatively small size, modular architecture and a simple subsumption hierarchy. It is available in an OWL Lite form and is easy to adopt and extend. On the face of it, the Topic class within PROTON, which is meant to serve as a bridge between different ontologies, may also provide a binding layer to specific subject topics as sub-classes or class instances.
 See my earlier post on Cyc.
 Even with such clout, it is questionable to get rather complete adherence, as Ada showed within the Federal government. However, where circumstances allow it, central schema and ontologies may be worth pursuing because of improved interoperability and lower costs, even where some portions do not adhere and are more chaotic like the standard Web.
 See, A Survey of RDF/Topic Maps Interoperability Proposals, W3C Working Group Note 10 February 2006, Pepper, Vitali, Garshol, Gessa, Presutti (eds.)
 See, Guidelines for RDF/Topic Maps Interoperability, W3C Editor’s Draft 30 June 2006, Pepper, Presutti, Garshol, Vitali (eds.)
 Here are some Sweet Tools that may have a usefulness to ontology federation and creation:
 The SWEO classification ontology is still under active development and has these draft classes. Note, however, the relative lack of actual subject or topic matter:
Classes are currently defined as:
If the page describes an organization, it can be tagged as:
If the page is a person’s home page or blog or similar, it could be:
The type of audience can also be tagged, for example:
 The OASIS Topic Maps Published Subjects Technical Committee was formed a number of years back to promote Topic Maps interoperability through the use of Published Subjects Indicators (PSIs). Their resulting report was a very interesting effort that unfortunately did not lead to wide adoption, perhaps because the effort was a bit ahead of its time or it was in advance of the broader acceptance of RDF. This general topic is the subject of a later posting by me.
 See further, Leo Obrst, “The Semantic Spectrum & Semantic Models,” a Powerpoint presentation (http://ontolog.cim3.net/file/resource/presentation/LeoObrst_20060112/OntologySpectrumSemanticModels–LeoObrst_20060112.ppt)
made as part of an Ontolog Forum (http://ontolog.cim3.net/) presentation in two parts, “What is an Ontology? – A Briefing on the Range of Semantic Models” (see http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2006_01_12), in January 2006. Leo Obrst is a principal artificial intelligence scientist at MITRE’s (http://www.mitre.org) Center for Innovative Computing and Informatics and a co-convener of the Ontolog Forum. His presentation is a rich source of practical overview information on ontologies.
 The actual diagram is an unattributed modification by Dan McCreary (see http://www.danmccreary.com/presentations/sem_int/sem_int.ppt) based on Obrst’s material in .
You may have observed some changes to my masthead thingeys. I have added a couple of new icons (stage right, upper quadrant) to (hopefully) more prominently display some important stuff. These icons are for some standard RDF ontologies that are becoming prevalent, FOAF (Friend of a Friend) and SIOC (Semantically Interlinked Online Communities, pronounced “shock”).
The truth is, today, these things are largely for the “insiders” working on semantic Web stuff. But the truth is also, tomorrow (and I mean that literally), you should know about these things and possibly adopt them for your own site .
You have likely noticed that I have repeatedly used the acronyms of FOAF, SIOC, DOAP and SKOS (among many others!) in my recent postings. What is interesting about this alphabet soup is that they represent sort of “standard” ways to discuss people, communities, projects and “ontologies” within the semantic Web community. So, the first observation is that it is useful to have “standard” ways to describe anything.
Each of these standards is an RDF (Resource Description Framework) ontology or, perhaps in a more understood way, a common vocabulary and world view about how to discuss something. (By the way, don’t get all confused in such discussions with XML; XML is but one way of how you might describe something — which may or may not apply to any given semantic Web concept — versus what you are actually describing.)
Just as Wikipedia or Google have emerged as the “standards” within their domains, these RDF ontologies may or may not survive the brutal survival of the fittest within their own domains. So, I’m not asserting whether any of these formats is going to make it. But I am asserting that such formats are part of the emerging structured Web.
In fact, in one back story to the recently completed WWW 2007 conference in Banff, Alberta (a rousing success by all accounts!), Tim Berners-Lee noted he only wanted to have his photo taken with those who already had a FOAF profile!
A typical concern about the semantic Web is, “Who pays the tax?” In other words, to get the advantage of the metadata and characterizations of content that allows retrieval and manipulation at the object level (as opposed to the document or page level), why do it? Who benefits? Why incur the effort and cost?
We all now understand the benefits of an interlinked document world. We see it many times daily; indeed, it is now such a part of our cultural and daily life as to go unquestioned. Wow! How could that have happened in a mere decade?
The same thing is true for these semantic Web and RDF ontology constructs. We need to keep twirling the stick until enough friction takes place to cause the fire’s spark.
So, anyone interested in this next phase of the Web — that is, the move to structure and linked data — needs to eat their greens. The advantages of large numbers of the small percentage and network effects will cause this fire to grow hot. The real point behind such efforts, of course, is that if no one listens it is unimportant; but if many listen, importance grows as the function of this network effect.
Of course, over time, the mechanisms to create the structure upon which the semantic Web works will occur automatically and in the background. (Just as today many no longer hand-code the HTML in their Web pages.) But learning about the general structure of RDF and spending some time hand-coding your own FOAF profile is a worthwhile start to good semantic nutrition.
Do you remember your first attempt to learn HTML? What was it like to learn how to effectively query a search engine? What are the other myriad ways we have learned and adopted to this (now) constant Web environment?
So, assuming you want to get a bit exposed and expose your own Web site to these standards, what do you do next? Not to be definitive, but here are some approaches solely from my own standpoint as a blogger who uses WordPress. Other leading blog and CMS software have slightly different requirements; just search on your brand in combination with one of the alphabet soup acronyms.
FOAF is kind of fun. FOAF is an RDF vocabulary for describing people, organizations, and relations among them (see here for specification). My own observation is that it is less used for the friend part and more as a self-description.
Conceptually, FOAF is simple: Who are you, and who are your friends? Yet it is broadly extensible to include any variety or details in terms of personal, family, work history, likes, wants and preferences (via the addition of more namespaces). And, unfortunately, it is also generally pretty crappy in how it is applied and installed. So, if you truly wanted to get your picture taken with TimBL, what the heck should you do?
First, learn about FOAF itself. Christopher Allen in his Life with Alacrity blog  wrote a good introduction to FOAF about three years ago. He explains online FOAF services to help you craft your first foaf.rdf file, validation services, and other nuances.
Next, you will need to publicize the availability of your FOAF profile. Perhaps the best known option for WordPress is the FOAF Output Plugin from . This is the most full-featured option, but it is also abstract and difficult to follow from an installation and configuration standpoint . (Pretty common for this stuff.) I personally chose the FOAF Header plug-in by Ben Hyde (with MIT’s Simile program), which was simple, direct and it works.
Once you’ve got a basic profile working, it is then time to really tweak your profile according to your needs and desires. To really configure your profile, consult the FOAF vocabulary. Note that not all FOAF statements are stable. (That is, third-party apps may not parse or support the statements.) Please also note that FOAF has no mechanism to restrict or limit information to different recipients. Only put stuff in your profile that you want to be public.
SIOC provides methods for interconnecting discussion methods such as blogs, forums and mailing lists to each other (see here for specification). There is a partial compilation of SIOC-enabled sites on the ESW wiki.
Uldis Bojārs (nick CaptSolo) is doing a remarkable job across the board on these RDF standard ontology issues. First, he has written the definitive browser for these protocols, the SIOC browser. Second, he has contributed with others in the SIOC community to create general SIOC exporters for aggregators, DotClear, Drupal, mailing lists, a PHP API, phpBB and WordPress . It is the latter version that I use on my own blog (the installation of which I initially documented last August).
Third, Uldis has written the Firefox add-on Semantic Radar. Semantic Radar inspects Web pages and detects links to semantic Web metadata from SIOC, FOAF or DOAP. When any of these forms are detected, its accompanying icon displays in the Firefox status bar and, if clicked, then displays the profile record in the SIOC browser. Very cool. (And, oh, BTW, Uldis is also one of the most helpful people around! )
Another cool option of Semantic Radar is that it pings Ping the Semantic Web (PTSW) when it detects a compliant site. PTSW is itself a highly useful aggregator of the semantic Web community and is one of Frédérick Giasson’s innovations exploiting interlinked data. This is a good site to monitor RDF publishing on the Web and new instances of sites that comply with the alphabet standards.
So, with the addition of the icons above, I’m now eating my greens! And, you know what, they’re both fun and nutritious.
I will continue to add to and improve my various online profiles. In the immediate future, I also plan to add DOAP and SKOS characterizations of my site and work. Now, back to the hand coding . . . .
 If one looks to early bloggers or early podcasters or whatever, the truth is that first users tend to get more traffic and attention. If you are a new blogger today, for example, the sad truth is that your ability to find a large audience is much reduced. Of course, that statement does not mean that you can not find a large audience (if that is your objective, and I don’t mean to suggest the only meaningful one either!), just that that percentage likelihood is lower today than yesterday.
EOL is meant to be the Wikipedia of all 1.8 million known living species, backed with real money and real prestige. The effort continues a growing list of impressive and open data sources being compiled and presented on the Web.
The Encyclopedia of Life is a planned 10-year effort to create a free Internet resource to catalog and describe every one of the planet’s organisms. Initially seeded with $12.5 million in backing from two US philanthropies, the John D. and Catherine T. MacArthur Foundation and the Alfred P. Sloan Foundation, the effort is anticipated to cost $100 million by completion.
EOL is based on an idea by Harvard biologist and noted author Edward O. Wilson, a 2007 TED winner for the idea (as is better seen and explained in his acceptance video). Other initial project backers are Harvard University, the Smithsonian Institution, Missouri Botantical Garden, the Biodiversity Library Project with its international participants, and the Chicago Field Museum of Natural History.
Each species will get its own Web page with a structured record, similar to the “infoboxes” within Wikipedia. Information will include photos, technical name and phylogeny, lay description, range maps, and place within the Tree of Life. More prominent species will also get information on its genetics and behavior and compiled scientific research articles, some from centuries ago.
EOL is two and one-half orders of magnitude more ambitious than the earlier Tree of Life Web Project from the University of Arizona with its 5000 pages, and 20 times more ambitious than Wikipedia’s own Wikispecies.
Potential applications range from conservation to mapping to identifying the odd plant or animal species. But, has been found with important predecessor data sets such as GenBank (65 billion genetic sequences for a 100,000 organisms), FishBase (30,000 species) or Conabio (biodiversity in Mexico), among literally hundreds of others, popularity and use exceeds wildest expectations. Conservationists, researchers, school children, amateur naturalists, and outdoor lovers will all find use for the site.
EOL is but the newest poster child of massive and hugely important datasets being exposed on the Internet. One initiative with the sole purpose to promote this trend and the interoperability of the data utilizing RDF is the Linking Open Data project. The LOD project is part of the W3C’s Semantic Web Education and Outreach interest group (yes, the unwieldy, SWEO-IG).
The LOD project is holding its first meetings ever this week at WWW2007 in Banff, Alberta. While only a mere six-months old, the SWEO-IG is stimulating much broader interest in the semantic Web through sponsored products, FAQs (also announced yesterday and got really digged!), use cases, supporting information, and growing and useful compilations such as datasets and semantic Web tools.
Anyone interested in the semantic Web should really be familiar with the ESW wiki (which unfortunately is in need of re-organization and clean-up; but keep poking, there are many riches hidden in the nooks and crannies!). You may also want to track the public mailing list of the Linking Open Data group.
I already wrote extensively on the exciting DBpedia intiative (Did you Blink? The Structured Web Just Arrived), itself one of the catalytic datasets at the core of the Linking Open Data efforts. Other datasets actively being pursued for inclusion by the group include:
Additional candidate datasets of interest have also been identified by the SWEO interest group on this page:
The SWEO-IG and its Linking Open Data initiative are catalyzing a new phase of excitement with relevant information, data and demos.
If anything, all I can say is: Set your sights higher! There’s data galore that needs to be “RDFized”.
So, folks, keep up the great work! And good luck with this week’s meetings in the beauty of Canada.
[BTW, please make sure that EOL makes its data available in RDF!]
It is a “smash-up” of Java applets and next generation web concepts. You can create galleries, edit photos, rotate and create that cool 3D rotating photo cube effect we’ve been seeing lately, you name it. Jasper’s short online demo of Iris is really cool, too! (I think the live demo is to be presented in Jasper’s talk at JavaOne tomorrow.)
Hey! Shag me baby!
BTW, thanks to Henry Story for the link to this (he is always sniffing out the good Java stuff).
[Click on image for full-size pop-up]
Colorado Interstate Construction – 1970
Courtesy National Archives
|NOTE: I was pleased when Jim Hendler asked me to pen some thoughts on the semantic Web (as a vehement moderate, I always use the mixed case). I think both of us hoped that my background combining Internet business with Web science might bring a pragmatic perspective to the subject. The material below is to appear as a guest editorial in IEEE Intelligent Systems in the May/June issue. I thank it for allowing early releases by authors, and to Linda World, its super senior editor, for cleaning up my language. There are some other differences due to length considerations. MKB|
For a dozen years, my career has been centered on Internet search, dynamic content and the deep Web. For the past few years, I have been somewhat obsessed by two topics. The first topic, a conviction really, is that implicit structure needs to be extracted from Web content to enable it to be disambiguated, organized, shared and re-purposed. The second topic, more an open question as a former academic married to a professor, is what might replace editorial selections and peer review to establish the authoritativeness of content. These topics naturally steer one to the semantic Web.
A Millennial Perspective
The semantic Web, by whatever name it comes to be called, is an inevitability. History tells us that as information content grows, so do the mechanisms for organizing and managing it. Over human history, innovations such as writing systems, alphabetization, pagination, tables of contents, indexes, concordances, reference look-ups, classification systems, tables, figures, and statistics have emerged in parallel with content growth.
When the Lycos search engine, one of the first profitable Internet ventures, was publicly released in 1994, it indexed a mere 54,000 pages . When Google wowed us with its page-ranking algorithm in 1998, it soon replaced my then favorite search engine, AltaVista. Now, tens of billions of indexed documents later, I often find Google’s results to be overwhelming dross — unfortunately true again for all of the major search engines. Faceted browsing, vertical search, and Web 2.0′s tagging and folksonomies demonstrate humanity’s natural penchant to fight this entropy, efforts that will next continue with the semantic Web and then mechanisms unforeseen to manage the chaos of burgeoning content.
An awful lot of hot air has been expelled over the false dichotomy of whether the semantic Web will fail or is on the verge of nirvana. Arguments extend from the epistemological versus ontological (classically defined) to Web 3.0 versus SemWeb or Web services (WS*) versus REST (Representational State Transfer). My RSS feed reader points to at least one such dust up every week.
Some set the difficulties of resolving semantic heterogeneities as absolutes, leading to an illogical and false rejection of semantic Web objectives. In contrast, some advocates set equally divisive arguments for semantic Web purity by insisting on formal ontologies and descriptive logics. Meanwhile, studied leaks about "stealth" semantic Web ventures mean you should grab your wallet while simultaneously shaking your head.
A Decades-Long Perspective
My mental image of the semantic Web is a road from here to some achievable destination–say, Detroit. Parts of the road are well paved; indeed, portions are already superhighways with controlled on-ramps and off-ramps. Other portions are two lanes, some with way too many traffic lights and some with dangerous intersections. A few small portions remain unpaved gravel and rough going.
[Click on image for full-size pop-up]
Wreck in Nebraska during the 1919 Transcontinental Motor Convoy
Courtesy National Archives
A lack of perspective makes things appear either too close or too far away. The automobile isn't yet a century old as a mass-produced item. It wasn't until 1919 that the US Army Transcontinental Motor Convoy made the first automobile trip across the United States. The 3,200 mile route roughly followed today's Lincoln Highway, US 30, from Washington, D.C. to San Francisco. The convoy took 62 days and 250 recorded accidents to complete the trip (see figure), half on dirt roads at an average speed of 6 miles per hour. A tank officer on that trip later observed Germany's autobahns during World War II. When he subsequently became President Dwight D. Eisenhower, he proposed and then signed the Interstate Highway Act. That was 50 years ago. Today, the US is crisscrossed with 50,000 miles of interstates, which have completely remade the nation's economy and culture .
Like the interstate system in its early years, today's semantic Web lets you link together a complete trip, but the going isn't as smooth or as fast as it could be. Nevertheless, making the trip is doable and keeps improving day by day, month by month.
My view of what's required to smooth the road begins with extracting structure and meaningful information according to understandable schema from mostly uncharacterized content. Then we store the now-structured content as RDF triples that can be further managed and manipulated at scale. By necessity, the journey embraces tools and requirements that, individually, might not constitute semantic Web technology as some strictly define it. These tools and requirements are nonetheless integral to reaching the destination. We are well into that journey's first leg, what I and others are calling the structured Web.
For the past six months or so I have been researching and assembling as many semantic Web and related tools as I can find . That Sweet Tools listing now exceeds 500 tools  (with its presentation using the nifty lightweight Exhibit publication system from MIT’s Simile program ). I've come to understand the importance of many ancillary tool sets to the entire semantic Web highway, such as natural language processing and information extraction. I've also found new categories of pragmatic tools that embody semantic Web and data mediation processes but don't label themselves as such.
In its entirety, the Sweet Tools listing provides a pretty good picture of the semantic Web's state. It's a surprisingly robust picture — though with some notable potholes — and includes impressive open source options in all categories. Content publishing, indexing, and retrieval at massive scales are largely solved problems. We also have the infrastructure, languages, and (yes!) standards for tying this content together meaningfully at the data and object levels.
I also think a degree of consensus has emerged on RDF as the canonical data model for semantic information. RDF triple stores are rapidly improving toward industrial strength, and RESTful designs enable massive scalability, as terabyte- and petabyte-scale full-text indexes prove.
Powerful and flexible middleware options, such as those from OpenLink , can transform and integrate diverse file formats with a variety of back ends. The World Wide Web Consortium's GRDDL standard  and related tools, plus various "RDF-izers" from Massachusetts Institute of Technology and elsewhere , largely provide the conversion infrastructure for getting Web data into that canonical RDF form. Sure, some of these converters are still research-grade, but getting them to operational capabilities at scale now appears trivial.
Things start getting shakier when trying to structure information into a semantic formalism. Controlled vocabularies and ontologies range broadly and remain a contentious area. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language  or microformats . From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend)  and the still greater formalism of OWL's various dialects .
Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications. All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it's done. The sooner we can embrace content in any of these formats and convert it into canonical RDF form, we can then move on to needed developments in semantic mediation, some of the roughest road on the journey.
Potholes on the Semantic Highway
Semantic mediation requires appropriate structured content. Many potholes on the road to the semantic Web exist because the content lacks structured markup; others arise because existing structure requires transformation. We need improved ways to address both problems. We also need more intuitive means for applying schema to structure. Some have referred to these issues as “who pays the tax.”
Recent experience with social software and collaboration proves that a portion of the Internet user community is willing to tag and characterize content. Furthermore, we can readily leverage that resulting structure, and free riders are welcomed. The real pothole is the lack of easy–even fun–data extractors and "structurizers." But we're tantalizingly close.
Tools such as Solvent and Sifter from MIT's Simile program  and Marmite from Carnegie Mellon University  are showing the way to match DOM (document object model) inspectors with automated structure extractors. DBpedia, the alpha version of Freebase, and System One now provide large-scale, open Web data sets in RDF , including all of Wikipedia. Browser extensions such as Zotero  are showing how to integrate structure management into acceptable user interfaces, as are services such as Zoominfo . Yet we still lack easy means to design the differing structures suitable for a plenitude of destinations.
Amazingly, a compelling road map for how all these pieces could truly fit together is also incomplete. How do we actually get from here to Detroit? Within specific components, architectural understandings are sometimes OK (although documentation is usually awful for open source projects, as most of the current tools are). Until our community better documents that vision, attracting new contributors will be needlessly slower, thus delaying the benefits of network effects.
So, let's create a road map and get on with paving the gaps and filling the potholes. It's not a matter of standards or technology–we have those in abundance. Let's stop the silly squabbles and commit to the journey in earnest. The structured Web's ability to reach Hyperland , Douglas Adam's prescient 1990 forecast of the semantic Web, now looks to be no further away than Detroit.
 Chris Sherman, “Happy Birthday, Lycos!,” Search Engine Watch, August 14, 2002. See http://searchenginewatch.com/showPage.html?page=2160551.
 David A. Pfeiffer, “Ike’s Interstates at 50: Anniversary of the Highway System Recalls Eisenhower’s Role as Catalyst,” Prologue Magazine, National Archives, Summer 2006, Vol. 38, No. 2. See: http://www.archives.gov/publications/prologue/2006/summer/interstates.html.
 Sweet Tools (SemWeb) listing; see http://www.mkbergman.com/?page_id=325 .
 See http://simile.mit.edu/exhibit/.
 OpenLink Software’s Virtuoso and Data Spaces products; see http://www.openlinksw.com/.
 W3C’s Gleaning Resource Descriptions from Dialects of Languages (GRDDL, pronounced “griddle”). See http://www.w3.org/2004/01/rdxh/spec.
 See http://simile.mit.edu/wiki/RDFizers.
 Outline Processor Markup Language (OPML); see http://www.opml.org/.
 Microformats; see http://microformats.org/.
 W3C’s Web Ontology Language (OWL). See http://www.w3.org/TR/owl-features/.
 Marmite (http://www.cs.cmu.edu/~jasonh/projects/marmite/) is from Carnegie Mellon University.
 DBpedia (http://dbpedia.org/docs/) and Freebase (in alpha, by invitation only at http://www.freebase.com/) are two of the first large-scale open datasets on the Web; Wikipedia has also been converted to RDF by System One (http://labs.systemone.at/wikipedia3).
 Zotero is produced by George Mason University’s Center for History and New Media; see http://www.zotero.org.
 ZoomInfo (http://www.zoominfo.com/) provides online structured search of companies and people, plus broader services to enterprises.
 The late Douglas Adams, of Doctor Who and A Hitchhiker's Guide to the Galaxy fame, produced a TV program for BBC2 presaging the Internet called Hyperland. This 50-min video can be seen in five parts via YouTube at Part 1 of 5, 2 of 5, 3 of 5, 4 of 5 and 5 of 5.