Posted:January 31, 2011

Making Connections Real

Refining UMBEL’s Linking and Mapping Predicates with Wikipedia

We are only days away from releasing the first commercial version 1.00 of UMBEL (Upper Mapping and Binding Exchange Layer) [1]. To recap, UMBEL has two purposes, both aimed to promote the interoperability of Web-accessible content. First, it provides a general vocabulary of classes and predicates for describing domain ontologies and external datasets. Second, UMBEL is a coherent framework of 28,000 broad subjects and topics (the “reference concepts”), which can act as binding nodes for mapping relevant content.

This last iteration of development has focused on the real-world test of mapping UMBEL to Wikipedia [2]. The result, to be more fully described upon release, has led to two major changes. It has acted to expand the size of the core UMBEL reference concepts to about 28,000. And it has led to adding to and refining the mapping predicates necessary for UMBEL to fulfill its purpose as a reference structure for external resources. This latter change is the focus of this post.

There is a huge diversity of organizational structure and world views on the Web; the linking and mapping predicates to fulfill this purpose must also capture that diversity. Relations between things on the Web can range from the exact and identity, to the approximate, descriptive and casual [3]. The 16 K direct mappings that have now been made between UMBEL and Wikipedia (resulting in the linkage of more than 2 million Wikipedia pages) provide a real-world test for how to capture this diversity. The need is to find the range of predicates that can reflect and capture quality, accurate mappings. Further, because mappings also can be aided with a variety of techniques from the manual to the automatic, it is important to characterize the specific mapping methods used whenever a linking predicate is assigned. Such qualifications can help to distinguish mapping trustworthiness, plus enable later segregation for the application of improved methods as they may arise.

As a result, the UMBEL Vocabulary now has a pretty well vetted and diverse set of linking and mapping predicates. Guidelines for how these differ, how they are used, and how they are qualified is described next.

A Comparison of Mapping Predicates

Properties for linking and mapping need to differ more than in name or intended use. They must represent differences that affect inferences and reasoners, and can be acted upon by specific utilities via user interfaces and other applications. Furthermore, the diversity of mapping predicates should capture the types of diverse mappings and linkages possible between disparate sources.

Sometimes things are individuals or instances; other times they are classes or groupings of similar things. Sometimes things are of the same kind, but not exactly aligned. Sometimes things are unlike, but related in a common way. (Everything in Britain, for example, is a British “thing” even though they may be as different as trees, dead kings or cathedrals.) Sometimes we want to say something about a thing, such as an animal’s fur color or age, as a way to further characterize it, and so on.

The OWL 2 language and existing semantic Web languages give us some tools and existing vocabulary to capture some of this diversity. How these options, plus new predicates defined for UMBEL’s purposes, compare is shown by this table:

Property	Relative Strength	Usage	Standard Reasoner?	Inverse Property?	Kind of Thing		Symmetrical?	Transitive?	Reflexive?
Property	Relative Strength	Usage	Standard Reasoner?	Inverse Property?	It is	It Relates to	Symmetrical?	Transitive?	Reflexive?
`owl:equivalentClass`	10	equivalence	X	N/A	class	class	yes	yes	yes
`owl:sameAs`	9	identity	X	N/A	individual	individual	yes	yes	yes
`rdfs:subClassOf`	8	subset	X		class	class	no	yes	yes
`umbel:correspondsTo`	7	~equivalence	+ / –		anything	RefConcept	yes	yes	yes
`skos:narrowerTransitive`	6	hierarchical		X	skos:Concept	skos:Concept	no	yes	no
`skos:broaderTransitive`	6	hierarchical		X	skos:Concept	skos:Concept	no	yes	no
`rdf:type`	5	membership	X		anything	class	no	no	no
`umbel:isAbout`	4	topical		X	anything	RefConcept	perhaps	not likely	not likely
`umbel:isLike`	3	similarity			anything	anything	yes	no	not likely
`umbel:relatesToXXX`	2	relationship			anything	SuperType	no	no	not likely
`umbel:isCharacteristicOf`	1	attribute		X	anything	RefConcept	no	no	no

I discuss each of these predicates below. But, first, let’s discuss what is in this table and how to interpret it [4].

Relative strength – an arbitrary value that is meant to capture the inferencing power (entailments) embodied in the predicate. Identity (equivalence), class implications, and specific predicate properties that can be acted upon by reasoners are given higher relative power
Standard reasoner? – indicates whether standard reasoners [5] draw inferences and entailments from the specific property. A “+ / -” indication indicates that reasoners do not recognize the specific property per se, but can act upon the predicates (such as symmetric, transitive or reflexive) used to define the predicate
Inverse property? – indicates whether there is an inverse property used within UMBEL that is not listed in the table. In such cases, the predicate shown is the one that treats the external entity as the subject
It is a kind of thing – is the same as domain; it means the kind of thing to which the subject belongs
It relates to a kind on thing – is the same as range; it means the kind of thing to which the object of the subject belongs
Symmetrical? – describes whether the predicate for an s – p – o (subject – predicate – object) relationship can also apply in the o – p – s manner
Transitive? – is whether the predicate interlinks two individuals A and C whenever it interlinks A with B and B with C for some individual B
Reflexive? – By that is meant whether the subject has itself as a member. In a reflexive closure between subject and object the subject is fully included as a member. Equivalence, subset, greater than or equal to, and less than or equal to relationships are reflexive; not equal, less than or greater than relationships are not.

The Usage metric is described for each property below.

Individual Predicates Discussion

To further aid the understanding of these properties, we can also group them into equivalence, membership, approximate or descriptive categories.

Equivalent Properties

Equivalent properties are the most powerful available since they entail all possible axioms between the resources.

owl:equivalentClass

Equivalent class means that two classes have the same members; each is a sub-class of the other. The classes may differ in terms of annotations defined for each of them, but otherwise they are axiomatically equivalent.

An owl:equivalentClass assertion is the most powerful available because of its ability to ‘Explode the Domain‘ [6]. Because of its entailments, owl:equivalentClass should be used with great care.

owl:sameAs

The owl:sameAs assertion claims two instances to be an identical individual. This assertion also carries with it strong entailments of symmetry and reflexivity.

owl:sameAs is often misapplied [7]. Because of its entailments, it too should be used with great care. When there are doubts about claiming this strong relationship, UMBEL has the umbel:isLike alternative (see below).

Membership and Hierarchical Properties

Membership properties assert that an instance is a member of a class.

rdfs:subClassOf

The rdfs:subClassOf asserts that one class is a subset of another class. This assertion is transitive and reflexive. It is a key means for asserting hierarchical or taxonomic structures in an ontology. This assertion also has strong entailments, particularly in the sense of members having consistent general or more specific relationships to one another.

Care must be exercised that full inclusivity of members occurs when asserting this relationship. When correctly asserted, however, this is one of the most powerful means to establish a reasoning structure in an ontology because of its transitivity.

skos:narrowerTransitive/skos:broaderTransitive

Both of these predicates work on skos:Concept (recall that umbel:RefConcept is itself a subClassOf a skos:Concept). The predicates state a hierarchical link between the two concepts that indicates one is in some way more general (“broader”) than the other (“narrower”) or vice versa. The particular application of skos:broaderTransitive (or its complement) is used to infer the transitive closure of the hierarchical links, which can then be used to access direct or indirect hierarchical links between concepts.

The transitive relationship means that there may be intervening concepts between the two stated resources, making the relationship an ancestral one, and not necessarily (though it is possible to be so) a direct parent-child one.

rdf:type

The rdf:type assertion assigns instances (individuals) to a class. While the idea is straightforward, it is important to understand the intensional nature of the target class to ensure that the assignment conforms to the intended class scope. When this determination can not be made, one of the more approximate UMBEL predicates (see below) should be used.

Approximation Properties

For one reason or another, the precise assertions of the equivalent or membership properties above may not be appropriate. For example, we might not know sufficiently an intended class scope, or there might be ambiguity as to the identity of a specific entity (is it Jimmy Johnson the football coach, race car driver, fighter, local plumber or someone else?). Among other options — along a spectrum of relatedness — is the desire to assign a predicate that is meant to represent the same kind of thing, yet without knowing if the relationship is an equivalence (identity, or sameAs), a subset, or merely just a member of relationship. Alternatively, we may recognize that we are dealing with different things, but want to assert a relationship of an uncertain nature.

This section presents the UMBEL alternatives for these different kinds of approximate predicates [4].

umbel:correspondsTo

The most powerful of these approximate predicates in terms of alignment and entailments is the umbel:correspondsTo property. This predicate is the recommended option if, after looking at the source and target knowledge bases [8], we believe we have found the best equivalent relationship, but do not have the information or assurance to assign one of the relationships above. So, while we are sure we are dealing with the same kind of thing, we may not have full confidence to be able to assign one of these alternatives:

   rdfs:subClassOf
   owl:equivalentClass
   owl:sameAs
   superClassOf

Thus, with respect to existing and commonly used predicates, we want an umbrella property that is generally equivalent or so in nature, and if perhaps known precisely might actually encompass one of the above relations, but we don’t have the certainty to choose one of them nor perhaps assert full “sameness”. This is not too dissimilar from the rationale being tested for the x:coref predicate in relation to owl:sameAs from the UMBC Ebiquity group [9,10].

The property umbel:correspondsTo is thus used to assert a close correspondence between an external class, named entity, individual or instance with a Reference Concept class. It asserts this correspondence through the basis of both its subject matter and intended scope.

This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.

umbel:isAbout

In most uses, the most prevalent linking property to be used is the umbel:isAbout assertion. This predicate is useful when tagging external content with metadata for alignment with an UMBEL-based reference ontology. The reciprocal assertion, umbel:isRelatedTo is when an assertion within an UMBEL vocabulary is desired to an external ontology. Its application is where the reference vocabulary itself needs to refer to an external topic or concept.

The umbel:isAbout predicate does not have the same level of confidence or “sameness” as the umbel:correspondsTo property. It may also reflect an assertion that is more like rdf:type, but without the confidence of class membership.

The property umbel:isAbout is thus used to assert the relation between an external named entity, individual or instance with a Reference Concept class. It can be interpreted as providing a topical assertion between an individual and a Reference Concept.

This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.

umbel:isLike

The property umbel:isLike is used to assert an associative link between similar individuals who may or may not be identical, but are believed to be so. This property is not intended as a general expression of similarity, but rather the likely but uncertain same identity of the two resources being related.

This property may be considered as an alternative to sameAs where there is not a certainty of sameness, and/or when it is desirable to assert a degree of overlap of sameness via the umbel:hasMapping reification predicate. This property can and should be changed if the certainty of the sameness of identity is subsequently determined.

It is appropriate to use this property when there is strong belief the two resources refer to the same individual with the same identity, but that association can not be asserted at the present time with full certitude.

This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.

umbel:relatesToXXX

At a different point along this relatedness spectrum we have unlike things that we would like to relate to one another. It might be an attribute, a characteristic or a functional property about something that we care to describe. Further, by nature of the thing we are relating, we may also be able to describe the kind of thing we are relating. The UMBEL SuperTypes (among many other options) gives us one such means to characterize the thing being related.

UMBEL presently has 31 predicates for these assertions relating to a SuperType [11]. The various properties designated by umbel:relatesToXXX are used to assert a relationship between an external instance (object) and a particular (XXX) SuperType. The assertion of this property does not entail class membership with the asserted SuperType. Rather, the assertion may be based on particular attributes or characteristics of the object at hand. For example, a British person might have an umbel:relatesToXXX asserted relation to the SuperType of the geopolitical entity of Britain, though the actual thing at hand (person) is a member of the Person class SuperType.

This predicate is used for filtering or clustering, often within user interfaces. Multiple umbel:relatesToXXX assertions may be made for the same instance.

Each of the 32 UMBEL SuperTypes has a matching predicate for external topic assignments (relatesToOtherOrganism shares two SuperTypes, leading to 31 different predicates):

SuperType	Mapping Predicate	Comments
NaturalPhenomena	`relatesToPhenomenon`	This predicate relates an external entity to the `SuperType` (ST) shown. It indicates there is a relationship to the ST of a verifiable nature, but which is undetermined as to strength or a full rdf:type relationship
NaturalSubstances	`relatesToSubstance`	same as above
Earthscape	`relatesToEarth`	same as above
Extraterrestrial	`relatesToHeavens`	same as above
Prokaryotes	`relatesToOtherOrganism`	same as above
ProtistsFungus	`relatesToOtherOrganism`	same as above
Plants	`relatesToPlant`	same as above
Animals	`relatesToAnimal`	same as above
Diseases	`relatesToDisease`	same as above
PersonTypes	`relatesToPersonType`	same as above
Organizations	`relatesToOrganizationType`	same as above
FinanceEconomy	`relatesToFinanceEconomy`	same as above
Society	`relatesToSociety`	same as above
Activities	`relatesToActivity`	same as above
Events	`relatesToEvent`	same as above
Time	`relatesToTime`	same as above
Products	`relatesToProductType`	same as above
FoodorDrink	`relatesToFoodDrink`	same as above
Drugs	`relatesToDrug`	same as above
Facilities	`relatesToFacility`	same as above
Geopolitical	`relatesToGeoEntity`	same as above
Chemistry	`relatesToChemistry`	same as above
AudioInfo	`relatesToAudioMusic`	same as above
VisualInfo	`relatesToVisualInfo`	same as above
WrittenInfo	`relatesToWrittenInfo`	same as above
StructuredInfo	`relatesToStructuredInfo`	same as above
NotationsReferences	`relatesToNotation`	same as above
Numbers	`relatesToNumbers`	same as above
Attributes	`relatesToAttribute`	same as above
Abstract	`relatesToAbstraction`	same as above
TopicsCategories	`relatesToTopic`	same as above
MarketsIndustries	`relatesToMarketIndustry`	same as above

This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.

Descriptive Properties

Descriptive properties are annotation properties.

umbel:isCharacteristicOf

Two annotation properties are used to describe the attribute characteristics of a RefConcept, namely umbel:hasCharacteristic and its reciprocal, umbel:isCharacteristicOf. These properties are the means by which the external properties to describe things are able to be brought in and used as lookup references (that is, metadata) to external data attributes. As annotation properties, they have weak semantics and are used for accounting as opposed to reasoning purposes.

These properties are designed to be used in external ontologies to characterize, describe, or provide attributes for data records associated with a given RefConcept. It is via this property or its inverse, umbel:hasCharacteristic, that external data characterizations may be incorporated and modeled within a domain ontology based on the UMBEL vocabulary.

Qualifying the Mappings

The choice of these mapping predicates may be aided with a variety of techniques from the manual to the automatic. It is thus important to characterize the specific mapping methods used whenever a linking predicate is assigned. Following this best practice allows us to distinguish mapping trustworthiness, plus to also enable later segregation for the application of improved methods as they may arise.

UMBEL, for its current mappings and purposes, has adopted the following controlled vocabulary for characterizing the umbel:hasMapping predicate; such listings may be readily modified for other domains and purposes when using the UMBEL vocabulary. This controlled vocabulary is based on instances of the Qualifier class. This class represents a set of descriptions to indicate the method used when applying an approximate mapping predicate (see above):

Qualifier	Description
Manual – Nearly Equivalent	The two mapped concepts are deemed to be nearly an `equivalentClass` or `sameAs` relationship, but not 100% so
Manual – Similar Sense	The two mapped concepts share much overlap, but are not the exact same sense, such as an action as related to the thing it acts upon
Heuristic – ListOf Basis	Type assignment based on Wikipedia ListOf category; not currently used
Heuristic – Not Specified	Heuristic mapping method applied; script or technique not otherwise specified
External – OpenCyc Mapping	Mapping based on existing OpenCyc assertion
External – DBOntology Mapping	Mapping based on existing DBOntology assertion
External – GeoNames Mapping	Mapping based on existing GeoNames assertion
Automatic – Inspected SV	Mapping based on automatic scoring of concepts using Semantic Vectors, with specific alignment choice based on hand selection
Automatic – Inspected S-Match	Mapping based on automatic scoring of concepts using S-Match, with specific alignment choice based on hand selection; not currently used
Automatic – Not Specified	Mapping based on automatic scoring of concepts using a script or technique not otherwise specified; not currently used

Again, as noted, for other domains and other purposes this listing can be modified at will.

Status of Mappings

Final aspects of these mappings are now undergoing a last round of review. A variety of sources and methods have been applied, to be more fully documented at time of release.

Some of the final specifics and counts may be modified slightly by the time of actual release of UMBEL v 1.00, which should occur in the next week or so. Nonetheless, here are some tentative counts for a select portion of these predicates in the internal draft version:

Item or Predicate	Count
Total UMBEL Reference Concepts	27,917
`owl:equivalentClass` (external OpenCyc, PROTON, DBpedia)	28,618
`umbel:correspondsTo` (direct mappings to Wikipedia)	16,884
`rdf:type`	876,125
`umbel:relatesToXXX (31 variations)`	3,059,023
Unique Wikipedia Pages Mapped	2,130,021

All of these assignments have also been hand inspected and vetted.

Major Progress Towards a Gold Standard

To date, in various steps and in various phases, the inspection of Wikipedia, its categories, and its match with UMBEL has perhaps incurred more than 5,000 hours (or nearly a three person-year equivalence) of expert domain and semantic technology review [12]. As noted, about 60% (16,884 of 27,917) of UMBEL concepts have now been directly mapped to Wikipedia and inspected for accuracy.

Wikipedia provides the most demanding and complete mapping target available for testing the coverage of UMBEL’s reference concepts and the adequacy of its vocabulary. As a result, we have added to and refined the mapping and linking predicates used in the UMBEL vocabulary, and added a Qualifier class to record the mapping process, as this post overviews. We have added the SuperType class to better organize and disambiguate large knowledge bases [13]. And, in this mapping process, we have expanded UMBEL’s reference concepts by about 33% to improve coverage, while remaining consistent with its origins as a faithful subset of the venerable Cyc knowledge structure [14].

A side benefit that has emerged from these efforts — with a huge potential upside — is the valuable combination of UMBEL and Wikipedia as a “gold standard” for aligning and mapping knowledge bases. Such a standard is critically needed. For example, in reviewing many of the existing Wikipedia mappings claimed as accurate, we found misplacement errors that averaged 15.8% [15]. Having a baseline of vetted mappings will aid future mappings. Moreover, having a complete conceptual infrastructure over Wikipedia will enable new and valuable reasoning and inference services.

The results from the UMBEL v 1.00 mapping are promising and very much useful today, but by no means complete. Future versions will extend the current mappings and continue to refine its accuracy and completeness [16]. What we can say, however, is that a coherent organization and conceptual schema — namely, UMBEL — overlaid on the richness of the instance data and content of Wikipedia, can produce immediate and useful benefits. These benefits apply to semantic search, semantic annotation and tagging, reasoning, discovery, inferencing, organization and comparisons.

[1] UMBEL has been under development since March 2007, with its first release in July 2008 and its last release (v 0.80) in November 2010. Throughout its releases we have reserved incrementing the vocabulary and its ontology to version 1.00 until it was deemed “commercial”. This is the first version to meet this test. We’d like to thank our partner, Ontotext, and the RENDER project for providing assistance and resources to bring the system to this point.

[2] The basic approach is to use the DBpedia representation of Wikipedia, since its extractors have already done a great job in preparing structured data.

[3] M.K. Bergman, 2010. “The Nature of Connectedness on the Web,” AI3:::Adaptive Information blog, November 22, 2010; see https://www.mkbergman.com/935/the-nature-of-connectedness-on-the-web/.

[4] A good starting reference for some of these concepts is Pascal Hitzler et al., eds., 2009. OWL 2 Web Ontology Language Primer, a W3C Recommendation, 27 October 2009; see http://www.w3.org/TR/owl2-primer/.

[5] Such as the semantic reasoners FaCT++, Racer, Pellet, Hermit, etc..

[6] Fred Giasson first coined this phrase; see F. Giasson, 2008. “Exploding the Domain: UMBEL Web Services by Zitgist,” blog posting on April 20, 2008; see http://fgiasson.com/blog/index.php/2008/04/20/exploding-the-domain-umbel-web-services-by-zitgist/.

[7] Among many, many references, see a fairly comprehensive discussion of this issue at http://ontologydesignpatterns.org/wiki/Community:Overloading_OWL_sameAs.

[8] This predicate is designed for the circumstance of aligning two different ontologies or knowledge bases based on node-level correspondences, but without entailing the actual ontological relationships and structure of the object source. For example, the umbel:correspondsTo predicate is used to assert close correspondence between UMBEL Reference Concepts and Wikipedia categories or pages, yet without entailing the actual Wikipedia category structure.

[9] Jennifer Sleeman and Tim Finin, 2010. “Learning Co-reference Relations for FOAF Instances,” Proceedings of the Poster and Demonstration Session at the 9th International Semantic Web Conference, November 2010; see http://ebiquity.umbc.edu/_file_directory_/papers/522.pdf.

[10] For example, in the words of Tim Finin of the Ebiquity group:

“The solution we are currently exploring is to define a new property to assert that two RDF instances are co-referential when they are believed to describe the same object in the world. The two RDF descriptions might be incompatible because they are true at different times, or the sources disagree about some of the facts, or any number of reasons, so merging them with owl:sameAs may lead to contradictions. However, virtually merging the descriptions in a co-reference engine is fine — both provide information that is useful in disambiguating future references as well as for many other purposes.”

See quote on http://www.semanticoverflow.com/questions/1095/alternatives-to-owlsameas-for-linked-data.

[11] The same vocabulary construct can be applied to other domain ontologies based on the UMBEL Vocabulary.

[12] The efforts with Wikipedia have been ongoing to a certain degree since the inception of UMBEL. As one example, we have been maintaining a comprehensive tracking of the use of Wikipedia for mapping and semantic technology purposes, called SWEETpedia, for many years. Applying these techniques to both UMBEL and Wikipedia has been most active over the past 18 months.

[13] See the UMBEL Annex G: UMBEL SuperTypes Documentation, which will also be slightly updated upon the new UMBEL v 1.00 release.

[14] See the ‘Use of OpenCyc’ section in the UMBEL Specifications.

[15] These sources and error rates will be detailed in a paper after the pending new release of UMBEL.

[16] Fortunately, returns on the time investment will accelerate since basic lessons and techniques have now been learned.

Posted:January 17, 2011

Declining IT Innovation in the Enterprise

Reasons for and Implications from Innovation Moving to Consumers

Today, the headlines and buzz for information technologies centers on smartphones, social networks, cloud computing, tablets and everything Internet. Very little is now discussed about IT in the enterprise. This declining trend began about 15 years ago, and has been accelerating over time. Letting the air out of the enterprise IT balloon has some profound reasons and implications. It also has some lessons and guidance related to semantic approaches and technologies and their adoption by enterprises.

A Brief Look at Sixty Years of Enterprise IT

One can probably clock the start of enterprise information technology (IT) to the first use of mainframe computers in the early 1950s [1], or sixty years ago. The earliest mainframes were huge and expensive machines that required their own specially air-conditioned rooms because of the heat they generated. The first use of “information technology” as a term occurred in a Harvard Business Review article from 1958 [2].

Until the late 1960s computers were usually supplied under lease, and were not purchased [3]. Service and all software were generally bundled into the lease amount without separate charge and with source code provided. Then, in 1969, IBM led an industry change by starting to charge separately for (mainframe) software and services, and ceasing to supply source code [3]. At about the same time integrated circuits enabled computer sizes to be reduced, with the minicomputers such as from DEC causing a marked expansion in number of potential customers. Enterprise apps became a huge business, with software licensing and maintenance fees achieving a peak of 70% of IT vendor total revenues by the mid-1990s [4]. However, since that peak, enterprise software as a portion of vendor revenues has been steadily eroding.

One of the earliest enterprise applications was in transaction systems and their underlying database management software. The relational database management system (RDBMS) was initially developed at IBM. Oracle, based on early work for the CIA in the late 1970s and its innovation to write in the C programming language, was able to port the RDBMS to multiple operating systems. These efforts, along with those of other notable vendors (most of which like Informix no longer exist), led to the RDBMS becoming more or less the de facto standard for data management within the enterprise by the 1980s. Today Oracle is the largest supplier of RDBMS software globally, and other earlier database system designs such as network databases or object databases fell out of favor [5].

In 1975, the Altair 8800 was introduced to electronics hobbyists as the first microcomputer, followed then by Apple II and the IBM PC in 1981, among others. Rapidly a slew of new applications became available to the individual, including spreadsheets, small databases, graphics programs and word processors. These apps were a boon to individual productivity and the IBM PC in particular brought credibility and acceptance within the enterprise (along with the growth of Microsoft). Novell and local area networks also pointed the way to a more distributed computing future. By the late 1980s virtually every knowledge worker within enterprises had some degree of computer literacy.

The apogee for enterprise software and apps occurred in the 1990s, with whole classes of new applications (most denoted by three-letter acronyms) such as enterprise resource planning (ERP), business intelligence (BI), customer relationship management (CRM), enterprise information systems (EIS) and the like coming to the fore. These systems also began as proprietary software, which resulted in the “stovepiping” or creating of information silos. In reaction and with great market acceptance, vendors such as SAP arose to provide comprehensive, enterprise-wide solutions, though often at high cost and with significant failure rates.

More significantly, the 1990s also saw the innovation of the World Wide Web with its basis in hypertext links on the Internet. Greatly facilitated by the Mosaic Web browser, the basis of the commercial Netscape browser, and the HTML markup language and HTTP transport protocol, millions began experiencing the benefit of creating Web pages and interconnecting. By the mid-1990s, enterprises were on the Web in force, bringing with them larger content volumes, dynamic databases and enterprise portals. The ability for anyone to become a publisher led to a focus and attention on the new medium that led to still further innovations in e-commerce and online advertising. New languages and uses of Web pages and applications emerged, creating a convergence of design, media, content and interactivity. Venture capital and new startups with valuations independent of revenues led to a frenzy of hype and eventually the dot com crash of 2000.

The growth companies of the past 15 years have not had the traditional focus on enterprises, but on the use and development of the Web. From search (Google) to social interactions (Facebook) to media and video (Flickr, YouTube) and to information (Wikipedia), the engines of growth have shifted away from the enterprise.

Meanwhile, the challenges of data integration and interoperability that were such a keen focus going back to initial enterprise computerization remain. Now, however, these challenges are even greater, as we see images, documents (unstructured data) and Web pages, markup and metadata (semi-structured data) become first-class information citizens. What was a challenge in integrating structured data in the 1980s and 1990s via data warehousing, has now become positively daunting for the enterprise with respect to scale and scope.

The paradox is that as these enterprise needs increased, the attractiveness of the enterprise from an IT perspective has greatly decreased. It is these factors we discuss below, with an eye to how Web architecture, design and opportunities may offer a new path through the maze of enterprise information interoperability.

The Current Landscape

Since 1995 the Gartner Group has been producing its annual Hype Cycle [6]. The clientele for this research is the enterprise, so Gartner’s presentation of what’s hot and what’s hype and what is being adopted is a good proxy for the IT state of affairs in enterprises. These graphs are reproduced below since 2006 (click to expand). Note how many of the items shown are not very specific to the enterprise:

References to architectures and content processing and related topics were somewhat prevalent in 2006, but have disappeared most recently. In comparison to the innovations noted under the History discussion, it appears that the items on Gartner’s radar are more related to consumer applications and uses. We no longer see whole new categories of enterprise-related apps or enterprise architectures.

The kinds of innovations that are being discussed as important to enterprises in the coming year [7,8] tend to mostly leverage existing innovations in other areas or to wrinkle existing approaches. One report from Constellation Research, for example, lists the five core disruptive technologies of social, mobile, cloud, analytics and unified communications [7]. Only analytics could be described as enterprise focused or driven.

And, even in analytics, the kinds of things being promoted are self-service reporting or analysis [8]. In essence, these opportunities represent the application of Web 2.0 techniques to bring reporting or analysis directly to the analyst. Though important and long overdue, such innovations are more derivative than fundamental.

Master data management (MDM) is another touted area. But, to read analyst’s predictions in these areas, it feels like one has stepped into a time warp of technologies and options from a decade ago. When has XML felt like an innovation?

Of course, there is a whole industry of analysts that makes their living prognosticating to enterprises about what to expect from information technologies and how to adopt and embrace them. The general observations — across the board — seem to center on items such as smartphones and mobile, moving to the cloud for software or platforms (SaaS, PaaS), and collaboration and social networks. As I note below, there is nothing inherently wrong or unexciting per se about these trends. But, what does appear true is that the locus of innovation has shifted from the enterprise to consumers or the Internet.

Seven Reasons for a Shift in Innovation

The shift in innovation away from the enterprise has been structural, not cyclical. That means that very fundamental forces are at work to cause this change in innovation focus. It does not mean that innovation has permanently shifted away from the enterprise (organizations), but that some form of countervailing structural changes would need to occur to see a return to the IT focus on the enterprise from prior decades.

I think we can point to seven structural reasons for this shift, many of which interact with one another. While all of them are bringing benefits (some yet to be foreseen) to the enterprise, and therefore are to be lauded, they are not strictly geared to address specific enterprise challenges.

#1: The Internet

As pundits say, “The Internet changes everything” [9]. For the reasons noted under the history above, the most important cause for the shift in innovation away from the enterprise has been the Internet.

One aspect that is quite interesting is the use of Internet-based technologies to provide “outsourced” enterprise applications hosted on Web servers. Such “cloud computing” leverages the technologies and protocols inherent to the Internet. It shifts hosting, maintenance and upgrade responsibilities for conventional apps to remote providers. Initially, of course, this simply shifts locus and responsibility from in-house to a virtual party. But, it is also the case that such changes will also promote more subtle shifts in collaboration and interaction possibilities. There is also the fact that quick upgrades of underlying infrastructure and application software can also occur.

The implications for existing enterprise IT staff, traditional providers, and licensing and maintenance approaches are profound. The Internet and cloud computing will perhaps have a greater effect on governance, staffing and management than application functionality per se.

#2: Consumer Innovations

The captivating IT-related innovations at present are mobile (smartphones) and their apps, tablets and e-book readers, Internet TV and video, and social networks of a variety of stripes. Somewhat like the phenomenon of when personal computers first appeared, many of these consumer innovations have applicability to the enterprise, though only as a side effect.

It is perhaps instructive to look back at the adoption of PCs in the enterprise to understand the possible effect of these new consumer innovations. Central IT was never able to control and manage the proliferation of personal computers, and only began to understand years later what benefits and new governance challenges they brought. Enterprise leaders will understand how to embrace and extend today’s new consumer technologies for the enterprise’s benefits; laggards will resist to no avail.

The ubiquity of computing will be enormously impactful on the enterprise. The understanding of what makes sense to do on a mobile basis with a small screen and what belongs on the desk or in the office is merely a glimmer in the current conversation. However, in the end, like most of the other innovations noted in this analysis, the enterprise will largely be a reactive player to these innovations. Yes, the implications will be profound, but their inherent basis are not grounded in unique enterprise challenges. Nonetheless, adapting to them and changing business practice will be critical to asserting enterprise leadership.

#3: Open Source

Ten years ago open source was largely dismissed in the enterprise. About five years ago VCs and others began funding new commercial open source ventures, even while there were still rear guard arguments from enterprises resisting open source. Meanwhile, as the figure to the right shows, open source projects were growing exponentially [10].

The shift to open source in the enterprise, still ongoing, has been rapid. Within 5 years, more than 50% of enterprise software will be open source [11] . According to an article in Fortune magazine last year [12], a Forrester Research survey found that 48% of enterprise respondents were using open source operating systems, and 57% were using open source code. A similar Accenture survey of 300 large public and private companies found that half are committed to open source software, with 38% saying they would begin using open-source software for “mission-critical” applications over the next 12 months.

There are likely many reasons for this shift, including the Internet itself and its basis in open source. Many of the most successful companies of the past 15 years including Amazon, Google, Facebook, and virtually any large Web site has shown excellent performance and scalability building their IT infrastructure around open source foundations. Most of the large, existing enterprise IT vendors, notably including IBM, Oracle, Nokia, Intel, Sun (prior to Oracle), Citrix, Novell (just acquired by Attachmate) and SAP have bought open source providers or have visible support for open source initiatives. Even two of the most vocal proprietary source proponents of the past — HP and Microsoft — have begun to make moves toward open source.

The age of proprietary software based on proprietary standards is dead. The monopoly rents formerly associated with unique, proprietary platforms and large-scale enterprise apps are over. Even where software remains proprietary, it is embracing open standards for data interchange and APIs. Traditional enterprise apps such as content management, business intelligence and ETL, among all others, are being penetrated by commercial open source offerings (as examples, Alfresco, Pentaho and Talend, respectively). The shift to services and new business models appears to be an inexorable force.

Declining profit margins, matched with the relatively high cost of marketing and sales to enterprises, means attention and focus have been shifting away from the enterprise. And with these shifts in focus has come a reduction in enterprise-focused innovation.

#4: Slow Development Cycles in Enterprise

It is not unusual to find deployed systems within enterprises as old as thirty years [13]. So long as they work reasonably well, systems once installed — along with their data — tend to remain in operation until their platforms or functionality become totally obsolete. This leads to rather lengthy turnover cycles, and slow development cycles.

Slow cycles in themselves slow innovation. But slow development cycles are also a disincentive to attract the most capable developers. When development tends to focus on maintenance and scripts and more routines of the same nature, the best developers tend to migrate elsewhere (see next).

Another aspect of slow development cycles is the imperative for new enterprise IT to relate to and accommodate legacy systems — again, including legacy data. This consideration is the source of one of the negative implications of a shift away from innovation in the enterprise: the orphaning of existing information assets.

#5: What’s Hot: Developers

Arguably the emphasis on consumer and Internet technologies means that is where the best developers gravitate. Developing apps for smartphones or working at one of the cool Internet companies or joining a passionate community of open source developers is now attracting the best developers. Open source and Web-based systems also lead to faster development cycles. The very best developers are often the founders of the next generation startups and Web and software companies [14].

While, of course, huge numbers of computer programmers and IT specialists are hired by enterprises each year, the motivations tend to be higher pay, better benefits and more job security. The nature of the work and the bureaucracy and routine of many IT functions require such compensation. And, because of the other shifts noted elsewhere, even the software startups that are able to attract the most innovative developers no longer tend to develop for enterprise purposes.

Computer science students have been declining in industrialized countries for some time and that is the category of slowest growth in IT [14]. Meanwhile, existing IT personnel often have expertise in older legacy systems or have been focused on bug fixes and more prosaic tasks like report writing. Narrow job descriptions and work activities also keep many existing IT personnel from getting exposed to or learning about new trends or innovations, such as the semantic Web.

Declining numbers of new talent, plus declining interest by that talent, combined with (often) narrow and legacy expertise of existing talent, creates a disappointing storm of energy and innovation to address enterprise IT challenges. Enterprises have it within their power to create more exciting career opportunities to overcome these limitations, but unfortunately IT management often also appears challenged to get on top of these structural forces.

#6: What’s Hot: Startups

Open source and Internet-based systems have reduced the capital necessary for a new startup by an order of magnitude or so over the past decade. It is now quite possible to get a new startup up and running for tens to hundreds of thousands of dollars, as opposed to the millions of years past. This is leading to more startups, more startups per innovator, and quicker startup and abandonment cycles. Ideas can be tried quickly and more easily thrown away [15].

These dynamics are acting to accelerate overall development cycles and to cause a shift in funding structures and funding amounts by VCs and angels. The kind of market and sales development typical for many enterprise sales does not fit well within these dynamics and is a countervailing force for more capital when all trends point the other way.

In short, all of this is saying that money goes to where the returns are, and returns are not of the same basis as decades past in the enterprise sector. Again, this means a hollowing out of innovation for enterprises.

#7: Declining Software Rents and Consolidation

As an earlier reference noted [4], software revenues as a percent of IT vendor revenues peaked in about the mid-1990s. As profitability for these entities began to decline, so did the overall attractiveness of the sector.

As the next chart shows, coincident with the peak in profitability was the onset of a consolidation trend in the enterprise IT vendor sector [16]. The chart below shows that three of the largest IT vendors today — Oracle, IBM and HP — began an acquisition spree in the mid-1990s that has continued until just recently, as many of the existing major players have already been acquired:

Notable acquisitions over this period include: Oracle — PeopleSoft, Siebel Systems, MySQL, Hyperion, BEA and Sun; HP — EDS, 3Com, VeriFone, Compaq, Palm and Mercury Interactive; IBM — Lotus, Rational, Informix, Ascential, FileNet, Cognos and SPSS. Published acquisition costs exceeded $130 billion, mostly for the larger deals. But terms for 75% of the 262 transactions were not disclosed [16]. The total value of these consolidations likely approaches $200 billion to $300 billion.

Clearly, the market is now favoring large players with large service components. This consolidation trend does belie one early criticism of open source v proprietary software: proprietary software is likely to be better supported. In theory this might be true, but vanishing suppliers does not bode well for support either. Over time, we may likely see successful open source projects showing greater longevity than many IT vendors.

Positive Implications from the Decline

This discussion is not a boo-hoo because the heyday of enterprise IT innovation is past. Much of that innovation was expensive, often failed to achieve successful adoption, and promoted walled gardens and silos. As someone who ran companies directly involved in enterprise software sales, I personally do not miss the meetings, the travel, the suits and the 18-month sales cycles.

The enterprise has gained much from outside innovation in the past, from the personal computer to LANs and browsers and the Internet. To be sure, what we are now seeing with mobile phones has more computing power than the original Space Shuttle [17], and continued mashup and social engagement innovations will have unforeseen and manifest benefits for enterprises. I think this is unalloyed goodness.

We can also see innovations based on the Internet such as the semantic Web and its languages and standards to promote interoperability. Breaking these barriers is critically needed by enterprises of the future. Data models such as RDF [18] and open world mindsets that better accommodate uncertainty and breadth of information [19] can only be seen as positive. The leverage that will come from these non-enterprise innovations may in the end prove to be as important as the enterprise-specific innovations of the past.

Negative Implications from the Decline

Yet a shift to Internet and consumer IT innovation leaves some implications. These concerns have to do with the unique demands and needs of enterprises. One negative implication is that a diminishing supplier base may not lead to actual deployments that are enterprise-ready or -responsive.

The first concern relates to quality and operational integrity. There is an immense gulf between ISO 9000 or Six Sigma and, for example, the “good enough” of standard search results on the Web. Consumer apps do not impose the same thresholds for quality as demanded by paying bosses or paying customers. This is not a value judgment; simply a reality. I see it reflected in the quality of tools and code for many new innovations today on the Web.

Proofs-of-concept and “cool” demos work well for academic theses or basic intros to new concepts. The 20% that gets you 80% goes a long way to point the way to new innovation; but the 80% to get to the last 20% is where enterprises bet their money. Unfortunately, in too many instances, that gap is not being filled. The last 20% is hard work, often boring, and certainly not as exciting as the next Big Thing. And, as the trends above try to explicate, there are also diminishing rewards for living in that territory.

A similar and second concern pervades data interoperability. Data interoperability has been the central challenge of enterprise IT for at least three decades. As soon as we were able to interconnect systems and bridge differences in operating systems and data schema, the Holy Grail has been breaking information barriers and silos. The initial attempts with proprietary data warehouses or enterprise-wide ERP systems were wrongly trying to apply closed solutions to inherently open problems. But, now, finally when we have the open approaches and standards in hand for bridging these gaps, the attractiveness of doing so for the enterprise seems to have vanished.

For example, we see demos, tools and algorithms being published all over the place that show promising advances or improvements in the semantic Web or linked data (among other areas; see [20]). Some of these automated techniques sound wonderful, but real systems require the hard slog of review and manual approval. Quality matters. If Technique A, say, shows an improvement over Technique B of 5%, that is worth touting. But even at 98% percent accuracy, we will still find 20,000 errors in a population of 1 million items. Such errors will simply not work in having trains run on time, seats be available on airplanes, or inventory get to their required destinations.

What can work from the standpoint of linkage or interoperability on the Web according to consumer standards will simply not fly for many enterprises. But, where are the rewards for tackling that hard slog?

Another concern is security and differential access. Open Web systems, bless their hearts, do not impose the same access and need to know restrictions as information systems within enterprises. If we are to adopt Web-based approaches to the next-generation enterprise — a position we strongly advocate — then we are also going to need to figure out how to marry these two world views. Again, there appears to be an effort-reward mismatch here.

What Lessons Might be Drawn?

These observations are not meant to be a polemic, but a statement of more-or-less current circumstances. Since its widescale adoption, the major challenge — and opportunity — of enterprise IT has been how to leverage the value within the enterprise’s existing digital information assets. That challenge is augmented today with the availability of literally a whole world of external digital knowledge. Yet, the energy and emphasis for innovation to address these challenges has seemingly shifted to consumers and away from the enterprise.

Economics abhors a vacuum. I think two responses may be likely to this circumstance. The first is that new vendors will emerge to address these gaps, but with different cost structures and business models. I’d like to think my own firm, Structured Dynamics, is one of these entities. How we are addressing this opportunity and differences in our business model we will discuss at a later time. In any case, any such new player will need to take account of some of the structural changes noted above.

Another response can come from enterprises themselves, using and working the same forces of change noted earlier. Via collaboration and open source, enterprises can band together to contribute resources, expertise and people to develop open source infrastructures and standards to address the challenges of interoperability. We already see exemplars of such responses in somewhat related areas via initiatives such as Eclipse, Apache, W3C, OASIS and others. By leveraging the same tools of collaboration and open data and systems and the Internet, enterprises can band together and ensure their own self-interests are being addressed.

One advantage of this open, collaborative approach is that it is consistent with the current innovation trends in IT. But the real advantage is that it works and is needed. Without it, it is unclear how the enterprise IT challenge — especially in data interoperability — will be met.

[1] Though calculating machines and others extend back to Charles Babbage and more relevant efforts during World War II, the first UNIVAC was delivered to the US Census Bureau in 1951, and the first IBM to the US Defense Department in 1953. Many installations followed thereafter. See, for example, Lectures in the History of Computing: Mainframes.

[2] As provided by “information technology” (subscription required), Oxford English Dictionary (2 ed.), Oxford University Press, 1989, http://dictionary.oed.com/, retrieved 12 January 2011.

[3] See further the Wikipedia entry on proprietary software.

[4] M.K. Bergman, 2006. “Redux: Enterprise Software Licensing on Life Support,” AI3:::Adaptive Information blog, June 2, 2006. See https://www.mkbergman.com/111/the-death-of-enterprise-software-licensing/.

[5] The combination of distributed network systems and table-oriented designs such as Google’s BigTable and related open source Hadoop, plus many scripting languages, is leading to the resurgence of new database designs including NoSQL, columnar, etc.

[6] The Gartner Hype Cycle is a graphical representation of the maturity, adoption and application of technologies. It proceed through five phases beginning with a technology trigger and then, if successful, ultimately adoption. The peak of the curve represents the biggest “hype” for the innovation.The information in these charts is courtesy of Gartner. The sources for the charts are summary Gartner reports for 2010, 2009, 2008, and 2006. 2007 was skipped to provide a bit longer time horizon for comparison purposes.

[7] As summarized by Klint Finley, 2011. “How Will Technology Disrupt the Enterprise in 2011?,” ReadWriteWeb Enterprise blog, January 4, 2011.

[8] Jaikumar Vijayan, 2011. “Self-service BI, SaaS, Analytics will Dominate in 2011,” in Computerworld Online, January 3, 2011.

[9] According to Google on January 12, 2011, there were 251,000 uses of this exact phrase on the Web.

[10] Amit Deshpande and Dirk Riehle, 2008. “The Total Growth of Open Source,” in Proceedings of the Fourth Conference on Open Source Systems (OSS 2008), Springer Verlag, pp 197-209; see http://dirkriehle.com/wp-content/uploads/2008/03/oss-2008-total-growth-final-web.pdf.

[11] See http://futureofopensource.drupalgardens.com/2010-survey-results.

[12] See http://tech.fortune.cnn.com/2010/08/16/how-corporate-america-went-open-source/.

[13] For example, according to James Mullarney in 2005, “How to Deal with the Legacy of Legacy Systems,” the average age of IT systems in the insurance industry was 23 years. In that same year, according to Logical Minds, a survey by HAL Knowledge Systems showed the average age of applications running core business processes to be 15 years old, with almost 30 per cent of companies maintaining software that is 25 years old or older.

[14] For general IT employment trends, see the Bureau of Labor Statistics; for example, http://www.bls.gov/oco/ocos303.htm.

[15] See, for example, Paul Graham, 2010. “The New Funding Landscape,” Blog post, October 2010.

[16] This chart was constructed from these sources: Oracle — http://en.wikipedia.org/wiki/List_of_acquisitions_by_Oracle; IBM — http://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions_by_IBM; and HP — http://en.wikipedia.org/wiki/List_of_acquisitions_by_Hewlett-Packard. Of course, other acquisitions occurred by other players over this period as well.

[17] Current smartphones may have around 2 GHz in processing power and 1 GB of RAM; see for example, this Motorola press release. By comparison to the Shuttle, see http://en.wikipedia.org/wiki/Space_Shuttle#Flight_systems.

[18] M. K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Information blog, April 8, 2009.

[19] M. K. Bergman, 2009. “The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Information blog, Dec. 21, 2009.

[20] See, for example, the Sweet Tools listing of 900 semantic Web and -related tools on this AI3:::Adaptive Information blog.

Posted:December 6, 2010

What is a Reference Concept?

And, Seven Guidelines for this Second of Two Semantic ‘Gaps’

Download PDF

I have been writing and speaking of late about next priorities to promote the interoperability of linked data and the semantic Web. In a talk a few weeks back to the Dublin Core (DCMI) annual conference, I summarized these priorities as the need to address two aspects of the semantic “gap”:

One aspect is the need for vetted reference sources that provide the entities and concepts for aligning disparate content sources on the Web, and
A second aspect is the need for accurate mapping predicates that can represent the often approximate matches and overlaps of this heterogeneous content.

I discussed the second aspect in an earlier post [1]. In today’s installment, we now focus on the “gap” relating to reference concepts.

The Web Increases the Need for Organization

Interoperability comes down to the nature of things and how we describe those things or quite similar things from different sources. Given the robust nature of semantic heterogeneities in diverse sources and datasets on the Web (or anywhere else, for that matter!) [2], how do we bring similar or related things into alignment? And, then, how can we describe the nature or basis of that alignment?

Of course, classifiers since Aristotle and librarians for time immemorial have been putting forward various classification schemes, controlled vocabularies and subject headings. When one wants to find related books, it is convenient to go to a central location where books about the same or related topics are clustered. And, if the book can be categorized in more than one way — as all are — then something like a card catalog is helpful to find additional cross-references. Every domain of human endeavor makes similar attempts to categorize things.

On the Web we have none of the limitations of physical books and physical libraries; locations are virtual and copies can be replicated or split apart endlessly because of the essentially zero cost of another electron. But, we still need to find things and we still want to gather related things together. According to Svenonius, “Organizing information if it means nothing else means bringing all the same information together” [3]. This sentiment and need remains unchanged whether we are talking about books, Web documents, chemical elements or linked data on the Web.

Like words or terms in human language that help us communicate about things, how we organize things on the Web needs to have an understood and definable meaning, hopefully bounded with some degree of precision, that enables us to have some confidence we are really communicating about the same something with one another. However, when applied to the Web and machine communications, the demands for how these definitions and precisions apply need to change. This makes the notion of a Web basis for organization both easier and harder than traditional approaches to classification.

It is easier because everything is virtual: we can apply multiple classification schema and can change those schema at will. We are not locked into historical anomalies like huge subject areas reserved for arcane or now historically less important topics, such as the Boer Wars or phrenology. We need not move physical books around on shelves in order to accommodate new or expanded classification schemes. We can add new branches to our classification of, say, nanotechnology as rapidly as the science advances.

Yet it is harder because we can no longer rely on the understanding of human language as a basis for naming and classifying things. Actually, of course, language has always been ambiguous, but it is manifestly more so when put through the grinder of machine processing and understanding. Machine processing of related information adds the new hurdles of no longer being able to rely on text labels (“names”) alone as the identifier of things and requires we be more explicit about our concept relationships and connections. Fortunately, here, too, much has been done in helping to organize human language through such lexical frameworks as WordNet and similar.

The Idea and Role of Reference Concepts

Many groups and individuals have been grappling with these questions of how to organize and describe information to aid interoperability in an Internet context. Among many, let me simply mention two because of the diversity their approaches show.

Bernard Vatant, for one, has with his colleagues been an advocate for some time for the need for what he calls “hubjects.” With an intellectual legacy from the Topic Maps community, the idea of “hubjects” is to have a flat space of reference subjects to which related information can link and refer. Each hubject is the hub of a spoked wheel of representations by which the same subject matter from different contexts may be linked. The idea of the flat space or neutrality in the system is to place the “hubject” identifier (referent) outside of other systems that attempt to organize and provide “meta-frameworks” of knowledge organization. In other words, there are no inherent suggested relationships in the reference “hubject” structure: just a large bin of defined subjects to which external systems may link.

A different and more formalized approach has been put forward by the FRSAD working group [4], dealing with subject authority data. Subject authority data is the type of classificatory information that deals with the subjects of various works, such as their concepts, objects, events, or places. As the group stated, the scope of this effort pertains to the “aboutness” of various conceptual works. The framework for this effort, as with the broader FRBR effort, are new standards and approaches appropriate to classifying electronic bibliographic records.

Besides one of the better summaries and introductions to the general problems of subject classification in general, the FRSAD approach makes its main contribution in clearly distinguishing the idea of something (which it calls a thema, or entity used as the subject of a work) from the name or label of something (which it calls nomen). For many in the logic community, steeped in the Peirce triad of sign–object–interpretant [5], this distinction seems rather obvious and straightforward. But, in library science, labels have been used interchangeably as identifiers, and making this distinction clean is a real contribution. The FRSAD effort does not itself really address how the thema are actually found or organized.

The notion of a reference concept used herein combines elements from both of these approaches. A reference concept is the idea of something, or a thema in the FRSAD sense. It is also a reference hub of sorts, similar to the idea of a “hubject”. But it is also much more and more fully defined.

So, let’s first begin by representing a reference concept in relation to its referers and possible linking predicates as follows:

A referer needs to link appropriately to its reference concept, with some illustrative examples shown on the arrows in the diagram. These links are the predicates, ranging from the exact to the approximate, discussed in the first semantic “gap” posting. (Note: see that earlier post for a longer listing of existing, candidate linking predicates. No further comment is made in this present article as to whether those in that earlier posting or the example ones above are “correct” or not; see the first post for that discussion.)

If properly constructed and used, a reference concept thus becomes a fixed point in an information space. As one or more external sources link to these fixed points, it is then possible to gather similar content together and to begin to organize the information space, in the sense of Svenonius. Further, and this is a key difference from the “hubject” approach, if the reference concept is itself part of a coherent structure, then additional value can be derived from these assignments, such as inference, consistence testing, and alignments. (More on this latter point is discussed below.)

Seven Guidelines for a Reference Concept

If the right factors are present, it should be possible to relate and interoperate multiple datasets and knowledge representations. If present, these factors can result in a series of fixed reference points to which external information can be linked. In turn, these reference nodes can form constellations to guide the traversal to desired information destinations on the Web.

Let’s look at the seven factors as to what constitutes guidelines for best practices.

Guideline #1: Persistent URI

By definition, a Web-based reference concept should adhere to linked data principles and should have a URI as its address and identifier. Also, by definition as a “reference”, the vocabulary or ontology in which the concept is a member should be given a permanent and persistent address. Steps should be taken to ensure 24×7 access to the reference concept’s URI, since external sources will be depending on it.

As a general rule, the concepts should also be stated as single nouns and use CamelCase notation (that is, class names should start with a capital letter and not contain any spaces, such as MyNewConcept).

Guideline #2: Preferred Label

Provide a preferred label annotation property that is used for human readable purposes and in user interfaces. For this purpose, a construct such as the SKOS property of skos:prefLabel works well. Note, this label is not the basis for deciding and making linkages, but it is essential for mouseovers, tooltips, interface labels, and other human use factors.

Guideline #3: Definition

Give all concepts and properties a definition. The matching and alignment of things is done on the basis of concepts (not simply labels), which means each concept must be defined [6]. Providing clear definitions (along with the coherency of its structure) gives an ontology its semantics. Remember not to confuse the label for a concept with its meaning. For this purpose, a property such as skos:definition works well, though others such as rdfs:comment or dc:description are also commonly used.

The definition is the most critical guideline for setting the concept’s meaning. Adequate text and content also aid semantic alignment or matching tasks.

Guideline #4: Tagset

Include explicit consideration for the idea of a “semset” or “tagset”, which means a series of alternate labels and terms to describe the concept. These alternatives include true synonyms, but may also be more expansive and include jargon, slang, acronyms or alternative terms that usage suggests refers to the same concept. The semset construct is similar to the “synsets” in Wordnet, but with a broader use understanding. Included in the semset construct is the single (per language) preferred (human-readable) label for the concept, the prefLabel, an embracing listing of alternative phrase and terms for the concept (including acronyms, synonyms, and matching jargon), the altLabels, and a listing of prominent or common misspellings for the concept or its alternatives, the hiddenLabels.

This tagset is an essential basis for tagging unstructured text documents with reference concepts, and for search not limited to keywords. The tagset, in combination with the definition, is also the basis for feeding many NLP-driven methods for concept or ontology alignment.

Guideline #5: Language Independent

The practice of using an identifier separate from label, and language qualified entries for definition, preferred label and tagset (alternative labels) means that multi-lingual versions can be prepared for each concept. Though this is a somewhat complicated best practice in its own right (for example, being attentive to the xml:lang=”en” tag for English), adhering to this practice provides language independence for reference concepts.

Sources such as Wikipedia, with its richness of concepts and multiple language versions, can then be a basis for creation of alternative language versions.

Guideline #6: Range and Domain

Use of domains and ranges assists testing, helps in disambiguation, and helps in external concept alignments. Domains apply to the subject (the left hand side of a triple); ranges to the object (the right hand side of the triple). Domains and ranges should not be understood as actual constraints, but as axioms to be used by reasoners. In general, domain for a property is the range for its inverse and the range for a property is the domain of its inverse.

Guideline #7: Part of Coherent Structure

When reference concepts, properly constructed as above, are also themselves part of a coherent structure, further benefits may be gained. These benefits include inferencing, consistency testing, discovery and navigation. For example, the sample at right shows that a retrieval for Saab cars can also inform that these are automobiles, a brand of automobile, and a Swedish kind of car.

To gain these advantages, the coherent structure need not be complicated. RDFS and SKOS-based lightweight vocabularies can meet this test. Properly constructed OWL ontologies can also provide these benefits.

When best practices are combined with being part of a coherent structure, we can refer to these structures as reference ontologies or domain ontologies.

The State of Reference Concepts

In part, these best practices are met to a greater or lesser extent by many current vocabularies. But few provide complete coverage, and across a broad swath of domain needs, major gaps remain. This unfortunate observation applies to upper-level ontologies, reference vocabularies, and domain ontologies alike.

Upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). While these have a coherency of construction, they are most often incomplete with respect to reference concept construction. With the exception of SUMO and Cyc, domain coverage is also very general.

Our own UMBEL reference ontology [7] is closest to meeting all criteria. The reference concepts are constructed to standard. But coverage is fairly general, and not directly applicable to most domains (though it can help to orient specific vocabularies).

Wikipedia, as accessed via the DBpedia expression, has good persistent URIs, labels, altLabels and proxy definitions (via the first sentences abstract). As a repository of reference concepts, it is extremely rich. But the organizational structure is weak and provides very few of the benefits for coherent structures noted above.

Going back to 1997, DCMI has been involved in putting forward possible vocabularies that may act as “qualifiers” to dc:subject [8]. Such reference vocabularies can extend from the global or comprehensive, such as the Universal Decimal Classification or Library of Congress Subject Headings, to the domain specific such as MeSH in medicine or Agrovoc in agriculture [9]. One or more concepts in such reference vocabularies can be the object of a dc:subject assertion, for example. While these vocabularies are also a rich source of reference concepts, they are not constructed to standards and at most provide hierarchical structures.

In the area of domain vocabularies, we are seeing some good pockets of practice, especially in the biomedical and life sciences arena [10]. Promising initiatives are also underway in library applications [11] and perhaps other areas unknown to the author.

In summary, I take the state of the art to be quite promising. We know what to do, and it is being done in some pockets. What is needed now is to more broadly operationalize these practices and to extend them across more domains. If we can bring attention to and publicize exemplar vocabularies, we can start to realize the benefits of actual data interoperability on the Web.

[1] See M. K. Bergman, 2010. “The Nature of Connectedness on the Web,” AI3:::Adaptive Information blog, November 22, 2010. See https://www.mkbergman.com/935/the-nature-of-connectedness-on-the-web/.

[2] See M. K. Bergman, 2006. “Sources and Classification of Semantic Heterogeneities,” AI3:::Adaptive Information blog, June 6, 2006. See https://www.mkbergman.com/232/sources-and-classification-of-semantic-heterogeneities/.

[3] From a quote on page 10 by Elaine Svenonius, 2000. The Intellectual Foundation of Information Organization, MIT Press, 2000, 255pp. I’d like to thank Karen Coyle for recently posting this quote on the Linked Library Data (LLD) mailing list.

[4] Marcia Lei Zeng, Maja Žumer, Athena Salaba, eds., 2010. Functional Requirements for Subject Authority Data (FRSAD): A Conceptual Model, prepared by the IFLA Working Group on the Functional Requirements for Subject Authority Records (FRSAR), June 2010, 75 pp. See http://www.ifla.org/files/classification-and-indexing/functional-requirements-for-subject-authority-data/frsad-final-report.pdf. This effort is part of the broader and well-known FRBR (Functional Requirements of Bibliographic Records) initiative.

[5] C.S. Peirce’s sign relations are covered under the discussion about Semiotic Elements under the Sign section on Peirce in Wikipedia. In the the context of this discussion, the sign corresponds to any of the labels or identifiers associated with the (reference concept) object, the meaning of which is provided by its interpretant definition and useful language labels. See also John Sowa, 2000. “Ontology, Metadata, and Semiotics,” presented at ICCS’2000 in Darmstadt, Germany, on August 14, 2000; see http://www.jfsowa.com/ontology/ontometa.htm.

[6] As another commentary on the importance of definitions, see http://ontologyblog.blogspot.com/2010/09/physician-decries-lack-of-definitions.html.

[7] UMBEL (Upper Mapping and Binding Exchange Layer) is an ontology of about 20,000 subject concepts that acts as a reference structure for inter-relating disparate datasets. It is also a general vocabulary of classes and predicates designed for the creation of domain-specific ontologies.

[8] Rebecca Guenther, 1997. Dublin Core Qualifiers/Substructure, October 15, 1997. See http://www.loc.gov/marc/dcqualif.html.

[9] Two, among many, metadata listings of potential reference vocabularies are http://www.jiscdigitalmedia.ac.uk/crossmedia/advice/controlling-your-language-links-to-metadata-vocabularies/ and http://hilt.cdlr.strath.ac.uk/hilt2web/Sources/thesauri.html.

[10] For example, see the Open Biological and Biomedical Ontologies (OBO) initiative and the W3C‘s Semantic Web Health Care and Life Sciences Interest Group.

[11] See the W3C’s Linked Library Data initiative, with particular attention to topics and use cases.

Posted:November 26, 2010

Brown Bag Lunch: An Intrepid Guide to Ontologies

There’s an Endless Variety of World Views, and Almost as Many Ways to Organize and Describe Them

Ontology is one of the more daunting terms for those exposed for the first time to the semantic Web. Not only is the word long and without many common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.

The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”

While there have been attempts to strap on more or less formal understandings or machinery around ontology, it still has very much the sense of a world view, a means of viewing and organizing and conceptualizing and defining a domain of interest. As is made clear below, I personally prefer a loose and embracing understanding of the term (consistent with Deborah McGuinness’s 2003 paper, Ontologies Come of Age [1]).

There has been a resurgence of interest in ontologies of late. Two reasons have been the emergence of Web 2.0 and tagging and folksonomies, as well as the nascent emergence of the structured Web. In fact, on April 23-24 one of the noted communities of practice around ontologies, Ontolog, sponsored the Ontology Summit 2007 ,”Ontology, Taxonomy, Folksonomy: Understanding the Distinctions.”

These events have sparked my preparing this guide to ontologies. I have to admit this is a somewhat intrepid endeavor given the wealth of material and diversity of opinions.

This Friday brown bag leftover was first placed into the AI3 refrigerator more than three years ago on May 16, 2007. This reprise is unchanged since its original posting, though there is a more recent executive-level intro to ontologies on the OpenStructs‘ TechWiki.

Overview and Role of Ontologies

Of course, a fancy name is not sufficient alone to warrant an interest in ontologies. There are reasons why understanding, using and manipulating ontologies can bring practical benefit:

Depending on their degree of formalism (an important dimension), ontologies help make explicit the scope, definition, and language and meaning (semantics) of a given domain or world view
Ontologies may provide the power to generalize about their domains
Ontologies, if hierarchically structured in part (and not all are), can provide the power of inheritance
Ontologies provide guidance for how to correctly “place” information in relation to other information in that domain
Ontologies may provide the basis to reason or infer over its domain (again as a function of its formalism)
Ontologies can provide a more effective basis for information extraction or content clustering
Ontologies, again depending on their formalism, may be a source of structure and controlled vocabularies helpful for disambiguating context; they can inform and provide structure to the “lexicons” in particular domains
Ontologies can provide guiding structure for browsing or discovery within a domain, and
Ontologies can help relate and “place” other ontologies or world views in relation to one another; in other words, ontologies can organize ontologies from the most specific to the most abstract.

Both structure and formalism are dimensions for classifying ontologies, which combined are often referred to as an ontology’s “expressiveness.” How one describes this structure and formality differs. One recent attempt is this figure from the Ontology Summit 2007‘s wrap-up communique:

Note the bridging role that an ontology plays between a domain and its content. (By its nature, every ontology attempts to “define” and bound a domain.) Also note that the Summit’s 50 or so participants were focused on the trade-off between semantics v. pragmatic considerations. This was a result of the ongoing attempts within the community to understand, embrace and (possibly) legitimize “less formal” Web 2.0 efforts such as tagging and the folksonomies that can result from them.

There is an M.C. Escher-like recursion of the lizard eating its tail when one observes ontologists creating an ontology to describe the ontological domain. The above diagram, which itself would be different with a slight change in Summit participation or editorship, is, of course, but one representative view of the world. Indeed, a tremendous variety of scientific and research disciplines concern themselves with classifying and organizing the “nature of things.” Those disciplines go by such names as logicians, taxonomists, philosophers, information architects, computer scientists, librarians, operations researchers, systematicists, statisticians, historians, and so forth. (In short, given our ontos, every area of human endeavor has the urge to classify, to organize.) In each of these areas not only do their domains differ, but so do the adopted structures and classification schemes often used.

There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach:

Actual domains or subject coverage are then mostly orthogonal to these approaches.

Loosely defined, the number of possible ontologies is therefore close to infinite: domain X perspective X schema. (Just kidding — sort of! In fact, UMBC’s Swoogle ontology search service claims 10,000 ontologies presently on the Web; the actual data from August 2006 ranges from about 16,000 to 92,000 ontologies, depending on how “formal” the definition. These counts are also limited to OWL-based ontologies.)

Many have misunderstood the semantic Web because of this diversity and the slipperiness of the concept of an ontology. This misunderstanding becomes flat wrong when people claim the semantic Web implies one single grand ontology or organizational schema, One Ring to Rule Them All. Human and domain diversities makes this viewpoint patently false.

Diversity, ‘Naturalness’ and Change

The choice of an ontological approach to organize Web and structured content can be contentious. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language or microformats. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) and the still greater formalism of OWL’s various dialects.

Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications. All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it to a canonical form, we can then move on to needed developments in semantic mediation, the threshold condition for the semantic Web.

There are at least 40 concepts — loosely defined — that could be called an “ontology” framework or approach.

So, diversity is inevitable and should be accepted. But that observation need not also embrace chaos.

In my early training in biological systematics, Ernst Haeckel’s recapitulation theory that “ontogeny recapitulates phylogeny” (note the same ontos root, the difference from ontology being growth v. study) was losing favor fast. The theory was that the development of an organism through its embryological phases mirrors its evolutionary history. Today, modern biologists recognize numerous connections between ontogeny and phylogeny, explain them using evolutionary theory, or view them as supporting evidence for that theory.

Yet, like the construction of phylogenetic trees, systematicists strive for their classifications of the relatedness of organisms to be “natural”, to reflect the true nature of the relationship. Thus, over time, that understanding of a “natural” system has progressed from appearance → embryology → embryology + detailed morphology → species and interbreeding → DNA. While details continue to be worked out, the degree of genetic relatedness is now widely accepted by biologists as a “natural” basis for organizing the Tree of Life.

It is not unrealistic to also seek “naturalness” in the organization of other knowledge domains, to seek “naturalness” in the organization of their underlying ontologies. Like natural systems in biology, this naturalness should emerge from the shared understandings and perceptions of the domain’s participants. While subject matter expertise and general and domain knowledge are essential to this development, they are not the only factors. As tagging systems on the Web are showing, common usage and broad acceptance by the community at hand is important as well.

While it may appear that a domain such as the biological relatedness of organisms is more empirical than the concepts and ambiguous words in most domains of human endeavor, these attempts at naturalness are still not foolish. The phylogeny example shows that understanding changes over time as knowledge is gained. We now accept DNA over the recapitulation theory.

As the formal SKOS organizational schema for knowledge systems recognizes (see below), the ideas of narrower and broader concepts can be readily embraced, as well as concepts of relatedness and aliases (synonyms). These simple constructs, I would argue, plus the application of knowledge being gained in related domains, will enable tomorrow’s understandings to be more “natural” than today’s, no matter the particular domain at hand.

So, in seeking a “naturalness” within our organizational schema, we can also see that change is a constant. We also see that the tools and ideas underlying the seemingly abstract cause of merging and relating existing ontologies to one another will further a greater “naturalness” within our organizations of the world.

A Spectrum of Formalisms

According to the Summit, expressiveness is the extent and ease by which an ontology can describe domain semantics. Structure they define as the degree of organization or hierarchical extent of the ontology. They further define granularity as the level of detail in the ontology. And, as the diagram above alludes, they define other dimensions of use, logical basis, purpose and so forth of an ontology.

The over fifty respondents from 42 communities submitted some 70 different ontologies under about 40 terms to a survey that was used by the Summit to construct their diagram. These submissions included:

. . . formal ontologies (e.g., BFO, DOLCE, SUMO), biomedical ontologies (e.g., Gene Ontology, SNOMED CT, UMLS, ICD), thesauri (e.g., MeSH, National Agricultural Library Thesaurus), folksonomies (e.g., Social bookmarking tags), general ontologies (WordNet, OpenCyc) and specific ontologies (e.g., Process Specification Language). The list also includes markup languages (e.g., NeuroML), representation formalisms (e.g., Entity-Relation model, OWL, WSDL-S) and various ISO standards (e.g., ISO 11179). This [Ontolog] sample clearly illustrates the diversity of artifacts collected under “ontology”.

I think the simplest spectrum for such distinctions is the formalism of the ontology and its approach (and language or syntax, not further discussed here). More formal ontologies have greater expressiveness and structure and inferential power, less formal ones the opposite. Constructing more formal ontologies is more demanding, and takes more effort and rigor, resulting in an approach that is more powerful but also more rigid and less flexible. Like anything else, there are always trade-offs.

Based on work by Leo Obrst of Mitre as interpreted by Dan McCreary, we can view this as a trade-off as one of semantic clarity v. the time and money required to construct the formalism [12, 13]:

Structure and Formalism Increases Semantic Expressiveness

[Click on image for full-size pop-up]

Note this diagram reflects the more conventional, practitioner’s view of the “formal” ontology, which does not include taxonomies or controlled vocabularies (for example) in the definition. This represents the more “closely defined” end of the ontology (semantic) spectrum.

However, since we are speaking here of ontologies and the structured Web or the semantic Web, I believe we need to embrace a concept of ontology aligned to actual practice. Not all content providers can or want to employ ontology engineers to enable formal inferencing of their content. Yet, on the other hand, their content in its various forms does have some meaningful structure, some organization. The trick is to extract this structure for more meaningful use such as data exchange or data merging.

Ontology Approaches on the Web

Under such “loosely defined” bases we can thus see a spectrum of ontology approaches on the Web, proceeding from less structure and formalism to more so:

Type or Schema	Examples	Comments on Structure and Formalism
Standard Web Page	entire Web	General metadata fields in the and internal HTML codes and tags provide minimal, but useful sources of structure; other HTTP and retrieval data can also contribute
Blog / Wiki Page	examples from Technorati, Bloglines, Wikipedia	Provides still greater formalism for the organization and characterization of content (subjects, categories, posts, comments, date/time stamps, etc.). Importantly, with the addition of plug-ins, some of the basic software may also provide other structured characterizations or output (SIOC, FOAF, etc.; highly varied and site-specific given the diversity of publishing systems and plug-ins)
RSS / Atom feeds	most blogs and most news feeds	RSS extends basic XML schema for more robust syndication of content with a tightly controlled vocabulary for feed concepts and their relationships. Because of its ubiquity, this is becoming a useful baseline of structure and formalism; also, the nature of adoption shows much about how ontological structure is an artifact, not driver, for use
RSS / Atom feeds with tags or OPML	Grazr, most newsfeed aggregators can import and export OPML lists of RSS feeds	The OPML specification defines an outline as a hierarchical, ordered list of arbitrary elements. The specification is fairly open which makes it suitable for many types of list data. See also OML and XOXO
Hierarchical Faceted Metadata	XFML, Flamenco	These and related efforts from the information architecture (IA) community are geared more to library science. However, they directly contribute to faceted browsing, which is one of the first practical instantiations of the semantic Web
Folksonomies	Flickr, del.icio.us	Based on user-generated tags and informal organizations of the same; not linked to any “standard” Web protocols. Both tags and hierarchical structure are arbitrary, but some researchers now believe over large enough participant sets that structural consensus and value does emerge
Microformats	Example formats include hAtom, hCalendar, hCard, hReview, hResume, rel-directory, xFolk, XFN and XOXO	A microformat is HTML mark up to express semantics with strictly controlled vocabularies. This markup is embedded using specific HTML attributes such as class, rel, and rev. This method is easy to implement and understand, but is not free-form
Embedded RDF	RDFa, eRDF	An embedded format, like microformats, but free-form, and not subject to the approval strictures associated with microformats
Topic Maps	Infoloom, Topic Maps Search Engine	A topic map can represent information using topics (representing any concept, from people, countries, and organizations to software modules, individual files, and events), associations (which represent the relationships between them), and occurrences (which represent relationships between topics and information resources relevant to them)
RDF	Many; DBpedia, etc.	RDF has become the canonical data model since it represents a “universal” conversion format
RDF Schema	SKOS, SIOC, DOAP, FOAF	RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources. This becomes the canonical ontology common meeting ground
OWL Lite	Here are some existing OWL ontologies; also see Swoogle for OWL search facilities	The Web Ontology Language (OWL) is a language for defining and instantiating Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. OWL is designed for use by applications that need to process the content of information instead of just presenting information to humans. It facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. The three language versions are in order of increasing expressiveness
OWL DL
OWL Full
Higher-order “formal” and “upper-level” ontologies	SUMO, DOLCE, PROTON, BFO, Cyc, OpenCyc	These provide comprehensive ontologies and often related knowledge bases, with the goal of enabling AI applications to perform human-like reasoning. Their reasoning languages often use higher-order logics

As a rule of thumb, items that are less “formal” can be converted to a more formal expression, but the most formal forms can generally not be expressed in less formal forms.

As latter sections elaborate, I see RDF as the universal data model for representing this structure into a common, canonical format, with RDF Schema (specifically SKOS, but also supplemented by FOAF, DOAP and SIOC) as the organizing ontology knowledge representation language (KRL).

This is not to say that the various dialects of OWL should be neglected. In bounded environments, they can provide superior reasoning power and are warranted if they can be sufficiently mandated or enforced. But the RDF and RDF-S systems represent the most tractable “meeting place” or “middle ground,” IMHO.

Still-Another “Level” of Ontologies

As if the formalism dimension were not complicated enough, there is also the practice within the ontology community to characterize ontologies by “levels”, specifically upper, middle and lower levels. For example, chances are that you have heard particularly of “upper-level” ontologies.

The following figure helps illustrate this “level” dimension. This diagram is also from Leo Obrst of Mitre [12], and was also used in another 2006 talk by Jack Park and Patrick Durusau (discussed further below for other reasons):

Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus — that is, real ontos stuff) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak [2].

The above diagram conveys a sense of how multiple ontologies can relate to one another both in terms of narrower and broader topic matter and at the same “levels” of generalization. Such “meta-structure” (if you will) can provide a reference structure for relating multiple ontologies to one another.

The relationships and mappings amongst ontologies is a critical infrastructure component of the semantic Web.

It resides exactly in such bindings or relationships that we can foresee the promise of querying and relating multiple endpoints on the Web with accurate semantics in order to connect dots and combine knowledge bases. Thus, the understanding of the relationships and mappings amongst ontologies becomes a critical infrastructural component of the semantic Web.

The SUMO Example

We can better understand these mapping and inter-relationship concepts by using a concrete example with a formal ontology. We’ll choose to use the Suggested Upper Merged Ontology simply because it is one of the best known. We could have also selected another upper-level system such as PROTON [3] or Cyc [4] or one of the many with narrower concept or subject coverage.

SUMO is one of the formal ontologies that has been mapped to the WordNet lexicon, which adds to its semantic richness. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE. The ontologies that extend SUMO are available under GNU General Public License.

The abstract, conceptual organization of SUMO is shown by this diagram, which also points to its related MILO (MId-Level Ontology), which is being developed as a bridge between the abstract content of the SUMO and the richer detail of various domain ontologies:

At this level, the structure is quite abstract. But one can easily browse the SUMO structure. A nifty tool to do so is the KSMSA (Knowledge Support for Modeling and Simulation) ontology browser. Using a hierarchical tree representation, you can navigate through SUMO, MILO, WordNet, and (with the locally installed version) Wikipedia.

The figure below shows the upper-level entity concept on the left; the right-hand panel shows a drill-down into the example atom entity:

[Click on image for full-size pop-up]

These views may be a bit misleading because the actual underlying structure, while it has hierarchical aspects as shown here, really is in the form of a directed acyclic graph (showing other relatedness options, not just hierarchical ones). So, alternate visualizations include traditional network graphs.

The other thing to note is that the “things” covered in the ontology, the entities, are also fairly abstract. That is because the intention of a standard “upper-level” ontology is to cover all relevant knowledge aspects of each entity’s domain. This approach results in a subject and topic coverage that feels less “concrete” than the coverage in, say, an encyclopedia, directory or card catalog.

Ontology Binding and Integration Mechanisms

According to Park and Durusau, upper ontologies are diverse, middle ontologies are even more diverse, and lower ontologies are more diverse still. A key observation is that ontological diversity is a given and increases as we approach real user interaction levels. Moreover, because of the “loose” nature of ontologies on the Web (now and into the future), diversity of approach is a further key factor.

Recall the initial discussion on the role and objectives of ontologies. About half of those roles involve effectively accessing or querying more than one ontology. The objective of “upper-level” ontologies, many with their own binding layers, is also expressly geared to ontology integration or federation. So, what are the possible mechanisms for such binding or integration?

A fundamental distinction within mechanisms to combine ontologies is whether it is a unified or centralized approach (often imposed or required by some party) or whether it is a schema mapping or binding approach. We can term this distinction centralized v. federated.

Centralized Approaches

Centralized approaches can take a number of forms. At the most extreme, adherence to a centralized approach can be contractual. At the other end are reference models or standards. For example, illustrative reference models include:

the Data Reference Model (DRM), one of the five reference models of the Federal Enterprise Architecture (FEA)
UDEF (Unified Data Element Framework), an approach toward a unified description framework, or
the eXtended MetaData Registry (XMDR) project.

Though I have argued that One Ring to Rule them All is not appropriate to the general Web, there may be cases within certain enterprises or where through funding clout (such as government contracts), some form of centralized approach could be imposed [5]. And, frankly, even where compliance can not be assured, there are advantages in economy, efficiency and interoperability to attempt central ontologies. Certain industries — notably pharmaceuticals and petrochemicals — and certain disciplines — such as some areas of biology among others — have through trade associations or community consensus done admirable jobs in adopting centralized approaches.

Federated Approaches

However, combining ontologies in the context of the broader Internet is more likely through federated approaches. (Though federated approaches can also be improved when there are consensual standards within specific communities.) The key aspect of a federated approach is to acknowledge that multiple schema need to be brought together, and that each contributing data set and its schema will not be altered directly and will likely remain in place.

Thus, the key distinctions within this category are the mechanisms by which those linkages may take place An important goal in any federated approach is to achieve interoperability at the data or instance level without unacceptable loss of information or corruption of the semantics. Numerous specific approaches are possible, but three example areas in RDF-topic map interoperability, the use of “subject maps”, and binding layers can illustrate some of the issues at hand.

In 2006 the W3C set up a working group to look at the issue of RDF and topic maps interoperability. Topic maps have been embraced by the library and information architecture community for some time, and have standards that have been adopted under ISO. Somewhat later but also in parallel was the development of the RDF standard by W3C. The interesting thing was that the conceptual underpinnings and objectives between these two efforts were quite similar. Also, because of the substantive thrust of topic maps and the substantive needs of its community, quite a few topic maps had been developed and implemented.

One of the first efforts of the W3C work group was to evaluate and compare five or six extant proposals for how to relate RDF and topic maps [6]. That report is very interesting reading for any one desirous of learning more about specific issues in combining ontologies and their interoperability. The result of that evaluation then led to some guidelines for best practices in how to complete this mapping [7]. Evaluations such as these provide confidence that interoperability can be achieved between relatively formal schema definitions without unacceptable loss in meaning.

A different, “looser” approach, but one which also grew out of the topic map community, is the idea of “subject maps.” This effort, backed by Park and Durusau noted above, but also with the support of other topic map experts such as Steve Newcomb and Robert Barta via their proposed Topic Maps Reference Model (ISO 13250-5), seems to be one of the best attempts I’ve seen that both respects the reality of the actual Web while proposing a workable, effective scheme for federation.

The basic idea of a subject map is built around a set of subject “proxies.” A subject proxy is a computer representation of a subject that can be implemented as an object, must have an identity, and must be addressable (this point provides the URI connector to RDF). Each contributing schema thus defines its own subjects, with the mappings becoming meta-objects. These, in turn, would benefit from having some accepted subject reference schema (not specifically addressed by the proponents) to reduce the breadth of the ultimate mapped proxy “space.”

I don’t have the expertise to judge further the specifics, but I find the presentation and papers by Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies to be worthwhile reading in any case. I highly recommend these papers for further background and clarity.

As the third example, “binding layers” are a comparatively newer concept. Leading upper-level ontologies such as SUMO or PROTON propose their own binding protocols to their “lower” domains, but that approach takes place within the construct of the parent upper ontology and language. Such designs are not yet generalized solutions. By far the most promising generalized binding solution is the SKOS (Simple Knowledge Organization System). Because of its importance, the next section is devoted to it.

Finally, with respect to federated approaches, there are quite a few software tools that have been developed to aid or promote some of these specific methods. For, example, about twenty of the software applications in my Sweet Tools listing of 500+ semantic Web and -related tools could be interpreted as aiding ontology mapping or creation. You may want to check out some of these specific tools depending on your preferred approach [8].

The Role of SKOS – the Simple Knowledge Organization System

SKOS, or the Simple Knowledge Organization System, is a formal language and schema designed to represent such structured information domains as thesauri, classification schemes, taxonomies, subject-heading systems, controlled vocabularies, or others; in short, most all of the “loosely defined” ontology approaches discussed herein. It is a W3C initiative more fully defined in its SKOS Core Guide [9].

SKOS is built upon the RDF data model of the subject-predicate-object “triple.” The subjects and objects are akin to nouns, the predicate a verb, in a simple Dick-sees-Jane sentence. Subjects and predicates by convention are related to a URI that provides the definitive reference to the item. Objects may be either a URI resource or a literal (in which case it might be some indexed text, an actual image, number to be used in a calculation, etc.).

Being an RDF Schema simply means that SKOS adds some language and defined relationships to this RDF baseline. This is a bit of recursive understanding, since RDFS is itself defined in RDF by virtue of adding some controlled vocabulary and relations. The power, though, is that these schema additions are also easily expressed and referenced.

This RDFS combination can thus be shown as a standard RDF triple graph, but with the addition of the extended vocabulary and relations:

The power of the approach arises from the ability of the triple to express virtually any concept, further extended via the RDFS language defined for SKOS. SKOS includes concepts such as “broader” and “narrower”, which enable hierarchical relations to be modeled, as well as “related” and “member” to support networks and arrays, respectively [9].

We can visualize this transforming power by looking at how an “ontology” in a totally foreign scheme can be related to the canonical SKOS scheme. In the figure below the left-hand portion shows the native hierarchical taxonomy structure of the UK Archival Thesaurus (UKAT), next as converted to SKOS on the right (with the overlap of categories shown in dark purple). Note the hierarchical relationships visualize better via a taxonomy, but that the RDF graph model used by SKOS allows a richer set of additional relationships including related and alternative names:

Example Structural Comparison of Hierarchical Taxonomy with Network Graph

[Click on image for full-size pop-up]

SKOS also has a rich set of annotation and labeling properties to enhance human readability of schema developed in it [9]. There is also a useful draft schema that the W3C’s SWEO (Semantic Web Education and Outreach) group is developing to organize semantic Web-related information [10].

Combined, these constructs provide powerful mechanisms for giving contributory ontologies a common conceptualization. When added to other sibling RDF schema such as FOAF or SIOC or DOAP, still additional concepts can be collated.

Conclusions

While not addressed directly in this piece, it is obviously of first importance to have content with structure before the questions of connecting that information can even arise. Then, that structure must also be available in a form suitable for merging or connection.

At that point, the subjects of this posting come into play.

We are stubbing our toes on the rocks while we gaze at the heavens.

We see that the daily Web has a diversity of schema or ontologies “loosely defined” for representing the structure of the content. These representations can be transferred to more complex schema, but not in the opposite direction. Moreover, the semantic basis for how to make these mappings also needs some common referents.

RDF provides the canonical data model for the data transfers and representations. RDFS, especially in the form of SKOS, appears to form one basis for the syntax and language for these transformations. And SKOS, with other schema, also appears to offer much of the appropriate “middle ground” for data relationships mapping.

However, lacking in this story is a referential structure for subject relationships [11]. (Also lacking are the ultimately critical domain specifics required for actual implementation.)

Abstract concepts of interest to philosophers and deep thinkers have been given much attention. Sadly, to date, concrete subject structures in which tangible things and tangible actions can be shared, is still very, very weak. We are stubbing our toes on the rocks while we gaze at the heavens.

Yet, despite this, simple and powerful infrastructures are well in-hand to address all foreseeable syntactic and semantic issues. There appear to be no substantive limits to needed next steps.

Lastly, many valuable resources for further reading and learning may be found within the Ontolog Community, W3C, TagCommons and Topics Maps groups. Enjoy! And be wary of ontology no longer.

[1] Deborah L. McGuinness. “Ontologies Come of Age”. In Dieter Fensel, Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, 2003. See http://www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-(with-citation).htm

[2] I think it would be much clearer to refer to “upper level” ontologies as abstract or conceptual, “mid levels” as mapping or binding, and “lower levels” as domain (without any hierarchical distinctions such as lower or lowest or sub-domain), but current practice is probably too entrenched to change now.

[3] There are many aspects that make PROTON one of the more attractive reference ontologies. The PROTON ontology (PROTo ONtology), developed within the scope of the SEKT project, is attractive because of its understandability, relatively small size, modular architecture and a simple subsumption hierarchy. It is available in an OWL Lite form and is easy to adopt and extend. On the face of it, the Topic class within PROTON, which is meant to serve as a bridge between different ontologies, may also provide a binding layer to specific subject topics as sub-classes or class instances.

[4] See my earlier post on Cyc.

[5] Even with such clout, it is questionable to get rather complete adherence, as Ada showed within the Federal government. However, where circumstances allow it, central schema and ontologies may be worth pursuing because of improved interoperability and lower costs, even where some portions do not adhere and are more chaotic like the standard Web.

[6] See, A Survey of RDF/Topic Maps Interoperability Proposals, W3C Working Group Note 10 February 2006, Pepper, Vitali, Garshol, Gessa, Presutti (eds.)

[7] See, Guidelines for RDF/Topic Maps Interoperability, W3C Editor’s Draft 30 June 2006, Pepper, Presutti, Garshol, Vitali (eds.)

[8] Here are some Sweet Tools that may have a usefulness to ontology federation and creation:

Adaptiva — is a user-centered ontology building environment, based on using multiple strategies to construct an ontology, minimising user input by using adaptive information extraction
Altova SemanticWorks — is a visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design
CMS — the CROSI Mapping System is a structure matching system that capitalizes on the rich semantics of the OWL constructs found in source ontologies and on its modular architecture that allows the system to consult external linguistic resources
ConcepTool — is a system to model, analyze, verify, validate, share, combine, and reuse domain knowledge bases and ontologies, reasoning about their implication
ConRef — is a service discovery system which uses ontology mapping techniques to support different user vocabularies
FOAM — is the Framework for Ontology Alignment and Mapping. It is based on heuristics (similarity) of the individual entities (concepts, relations, and instances)
hMAFRA (Harmonize Mapping Framework) — is a set of tools supporting semantic mapping definition and data reconciliation between ontologies. The targeted formats are XSD, RDFS and KAON
IF-Map — is an Information Flow based ontology mapping method. It is based on the theoretical grounds of logic of distributed systems and provides an automated streamlined process for generating mappings between ontologies of the same domain
IODT — is IBM’s toolkit for ontology-driven development. The toolkit includes EMF Ontology Definition Metamodel (EODM), EODM workbench, and an OWL Ontology Repository (named Minerva)
KAON — is an open-source ontology management infrastructure targeted for business applications. It includes a comprehensive tool suite allowing easy ontology creation and management and provides a framework for building ontology-based applications. An important focus of KAON is scalable and efficient reasoning with ontologies
LinKFactory — is Language & Computing’s ontology management tool. It provides an effective and user-friendly way to create, maintain and extend extensive multilingual terminology systems and ontologies (English, Spanish, French, etc.). It is designed to build, manage and maintain large, complex, language independent ontologies
M3t4.Studio Semantic Toolkit — is Metatomix’s free set of Eclipse plug-ins to allow developers to create and manage OWL ontologies and RDF documents
MAFRA Toolkit — the Ontology MApping FRAmework Toolkit allows to create semantic relations between two (source and target) ontologies, and apply such relations in translating source ontology instances into target ontology instances
OntoEngine — is a step toward allowing agents to communicate even though they use different formal languages (i.e., different ontologies). It translates data from a “source” ontology to a “target.”
OntoPortal — enables the authoring and navigation of large semantically-powered portals
OWLS-MX — the hybrid semantic Web service matchmaker OWLS-MX 1.0 utilizes both description logic reasoning, and token based IR similarity measures. It applies different filters to retrieve OWL-S services that are most relevant to a given query
pOWL — is a semantic Web development platform for ontologies in PHP. pOWL consists of a number of components, including RAP
Protege — is an open source visual ontology editor written in Java with many plug-in tools
Semantic Net Generator — is a utility for generating topic maps automatically from different data sources by using rules definitions specified with Jelly XML syntax. This Java library provides Jelly tags to access and modify data sources (also RDF) to create a semantic network
SOFA — is a Java API for modeling ontologies and Knowledge Bases in ontology and Semantic Web applications. It provides a simple, abstract and language neutral ontology object model, inferencing mechanism and representation of the model with OWL, DAML+OIL and RDFS languages
Terminator — is a tool for creating term to ontology resource mappings (documentation in Finnish)
WebOnto — supports the browsing, creation and editing of ontologies through coarse grained and fine grained visualizations and direct manipulation.

[9] The SKOS language has the following classes:

CollectableProperty — A property which can be used with a skos:Collection
Collection — A meaningful collection of concepts
Concept — An abstract idea or notion; a unit of thought
ConceptScheme — A set of concepts, optionally including statements about semantic relationships between those concepts. Thesauri, classification schemes, subject heading lists, taxonomies, ‘folksonomies’, and other types of controlled vocabulary are all examples of concept schemes. Concept schemes are also embedded in glossaries and terminologies.
OrderedCollection — An ordered collection of concepts, where both the grouping and the ordering are meaningful

. . . and the following properties:

altLabel — An alternative lexical label for a resource. Acronyms, abbreviations, spelling variants, and irregular plural/singular forms may be included among the alternative labels for a concept
altSymbol — An alternative symbolic label for a resource
broader — A concept that is more general in meaning. Broader concepts are typically rendered as parents in a concept hierarchy (tree)
changeNote — A note about a modification to a concept
definition — A statement or formal explanation of the meaning of a concept
editorialNote — A note for an editor, translator or maintainer of the vocabulary
example — An example of the use of a concept
hasTopConcept — A top level concept in the concept scheme
hiddenLabel — A lexical label for a resource that should be hidden when generating visual displays of the resource, but should still be accessible to free text search operations
historyNote — A note about the past state/use/meaning of a concept
inScheme — A concept scheme in which the concept is included. A concept may be a member of more than one concept scheme
isPrimarySubjectOf — A resource for which the concept is the primary subject
isSubjectOf –A resource for which the concept is a subject
member — A member of a collection
memberList — An RDF list containing the members of an ordered collection
narrower — A concept that is more specific in meaning. Narrower concepts are typically rendered as children in a concept hierarchy (tree)
note — A general note, for any purpose. The other human-readable properties of definition, scopeNote, example, historyNote, editorialNote and changeNote are all sub-properties of note
prefLabel — The preferred lexical label for a resource, in a given language. No two concepts in the same concept scheme may have the same value for skos:prefLabel in a given language
prefSymbol — The preferred symbolic label for a resource
primarySubject — A concept that is the primary subject of the resource. A resource may have only one primary subject per concept scheme
related — A concept with which there is an associative semantic relationship
scopeNote — A note that helps to clarify the meaning of a concept
semanticRelation — A concept related by meaning. This property should not be used directly, but as a super-property for all properties denoting a relationship of meaning between concepts
subject — A concept that is a subject of the resource
subjectIndicator — A subject indicator for a concept. [The notion of ‘subject indicator’ is defined here with reference to the latest definition endorsed by the OASIS Published Subjects Technical Committee]
symbol — An image that is a symbolic label for the resource. This property is roughly analagous to rdfs:label, but for labelling resources with images that have retrievable representations, rather than RDF literals. Symbolic labelling means labelling a concept with an image.

[10] The SWEO classification ontology is still under active development and has these draft classes. Note, however, the relative lack of actual subject or topic matter:

Classes are currently defined as:

article – magazine article
blog – blog discussing SW topics
book – indicates a textbook, applies to the book’s home page, review or listing in Amazon or such
casestudy – Article on a business case
conference/event – conferences or events where you can learn about the Semantic Web
demo/demonstration – interactive SW demo
forum – a forum on semantic web or related topics
presentation – Powerpoint or similar slide show
person – If this is a person’s home page or blog, see below
publication – a scientific publication
ontology – a formalisation of a shared conceptualization using OWL, RDFS, SKOS or something else based on RDF
organization – If the page is the home page of an organization, research, vendor etc, see below
portal – a portal website Semantic Web or related topics, usually hosting information items, mailinglists, community tools
project – a research (for example EU-IST) or other project that addresses Semantic Web issues
mailinglist – a mailinglist on semantic Web or related topics
person – ideally a person that is well known regarding the Semantic Web (people who can do keynote speakers), may also be any related person
press – a press release by a company or an article about Semantic Web
recommended – If the resource is seen to be in the top 10 of its kind
specification – a Semantic Web specification (RDF, RDF/S, OWL, etc)
categories – (perhaps using tags or other free form annotation
successstory – Article that can contain advertisment and clearly shows the benefit of semantic web
tutorial – a tutorial teaching some aspect of semantic web, an example
vocabulary – a RDF vocabulary
software project/tool – For product/project home pages

If the page describes an organization, it can be tagged as:

vendor
research
enduser

If the page is a person’s home page or blog or similar, it could be:

opinionleader
researcher
journalist
executive
geek

The type of audience can also be tagged, for example:

general public
beginners
technicians
researchers.

[11] The OASIS Topic Maps Published Subjects Technical Committee was formed a number of years back to promote Topic Maps interoperability through the use of Published Subjects Indicators (PSIs). Their resulting report was a very interesting effort that unfortunately did not lead to wide adoption, perhaps because the effort was a bit ahead of its time or it was in advance of the broader acceptance of RDF. This general topic is the subject of a later posting by me.

[12] See further, Leo Obrst, “The Semantic Spectrum & Semantic Models,” a Powerpoint presentation (http://ontolog.cim3.net/file/resource/presentation/LeoObrst_20060112/OntologySpectrumSemanticModels–LeoObrst_20060112.ppt)
made as part of an Ontolog Forum (http://ontolog.cim3.net/) presentation in two parts, “What is an Ontology? – A Briefing on the Range of Semantic Models” (see http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2006_01_12), in January 2006. Leo Obrst is a principal artificial intelligence scientist at MITRE’s (http://www.mitre.org) Center for Innovative Computing and Informatics and a co-convener of the Ontolog Forum. His presentation is a rich source of practical overview information on ontologies.

[13] The actual diagram is an unattributed modification by Dan McCreary (see http://www.danmccreary.com/presentations/sem_int/sem_int.ppt) based on Obrst’s material in [12].

Posted:November 22, 2010

The Nature of Connectedness on the Web

The Reality is: Most Connections are Proximate

What does it mean to interoperate information on the Web? With linked data and other structured data now in abundance, why don’t we see more information effectively combined? Why express your information as linked data if no one is going to use it?

Interoperability comes down to the nature of things and how we describe those things or quite similar things from different sources. This was the major thrust of my recent keynote presentation to the Dublin Core annual conference. In that talk I described two aspects of the semantic “gap”:

One aspect is the need for vetted reference sources that provide the entities and concepts for aligning disparate content sources on the Web, and
A second aspect is the need for accurate mapping predicates that can represent the often approximate matches and overlaps of this heterogeneous content.

I’ll discuss the first “gap” in a later post. What we’ll discuss here is the fact that most relationships between putatively same things on the Web are rarely exact, and are most often approximate in nature.

“It Ain’t the Label, Mabel”

The use of labels for matching or descriptive purposes was the accepted practice in early libraries and library science. However, with the move to electronic records and machine bases for matching, appreciation for ambiguities and semantics have come to the fore. Labels are no longer an adequate — let alone a sufficient — basis for matching references.

The ambiguity point is pretty straightforward. Refer to Jimmy Johnson by his name, and you might be referring to a former football coach, a NASCAR driver, a former boxing champ, a blues guitarist, or perhaps even a plumber in your home town. Or perhaps none of these individuals. Clearly, the label “Jimmy Johnson” is insufficient to establish identity.

Of course, not all things are named entities such as a person’s name. Some are general things or concepts. But, here, semantic heterogeneities can also lead to confusion and mismatches. It is always helpful to revisit the sources and classification of semantic heterogeneities, which I first discussed at length nearly five years ago. Here is a schema classifying more than 40 categories of potential semantic mismatches [1]:

Class	Category	Subcategory
STRUCTURAL	Naming	Case Sensitivity
		Synonyms
		Acronyms
		Homonyms
	Generalization / Specialization
	Aggregation	Intra-aggregation
		Inter-aggregation
	Internal Path Discrepancy
	Missing Item	Content Discrepancy
		Attribute List Discrepancy
		Missing Attribute
		Missing Content
	Element Ordering
	Constraint Mismatch
	Type Mismatch
DOMAIN	Schematic Discrepancy	Element-value to Element-label Mapping
		Attribute-value to Element-label Mapping
		Element-value to Attribute-label Mapping
		Attribute-value to Attribute-label Mapping
	Scale or Units
	Precision
	Data Representation	Primitive Data Type
		Data Format
DATA	Naming	Case Sensitivity
		Synonyms
		Acronyms
		Homonyms
	ID Mismatch or Missing ID
	Missing Data
	Incorrect Spelling
LANGUAGE	Encoding	Ingest Encoding Mismatch
		Ingest Encoding Lacking
		Query Encoding Mismatch
		Query Encoding Lacking
	Languages	Script Mismatches
		Parsing / Morphological Analysis Errors (many)
		Syntactical Errors (many)
		Semantic Errors (many)

Even with the same label, two items in different information sources can refer generally to the same thing, but may not be the same thing or may define it with a different scope and content. In broad terms, these mismatches can be due to structure, domain, data or language, with many nuances within each type.

The sameAs approach used by many of the inter-dataset linkages in linked data ignores these heterogeneities. In a machine and reasoning sense, indeed even in a linking sense, these assertions can make as little or nonsensical sense as talking about the plumber with the facts about the blues guitarist.

Cats, Paul Newman and Great Britain

Let’s take three examples where putatively we are talking about the same thing and linking disparate sources on the Web.

The first example is the seemingly simple idea of “cats”. In one source, the focus might be on house cats, in another domestic cats, and in a third, cats as pets. Are these ideas the same thing? Now, let’s bring in some taxonomic information about the cat family, the Felidae. Now, the idea of “cats” includes lynx, tigers, lions, cougars and many other kinds of cats, domestic and wild (and, also extinct!). Clearly, the “cat” label used alone fails us miserably here.

Another example is one that Fred Giasson and I brought up one year ago in When Linked Data Rules Fail [2]. That piece discussed many poor practices within linked data, and used as one case the treatment of articles in the New York Times about the (deceased) actor Paul Newman. The NYT dataset is about various articles written about people historically in the newspaper. Their record about Paul Newman was about their pool of articles with attributes such as first published and so forth, with no direct attribute information about Paul Newman the person. Then, they asserted a sameAs relationship with external records in Freebase and DBpedia, which acts to commingle person attributes like birth, death and marriage with article attributes such as first and last published. Clearly, the NYT has confused the topic ( Paul Newman) of a record with the nature of that record (articles about topics). This misunderstanding of the “thing” at hand makes the entailed assertions from the multiple sources illogical and useless [3].

Our third example is the concept or idea or named entity of Great Britain. Depending on usage and context, Great Britain can refer to quite different scopes and things. In one sense, Great Britain is an island. In a political sense, Great Britain can comprise the territory of England, Scotland and Wales. But, even more precise understandings of that political grouping may include a number of outlying islands such as the Isle of Wight, Anglesey, the Isles of Scilly, the Hebrides, and the island groups of Orkney and Shetland. Sometimes the Isle of Man and the Channel Islands, which are not part of the United Kingdom, are fallaciously included in that political grouping. And, then, in a sporting context, Great Britain may also include Northern Ireland. Clearly, these, plus other confusions, can mean quite different things when referring to “Great Britain.” So, without definition, a seemingly simple question such as what the population of Great Britain is could legitimately return quite disparate values (not to mention the time dimension and how that has changed boundaries as well!).

These cases are quite usual for what “things” mean when provided from different sources with different perspectives and with different contexts. If we are to get meaningful interoperation or linkage of these things, we clearly need some different linking predicates.

Some Attempts at ‘Approximateness’

The realization that many connections across datasets on the Web need to be “approximate” is growing. Here is the result of an informal survey for leading predicates in this regard [4]:

skos:broadMatch
skos:related
ore:similarTo
dul:associatedWith
umbel:isAbout
skos:narrowMatch
vmf:isInVocabulary
skos:closeMatch
owl:equivalentClass
skos:mappingRelation
ov:similarTo
umbel:hasMapping
doape:similarThing
lvont:nearlySameAs
umbel:isRelatedTo

umbel:isLike
skos:exactMatch
sswap:hasMapping
umbel:hasCharacteristic
lvont:somewhatSameAs
dul:isAbout
skos:semanticRelation
rdfs:seeAlso
ore:describes
skos:narrowerTransitive
map:narrowerThan
dul:isConceptualizedBy
skos:narrower
umbel:isCharacteristicOf
prowl:defineUncertaintyOf

dc:subject
sumo:entails
link:uri
foaf:isPrimaryTopicOf
skos:broaderTransitive
dul:isComponentOf
foaf:focus
skos:relatedMatch
map:broaderThan
owl:sameAs
skos:broader
dul:isAssignedTo
wn:similarTo
sumo:refers
rdfs:subClassOf

Besides the standard OWL and RDFS predicates, SKOS, UMBEL and DOLCE [5] provide the largest number of choices above. In combination, these predicates probably provide a good scoping of “approximateness” in mappings.

Rationality and Reasoners

It is time for some leadership to emerge to provide a more canonical set of linking predicates for these real-world connection requirements. It would also be extremely useful to have such a canonical set adopted by some leading reasoners such that useful work could be done against these properties.

[1] See M. K. Bergman, 2006. “Sources and Classification of Semantic Heterogeneities,” AI3:::Adaptive Information blog, June 6, 2006. See https://www.mkbergman.com/232/sources-and-classification-of-semantic-heterogeneities/.

[2] See M. K. Bergman and F. Giasson, 2009. “When Linked Data Rules Fail,” AI3:::Adaptive Information blog, November 16, 2009. See https://www.mkbergman.com/846/when-linked-data-rules-fail/.

[3] On a different disappointing note, the critical errors that we noted a year ago and the NYT’s own acknowledgement on its site that:

“An RDFS description and English language documentation for the NYT namespace will be provided soon. Thanks for your patience.”

has still not been corrected, now a year later. Poor performance like this by a professional publisher gives linked data a bad name.

[4] These predicates have been obtained from personal knowledge and directed searches using the Falcons ontology search service. Simple Web searches on the namespace plus predicate name will provide more detail on any given predicate.

[5] UMBEL (Upper Mapping and Binding Exchange Layer) is an ontology of about 20,000 subject concepts that acts as a reference structure for inter-relating disparate datasets. It is also a general vocabulary of classes and predicates designed for the creation of domain-specific ontologies. For SKOS, see Alistair Miles and Sean Bechhofer, eds., 2009. SKOS Simple Knowledge Organization System Reference, W3C Recommendation, 18 August 2009; http://www.w3.org/TR/skos-reference/. The Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) is one of the more popular upper ontologies.

Main Links

Search

Refining UMBEL’s Linking and Mapping Predicates with Wikipedia

A Comparison of Mapping Predicates

Individual Predicates Discussion

Equivalent Properties

owl:equivalentClass

owl:sameAs

Membership and Hierarchical Properties

rdfs:subClassOf

skos:narrowerTransitive/skos:broaderTransitive

rdf:type

Approximation Properties

umbel:correspondsTo

umbel:isAbout

umbel:isLike

umbel:relatesToXXX

Descriptive Properties

umbel:isCharacteristicOf

Qualifying the Mappings

Status of Mappings

Major Progress Towards a Gold Standard

Reasons for and Implications from Innovation Moving to Consumers

A Brief Look at Sixty Years of Enterprise IT

The Current Landscape

Seven Reasons for a Shift in Innovation

#1: The Internet

#2: Consumer Innovations

#3: Open Source

#4: Slow Development Cycles in Enterprise

#5: What’s Hot: Developers

#6: What’s Hot: Startups

#7: Declining Software Rents and Consolidation

Positive Implications from the Decline

Negative Implications from the Decline

What Lessons Might be Drawn?

And, Seven Guidelines for this Second of Two Semantic ‘Gaps’

The Web Increases the Need for Organization

The Idea and Role of Reference Concepts

Seven Guidelines for a Reference Concept

Guideline #1: Persistent URI

Guideline #2: Preferred Label

Guideline #3: Definition

Guideline #4: Tagset

Guideline #5: Language Independent

Guideline #6: Range and Domain

Guideline #7: Part of Coherent Structure

The State of Reference Concepts

There’s an Endless Variety of World Views, and Almost as Many Ways to Organize and Describe Them

Overview and Role of Ontologies

Diversity, ‘Naturalness’ and Change

A Spectrum of Formalisms

Ontology Approaches on the Web

Still-Another “Level” of Ontologies

The SUMO Example

Ontology Binding and Integration Mechanisms

Centralized Approaches

Federated Approaches

The Role of SKOS – the Simple Knowledge Organization System

Conclusions

The Reality is: Most Connections are Proximate

“It Ain’t the Label, Mabel”

Cats, Paul Newman and Great Britain

Some Attempts at ‘Approximateness’

Rationality and Reasoners