Posted:February 21, 2011

Hitting the Sweet SpotReference Structures Provide a Third Way

Since the first days of the Web there has been an ideal that its content could extend beyond documents and become a global, interoperating storehouse of data. This ideal has become what is known as the “semantic Web“. And within this ideal there has been a tension between two competing world views of how to achieve this vision. At the risk of being simplistic, we can describe these world views as informal v formal, sometimes expressed as “bottom up” v “top down” [1,2].

The informal view emphasizes freeform and diversity, using more open tagging and a bottoms-up approach to structuring data [3]. This group is not anarchic, but it does support the idea of open data, open standards and open contributions. This group tends to be oriented to RDF and is (paradoxically) often not very open to non-RDF structured data forms (as, for example, microdata or microformats). Social networks and linked data are quite central to this group. RDFa, tagging, user-generated content and folksonomies are also key emphases and contributions.

The formal view tends to support more strongly the idea of shared vocabularies with more formalized semantics and design. This group uses and contributes to open standards, but is also open to proprietary data and structures. Enterprises and industry groups with standard controlled vocabularies and interchange languages (often XML-based) more typically reside in this group. OWL and rules languages are more often typically the basis for this group’s formalisms. The formal view also tends to split further into two camps: one that is more top down and engineering oriented, with typically a more closed world approach to schema and ontology development [4]; and a second that is more adaptive and incremental and relies on an open world approach [5].

Again, at the risk of being simplistic, the informal group tends to view many OWL and structured vocabularies, especially those that are large or complex, as over engineered, constraining or limiting freedom. This group often correctly points to the delays and lack of adoption associated with more formal efforts. The informal group rarely speaks of ontologies, preferring to use the term of vocabularies. In contrast, the formal group tends to view bottoms-up efforts as chaotic, poorly structured and too heterogeneous to allow machine reasoning or interoperability. Some in the formal group sometimes advocate certification or prescribed training programs for ontologists.

Readers of this blog and customers of Structured Dynamics know that we more often focus on the formal world view and more specifically from an open world perspective. But, like human tribes or different cultures, there is no one true or correct way. Peaceful coexistence resides in the understanding of the importance and strength of different world views.

Shared communication is the way in which we, as humans, learn to understand and bridge cultural and tribal differences. These very same bases can be used to bridge the differences of world views for the semantic Web. Shared concepts and a way to communicate them (via a common language) — what I call reference structures [6] — are one potential “sweet spot” for bridging these views of the semantic Web [7].

Referring to Referents as Reference

According to Merriam Webster and Wikipedia, a reference is the intentional use of one thing, a point of reference or reference state, to indicate something else. When reference is intended, what the reference points to is called the referent. References are indicated by sounds (like onomatopoeia), pictures (like roadsigns), text (like bibliographies), indexes (by number) and objects (a wedding ring), but many other methods can be used intentionally as references. In language and libraries, references may include dictionaries, thesauri and encyclopedias. In computer science, references may include pointers, addresses or linked lists. In semantics, reference is generally construed as the relationships between nouns or pronouns and objects that are named by them.

The Building Blocks of Language

Structures, or syntax, enable multiple referents to be combined into more complex and meaningful (interpretable) systems. Vocabularies refer to the set of tokens or words available to act as referents in these structures. Controlled vocabularies attempt to limit and precisely define these tokens as a means of reducing ambiguity and error. Larger vocabularies increase richness and nuance of meaning for the tokens. Combined, syntax, grammar and vocabularies are the building blocks for constructing understandable human languages.

Many researchers believe that language is an inherent capability of humans, especially including children. Language acquisition is expressly understood to be the combined acquisition of syntax, vocabulary and phonetics (for spoken language). Language development occurs via use and repetition, in a social setting where errors are corrected and communication is a constant. Via communication and interaction we learn and discover nuance and differences, and acquire more complex understandings of syntax structures and vocabulary. The contact sport of communication is itself a prime source for acquiring the ability to communicate. Without the structure (syntax) and vocabulary acquired through this process, our language utterances are mere babblings.

Pidgin languages emerge when two parties try to communicate, but do not share the same language. Pidgin languages result in much simplified vocabularies and structure, which lead to frequent miscommunication. Small vocabularies and limited structure share many of these same limitations.

Communicating in an Evanescent Environment

Information theory going back to Shannon defined that the “fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point” [8]. This assertion applies to all forms of communication, from the electronic to animal and human language and speech.

Every living language is undergoing constant growth and change. Current events and culture are one driver of new vocabulary and constructs. We all know the apocryphal observation that northern peoples have many more words for snow, for example. Jargon emerges because specific activities, professions, groups, or events (including technical change) often have their own ideas to communicate. Slang is local or cultural usage that provides context and communication, often outside of “formal” or accepted vocabularies. These sources of environmental and other changes cause living languages to be constantly changing in terms of vocabulary and (also, sometimes) structure.

Natural languages become rich in meaning and names for entities to describe and discern things, from plants to people. When richness is embedded in structure, contexts can emerge that greatly aid removing ambiguity (“disambiguating”). Contexts enable us to discern polysemous concepts (such as bank for river, money institution or pool shot) or similarly named entities (such as whether Jimmy Johnson is a race car driver, football coach, or a local plumber). As with vocabulary growth, contexts sometimes change in meaning and interpretation over time. It is likely the Gay ’90s would not be used again to describe a cultural decade (1890s) in American history.

All this affirms what all of us know about human languages:  they are dynamic and changing. Adaptable (living) languages require an openness to changing vocabulary and changing structure. The most dynamic languages also tend to be the most open to the coining of new terminology; English, for example, is estimated to have 25,000 new words coined each year [9].

The Semantic Web as a Human Language

One could argue that similar constructs must be present within the semantic Web to enable either machine or human understanding. At first blush this may sound a bit surprising:  Isn’t one premise of the semantic Web machine-to-machine communications with “artificial intelligence” acting on our behalf in the background? Well, hmmm, OK, let’s probe that thought.

Recall there are different visions about what constitutes the semantic Web. In the most machine-oriented version, the machines are posited to replace some of what we already do and anticipate what we already want. Like Watson on Jeopardy, machines still need to know that Toronto is not an American city [10]. So, even with its most extreme interpretation — and one that is more extreme than my own view of the near-term semantic Web — machine-based communication still has these imperatives:

  • Humans, too, interact with data and need to understand it
  • Much of the data to be understood and managed is based on human text (unstructured), and needs to be adequately captured and represented
  • There is no basis to think that machine languages can be any simpler in representing the world than human languages.

These points suggest that machine languages, even in the most extreme machine-to-machine sense, still need to have a considerable capability akin to human languages.  Of course, computer programming languages and data exchange languages as artificial languages need not read like a novel. In fact, most artificial languages have more constraints and structure limitations than human languages. They need to be read by machines with fixed instruction sets (that is, they tend to have fewer exceptions and heuristics).

But, even with software or data, people write and interact with these languages, and human readability is a key desirable aspect for modern artificial languages [11]. Further, there are some parts of software or data that also get expressed as labels in user interfaces or for other human factors. The admonition to Web page developers to “view source” is a frequent one. Any communication that is text based — as are all HTTP communications on the Web, including the semantic Web — has this readability component.

Though the form (structure) and vocabulary (tokens) of languages geared to machine use and understanding most certainly differ from that used by humans, that does not mean that the imperatives for reference and structure are excused. It seems evident that small vocabularies, differing vocabularies and small and incompatible structures have the same limiting effect on communications within the semantic Web as they do for human languages.

Yet, that being said, correcting today’s relative absence of reference and structure on the nascent semantic Web should not then mean an overreaction to a solution based on a single global structure. This is a false choice and a false dichotomy, belied by the continued diversity of human languages [12]. In fact, the best analog for an effective semantic Web might be human languages with their vocabularies, references and structures. Here is where we may find the clues for how we might improve the communications (interoperability) of the semantic Web.

A Call for Vehement Moderation

Freeform tagging and informal approaches are quick and adaptive. But, they lack context, coherence and a basis for interoperability. Highly engineered ontologies capture nuance and sophistication. But, they are difficult and expensive to create, lack adoption and can prove brittle. Neither of these polar opposites is “correct” and each has its uses and importance. Strident advocacy of either extreme alone is shortsighted and unsuited to today’s realities. There is not an ineluctable choice between freedom and formalism.

An inherently open and changing world with massive growth of information volumes demands a third way. Reference structures and vocabularies sufficient to guide (but not constrain) coherent communications are needed. Structure and vocabulary in an open and adaptable language can provide the communication medium. Depending on task, this language can be informal (RDF or data struct forms convertible to RDF) or formal (OWL). The connecting glue is provided by the reference vocabularies and structures that bound that adaptable language. This is the missing “sweet spot” for the semantic Web.

Just like human languages, these reference structures must be adaptable ones that can accommodate new learning, new ideas and new terminology. Yet, they must also have sufficient internal consistency and structure to enable their role as referents. And, they need to have a richness of vocabulary (with defined references) sufficient to capture the domain at hand. Otherwise, we end up with pidgin communications.

We can thus see a pattern emerging where informal approaches are used for tagging and simple datasets; more formal approaches are used for bounded domains and the need for precise semantics; and reference structures are used when we want to get multiple, disparate sources to communicate and interoperate. So long as these reference structures are coherent and designed for vocabulary expansion and accommodation for synonyms and other means for terminology mapping, they can adapt to changing knowledge and demands.

For too long there has been a misunderstanding and mischaracterization of anything that smacks of structure and referenceability as an attempt to limit diversity, impose control, or suggest some form of “One Ring to rule them all” organization of the semantic Web. Maybe that was true of other suggestions in the past, but it is far from the enabling role of reference structures advocated herein. This reaction to structure has something of the feeling of school children adverse to their writing lessons taking over the classroom and then saying No! to more lessons. Rather than Lord of the Rings we get Lord of the Flies.

To try to overcome this misunderstanding — and to embrace the idea of language and communication for the semantic Web — I and others have tried in the past to find various analogies or imagery to describe the roles of these reference structures. (Again, all of those vagaries of human language and communication!). Analogies for these reference structures have included [13]:

  • backbones, to signal their importance as dependable structures upon which we can put “meat on the bones”
  • scaffoldings, to emphasize their openness and infrastructural role
  • roadmaps, as orienting and navigational frameworks for information
  • docking ports, as connection points for diverse datasets on the Web
  • forest paths, to signal common traversals but with much to discover once we step off the paths
  • infoclines, to represent the information interface between different world views,
  • and others.

What this post has argued is the analogy of reference structures to human language and communication. In this role, reference structures should be seen as facilitating and enabling. This is hardly a vision of constraints and control. The ability to articulate positions and ideas in fact leads to more diversity and freedom, not less.

To be sure, there is extra work in using and applying reference structures. Every child comes to know there is work in learning languages and becoming articulate in them. But, as adults, we also come to learn from experience the frustration that individuals with speech or learning impairments have when trying to communicate. Knowing these things, why do we not see the same imperatives for the semantic Web? We can only get beyond incoherent babblings by making the commitment to learn and master rich languages grounded in appropriate reference structures. We are not compelled to be inchoate; nor are our machines.

Yet, because of this extra work, it is also important that we develop and put in place semi-automatic [14] ways to tag and provide linkages to such reference structures. We have the tools and information extraction techniques available that will allow us to reference and add structure to our content in quick and easy ways. Now is the time to get on with it, and stop babbling about how structure and reference vocabularies may limit our freedoms.

This is the first of a two-part series on the importance and choice of reference structures (Part I) and gold standards (Part II) on the semantic Web.

[1] This is reflected well in a presentation from the NSF Workshop on DB & IS Research for Semantic Web and Enterprises, April 3, 2002, entitled “The “Emergent, Semantic Web: Top Down Design or Bottom Up Consensus?“. This report defines top down as design and committee-driven; bottom up is more decentralized and based on social processes. Also, see Ralf Klischewski, 2003. “Top Down or Bottom Up? How to Establish a Common Ground for Semantic Interoperability within e-Government Communities,” pp. 17-26, in R. Traunmüller and M Palmirani, eds., E-Government: Modelling Norms and Concepts as Key Issues: Proceedings of 1st International Workshop on E-Government at ICAIL 2003, Bologna, Italy. Also, see David Weinberger, 2006. “The Case for Two Semantic Webs,” KM World, May 26, 2006; see http://www.kmworld.com/Articles/ReadArticle.aspx?ArticleID=15809.
[2] For a discussion about formalisms and the nature of the Web, see this early report by F.M. Shipman III and C.C. Marshall, 1994. “Formality Considered Harmful: Experiences, Emerging Themes, and Directions,” Xerox PARC Technical Report ISTL-CSA-94-08-02, 1994; see http://www.csdl.tamu.edu/~shipman/formality-paper/harmful.html.
[3] Others have posited contrasting styles, most often as “top down” v. “bottom up.” However, in one interpretation of that distinction, “top down” means a layer on top of the existing Web; see further, A. Iskold, 2007. “Top Down: A New Approach to the Semantic Web,” in ReadWrite Web, Sept. 20, 2007. The problem with this terminology is that it offers a completely different sense of “top down” to traditional uses. In Iskold’s argument, his “top down” is a layering on top of the existing Web. On the other hand, “top down” is more often understood in the sense of a “comprehensive, engineered” view, consistent with [1].
[4] See M. K. Bergman, 2009. The Open World Assumption: Elephant in the Room, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems.
The closed world assumption (CWA) is a key underpinning to most standard relational data systems and enterprise schema and logics. CWA is the logic assumption that what is not currently known to be true, is false. For semantics-related projects there is a corollary problem to the use of CWA which is the need for upfront agreement on what all predicates “mean”, which is difficult if not impossible in reality when different perspectives are the explicit purpose for the integration.
[5] See M.K. Bergman, 2010. “Two Contrasting Styles for the Semantic Enterprise,” AI3:::Adaptive Information blog post, February 15, 2010. See https://www.mkbergman.com/866/two-contrasting-styles-for-the-semantic-enterprise/.
[6] I first used the term in passing in M.K. Bergman, 2007. “An Intrepid Guide to Ontologies,” AI3:::Adaptive Information blog post, May 16, 2007. See https://www.mkbergman.com/374/an-intrepid-guide-to-ontologies/, then more fully elaborated the idea in “Where are the Road Signs for the Structured Web,” AI3:::Adaptive Information blog post, May 29, 2007. See https://www.mkbergman.com/375/where-are-the-road-signs-for-the-structured-web/.
[7] See Catherine C. Marshall and Frank M. Shipman, 2003. “Which Semantic Web?,” in Proceedings of ACM Hypertext 2003, pp. 57-66, August 26-30, 2003, Nottingham, United Kingdom; http://www.csdl.tamu.edu/~marshall/ht03-sw-4.pdf, for a very different (but still accurate and useful) way to characterize the “visions” for the semantic Web. In this early paper, the authors posit three competing visions: 1) the development of standards, akin to libraries, to bring order to digital documents; this is the vision they ascribe to the W3C and has been largely adopted via use of URIs as identifiers, and languages such as RDF and OWL; 2) a vision of a globally distributed knowledge base (which they characterize as being Tim Berners-Lee’s original vision, with examples being Cyc or Apple’s (now disbanded) Knowledge Navigator; and 3) a vision of an infrastructure for the coordinated sharing of data and knowledge..
[8] See Claude E. Shannon‘s classic paper “A Mathematical Theory of Communication” in the Bell System Technical Journal in July and October 1948.
[9] This reference is from the Wikipedia entry on the English language: Kister, Ken. “Dictionaries defined.” Library Journal, 6/15/92, Vol. 117 Issue 11, p 43.
[10] See http://www-943.ibm.com/innovation/us/watson/related-content/toronto.html, or simply do a Web search on “watson toronto jeopardy” (no quotes).
[11] Readability is important because programmers spend the majority of their time reading, trying to understand and modifying existing source code, rather than writing new source code. Unreadable code often leads to bugs, inefficiencies, and duplicated code. It has been known for at least three decades that a few simple readability transformations can make code shorter and drastically reduce the time to understand it. See James L. Elshoff and Michael Marcotty, 1982. “Improving Computer Program Readability to Aid Modification,” Communications of the ACM, v.25 n.8, p. 512-521, Aug 1982; see http://doi.acm.org/10.1145/358589.358596. From the Wikipedia entry on Readability.
[12] According the the Wikipedia entry on Language, there are an estimated 3000 to 6000 active human languages presently in existence.
[13] The forest path analogy comes from Atanas Kiryakov of Ontotext. The remaining analogies come from M.K. Bergman in his AI3:::Adaptive Innovation blog: “There’s Not Yet Enough Backbone,” May 1, 2007 (backbone); “The Role of UMBEL: Stuck in the Middle with you …,” May 11, 2008 (infocline, scaffolding and docking port); “Structure Paves the Way to the Semantic Web,” May 3, 2007 (roadmap).
[14] Semi-automatic methods attempt to apply as much automated screening and algorithmic- or rules-based scoring as possible, and then allow the final choices to be arbitrated by humans. Fully automated systems, particularly involving natural language processing, are not yet acceptable because of (small, but) unacceptably high error rates in precision. The best semi-automated approaches handle all tasks that are rote or error-free, and then limit the final choices to those areas where unacceptable errors are still prevalent. As time goes on, more of these areas can be automated as algorithms, heuristics and methodologies improve. Eventually, of course, this may lead to fully automated approaches.

Posted by AI3's author, Mike Bergman Posted on February 21, 2011 at 2:27 am in Semantic Web, Structured Web, UMBEL | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/946/seeking-a-semantic-web-sweet-spot/
The URI to trackback this post is: https://www.mkbergman.com/946/seeking-a-semantic-web-sweet-spot/trackback/
Posted:February 15, 2011

UMBEL Vocabulary and Reference Concept OntologyA Seminal Release by SD and Ontotext; Links to Wikipedia and PROTON

Structured Dynamics and Ontotext are pleased to announce — after four years of iterative refinement — the release of version 1.00 of UMBEL (Upper Mapping and Binding Exchange Layer). This version is the first production-grade release of the system. UMBEL’s current implementation is the result of much practical experience.

UMBEL is primarily a reference ontology, which contains 28,000 concepts (classes and relationships) derived from the Cyc knowledge base. The reference concepts of UMBEL are mapped to Wikipedia, DBpedia ontology classes, GeoNames and PROTON.

UMBEL is designed to facilitate the organization, linkage and presentation of heterogeneous datasets and information. It is meant to lower the time, effort and complexity of developing, maintaining and using ontologies, and aligning them to other content.

This release 1.00 builds on the prior five major changes in UMBEL v. 0.80 announced last November. It is open source, provided under the Creative Commons Attribution 3.0 license.

Profile of the Release

In broad terms, here is what is included in the new version 1.00:

  • A core structure of 27,917 reference concepts (RCs)
  • The clustering of those concepts into 33 mostly disjoint SuperTypes (STs)
  • Direct RC mapping to 444 PROTON classes
  • Direct RC mapping to 257 DBpedia ontology classes
  • An incomplete mapping to 671 GeoNames features
  • Direct mapping of 16,884 RCs to Wikipedia (categories and pages)
  • The linking of 2,130,021 unique Wikipedia pages via 3,935,148 predicate relations; all are characterized by one or more STs
    • 876,125 are assigned a specific rdf:type
  • The UMBEL RefConcepts have been re-organized, with most local, geolocational entities moved to a supplementary module. 577 prior (version 0.80) UMBEL RCs and a further 3204 new RCs have been added to this geolocational module. This module is not being released for the current version because testing is incomplete (watch for a pending version 1.0x)
  • Some vocabulary changes, including some new and some dropped predicates (see next), and
  • Added an Annex H that describes the version 1.00 changes and methods.

Vocabulary Summary

UMBEL’s basic vocabulary can also be used for constructing specific domain ontologies that can easily interoperate with other systems. This release sees a number of changes in the UMBEL vocabulary:

  • A new correspondsTo predicate has been added for nearly or approximate sameAs mappings (symmetric, transitive, reflexive)
  • A controlled vocabulary of qualifiers was developed for the hasMapping predicate
  • 31 new relatesToXXX predicates have been added to relate external entities or concepts to UMBEL SuperTypes
  • Some disjointedness assertions between SuperTypes were added or changed.

The UMBEL Vocabulary defines three classes:

The UMBEL Vocabulary defines these properties:

The UMBEL vocabulary also has a significant reliance on SKOS, among other external vocabularies.

Access and More Information

Here are links to various downloads, specifications, communities and assistance.

Specifications and Documentation

All documentation from the prior v 0.80 has been updated, and some new documentation has been added:

Major updates were made to the specifications and Annex G; Annex H is new. Minor changes were also made to Annexes A and B. All remaining Annexes only had minor header changes. All spec documents with minor or major changes were also versioned, with the earlier archives now date stamped.

Files and Downloads

All UMBEL files are listed on the Downloads and SVN page on the UMBEL Web site. The reference concept and mapping files may also be obtained from http://code.google.com/p/umbel/source/browse/#svn/trunk.

Additional Information

To learn more about UMBEL or to participate, here are some additional links:

Acknowledgements

These latest improvements to UMBEL and its mappings have been undertaken by Structured Dynamics and Ontotext. Support has also been provided by the European Union research project RENDER, which aims to develop diversity-aware methods in the ways Web information is selected, ranked, aggregated, presented and used.

Next Steps

This release continues the path to establish a gold standard between UMBEL and Wikipedia to guide other ontological, semantic Web and disambiguation needs. For example, the number of UMBEL reference concepts was expanded by some 36% from 20,512 to 27,917 in order to provide a more balanced superstructure for organizing Wikipedia content. And across all mappings, 60% of all UMBEL reference concepts (or 16,884) are now linked directly to Wikipedia via the new umbel:correspondsTo property. A later post will describe the design and importance of this gold standard in greater detail.

Next releases will expand this linkage and coverage, and bring in other important reference structures such as GeoNames and others. This version of UMBEL will also be incorporated into the next version of FactForge. We will also be re-invigorating the Web vocabulary access and Web services, and adding tagging services based on UMBEL.

We invite other players with an interest in reusable and broadly applicable vocabularies and reference concepts to join with us in these efforts.


[1] Note, for legacy reasons, you may still encounter reference to ‘subject concepts’ in earlier UMBEL documentation. Please consider that term as interchangeable with the current ‘reference concepts’.

Posted by AI3's author, Mike Bergman Posted on February 15, 2011 at 12:39 am in Ontologies, Open Source, Semantic Web, UMBEL | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/945/announcing-the-first-production-grade-umbel/
The URI to trackback this post is: https://www.mkbergman.com/945/announcing-the-first-production-grade-umbel/trackback/
Posted:February 10, 2011

Image courtesy of http://acts2fellowship.wordpress.com/Changes Instigated by UMBEL May Benefit Others

In the semantic Web, arguably SKOS is the right vocabulary for representing simple knowledge structures [1] and OWL 2 is the right language for asserting axioms and ontological relationships. In the early days we chose a reliance on SKOS for the UMBEL reference concept ontology, because of UMBEL’s natural role as a knowledge structure. Most recently we also migrated UMBEL to OWL 2 to gain (among other reasons) the metamodeling advantages of “punning”, which allows us to treat things as either classes or instances depending on modeling needs, context and viewpoint [2].

But — until today — we could not get SKOS and OWL 2 to play together nicely. This meant we could not take full advantage of each language’s respective strengths.

Happily, that gap has now been closed. By action of a relatively minor change by the SKOS Work Group (WG) and the addition of a simple statement, SKOS more fully interoperates with OWL 2.

Brief Description of the Issue

SKOS was and is purposefully designed to be simple and to stick to its knowledge system scope. The authors of the language avoided many restrictions and kept axioms regarding the language to a minimum. As a result, since its inception, in relation to OWL, the core SKOS has been expressed as OWL Full (that is, undecidable and more free-form in interpretation).

However, also for some time it has been understood that, with some relatively minor changes, the core SKOS language could be modified to be OWL DL (decidable). In fact, since June 2009 there has been a DL version of SKOS, called by the editors the “DL prune” [3]. A useful discussion of the specific axioms that cause the core SKOS to require interpretation as OWL Full is provided by the Semantic Web and Interoperability Group at the University of Patras [4].

Most of the conditions in question relate to the various annotation properties in SKOS, such as skos:prefLabel, skos:altLabel and skos:hiddenLabel. These are core constructs within the use of “semsets” to describe and refer to reference concepts in UMBEL [5]. Among others, a simple explanation for one issue is that these labels within SKOS are defined as sub-properties of rdfs:label, which is not allowable in OWL 2. Where such conflicts exist, they are removed (“pruned”) from the SKOS DL version. There are only a minor few of these, and not (in our view) central to the purpose or usefulness of SKOS.

For UMBEL, and we think other vocabularies, the transition to OWL 2 is important for many reasons, including the “punning” of individuals and classes. Other reasons include better handling of annotation properties, a better emerging set of tools, and the use of inference and reasoning engines.

Thus, to square this circle when using SKOS in OWL 2, it is important to refer to the DL version of SKOS and not SKOS core. However, since there was no version acknowledgment to the core in the SKOS DL namespace, it was not possible to make this reference while retaining general SKOS namespace (skos:XXX) compatibility.

How the SKOS WG Resolved the Issue

My UMBEL co-editor, Fred Giasson, first brought this issue to the SKOS WG’s attention in November [6]. He further suggested a relatively simple means to resolve the problem. By including a version reference in the SKOS DL version to SKOS core, consistent with the current OWL 2 specification [7], OWL 2-compliant tools will now know how to discern the proper references. This fix looks as follows:

   <rdf:Description rdf:about="http://www.w3.org/2004/02/skos/core">
      <owl:versionIRI rdf:resource="http://www.w3.org/TR/skos-reference/skos-owl1-dl.rdf"/>
   </rdf:Description>

Now, in our ontology (UMBEL), we define the skos namespace (N3 format):

   @prefix skos: <http://www.w3.org/2004/02/skos/core#> .

and, then, import the SKOS DL version:

   <http://umbel.org/umbel> rdf:type owl:Ontology ;
      <-- statements --> ;
      owl:imports <http://www.w3.org/TR/skos-reference/skos-owl1-dl.rdf> .

Pretty simple and straightforward.

With the expeditious treatment of the SKOS group [8], this proposal was floated and commented upon and then passed and adopted today. The change was also posted as an erratum to the SKOS specification [9]. As Fred reported [10], his testing of the fix with available tools (such as Protégé 4.1 and reasoners) also appears to be working properly (on a 58 MB file with 28 K concepts and all of their annotations and relationships!).

Some Final Notes

This is a good example of how even simple changes may make major differences. It also shows how even simple changes may take some time and effort in a standards-making environment. But, even though simple, we think this will be highly useful to keeping SKOS a central vocabulary within OWL 2-based systems. Within the context of conceptual and domain vocabularies, such as represented by UMBEL or other ontologies based on the UMBEL vocabulary or similar approaches, we see this as a major win.

Finally, as for UMBEL itself, this change has allowed us to remove duplicate predicates in favor of those from SKOS. It has also caused a delay in our release of UMBEL v. 1.00, planned for today, until February 15 to complete testing and documentation revisions. But we’ll gladly take a change like this in trade for a minor delay any day.

Thanks, SKOS!


[1] SKOS, or Simplified Knowledge Organization System, is a family of formal languages designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is built upon RDF and RDFS, and its main objective is to enable easy publication of controlled structured vocabularies for the semantic Web.
[2] For a full explanation of this topic and its rationale for our work, see M. K. Bergman, 2010. “Metamodeling in Domain Ontologies,” September 20, 2010 posting on the AI3:::Adaptive Information blog; https://www.mkbergman.com/913/metamodeling-in-domain-ontologies/.
[3] The first appearance of the DL-compliant listing, SKOS-RDF-OWL1-DL, appears in the June 2009 version of the latest SKOS specifications. It is described as a “pruned” subset of the SKOS specifications that conforms to OWL DL. The RDF file itself helpfully annotates the specific issues in the core language; see next note.
[4] See http://swig.hpclab.ceid.upatras.gr/SKOS/Skos2Owl2. Also, there is other useful discussion on the SKOS mailing list by Antoine Isaac and the OWL Working Group’s comments to the SKOS WG on their last efforts.
[5] See this section in the UMBEL Specifications regarding the use of labels and “semsets.”
[8] We’d like to thank Tom Baker for leading the effort with the SKOS group, and also thank former group members Rinke Hoekstra, Uli Sattler and Bijan Parsia for their expressions of support.

Posted by AI3's author, Mike Bergman Posted on February 10, 2011 at 11:01 pm in Ontologies, Semantic Web, UMBEL | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/944/skos-now-interoperates-with-owl-2/
The URI to trackback this post is: https://www.mkbergman.com/944/skos-now-interoperates-with-owl-2/trackback/
Posted:February 7, 2011

Sweet Tools ListingNow Presented as a Semantic Component; Grows to 900+ Tools

Sweet Tools, AI3‘s listing of semantic Web and -related tools, has just been released with its 17th update. The listing now contains more than 900 tools, about a 10% increase over the last version. Significantly the listing is also now presented via its own semantic tool, the structSearch sComponent, which is one of the growing parts to Structured Dynamics‘ open semantic framework (OSF).

So, we invite you to go ahead and try out this new Flex/Flash version with its improved search and filtering! We’re pretty sure you’ll like it.

Summary of Major Changes Sweet Tools structSearch View

Sweet Tools now lists 907 919 tools, an increase of 72 84 (or 8.6 10.1%) over the prior version of 835 tools. The most notable trend is the continued increase in capabilities and professionalism of (some of) the new tools.

This new release of Sweet Tools — available for direct play and shown in the screenshot to the right — is the first to be presented via Structured Dynamics’ Flex-based semantic component technology. The system has greatly improved search and filtering capabilities; it also shares the superior dataset management and import/export capabilities of its structWSF brethren.

As a result, moving forward, Sweet Tools updates will now be added on a more regular basis, reducing the big burps that past releases have tended to follow. We will also see much expanded functionality over time as other pieces of the structWSF and sComponents stack get integrated and showcased using this dataset.

This release is the first in WordPress, and shows the broad capabilities of the OSF stack to be embedded in a variety of CMS or standalone systems. We have provided some updates on Structured Dynamics’ OSF TechWiki for how to modify, embed and customize these components with various Flex development frameworks (see one, two or three), such as Flash Builder or FlashDevelop.

We should mention that the OSF code group is also seeing external parties exposing these capabilities via JavaScript deployments as well. This recent release expands on the conStruct version with its capabilities described in a post about a year ago.

Retiring the Exhibit Version

However, this release does mark the retirement of the very fine Exhibit version of Sweet Tools (an archive version will be kept available until it gets too long in the tooth). I was one of the first to install a commercial Exhibit system, and the first to do so on WordPress, as I described in an article more than four years ago.

Exhibit has worked great and without a hitch, and through a couple of upgrades. It still has (I think) a superior faceting system and sorting capabiities to what we presently offer with our own sComponent alternative. However, the Exhibit version is really a display technology alone, and offers no search, access control or underlying data management capabilities (such as CRUD), all of which are integral to our current system. It is also not grounded in RDF or semantic technologies, though it does have good structural genes. And, Sweet Tools has about reached the limits of the size of datasets Exhibit can handle efficiently.

Exhibit has set a high bar for usability and lightweight design. As we move in a different direction, I’d like again to publicly thank David Huynh, Exhibit’s developer, and the MIT Simile program for when he was there, for putting forward one of the seminal structured data tools of the past five years.

Updated Statistics

The updated Sweet Tools listing now includes nearly 50 different tools categories. The most prevalent categories are browser tools (RDF, OWL), information extraction, ontology tools, parsers or converters, and general RDF tools. The relative share by category is shown in this diagram (click to expand):

Since the last listing, the fastest growing categories have been utilities (general and RDF) and visualization. Linked data listings have also grown by 200%, but are still a relatively small percentage of the total.

These values should be taken with a couple of grains of salt. First, not all of these additions are organic or new releases. Some are the result of our own tools efforts and investigations, which can often surface prior overlooked tools. Also, even with this large number of application categories, many tools defy characterization, and can reside in multiple categories at once or are even pointing to new ones. So, the splits are illustrative, but not defining.

General language percentages have been keeping pretty constant over the past couple of years. Java remains the leading language with nearly half of all applications, a percentage it has kept steady for four years. PHP continues to grow in popularity, and actually increased the largest percentage amount of any language over this past census. The current language splits are shown in the next diagram (click to expand):

C/C++ and C# have really not grown at all over the past year. Again, however, for the reasons noted, these trends should be interpreted with care.

Tasty Dogfood?Dogfood Never Tasted So Good

Tools development is hard and the open source nature of today’s development tends to require a certain critical mass of developer interest and commitment. There are some notable tools that have much use and focus and are clearly professional and industrial grade. Yet, unfortunately, too many of the tools on the Sweet Tools listing are either proofs-of-concept, academic demos, or largely abandoned because of lack of interest by the original developer, the community or the market as a whole.

There is a common statement within the community about how important it is for developers to “eat their own dogfood.” On the face of it, this makes some sense since it conveys a commitment to use and test applications as they are developed.

But looked at more closely, this sentiment carries with it a troublesome reflection of the state of (many) tools within the semantic Web: too much kibble that is neither attractive nor tasty. It is probably time to keep the dogfood in the closet and focus on well-cooked and attractive fare.

We at Structured Dynamics are not trying to hold ourselves up as exemplars or the best chefs of tasty food. We do, however, have a commitment to produce fare that is well prepared and professional. Let’s stop with the dogfood and get on with serving nutritious and balanced fare to the marketplace.

Posted by AI3's author, Mike Bergman Posted on February 7, 2011 at 1:47 am in Open Source, Semantic Web Tools, Structured Web | Comments (1)
The URI link reference to this post is: https://www.mkbergman.com/942/tasty-new-sweet-tools-release/
The URI to trackback this post is: https://www.mkbergman.com/942/tasty-new-sweet-tools-release/trackback/
Posted:January 31, 2011

UMBEL Vocabulary and Reference Concept Ontology Wikipedia Refining UMBEL’s Linking and Mapping Predicates with Wikipedia

We are only days away from releasing the first commercial version 1.00 of UMBEL (Upper Mapping and Binding Exchange Layer) [1]. To recap, UMBEL has two purposes, both aimed to promote the interoperability of Web-accessible content. First, it provides a general vocabulary of classes and predicates for describing domain ontologies and external datasets. Second, UMBEL is a coherent framework of 28,000 broad subjects and topics (the “reference concepts”), which can act as binding nodes for mapping relevant content.

This last iteration of development has focused on the real-world test of mapping UMBEL to Wikipedia [2]. The result, to be more fully described upon release, has led to two major changes. It has acted to expand the size of the core UMBEL reference concepts to about 28,000. And it has led to adding to and refining the mapping predicates necessary for UMBEL to fulfill its purpose as a reference structure for external resources. This latter change is the focus of this post.

There is a huge diversity of organizational structure and world views on the Web; the linking and mapping predicates to fulfill this purpose must also capture that diversity. Relations between things on the Web can range from the exact and identity, to the approximate, descriptive and casual [3]. The 16 K direct mappings that have now been made between UMBEL and Wikipedia (resulting in the linkage of more than 2 million Wikipedia pages) provide a real-world test for how to capture this diversity. The need is to find the range of predicates that can reflect and capture quality, accurate mappings. Further, because mappings also can be aided with a variety of techniques from the manual to the automatic, it is important to characterize the specific mapping methods used whenever a linking predicate is assigned. Such qualifications can help to distinguish mapping trustworthiness, plus enable later segregation for the application of improved methods as they may arise.

As a result, the UMBEL Vocabulary now has a pretty well vetted and diverse set of linking and mapping predicates. Guidelines for how these differ, how they are used, and how they are qualified is described next.

A Comparison of Mapping Predicates

Properties for linking and mapping need to differ more than in name or intended use. They must represent differences that affect inferences and reasoners, and can be acted upon by specific utilities via user interfaces and other applications. Furthermore, the diversity of mapping predicates should capture the types of diverse mappings and linkages possible between disparate sources.

Sometimes things are individuals or instances; other times they are classes or groupings of similar things. Sometimes things are of the same kind, but not exactly aligned. Sometimes things are unlike, but related in a common way. (Everything in Britain, for example, is a British “thing” even though they may be as different as trees, dead kings or cathedrals.) Sometimes we want to say something about a thing, such as an animal’s fur color or age, as a way to further characterize it, and so on.

The OWL 2 language and existing semantic Web languages give us some tools and existing vocabulary to capture some of this diversity. How these options, plus new predicates defined for UMBEL’s purposes, compare is shown by this table:

Property Relative Strength Usage Standard Reasoner? Inverse Property? Kind of Thing Symmetrical? Transitive? Reflexive?
It is It Relates to
owl:equivalentClass 10 equivalence X N/A class class yes yes yes
owl:sameAs 9 identity X N/A individual individual yes yes yes
rdfs:subClassOf 8 subset X class class no yes yes
umbel:correspondsTo 7 ~equivalence + / – anything RefConcept yes yes yes
skos:narrowerTransitive 6 hierarchical X skos:Concept skos:Concept no yes no
skos:broaderTransitive 6 hierarchical X skos:Concept skos:Concept no yes no
rdf:type 5 membership X anything class no no no
umbel:isAbout 4 topical X anything RefConcept perhaps not likely not likely
umbel:isLike 3 similarity anything anything yes no not likely
umbel:relatesToXXX 2 relationship anything SuperType no no not likely
umbel:isCharacteristicOf 1 attribute X anything RefConcept no no no

I discuss each of these predicates below. But, first, let’s discuss what is in this table and how to interpret it [4].

  • Relative strength – an arbitrary value that is meant to capture the inferencing power (entailments) embodied in the predicate. Identity (equivalence), class implications, and specific predicate properties that can be acted upon by reasoners are given higher relative power
  • Standard reasoner? – indicates whether standard reasoners [5] draw inferences and entailments from the specific property. A “+ / -” indication indicates that reasoners do not recognize the specific property per se, but can act upon the predicates (such as symmetric, transitive or reflexive) used to define the predicate
  • Inverse property? – indicates whether there is an inverse property used within UMBEL that is not listed in the table. In such cases, the predicate shown is the one that treats the external entity as the subject
  • It is a kind of thing – is the same as domain; it means the kind of thing to which the subject belongs
  • It relates to a kind on thing – is the same as range; it means the kind of thing to which the object of the subject belongs
  • Symmetrical? – describes whether the predicate for an s – p – o (subject – predicate – object) relationship can also apply in the o – p – s manner
  • Transitive? – is whether the predicate interlinks two individuals A and C whenever it interlinks A with B and B with C for some individual B
  • Reflexive? – By that is meant whether the subject has itself as a member. In a reflexive closure between subject and object the subject is fully included as a member. Equivalence, subset, greater than or equal to, and less than or equal to relationships are reflexive; not equal, less than or greater than relationships are not.

The Usage metric is described for each property below.

Individual Predicates Discussion

To further aid the understanding of these properties, we can also group them into equivalence, membership, approximate or descriptive categories.

Equivalent Properties

Equivalent properties are the most powerful available since they entail all possible axioms between the resources.

owl:equivalentClass

Equivalent class means that two classes have the same members; each is a sub-class of the other. The classes may differ in terms of annotations defined for each of them, but otherwise they are axiomatically equivalent.

An owl:equivalentClass assertion is the most powerful available because of its ability to ‘Explode the Domain[6]. Because of its entailments, owl:equivalentClass should be used with great care.

owl:sameAs

The owl:sameAs assertion claims two instances to be an identical individual. This assertion also carries with it strong entailments of symmetry and reflexivity.

owl:sameAs is often misapplied [7]. Because of its entailments, it too should be used with great care. When there are doubts about claiming this strong relationship, UMBEL has the umbel:isLike alternative (see below).

Membership and Hierarchical Properties

Membership properties assert that an instance is a member of a class.

rdfs:subClassOf

The rdfs:subClassOf asserts that one class is a subset of another class. This assertion is transitive and reflexive. It is a key means for asserting hierarchical or taxonomic structures in an ontology. This assertion also has strong entailments, particularly in the sense of members having consistent general or more specific relationships to one another.

Care must be exercised that full inclusivity of members occurs when asserting this relationship. When correctly asserted, however, this is one of the most powerful means to establish a reasoning structure in an ontology because of its transitivity.

skos:narrowerTransitive/skos:broaderTransitive

Both of these predicates work on skos:Concept (recall that umbel:RefConcept is itself a subClassOf a skos:Concept). The predicates state a hierarchical link between the two concepts that indicates one is in some way more general (“broader”) than the other (“narrower”) or vice versa. The particular application of skos:broaderTransitive (or its complement) is used to infer the transitive closure of the hierarchical links, which can then be used to access direct or indirect hierarchical links between concepts.

The transitive relationship means that there may be intervening concepts between the two stated resources, making the relationship an ancestral one, and not necessarily (though it is possible to be so) a direct parent-child one.

rdf:type

The rdf:type assertion assigns instances (individuals) to a class. While the idea is straightforward, it is important to understand the intensional nature of the target class to ensure that the assignment conforms to the intended class scope. When this determination can not be made, one of the more approximate UMBEL predicates (see below) should be used.

Approximation Properties

For one reason or another, the precise assertions of the equivalent or membership properties above may not be appropriate. For example, we might not know sufficiently an intended class scope, or there might be ambiguity as to the identity of a specific entity (is it Jimmy Johnson the football coach, race car driver, fighter, local plumber or someone else?). Among other options — along a spectrum of relatedness — is the desire to assign a predicate that is meant to represent the same kind of thing, yet without knowing if the relationship is an equivalence (identity, or sameAs), a subset, or merely just a member of relationship. Alternatively, we may recognize that we are dealing with different things, but want to assert a relationship of an uncertain nature.

This section presents the UMBEL alternatives for these different kinds of approximate predicates [4].

umbel:correspondsTo

The most powerful of these approximate predicates in terms of alignment and entailments is the umbel:correspondsTo property. This predicate is the recommended option if, after looking at the source and target knowledge bases [8], we believe we have found the best equivalent relationship, but do not have the information or assurance to assign one of the relationships above. So, while we are sure we are dealing with the same kind of thing, we may not have full confidence to be able to assign one of these alternatives:

   rdfs:subClassOf
   owl:equivalentClass
   owl:sameAs
   superClassOf

Thus, with respect to existing and commonly used predicates, we want an umbrella property that is generally equivalent or so in nature, and if perhaps known precisely might actually encompass one of the above relations, but we don’t have the certainty to choose one of them nor perhaps assert full “sameness”. This is not too dissimilar from the rationale being tested for the x:coref predicate in relation to owl:sameAs from the UMBC Ebiquity group [9,10].

The property umbel:correspondsTo is thus used to assert a close correspondence between an external class, named entity, individual or instance with a Reference Concept class. It asserts this correspondence through the basis of both its subject matter and intended scope.

This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.

umbel:isAbout

In most uses, the most prevalent linking property to be used is the umbel:isAbout assertion. This predicate is useful when tagging external content with metadata for alignment with an UMBEL-based reference ontology. The reciprocal assertion, umbel:isRelatedTo is when an assertion within an UMBEL vocabulary is desired to an external ontology. Its application is where the reference vocabulary itself needs to refer to an external topic or concept.

The umbel:isAbout predicate does not have the same level of confidence or “sameness” as the umbel:correspondsTo property. It may also reflect an assertion that is more like rdf:type, but without the confidence of class membership.

The property umbel:isAbout is thus used to assert the relation between an external named entity, individual or instance with a Reference Concept class. It can be interpreted as providing a topical assertion between an individual and a Reference Concept.

This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.

umbel:isLike

The property umbel:isLike is used to assert an associative link between similar individuals who may or may not be identical, but are believed to be so. This property is not intended as a general expression of similarity, but rather the likely but uncertain same identity of the two resources being related.

This property may be considered as an alternative to sameAs where there is not a certainty of sameness, and/or when it is desirable to assert a degree of overlap of sameness via the umbel:hasMapping reification predicate. This property can and should be changed if the certainty of the sameness of identity is subsequently determined.

It is appropriate to use this property when there is strong belief the two resources refer to the same individual with the same identity, but that association can not be asserted at the present time with full certitude.

This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.

umbel:relatesToXXX

At a different point along this relatedness spectrum we have unlike things that we would like to relate to one another. It might be an attribute, a characteristic or a functional property about something that we care to describe. Further, by nature of the thing we are relating, we may also be able to describe the kind of thing we are relating. The UMBEL SuperTypes (among many other options) gives us one such means to characterize the thing being related.

UMBEL presently has 31 predicates for these assertions relating to a SuperType [11]. The various properties designated by umbel:relatesToXXX are used to assert a relationship between an external instance (object) and a particular (XXX) SuperType. The assertion of this property does not entail class membership with the asserted SuperType. Rather, the assertion may be based on particular attributes or characteristics of the object at hand. For example, a British person might have an umbel:relatesToXXX asserted relation to the SuperType of the geopolitical entity of Britain, though the actual thing at hand (person) is a member of the Person class SuperType.

This predicate is used for filtering or clustering, often within user interfaces. Multiple umbel:relatesToXXX assertions may be made for the same instance.

Each of the 32 UMBEL SuperTypes has a matching predicate for external topic assignments (relatesToOtherOrganism shares two SuperTypes, leading to 31 different predicates):

SuperType Mapping Predicate Comments
NaturalPhenomena relatesToPhenomenon This predicate relates an external entity to the SuperType (ST) shown. It indicates there is a relationship to the ST of a verifiable nature, but which is undetermined as to strength or a full rdf:type relationship
NaturalSubstances relatesToSubstance same as above
Earthscape relatesToEarth same as above
Extraterrestrial relatesToHeavens same as above
Prokaryotes relatesToOtherOrganism same as above
ProtistsFungus
Plants relatesToPlant same as above
Animals relatesToAnimal same as above
Diseases relatesToDisease same as above
PersonTypes relatesToPersonType same as above
Organizations relatesToOrganizationType same as above
FinanceEconomy relatesToFinanceEconomy same as above
Society relatesToSociety same as above
Activities relatesToActivity same as above
Events relatesToEvent same as above
Time relatesToTime same as above
Products relatesToProductType same as above
FoodorDrink relatesToFoodDrink same as above
Drugs relatesToDrug same as above
Facilities relatesToFacility same as above
Geopolitical relatesToGeoEntity same as above
Chemistry relatesToChemistry same as above
AudioInfo relatesToAudioMusic same as above
VisualInfo relatesToVisualInfo same as above
WrittenInfo relatesToWrittenInfo same as above
StructuredInfo relatesToStructuredInfo same as above
NotationsReferences relatesToNotation same as above
Numbers relatesToNumbers same as above
Attributes relatesToAttribute same as above
Abstract relatesToAbstraction same as above
TopicsCategories relatesToTopic same as above
MarketsIndustries relatesToMarketIndustry same as above

This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.

Descriptive Properties

Descriptive properties are annotation properties.

umbel:isCharacteristicOf

Two annotation properties are used to describe the attribute characteristics of a RefConcept, namely umbel:hasCharacteristic and its reciprocal, umbel:isCharacteristicOf. These properties are the means by which the external properties to describe things are able to be brought in and used as lookup references (that is, metadata) to external data attributes. As annotation properties, they have weak semantics and are used for accounting as opposed to reasoning purposes.

These properties are designed to be used in external ontologies to characterize, describe, or provide attributes for data records associated with a given RefConcept. It is via this property or its inverse, umbel:hasCharacteristic, that external data characterizations may be incorporated and modeled within a domain ontology based on the UMBEL vocabulary.

Qualifying the Mappings

The choice of these mapping predicates may be aided with a variety of techniques from the manual to the automatic. It is thus important to characterize the specific mapping methods used whenever a linking predicate is assigned. Following this best practice allows us to distinguish mapping trustworthiness, plus to also enable later segregation for the application of improved methods as they may arise.

UMBEL, for its current mappings and purposes, has adopted the following controlled vocabulary for characterizing the umbel:hasMapping predicate; such listings may be readily modified for other domains and purposes when using the UMBEL vocabulary. This controlled vocabulary is based on instances of the Qualifier class. This class represents a set of descriptions to indicate the method used when applying an approximate mapping predicate (see above):

Qualifier Description
Manual – Nearly Equivalent The two mapped concepts are deemed to be nearly an equivalentClass or sameAs relationship, but not 100% so
Manual – Similar Sense The two mapped concepts share much overlap, but are not the exact same sense, such as an action as related to the thing it acts upon
Heuristic – ListOf Basis Type assignment based on Wikipedia ListOf category; not currently used
Heuristic – Not Specified Heuristic mapping method applied; script or technique not otherwise specified
External – OpenCyc Mapping Mapping based on existing OpenCyc assertion
External – DBOntology Mapping Mapping based on existing DBOntology assertion
External – GeoNames Mapping Mapping based on existing GeoNames assertion
Automatic – Inspected SV Mapping based on automatic scoring of concepts using Semantic Vectors, with specific alignment choice based on hand selection
Automatic – Inspected S-Match Mapping based on automatic scoring of concepts using S-Match, with specific alignment choice based on hand selection; not currently used
Automatic – Not Specified Mapping based on automatic scoring of concepts using a script or technique not otherwise specified; not currently used

Again, as noted, for other domains and other purposes this listing can be modified at will.

Status of Mappings

Final aspects of these mappings are now undergoing a last round of review. A variety of sources and methods have been applied, to be more fully documented at time of release.

Some of the final specifics and counts may be modified slightly by the time of actual release of UMBEL v 1.00, which should occur in the next week or so. Nonetheless, here are some tentative counts for a select portion of these predicates in the internal draft version:

Item or Predicate Count
Total UMBEL Reference Concepts 27,917
owl:equivalentClass
(external OpenCyc, PROTON, DBpedia)
28,618
umbel:correspondsTo
(direct mappings to Wikipedia)
16,884
rdf:type 876,125
umbel:relatesToXXX
(31 variations)
3,059,023
Unique Wikipedia Pages Mapped 2,130,021

All of these assignments have also been hand inspected and vetted.

Major Progress Towards a Gold Standard

To date, in various steps and in various phases, the inspection of Wikipedia, its categories, and its match with UMBEL has perhaps incurred more than 5,000 hours (or nearly a three person-year equivalence) of expert domain and semantic technology review [12]. As noted, about 60% (16,884 of 27,917) of UMBEL concepts have now been directly mapped to Wikipedia and inspected for accuracy.

Wikipedia provides the most demanding and complete mapping target available for testing the coverage of UMBEL’s reference concepts and the adequacy of its vocabulary. As a result, we have added to and refined the mapping and linking predicates used in the UMBEL vocabulary, and added a Qualifier class to record the mapping process, as this post overviews. We have added the SuperType class to better organize and disambiguate large knowledge bases [13]. And, in this mapping process, we have expanded UMBEL’s reference concepts by about 33% to improve coverage, while remaining consistent with its origins as a faithful subset of the venerable Cyc knowledge structure [14].

A side benefit that has emerged from these efforts — with a huge potential upside — is the valuable combination of UMBEL and Wikipedia as a “gold standard” for aligning and mapping knowledge bases. Such a standard is critically needed. For example, in reviewing many of the existing Wikipedia mappings claimed as accurate, we found misplacement errors that averaged 15.8% [15]. Having a baseline of vetted mappings will aid future mappings. Moreover, having a complete conceptual infrastructure over Wikipedia will enable new and valuable reasoning and inference services.

The results from the UMBEL v 1.00 mapping are promising and very much useful today, but by no means complete. Future versions will extend the current mappings and continue to refine its accuracy and completeness [16]. What we can say, however, is that a coherent organization and conceptual schema — namely, UMBEL — overlaid on the richness of the instance data and content of Wikipedia, can produce immediate and useful benefits. These benefits apply to semantic search, semantic annotation and tagging, reasoning, discovery, inferencing, organization and comparisons.


[1] UMBEL has been under development since March 2007, with its first release in July 2008 and its last release (v 0.80) in November 2010. Throughout its releases we have reserved incrementing the vocabulary and its ontology to version 1.00 until it was deemed “commercial”. This is the first version to meet this test. We’d like to thank our partner, Ontotext, and the RENDER project for providing assistance and resources to bring the system to this point.
[2] The basic approach is to use the DBpedia representation of Wikipedia, since its extractors have already done a great job in preparing structured data.
[3] M.K. Bergman, 2010. “The Nature of Connectedness on the Web,” AI3:::Adaptive Information blog, November 22, 2010; see https://www.mkbergman.com/935/the-nature-of-connectedness-on-the-web/.
[4] A good starting reference for some of these concepts is Pascal Hitzler et al., eds., 2009. OWL 2 Web Ontology Language Primer, a W3C Recommendation, 27 October 2009; see http://www.w3.org/TR/owl2-primer/.
[5] Such as the semantic reasoners FaCT++, Racer, Pellet, Hermit, etc..
[6] Fred Giasson first coined this phrase; see F. Giasson, 2008. “Exploding the Domain: UMBEL Web Services by Zitgist,” blog posting on April 20, 2008; see http://fgiasson.com/blog/index.php/2008/04/20/exploding-the-domain-umbel-web-services-by-zitgist/.
[7] Among many, many references, see a fairly comprehensive discussion of this issue at http://ontologydesignpatterns.org/wiki/Community:Overloading_OWL_sameAs.
[8] This predicate is designed for the circumstance of aligning two different ontologies or knowledge bases based on node-level correspondences, but without entailing the actual ontological relationships and structure of the object source. For example, the umbel:correspondsTo predicate is used to assert close correspondence between UMBEL Reference Concepts and Wikipedia categories or pages, yet without entailing the actual Wikipedia category structure.
[9] Jennifer Sleeman and Tim Finin, 2010. “Learning Co-reference Relations for FOAF Instances,” Proceedings of the Poster and Demonstration Session at the 9th International Semantic Web Conference, November 2010; see http://ebiquity.umbc.edu/_file_directory_/papers/522.pdf.
[10] For example, in the words of Tim Finin of the Ebiquity group:
“The solution we are currently exploring is to define a new property to assert that two RDF instances are co-referential when they are believed to describe the same object in the world. The two RDF descriptions might be incompatible because they are true at different times, or the sources disagree about some of the facts, or any number of reasons, so merging them with owl:sameAs may lead to contradictions. However, virtually merging the descriptions in a co-reference engine is fine — both provide information that is useful in disambiguating future references as well as for many other purposes.”
[11] The same vocabulary construct can be applied to other domain ontologies based on the UMBEL Vocabulary.
[12] The efforts with Wikipedia have been ongoing to a certain degree since the inception of UMBEL. As one example, we have been maintaining a comprehensive tracking of the use of Wikipedia for mapping and semantic technology purposes, called SWEETpedia, for many years. Applying these techniques to both UMBEL and Wikipedia has been most active over the past 18 months.
[13] See the UMBEL Annex G: UMBEL SuperTypes Documentation, which will also be slightly updated upon the new UMBEL v 1.00 release.
[14] See the ‘Use of OpenCyc’ section in the UMBEL Specifications.
[15] These sources and error rates will be detailed in a paper after the pending new release of UMBEL.
[16] Fortunately, returns on the time investment will accelerate since basic lessons and techniques have now been learned.

Posted by AI3's author, Mike Bergman Posted on January 31, 2011 at 1:18 am in Adaptive Innovation, Ontologies, UMBEL | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/941/making-connections-real/
The URI to trackback this post is: https://www.mkbergman.com/941/making-connections-real/trackback/