Posted:March 30, 2008

Part 1 of 4 on Foundations to UMBEL

UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight ontology for relating Web content and data to a standard set of subject concepts. It is being designed to apply to all types of data on the Web from RSS and Atom feeds to tagging to microformats and topic maps to RDF and OWL (among others). The project Web site is at http://www.umbel.org.

UMBEL was first announced in July 2007 and has been a direct subject of these prior posts:

However, much internal development and refinement has occurred especially in the past few months [1]. Over the next few days, this posting, a re-introduction to UMBEL, will be followed by these additional parts:

These articles are lead-ins to the discussion of the actual UMBEL ontology that will soon follow.

Purpose

UMBEL has two purposes: 1) to provide a lightweight structure of subject concepts as a reference to what Web content or data “is about“, what is called a concept schema in SKOS [4]; and 2) to define a variety of binding protocols for different Web data formats to map to this “backbone.” The project’s immediate priority is to first create this reference backbone [2]. That is the focus of these current postings.

The UMBEL backbone traces the major pathways through the content graph of the Web.

Think of the backbone as a set of roadsigns to help find related content. UMBEL is like a map of an interstate highway system, a way of getting from one big place to another. Once in the right vicinity, other maps (or ontologies), more akin to detailed street maps, are then necessary to get to specific locations or street addresses.

By definition, these more fine-grained maps are beyond UMBEL’s scope. But UMBEL can help provide the context for placing such detailed maps in relation to one another and in relation to the Big Picture of what related content is about.

These subject concepts also provide the mapping points for the many, many thousands (indeed, millions) of specific named entities that are the notable instances of these subject concepts. Examples might include the names of specific physicists, cities in a country, or a listing of financial stock exchanges. UMBEL mappings enable us to link a given named entity to the various subject classes of which it is a member.

And, because of relationships amongst subject concepts in the backbone, we can also relate that entity to other related entities and concepts. The UMBEL backbone traces the major pathways through the content graph of the Web. For some visualizations of this subject graph, see So, What Might The Web’s Subject Backbone Look Like?

Relation to Linked Data

Today, the actual linkages in Linked Data, the first meaningful expression of the semantic Web, occur almost exclusively via direct sameAs relationships between instances. An easy way to think of one of these notable instances is as the topic of a specific article in Wikipedia. People, places, important historical events, and so forth are all examples of such named entities.

Current Linked Data is therefore useful for linking data for given instances from different data sources (say, for combining political, demographic and mapping information for a geographic place like Quebec City). But, these instance-level links lack context and a conceptual framework for inferencing or determining relatedness between concepts or in relation to other instances. For these purposes, Linked Data needs a class structure (Part 2).

As noted, UMBEL’s class structure is based on subject concepts, which are a distinct subset of the more broadly understood concept [3] such as used in the SKOS RDFS controlled vocabulary [4], conceptual graphs, formal concept analysis or the very general concepts common to many upper ontologies [5]. We define subject concepts as a special kind of concept: namely, ones that are concrete, subject-related and non-abstract [6].

UMBEL contrasts its subject concepts with abstract concepts and with named entities. Abstract concepts represent abstract or ephemeral notions such as truth, beauty, evil or justice, or are thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world. Named entities are the real things or instances in the world that are themselves natural and notable instances (members) of subject concepts (classes). More detailed discussion of this terminology is provided in Part 3.

Basic Approach

UMBEL thus sets for itself objectives that include an identification of subject concepts and their relationships; a premise of emphasizing representational concepts over unattainable precision or exactitude; and a means for relating any notable thing of the world to that structure. Moreover, meeting these objectives should be based on best systems and practices, informed where possible by social acceptance and consensus.

W-O-W-Y is the shorthand we apply to the semantic framework for meeting these UMBEL objectives. W-O-W-Y is derived from the constituent UMBEL building blocks of WordNet (W) [7], OpenCyc (O), Wikipedia (W) [8] and YAGO (Y) [9]. Each resource contributes in a different way.

Via the WOWY framework, OpenCyc provides the basis for the reference subject backbone (Part 4), WordNet (supplemented by others) provides the “synsets” for relating natural language nouns and phrases to these concepts, and Wikipedia as processed by YAGO (among a growing list of other resources) provides the starting dictionary of relevant named entities important to the Web public.

The initial UMBEL ontology contains roughly 21,000 subject concepts distilled from OpenCyc and vetted for their relational structure. A further 1.5 million named entities are also currently mapped to that structure.

Coincident with the pending release of the draft UMBEL ontology of these subject concepts will be a multi-volume technical report that details the exact mapping and vetting procedures used.


[1] UMBEL’s editors are Michael Bergman and Frédérick Giasson.
[2] UMBEL’s second purpose for binding mechanisms is not further discussed in this series. That design includes binding mechanisms to work with HTML, tagging or other standard practices, including various RDF schema and more formal ontologies. At minimum, support is intended for Atom, microformats, OPML, OWL, RDF, RDFa, RDF Schema, RSS, tags (via Tag Commons or MOAT or SCOT), and topic maps. Finally, thorugh fulfilling its two purposes, the UMBEL project will also provide a data set registration service, information and collaboration Web site, tools clearinghouse, and support for language translations and some tools development. These latter initiatives are part of the longer-term UMBEL project development plan.
[3] As the term is used in mainstream cognitive science and philosophy of mind, a concept is an abstract idea or a mental symbol, typically associated with a corresponding representation in a language or symbology. Definition is from Wikipedia; see further http://en.wikipedia.org/wiki/Concept.
[4] SKOS stands for Simplified Knowledge Organization Systems; it is a controlled vocabulary based on RDF Schema designed to allow the creation of formal languages to represent thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured information. See http://www.w3.org/2004/02/skos/.
[5] Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus), though Cyc is a clear exception with its stated emphasis on capturing “common knowledge.”
[6] A subject concept bears some resemblance to dc:subject or foaf:interest in other ontologies. However, unlike those approaches, UMBEL: 1) provides a reference set of subject concepts to pick from and synonym-like relationships similar to WordNet synsets; and 2) are not semantically literal descriptions for the terms, but rather “proxies” for the concepts they represent. This referential character for subject concepts makes them readily transferrable to multiple human languages.
[7] WordNet  is the oldest and largest database of English nouns, adjectives, verbs and adverbs and their linguistic relationships. Terms are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept, which are interlinked by means of conceptual, semantic and lexical relations. The database is the acknowledged authority for use in computational linguistics and natural language processing. UMBEL principally uses WordNet nouns; the most recent version 3.0 of the database has 117,798 unique nouns with 82,115 associated synsets. WordNet is available as open source; it is provided and maintained by the Cognitve Science Laboratory at Princeton University; see http://wordnet.princeton.edu/.
[8] Since about 2005 — and at an accelerating pace — Wikipedia has emerged as the leading online knowledge base for conducting semantic Web and related research; see here for a listing of about 100 academic papers. The system is being tapped for both data and structure. The tremendous growth of content and topics within Wikipedia is well documented (see, as examples, the internal Wikipedia statistics W1, W2, W3, and W4). As of early 2008, Wikipedia had about 2.25 million articles in English and versions in 256 languages and variants.
[9] YAGO (“yet another great ontology”) was developed by Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum at the Max-Plack-Institute for Computer Science at Saarbrücken University. A WWW 2007 paper and more recent and detailed technical report describing the project is available, plus there is an online query point for the YAGO service and the data can be downloaded. YAGO combines Wikipedia and WordNet to derive a rich hierarchical ontology with high precision and recall. The system contains nearly 100 data relations (predicates).