Posted:April 1, 2008

Part 3 of 4 on Foundations to UMBEL

UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight ontology for relating Web content and data to a standard set of subject concepts. It is being designed to apply to all types of data on the Web from RSS and Atom feeds to tagging to microformats and topic maps to RDF and OWL (among others). The project Web site is at http://www.umbel.org.

The first portion and priority for UMBEL is to prepare the lightweight subject concept ontology, the focus of this four-part foundations series. After the UMBEL ontology is released in first draft, the project will then turn to the binding protocols for non-RDF formats.

The previous part in this series discussed at length RDF classes and instances or individuals. We are now tightening these terms down to reflect the specific intents and usage within UMBEL. UMBEL’s main classes categorize subject concepts; notable instances are specifically termed named entities.

UMBEL defines subject concepts as a distinct subset of the more broadly understood concept [1] such as used in the SKOS RDFS controlled vocabulary [2], conceptual graphs, formal concept analysis or the very general concepts common to many upper ontologies [3]. We define subject concepts as a special kind of concept: namely, ones that are concrete, subject-related and non-abstract [4].

UMBEL contrasts subject concepts with abstract concepts and with named entities. Abstract concepts represent abstract or ephemeral notions such as truth, beauty, evil or justice, or are thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world. Named entities are the real things or instances in the world that are themselves natural and notable class members of subject concepts.

Subject Concepts

Subject concepts are a special kind of concept: namely, ones that are concrete, subject-related and non-abstract. Note in other systems or ontologies similar constructs may alternatively be called topics, subjects, concepts or perhaps interests. UMBEL has adopted the term subject concept to distinguish from these uses, which have different nuances of meaning and use, as well as to highlight the subject or topic nature of UMBEL's concrete concepts.

Each subject concept is a class. While subject concepts have a preferred label (using SKOS terminology), they are representative or a proxy for that concept, and not to be confused with the thing itself. Every UMBEL subject concept can be expressed and referred to by a different preferred label in alternate languages. Indeed, in a given language, different preferred labels may be swapped out without affecting the identity or use of the subject concept itself. The name for a subject concept is therefore merely a handle.

Subject concepts are the core constituents to the UMBEL framework. All subject concepts are based on existing concepts in OpenCyc, the open source version of the Cyc knowledge base (see Part 4). About 21,000 of them have been distilled and are part of the UMBEL backbone.

Semsets

Semsets are semantically close terms or phrases synonomous or nearly so with the meanings of a subject concept or a named entity. Semsets are akin to WordNet synsets or Cyc aliases, but can also include more contemporary jargon or slang as may be drawn from Web tagging or folksonomies. The term semset has been chosen to distinguish this consolidated meaning.

Semsets may apply to either subject concepts or named entities. In the latter case, their use is closer to the sense of an alias (such as nicknames, or "great satan" or "uncle sam" for the "United States").

Abstract Concepts

Abstract concepts represent abstract or ephemeral notions such as truth, beauty, evil or justice, or are thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world. They are included in the UMBEL specification because they help maintain the integrity of the UMBEL subject concept graph.

Like subject concepts, abstract concepts are based strictly on those already in OpenCyc. Abstract concepts may be viewed in the UMBEL graph, and may be used for ontology mapping, but are not generally displayed when doing standard content mapping or concept look-ups via Web services. For various domain extraction or relatedness determinations, abstract concepts may be excluded from UMBEL's internal processing.

Named Entities

Named entities are the real things or instances in the world that are themselves natural and notable class members of subject concepts. The initial named entities are drawn from Wikipedia as processed via YAGO, and other online fact-based repositories. Named entities are the instances of the subject concepts in the standard definition of the term [5].

Named entities and the sources for them are also a major avenue for growth and expansion of UMBEL moving forward. Named entities are more contemporary and changing, while the reference subject concept backbone is more fixed and stable.

Each named entity is mapped to a governing subject concept for ontology purposes. There are no relations between named entities except as mediated through a subject concept(s). As noted, named entities may also have semset aliases.

Subject Concepts v. Abstract Concepts

The following table helps draw the distinction between subject concepts and abstract concepts. Technical documentation at the time of the UMBEL ontology release will list the 520 or so abstract concepts presently within UMBEL. Looking at those can help draw the distinction.

Subject Concepts Abstract Concepts
  • Nouns or noun phrases
  • These are concrete kinds of things or ideas in the real world
  • Broad, collective, reference concepts, often hierarchically related
  • Similar to “topics” or “subjects”, these other terms are used in somewhat different ways in alternative schemas
  • Collections or classes of like “kinds” of items
  • Quite stable in scope, breadth and structure
  • Grounded in the OpenCyc knowledge base, which is the source of its relationships and graph structure
  • Named entities are members of subject concepts
  • These are either: 1) abstract (truth, beauty, evil) concepts, or 2) artificial thought constructs for organizing things but not encountered as standalone concepts in their own right (e.g., PartiallyTangibleThing)
  • Collections or classes of like “kinds” of items
  • Class members may be either other abstract concepts or subject concepts
  • Class members are never named entities
  • Tend to reside higher in the subsumption structure
  • Generally hidden from the UMBEL subject concept reference “backbone” structure
  • May be used for ontology mapping purposes
  • Grounded in the OpenCyc knowledge base, which is the source of its relationships and graph structure

Subject Concepts v. Named Entities

The following table helps draw the distinction between subject concepts and named entities. Technical documentation at the time of the UMBEL ontology release will describe certain "gray" categories and the determination as to whether they should be treated as one or the other.

For example, most geographical places clearly belong to the named entity category. But, on somewhat arbitrary grounds, all nations, countries, states and provinces were assigned as subject concepts so that they would act as classes with other entities mapped to them. It should also be noted that entites or concepts in the gray zone may be treated both as a named entity and a subject concept.

Subject Concepts Named Entities
  • Broad, collective, reference concepts. In a hierarchical category structure, subject concepts represent the “root” or “branch” nodes
  • Nouns or noun phrases
  • Called “subject concepts” (or sometimes as a shorthand, “concepts”). Similar to “topics” or “subjects”, these other terms are used in somewhat different ways in specific in alternative schemas and are therefore not used interchangeably here
  • These are not abstract (truth, beauty, evil) concepts, but concrete about kinds of things or ideas in the real world; abstract concepts are often properly part of what are known as “upper ontologies” but they are not applicable for UMBEL’s purposes
  • Collections or classes of like “kinds” of items
  • Quite stable in scope, breadth and structure
  • Grounded in the OpenCyc knowledge base, which is the source of its relationships and graph structure
  • Basis for the UMBEL subject concept reference “backbone” structure
  • Named entities are members of subject concepts
  • Atomic, specific objects, often famous or well-known, that belong to reference “types” such as persons, places, organizations, events, products, time intervals, etc. In a hierarchical category structure, named entities represent the “leaves”
  • Nouns or noun phrases
  • Called “named entities” not entities alone, to prevent confusion with other general senses of the term “entity” and in keeping with named entity recognition (NER).
  • Very concrete, atomic entities
  • The number and scope is fluid and growing, and potentially of huge size as specific objects are named
  • Often expressed as a proper noun (with some capitalization), but not necessarily so. Common animal, plant, object, substance names also can be named entities
  • Major sources are Wikipedia (YAGO), and similar such as Wikinvest, Wikicompanies, etc.
  • Named entities are maintained and treated separately from the UMBEL subject concept ontology
  • Every named entity belongs to at least one subject concept.

Though there are shades of gray between subject concepts and named entities, we have found this distinction to be a powerful means for gaining clarity in UMBEL’s design. It provides a clean path for keeping an ontology lightweight while in essence providing infinite extensibility for all manner of named entities and the datasources that contain them. Moreover, the ability to classify named entities into types orthogonal to subject concepts also provides useful guidance for presentation templates that may be automatically invoked in data meshups. But, that is a topic for another day. :)

This is Part 3 of 4 on the foundations to UMBEL. This four-part series covers a Re-Introduction to UMBEL, UMBEL: Making Linked Data Classy, Subject Concepts and Named Entities, and Basing UMBEL’s Backbone on OpenCyc. These articles are lead-ins to the discussion of the actual UMBEL ontology that will soon follow.

[1] As the term is used in mainstream cognitive science and philosophy of mind, a concept is an abstract idea or a mental symbol, typically associated with a corresponding representation in a language or symbology. Definition is from Wikipedia; see further http://en.wikipedia.org/wiki/Concept.
[2] SKOS stands for Simplified Knowledge Organization Systems; it is a controlled vocabulary based on RDF Schema designed to allow the creation of formal languages to represent thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured information. See http://www.w3.org/2004/02/skos/.
[3] Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget's Thesaurus), though Cyc is a clear exception with its stated emphasis on capturing “common knowledge.”
[4] A subject concept bears some resemblance to dc:subject or foaf:interest in other ontologies. However, unlike those approaches, UMBEL: 1) provides a reference set of subject concepts to pick from and synonym-like relationships similar to WordNet synsets; and 2) are not semantically literal descriptions for the terms, but rather "proxies" for the concepts they represent. This referential character for subject concepts make them readily transferrable to multiple human languages.
[5] In a named entity, the word named applies to entities that have a "rigid designators" as defined by Kripke for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind of terms like biological species and substances. BBN categories proposed in 2002 consists of 29 types and 64 subtypes; Sekine’s extended hierarchy also proposed in 2002 is made up of 200 subtypes. We use Sekine (http://nlp.cs.nyu.edu/ene/version6_1_0eng.html) as our guide. For example, Sekine's top 15 named entity classes are: Name_Other, Person, Organization, Location, Facility, Product, Event, Natural_Object, Title, Unit, Vocation, Disease, God, Id_Number and Color; the remaining types are subsumed under these. See further http://en.wikipedia.org/wiki/Named_entity_recognition. Generally, named entities are the instances of UMBEL classes.