Posted:May 11, 2008

Unattributed cover drawing from The Economist, August 4, 2007; see http://www.economist.com/

Squeezed Between Two World Views at the Infocline

I remember one of my formative jobs at the American Public Power Association when we were running all of APPA’s technical activities. While mostly a lobbying outfit, we were after all in Washington, DC, and were perceived by many to be near some nexus of power and influence. And, because my group did technical stuff, we were a natural magnet for some inventors seeking an edge. And, believe me, some of those folks were real crackpots.

We’d hear claims how this person or that invented radar, or perpetual energy machines, or bendable concrete, and, once, even, cold fusion. Hehe.

In monitoring various ontology and semantic Web mailing lists, claims sometimes arise about the “ontology of everything” or the single, universal ontology that cures cancer or walks on one leg. It makes me smile and think about those past wild claims about radar or perpetual energy.

Some, I believe, when we mention the UMBEL lightweight subject concept reference structure, conjure up similar visions of a universal “ontology of everything”. That is wrong and not our intent. But, UMBEL is trying to straddle two different worlds and world views, and that can often lead to misunderstandings and misperceptions.

This posting is not the first and surely will not be the last on the subject, but it is worthwhile again to try to explain the role of UMBEL from these different angles and with slightly different analogies.

UMBEL is an Infocline

In prior posts, we have described UMBEL as a backbone, as a roadmap to related content, as a lightweight ontology or a lightweight reference structure, and as middleware. In this post, I am going to concentrate on its role as middleware, in its role as residing at the infocline between two different worlds and world views.

The Greek base -cline is often applied to gradual transition layers or changes in gradients or slope. A thermocline, for example, represents the layer between the deep and surface ocean. While there is mixing across this layer, it is slower than within the two parts that it separates. Both parts and the thermocline layer itself have quite different properties and temperatures, even though all are ocean and salty water.

The UMBEL infocline acts in a similar manner. On one side of the UMBEL layer is the Cyc knowledge base, with its self-contained, more-or-less closed world of higher order logics, microtheories regarding thousands of knowledge domains, rich predicates, and coherence. It is venerable, solid and proven, but with its own language and world view. Its purpose is also directed to reasoning and inference, driven from a foundation of (generally not codified outside of Cyc) common sense. It was designed well in advance of the creation of RDF or OWL, indeed in advance of the Internet and Web itself.

On the other side of the UMBEL infocline is the entire Web. This is a chaotic, decentralized, distributed knowledge environment representing untold numbers of world views. The specifications of the semantic Web and its languages and vocabularies have been designed expressly with these differences in mind and the means and structures to link and interrelate them. The Web environment — though not exactly incoherent — is also not in its ground state coherent. Indeed, it is the very purpose of existing semantic Web standards and UMBEL to help provide that coherence.

A key aspect of the Web is its “open world” assumption, defined in SKOS [1] as:

"RDF and OWL Full are designed for systems in which data may be widely distributed ( e.g., the Web). As such a system becomes larger, it becomes both impractical and virtually impossible to "know" where all of the data in the system is located. Therefore, one cannot generally assume that data obtained from such a system is "complete", i.e., if some data appears to be "missing", one has to assume, in general, that the data might exist somewhere else in the system. This assumption, roughly speaking, is known as the "open world" assumption."

What this means for the Web is that we must assume that the system’s knowledge is incomplete, and that if a statement cannot be inferred from what is explicitly expressed, we still cannot infer it to be false. Adding new information never falsifies a previous conclusion, and most of what we can know about the world will remain unknown. Cyc, on the other hand, can make closed-world assumptions under appropriate conditions.

UMBEL thus must act as a mediator, or middleware, in its role as the interface between these world views. It can lead to tension and turbulence when contemplating or transiting this infocline layer.

The Cyc Reference Knowledge Base

The central purpose of UMBEL is to provide a context for relating information. Once such a purpose for context is embraced, the natural next question is: And what shall be the basis for this context?

A previous post discussed why Cyc was chosen over alternatives as this contextual basis. Ultimately, the reasons for choosing Cyc come down to real practical tools and capabilities such as helping to disambiguate the identities of named entities, mapping ontologies and schema, doing natural language processing, and the sheer provenness of the concept relationships that are at the core of UMBEL.

(Also, as noted many times, others could just as reasonably chose other bases for providing context. The important point, again, is to provide some context over no context.)

We can view the Cyc knowledge base as a complete, albeit large, world unto itself. Like the Earth, it is complex and varied and self-contained. It has its own atmosphere and perspective on the broader universe:

But, like other planets or celestial bodies, Cyc is a world, not the world. There are many different possible worlds with different atmospheres and gravities and temperatures and compositions. And, of course, Cyc is not a physical world at all, but a conceptual “world” representing knowledge and its relationships. We will, however, represent it with the Earth image below.

There is 25 years and perhaps close to 300 person-years of development behind Cyc. It has thousands of able practitioners around the world and has been used in hundreds of meaningful projects and engagements. Since its release in 2002, there have been well in excess of 100,000 downloads of its open-source OpenCyc version.[2]

This legacy and history leads to distinct functional and terminology differences from current semantic Web perspectives. For example, the richness of Cyc predicates does not lead to simple mappings to existing OWL and RDFS properties. The notion of class is different than the closest analog in Cyc, the ‘collection‘. The concept and treatment of individuals and types is different. The 1000 or so microtheory domains in Cyc are not easily transferred or mapped to OWL constructs. Cyc uses reification aggressively to functionally combine concepts from constituent elements, such as “apple tree”. Higher-order logic is not transferable in all cases to the first-order logic (FOL) of the semantic Web. And so forth . . . .

Perhaps most importantly, however, is that Cyc has been designed, built and extended by professional ontologists and related researchers. This brings a degree of consistency and quality control that Web-broad initiatives can not hope to approach.

In our working with Cyc there has been nothing but good will and professionalism from the staff at Cycorp and the Cyc Foundation. But, there are clearly times when world view and terminology can differ, sometimes leading to translation problems and issues. Moreover, attempts to bridge from the Cyc world to the open world assumptions of the general Web means the translation is “lossy”, much like what happens in moving from a 16 million palette of 24-bit colors to something less.

Cleaning Cyc

Here are some statistics showing the relative size and scope of the ResearchCyc and OpenCyc versions of Cyc, current as of the last official distributions [3]:

Category OpenCyc ResearchCyc
Reifiied Terms (Constants and NARTs) 263,332 303,340
Assertions 2,040,330 2,964,161
Deductions 323,751 1,305,354
Unique Predicates (Properties) [2] 17 (OWL) ~16,000
Disk Storage (KBs) [4] 495,104 566,400

The sheer size and sophistication of either version is too great for easy comprehension and linkage by standard Web resources. Thus, the UMBEL project set out to determine and derive the most fundamental concepts from within OpenCyc. What was desired was a tractable set of subject concept “hub” nodes from within OpenCyc. A further design criterion was to maintain a 100% consistency with OpenCyc for this subset of subject concepts in order for UMBEL to preserve linkage into the Cyc knowledge base.

A subsequent post will relate in detail the nine-month (and continuing!) vetting and extraction process applied to Cyc, the result of which is currently the identification of about 21,000 subject concepts. These are schematically illustrated by the yellow dots on our Cyc Earth representation:

Of course, these yellow dots are not really physical locations on a globe. Rather, they represent important “hub” locations within the virtual Cyc knowledge “space”.

UMBEL: A Lightweight Skein

Once removed from the broader knowledge base, we now have a simple skein of these 21,000 subject concepts and their interrelations. We can show this lightweight structure as a ball of subject concept nodes (in red) connected to one another via their graph edges. We can represent this lightweight skein as follows, which has similarities to a hairnet with the nodes represented by the knots (in red) in the net:

This simplistic wireframe representation has been presented before for all 21,000 UMBEL nodes via the Cytoscape graph visualization software (see figure right; click for larger size).

Click to ExpandThe following table shows that the overall size and complexity of Cyc has been reduced by 1-2 orders of magnitude through this cleaning exercise, resulting in a lightweight UMBEL structure about 5-10% of the original size:

Category OpenCyc ResearchCyc UMBEL
Terms or Concepts 263,332 303,340 21,057
Assertions 2,040,330 2,964,161 285,700
Deductions 323,751 1,305,354
Unique Predicates (Properties) [2,5] 17 (OWL) ~16,000 18
Disk Storage (KBs) [4] 495,104 566,400 14,445

Very striking is the predicate reduction, which is both a key source of “lossiness” and a challenge in maintaining a meaningful OWL and RDFS correspondence with the original Cyc. However, since the purpose of UMBEL is context and not reasoning or inference, this reduction is appropriate and understandable.

The ‘Hairnet Over the Basketball’

Metaphorically, we can now re-apply this UMBEL skein over the Cyc knowledge base. We have described this visual metaphor as the “hairnet over the basketball,” with UMBEL being the hairnet, and Cyc (Earth) the basketball:

Note that the UMBEL skein can act and be used fully independently from the underlying Cyc structure or not.

21,000 Docking Ports for an Open World

This UMBEL lightweight skein or wireframe structure now is ready to act as middleware, to play its role as an infocline. Each of UMBEL’s 21,000 subject concepts is, in effect, a “docking port” to which external Web data can “attach”. Once attached, this data can then be related to other Web data via the subject concept relationships in the UMBEL skein. This docking and attachment mechanism can be visualized as follows (click to enlarge):

If you mentally remove the Earth figure (Cyc) above, the UMBEL skein acts solely as a context reference structure for other Web data through its lightweight SKOS taxonomy structure (narrowerTransitive and broaderTransitive). These are the internal edge relationships of the wireframe structure with the red nodes above.

Though lightweight, this structure is surprisingly powerful in that it also enables tie-ins with external ontology classes — what Fred Giasson has called ‘exploding the domain‘ — and provides a reference context for Web data. Without these docking ports via UMBEL’s subject concepts, there is no contextual frame of reference and these Web data bits essentially tumble aimlessly in a dark knowledge space.[6]

But one need not stop at the infocline wireframe layer of UMBEL. Because each subject concept (“docking port”) has a direct correspondence to Cyc, we can dive more deeply into the Cyc knowledge environment. First through OpenCyc and then (via licensing or other arrangements) into ResearchCyc or the full Cyc, another dimension of tools and capabilities can become available. We now have backup and support to assess mappings and assignments and inferences and reasoning.

Will everyone want such capabilities? Most will not.

But it also surely does not hurt to have these value-added pathways so readily available for use and exploitation.

Some Context is Better than No Context at All

Some perhaps in the Cyc community may look at this picture and say, Whoa!: We’re giving Web denizens loaded Cyc guns via the UMBEL infocline to harm themselves and others.

Perhaps so. But this is also why we have courses on firearms safety and practice ranges for gaining the experience. Ontology mapping of any nature in an open world requires attention and skill to maintain quality.

The open world circumstances have already shown challenges with sameAs assignments and will certainly be exacerbated as we extend to class mappings in ontologies and inferencing. Quality and provenance will assert their prominence. Who do you trust and who is capable? But haven’t these always been operative questions?

Some perhaps in the broader Web community may go, Whoa! We are free and independent actors who hate any sniff of possible centralized Big Brother crap. Why UMBEL? Why Cyc? I want to free-form tag and twitter to my heart’s content.

OK, well, sure. But how can the Web of data meaningfully expand without reference points, structure and context? Though we may have foundational semantic Web standards in place, if we are going to meaningfully inter-relate data, we also need context and semantics.

UMBEL and Cyc offer one set of contexts, semantics and tools. Whether they are the best or not is a matter for the market to decide. But I think it will rapidly become clear that future Linked Data that is published without context will remain largely unused data. The question now going forward is not the rejection of context but deciding what contextual frameworks work better, are easy to implement, and are readily understood.

So, I think the game has changed and I’d like to believe for the better. UMBEL has placed a marker down — and it’s smack dab in the middle.

Yes I’m stuck in the middle with you,
And I’m wondering what it is I should do,
It’s so hard to keep this smile from my face,
And knowledge, yeah, is all over the place,
Cyc to the left of me, Open Web to the right,
Here I am, stuck in the middle with you. [7]


[1] W3C, SKOS Simple Knowledge Organization System Reference, W3C Working Draft, 25 January 2008; see http://www.w3.org/TR/2008/WD-skos-reference-20080125/#L881.
[2] See Priming the Pump and Threshold Conditions for the ResearchCyc estimate; the OpenCyc predicate count is from the OWL distribution version; the standard OpenCyc predicate count is not calculated.
[3] Obtained by running the CycL query, (kb-statistics), to a local instance of the distribution. The OpenCyc version is 5006; the ResearchCyc version 7117.
[4] Only the distribution size for the World model is reported; there are additional executables and supporting files not included.
[5] For UMBEL, 9 of the 18 are new properties, the rest are from existing OWL and RDFS vocabularies. These predicates include: type (RDF); subClassOf (RDFS); equivalentClass (OWL Full); language (DC, including all lingvoj instances); prefLabel, altLabel, definition, broaderTransitive, and narrowerTransitive (SKOS; also, there are some SKOS notes properties not listed); and the new properties of superClassOf, hasSemset, isAligned, withOverlap, linksConcept, isAbout, linksEntity, isLike, and withLikelihood (UMBEL). The forthcoming UMBEL technical documentation will explain this vocabulary in detail.
[6] While the current and common use of the sameAs relation provides linkage between instances or named entity identities in various datasets, this predicate does nothing to orient or provide frames of reference for the datasets themselves.
[7] Stuck in the Middle with You, with all due respect to Bob Dylan and Stealers Wheel.

Posted by AI3's author, Mike Bergman Posted on May 11, 2008 at 4:16 pm in Adaptive Information, Semantic Web, Structured Web, UMBEL | Comments (4)
The URI link reference to this post is: http://www.mkbergman.com/441/the-role-of-umbel-stuck-in-the-middle-with-you/
The URI to trackback this post is: http://www.mkbergman.com/441/the-role-of-umbel-stuck-in-the-middle-with-you/trackback/
Posted:April 27, 2008

UMBEL’s Eleven,” overviews the project’s first 11 semantic Web services and online demos. The brief slideshow has been posted to Slideshare:

SlideShare | View

UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight reference structure for placing Web content, named entities and data in context with other data. It is comprised of about 21,000 subject concepts and their relationships — with one another and with external vocabularies and named entities.

Recent postings by Fred Giasson and by me discussed these Web services in a bit more detail.

Posted:April 20, 2008

There’s Some Cool Tools in this Box of Crackerjacks

UMBEL is today releasing a new sandbox for its first iteration of Web services. The site is being hosted by Zitgist. All are welcomed to visit and play.

And, UMBEL is What, Again?

UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight reference structure for placing Web content and data in context with other data. It is comprised of about 21,000 subject concepts and their relationships — with one another and with external vocabularies and named entities.

Each UMBEL subject concept represents a defined reference point for asserting what a given chunk of content is about. These fixed hubs enable similar content to be aggregated and then placed into context with other content. These subject context hubs also provide the aggregation points for tying in their class members, the named entities which are the people, places, events, and other specific things of the world.

The backbone to UMBEL is the relationships amongst these subject concepts. It is this backbone that provides the contextual graph for inter-relating content. UMBEL’s subject concepts and their relationships are derived from the OpenCyc version of the Cyc knowledge base.

The UMBEL ontology is based on RDF and written in the RDF Schema vocabulary of SKOS (Simple Knowledge Organization System) with some OWL Full constructs to aid interoperability.

UMBEL’s backbone is also a reference structure for more specific domains or ontologies, thereby enabling further context for inter-relating additional content. Much of the sandbox shows these external relationships.

UMBEL’s Eleven

These first set of Web services provide online demo sandboxes, and descriptions of what they are about and their API documentation. The first 11 services are:

A CLASSy Detailed Report

The single service that provides the best insight to what UMBEL is all about is the Subject Concept Detailed Report. (That is probably because this service is itself an amalgam of some of the others.)

Starting from a single concept amongst the 21,000, in this case ‘Mammal’, we can get descriptions or definitions (the proper basis for making semantic relationships, not the ‘Mammal’ label), aliases and semsets, equivalent classes (in OWL terms), named entities (for leaf concepts), more general or specific external classes, and domain and range relationships with other ontologies. Here is the sample report for ‘Mammal’:

The discerning eye likely observes that while there are a rich set of relationships to the internal UMBEL subject concepts, coverage is still light for external classes and named entities. This sandbox is, after all, a first release and we are early in the mapping process. :)

But, it should also start to become clear that the ability of this structure to map and tie in all forms of external concepts and class structures is phenomenal. Once such class relationships are mapped (to date, most other Linked Data only occurs at the instance level), all external relationships and properties can be inherited as well. And, vice versa.

So, for aficionados of the network effect, stand back! You ain’t seen nothing yet. If we have seen amazing emergent properties arising from the people and documents on the Web, with data we move to another quantum level, like moving from organisms to cells. The leverage of such concept and class structures to provide coherence to atomic data is literally primed to explode.

Bloomin’ Concepts!

To put it mildly, trying to get one’s mind around the idea of 21,000 concepts and all of their relationships and all of their possible tie in points and mappings to still further ontologies and all of their interactions with named entities and all of their various levels of aggregation or abstraction and all of their possible translations into other languages or all of their contextual descriptions or all of their aliases or synonyms or all of their clusterings or all of their spatial relationships or all of the still more detailed relationships and instances in specific domains or, well, whew! You get the idea.

It is all pretty complex and hard to grasp.

One great way to wrap one’s mind around such scope is through interactive visualization. The first UMBEL service to provide this type of view is the Subject Concept Explorer, a screenshot of which is shown here:

But really, to gain the true feel, go to the service and explore for yourself. It feels like snorkeling through those schools of billions of tiny silver fish. Very cool!

These amazing visualizations are being brought to us by Moritz Stefaner, imho one of the best visualization and Flash gurus around. We will be showcasing more about Moritz’s unbelievable work in some forthcoming posts, where some even cooler goodies will be on display. His work is also on display at a couple of other sites that you can spend hours drooling over. Thanks, Moritz!

Missing Endpoints and Next Steps

You should note that developer access to the actual endpoints and external exposure of the subject concepts as Linked Data are not yet available. The endpoints, Linked Data and further technical documentation will be forthcoming shortly.

The currently displayed services and demos provided on this UMBEL Web services site are a sandbox for where the project is going. Next releases will soon provide as open source under attribution license:

  • The formal UMBEL ontology written in OWL Full and SKOS
  • Technical documentation for the ontology and its use and extension
  • Freely accessible Web services according to the documentation already provided
  • Technical documentation and reports for the derivation of the subject concepts from OpenCyc and the creation and extension of semsets and named entities related to that structure.

When we hit full stride, we expect to be releasing still further new Web services on a frequent basis.

BTW, for more technical details on this current release, see Fred Giasson’s accompanying post. Fred is the magician who has brought much of this forward.

Posted:April 2, 2008

Part 4 of 4 on Foundations to UMBEL

CycorpJust as DBpedia has provided the nucleating point for linking instance data (see Part 2), UMBEL is designed to provide a similar reference structure for concepts. These concepts provide some fixed positions in space to which other sources can link and relate. And, like references for instance data, the existence of reference concepts can greatly diminish the number of links necessary in the Linked Data environment.

Clearly, the combination of the representativeness of UMBEL’s subject concepts (the “scope” of the ontology) and their relationships (the “structure” of the backbone) is fundamental. These factors in turn express the functional capabilities of the system.

First Things First

The first fundamental point deserving emphasis is that a reference structure of almost any nature has value. We can argue later about what is the best reference structure, but the first task is to just get one in place and begin bootstrapping. Indeed, over time, it is likely that a few reference structures will emerge and compete and get supplemented by still further structures. This evolution is expected and natural and desirable in that it provides choice and options.

A reference structure of concepts has the further benefit of providing a logical reference structure for instances as well. While Wikipedia is perhaps the most comprehensive collection of humanity-wide instances, no single source can or will be complete in scope. Thus, we foresee specialty sources ranging from the companies in Wikicompany to plants and animals in the Encyclopedia of Life or thousands of other rich instance sources also acting as reference hubs.

How do each of these rich instance sources relate to one another? What is the subject concept or topical basis by which they overlap or complement? What is the framework and graph structure of knowledge to give this information context? These are the benefits brought by a structure of reference concepts, independent from the specifics of the reference structure itself.

Another key consideration is that broad-scale acceptance is important. An express purpose of UMBEL is to aid the interconnection of related content using broadly accepted foundations.

Alternative Approaches

Since the Web’s inception fifteen years ago, there have been various alternatives tried or in ascendance for organizing and bringing structure to Web content. Some of these may be too static and inflexible, others perhaps too arbitrary or parochial. All approaches to date have had little collective success.

There are also new and exciting developments in social networks and user-driven content and structure arising from areas such as tagging or Wikipedia (and wikis in general). But it is not clear that bottom-up contributions suitable to individual articles or topics can lead to coherent structural frameworks; arguably, they have not yet so far. And then there are sporadic government or corporate or trade association initiatives as well.

Here is a summary of alternate approaches:

  • Existing library systems — Dewey Decimal Classification, Library of Congress, UDC and many other library classification schemes have been touted for the Web and all have failed. Some reasons cited for this failure are physical books are very different from free digital bits; Web schema need to evolve quickly; and lack of stewards and curation
  • Market share — at various times certain successful vendors have held temporary minor ascendance with content organizational frameworks, generally directory structures. Examples include About, Yahoo!, Open Directory Project (DMOZ), Northern Light, etc. Yet even at their peaks, market shares were low, external adoption was rare, scope was questioned and arbitrary, with interest in directories now nearly absent
  • WordNet — though of strong interest and use to computational linguists, and quite popular for many content analyses, WordNet has seen little consumer or commercial interest. However, the synset structure and its coverage is extremely valuable for concept disambiguation, and therefore has a role in UMBEL (as it does in many other online systems)
  • Standards efforts — some sporadic success and some notable failures have occurred in the standards arenas. Generally, the successful initiatives tend to be in close communities where there are clear financial benefits for adherence, such as in the exchange of financial or commerce data; broader and more ambitious efforts have tended to be less successful
  • Professional organizations and associations — areas such a finance, pharmaceuticals, biologists, physicists and many bounded communities have enjoyed sporadic and sometimes notable success in developing and using domain-specific schema; none have yet transferred beyond their beginning boundaries to the broader Web
  • Government initiatives — there are episodic successes for government-sponsored content organizational initiatives, mostly in metadata, controlled vocabularies and ontologies, often where contractors or suppliers may be compelled to comply. NIH’s National Library of Medicine (and other NIH branches) have also seen significant domain successes, due to its foresight and its receptive biology, genetics and medical communities
  • Upper ontologies — UMBEL investigated this area considerably in the early months of the project. Most of the upper ontologies have relatively sparse subject concept content, being geared to smaller, abstract-oriented “upper” structures. Some such as SUMO and DOLCE and now PROTON, have concerted initiatives to extend to middle- and domain-level ontologies [1]. To date, penetration of these systems into general Web or commercial realms has been quite limited
  • Wikipedia — a clear and phenomenal success, Wikipedia and related initiatives like Wikinvest and Wikicompany and scores more have proven to be a rich fount for named entities and article-length content, but not for the category and content organization structures in which that content is embedded. This is an area of keen academic and collective interest [2] and it may still result is useful organizational schema as these popular wikis continue to evolve and mature. However, they have not yet done so, and while a rich source for entities and data, UMBEL decided to pass on their use for “backbone” structure at this time
  • No collective structure — tagging or folksonomies or doing nothing have perhaps the greatest market share at present.

Since inception, the stated intent of the UMBEL project was to base its subject structure on extant systems. To minimize development time, the structure needed to be drawn from one of the categories above. Possible development of a de novo structure was rejected because of development time and the low probability of gaining acceptance in the face of so many competing alternatives.

Rationale for OpenCyc

The granddaddy of knowledge bases suitable to all human content and knowledge is Cyc. Because of its more than 20-year history, Cyc brings with it considerable strengths and some weaknesses.

Amongst all alternatives, Cyc rapidly emerged as the leading candidate. While its strengths warranted close attention, its weaknesses also suggested a considerable effort to overcome them. This combination compelled the need for a significant investigation and due diligence.

First, here are OpenCyc’s strengths:

  • Venerable and solid — through an estimated 200 person years of engineering and effort, the Cyc structure has been tested and refined through many projects and applications. While a few years back such groundings were unparalleled in the field, we are also now seeing some Internet-wide projects tap into the law of large numbers to get significant inputs of human labor. Cyc has also tapped this venue for ongoing expansion of its KB using the online FACTory game [3]
  • Community — there is a large community of Cyc users and supporters from academic, government, commercial and non-profit realms. Moreover, the formation of The Cyc Foundation has also served as a vehicle for tapping into volunteer effort as well
  • Upgrade Path — OpenCyc has an upgrade path to the more capable ResearchCyc, full Cyc and the services of Cycorp
  • Comprehensive — no existing system has the scope, breadth and coverage of human concepts to match that of Cyc (however, sources for named entities such as Wikipedia have recently passed Cyc in scope; see next section)
  • Common sense — since its founding as a project and then backed by the standalone Cycorp, Cyc has set for itself both a more pragmatic but harder challenge than other knowledge systems. Cyc has set out to capture the common sense at the heart of human reasoning. This objective means codifying generally unstated logic and rules-of-thumb — not unlike teaching a baby to walk and talk and read — all of which are lengthy tasks of trial and error. However, as Cyc has gained this foundation, it has also led to a more solid basis for its reasoning and conceptual relationships
  • Power and inference — ultimately the purpose of a knowledge base is to support reasoning and inference by computer when presented with a (often small) starting set of assertions or facts. Cyc has literally thousands of microtheories now governing its inference domains, giving it a scope and power unmatched by other systems. The importance of such reasoning is not the silly science fiction of autonomous intelligent robots, but as achievable aids to make connections, determine relationships and filter and order results
  • Robust supporting capabilities — such knowledge base-wide capabilities can also be deeply leveraged in such areas as entity extraction, machine translation, natural language processing, risk analysis or one of the other dozens of specialty modules available in Cyc, and
  • Free and open — last, but not least, is the fact that a mostly complete Cyc was released as a free and open source version in 2002. OpenCyc has now been downloaded more than 100,000 times and is in production use for many applications. Non-profits and academics can also obtain access to the full capabilities of the Cyc system through ResearchCyc. This open character is an absolute essential because leading Web applications and leading innovators of the Web eschew proprietary systems.

Literally, after months of investigation and involvement, the richness of practical uses to which the OpenCyc knowledge base can be applied are still revealing themselves.

Drawbacks to OpenCyc

But there are weaknesses and problems with Cyc.

To be sure, there are some individuals and perhaps some historical criticisms of Cyc that involved fears of Big Brother or grandiose claims about artificial intelligence or machine reasoning. These are criticisms of hype, immaturity or ignorance; they are different than the drawbacks observed by our UMBEL project and not further discussed here.

In UMBEL’s investigation of Cyc, we observed these drawbacks:

  • Obscure upper ontology — the Cyc upper ontology, shown in the figure below, is perhaps not as clean as more recent upper ontologies (Proton, [4] for example, is a very clean system). The various sub-classifications of ‘Thing’ and degrees of “tangibility” seem particularly problematic. However, since these are not direct binding concepts for UMBEL and provide appropriate “glue” for the upper portions of the graph, these criticisms can be easily overlooked
Cyc Upper Ontology
  • Cruft — twenty years of projects and forays into obscure domains (many for the military or intelligence communities) have left a significant degree of cruft marbled through the knowledge base. Indeed, as our vetting showed, perhaps about 30% of the concepts in Cyc are holdovers from prior projects or relate to internal Cyc-only topics
  • Reasoning concepts — another 15% or so of Cyc concepts are abstract or for reasoning purposes, such as reasoning over colors, beliefs, the sizes of objects, their orientations in space, and so forth. These are certainly legitimate concepts and appropriate to Cyc’s purposes, but are not needed or desired for UMBEL’s purposes
  • Greater expressivity — Cyc is grounded in the LISP language and has many higher-order logic constructs. Paradoxically, this greater expressiveness may make translation to UMBEL more difficult
  • Older conventions — also related to these groundings in an earlier era are the reliance on functions and functional predicates for many relations, and the absence of the current triple data model underlying RDF. While it is true that OWL versions of OpenCyc have been made and are the basis for UMBEL’s work to date, there are also errors in these translations perhaps in some instances due to the lesser expressiveness of RDF and OWL
  • Documentation — while complete reference materials can ultimately be found, it is difficult to do so and introductory and entry-level tutorials could stand to be augmented
  • Named entities — for many years, but now especially with the emergence of Wikipedia, Cyc has been criticized for its relative paucity of named entity coverage and imbalances of what it does contain. While from UMBEL’s perspective this appears to be strictly correct, such criticism misses the mark of Cyc’s special purpose and contributions as a solid conceptual and common sense framework. Those common-sense portions of the system are more immutable, and can be readily mapped to named entity sources. Indeed, perhaps Cyc will now see new vigor as the Web becomes a superior source for contemporary named entity coverage while Cyc fulfills its more natural (and needed) structural role.

Surprisingly, for a system of its age and evolution, Cyc seems to have adhered well to naming conventions and other standards.

UMBEL’s project diligence thus found the biggest issue going forward to be the cruft in the system. There is a solid structure underneath Cyc, but one that is too often obscured and not made as shiny and clean as it deserves.

The Decision and Design

Five months of nearly full-time due diligence was devoted to this question of the suitability of Cyc as the intellectual grounding for UMBEL.

On balance, OpenCyc’s benefits significantly outweighed its weaknesses. This balance also stands considerably superior to all potential alternatives.

An important factor through this deliberation was the commitment of Cycorp and The Cyc Foundation to the aims of UMBEL, and the willingness of those organizations to lend time and effort to promote UMBEL’s aims. Twenty years of development and the investment of decades of human effort and scrutiny provides a foundation of immense solidity.

Though perhaps Wikipedia (or something like it also based on broad Web input) might emerge with the scope and completeness of Cyc, that prospect is at minimum some years away and by no means certain. No other current framework than Cyc can meet UMBEL’s immediate purposes. Moreover, as stated at the outset, UMBEL’s purpose is pragmatic. We will leave it to others to argue the philosophical nuances of ontology design and “truth” while we get on with the task of creating context of real value.

The next decision was to base all UMBEL subject concepts on existing concepts in OpenCyc.

This means that UMBEL inherits all of the structural relations already in OpenCyc. It also means that UMBEL can act as a sort of contextual middleware between unstructured Web content and the inferential and tools infrastructure within OpenCyc (and beyond into ResearchCyc and Cyc for commercial purposes) and back again to the Web. We term this “roundtripping” and the capability is available for any of the 21,000 subject concepts vetted from OpenCyc within UMBEL.

Having made these commitments, our next effort was to break out the brushes, roll up the sleeves, and plunge into a Spring session of deep cleaning. This effort to vet and clean OpenCyc will be documented in the Technical Report to accompany the first release of the UMBEL ontology. We think you’ll like its shiny new look. :)

This is Part 4 of 4 on the foundations to UMBEL. This four-part series covers a Re-Introduction to UMBEL, UMBEL: Making Linked Data Classy, Subject Concepts and Named Entities, and Basing UMBEL’s Backbone on OpenCyc. These articles are lead-ins to the discussion of the actual UMBEL ontology. That series will begin next.

[1] Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc, John Sowa’s Top-Level Categories and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus), though Cyc is a clear exception with its stated emphasis on capturing “common knowledge.”
[2] See, for example, this listing of about 100 academic articles devoted to structure and linguistic uses of Wikipedia: http://www.mkbergman.com/?p=417.
[3] FACTory is a game that lets people enter knowledge into the Cyc knowledge base. Via this online game, Cyc tries to determine the truth or falsehood of a series of facts. When enough people have agreed that a fact is true or not, Cyc considers it confirmed and stops asking about it. See http://game.cyc.com/helpfiles/HowToPlay.html.
[4] There are many aspects that make PROTON one of the more attractive reference ontologies. The PROTON ontology (PROTo ONtology), developed within the scope of the SEKT project, is attractive because of its understandability, relatively small size, modular architecture and a simple subsumption hierarchy. It is available in an OWL Lite form and is easy to adopt and extend. On the face of it, the Topic class within PROTON, which is meant to serve as a bridge between different ontologies, may also provide a binding layer to specific subject topics as sub-classes or class instances.
Posted:April 1, 2008

Part 3 of 4 on Foundations to UMBEL

UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight ontology for relating Web content and data to a standard set of subject concepts. It is being designed to apply to all types of data on the Web from RSS and Atom feeds to tagging to microformats and topic maps to RDF and OWL (among others). The project Web site is at http://www.umbel.org.

The first portion and priority for UMBEL is to prepare the lightweight subject concept ontology, the focus of this four-part foundations series. After the UMBEL ontology is released in first draft, the project will then turn to the binding protocols for non-RDF formats.

The previous part in this series discussed at length RDF classes and instances or individuals. We are now tightening these terms down to reflect the specific intents and usage within UMBEL. UMBEL’s main classes categorize subject concepts; notable instances are specifically termed named entities.

UMBEL defines subject concepts as a distinct subset of the more broadly understood concept [1] such as used in the SKOS RDFS controlled vocabulary [2], conceptual graphs, formal concept analysis or the very general concepts common to many upper ontologies [3]. We define subject concepts as a special kind of concept: namely, ones that are concrete, subject-related and non-abstract [4].

UMBEL contrasts subject concepts with abstract concepts and with named entities. Abstract concepts represent abstract or ephemeral notions such as truth, beauty, evil or justice, or are thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world. Named entities are the real things or instances in the world that are themselves natural and notable class members of subject concepts.

Subject Concepts

Subject concepts are a special kind of concept: namely, ones that are concrete, subject-related and non-abstract. Note in other systems or ontologies similar constructs may alternatively be called topics, subjects, concepts or perhaps interests. UMBEL has adopted the term subject concept to distinguish from these uses, which have different nuances of meaning and use, as well as to highlight the subject or topic nature of UMBEL's concrete concepts.

Each subject concept is a class. While subject concepts have a preferred label (using SKOS terminology), they are representative or a proxy for that concept, and not to be confused with the thing itself. Every UMBEL subject concept can be expressed and referred to by a different preferred label in alternate languages. Indeed, in a given language, different preferred labels may be swapped out without affecting the identity or use of the subject concept itself. The name for a subject concept is therefore merely a handle.

Subject concepts are the core constituents to the UMBEL framework. All subject concepts are based on existing concepts in OpenCyc, the open source version of the Cyc knowledge base (see Part 4). About 21,000 of them have been distilled and are part of the UMBEL backbone.

Semsets

Semsets are semantically close terms or phrases synonomous or nearly so with the meanings of a subject concept or a named entity. Semsets are akin to WordNet synsets or Cyc aliases, but can also include more contemporary jargon or slang as may be drawn from Web tagging or folksonomies. The term semset has been chosen to distinguish this consolidated meaning.

Semsets may apply to either subject concepts or named entities. In the latter case, their use is closer to the sense of an alias (such as nicknames, or "great satan" or "uncle sam" for the "United States").

Abstract Concepts

Abstract concepts represent abstract or ephemeral notions such as truth, beauty, evil or justice, or are thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world. They are included in the UMBEL specification because they help maintain the integrity of the UMBEL subject concept graph.

Like subject concepts, abstract concepts are based strictly on those already in OpenCyc. Abstract concepts may be viewed in the UMBEL graph, and may be used for ontology mapping, but are not generally displayed when doing standard content mapping or concept look-ups via Web services. For various domain extraction or relatedness determinations, abstract concepts may be excluded from UMBEL's internal processing.

Named Entities

Named entities are the real things or instances in the world that are themselves natural and notable class members of subject concepts. The initial named entities are drawn from Wikipedia as processed via YAGO, and other online fact-based repositories. Named entities are the instances of the subject concepts in the standard definition of the term [5].

Named entities and the sources for them are also a major avenue for growth and expansion of UMBEL moving forward. Named entities are more contemporary and changing, while the reference subject concept backbone is more fixed and stable.

Each named entity is mapped to a governing subject concept for ontology purposes. There are no relations between named entities except as mediated through a subject concept(s). As noted, named entities may also have semset aliases.

Subject Concepts v. Abstract Concepts

The following table helps draw the distinction between subject concepts and abstract concepts. Technical documentation at the time of the UMBEL ontology release will list the 520 or so abstract concepts presently within UMBEL. Looking at those can help draw the distinction.

Subject Concepts Abstract Concepts
  • Nouns or noun phrases
  • These are concrete kinds of things or ideas in the real world
  • Broad, collective, reference concepts, often hierarchically related
  • Similar to “topics” or “subjects”, these other terms are used in somewhat different ways in alternative schemas
  • Collections or classes of like “kinds” of items
  • Quite stable in scope, breadth and structure
  • Grounded in the OpenCyc knowledge base, which is the source of its relationships and graph structure
  • Named entities are members of subject concepts
  • These are either: 1) abstract (truth, beauty, evil) concepts, or 2) artificial thought constructs for organizing things but not encountered as standalone concepts in their own right (e.g., PartiallyTangibleThing)
  • Collections or classes of like “kinds” of items
  • Class members may be either other abstract concepts or subject concepts
  • Class members are never named entities
  • Tend to reside higher in the subsumption structure
  • Generally hidden from the UMBEL subject concept reference “backbone” structure
  • May be used for ontology mapping purposes
  • Grounded in the OpenCyc knowledge base, which is the source of its relationships and graph structure

Subject Concepts v. Named Entities

The following table helps draw the distinction between subject concepts and named entities. Technical documentation at the time of the UMBEL ontology release will describe certain "gray" categories and the determination as to whether they should be treated as one or the other.

For example, most geographical places clearly belong to the named entity category. But, on somewhat arbitrary grounds, all nations, countries, states and provinces were assigned as subject concepts so that they would act as classes with other entities mapped to them. It should also be noted that entites or concepts in the gray zone may be treated both as a named entity and a subject concept.

Subject Concepts Named Entities
  • Broad, collective, reference concepts. In a hierarchical category structure, subject concepts represent the “root” or “branch” nodes
  • Nouns or noun phrases
  • Called “subject concepts” (or sometimes as a shorthand, “concepts”). Similar to “topics” or “subjects”, these other terms are used in somewhat different ways in specific in alternative schemas and are therefore not used interchangeably here
  • These are not abstract (truth, beauty, evil) concepts, but concrete about kinds of things or ideas in the real world; abstract concepts are often properly part of what are known as “upper ontologies” but they are not applicable for UMBEL’s purposes
  • Collections or classes of like “kinds” of items
  • Quite stable in scope, breadth and structure
  • Grounded in the OpenCyc knowledge base, which is the source of its relationships and graph structure
  • Basis for the UMBEL subject concept reference “backbone” structure
  • Named entities are members of subject concepts
  • Atomic, specific objects, often famous or well-known, that belong to reference “types” such as persons, places, organizations, events, products, time intervals, etc. In a hierarchical category structure, named entities represent the “leaves”
  • Nouns or noun phrases
  • Called “named entities” not entities alone, to prevent confusion with other general senses of the term “entity” and in keeping with named entity recognition (NER).
  • Very concrete, atomic entities
  • The number and scope is fluid and growing, and potentially of huge size as specific objects are named
  • Often expressed as a proper noun (with some capitalization), but not necessarily so. Common animal, plant, object, substance names also can be named entities
  • Major sources are Wikipedia (YAGO), and similar such as Wikinvest, Wikicompanies, etc.
  • Named entities are maintained and treated separately from the UMBEL subject concept ontology
  • Every named entity belongs to at least one subject concept.

Though there are shades of gray between subject concepts and named entities, we have found this distinction to be a powerful means for gaining clarity in UMBEL’s design. It provides a clean path for keeping an ontology lightweight while in essence providing infinite extensibility for all manner of named entities and the datasources that contain them. Moreover, the ability to classify named entities into types orthogonal to subject concepts also provides useful guidance for presentation templates that may be automatically invoked in data meshups. But, that is a topic for another day. :)

This is Part 3 of 4 on the foundations to UMBEL. This four-part series covers a Re-Introduction to UMBEL, UMBEL: Making Linked Data Classy, Subject Concepts and Named Entities, and Basing UMBEL’s Backbone on OpenCyc. These articles are lead-ins to the discussion of the actual UMBEL ontology that will soon follow.

[1] As the term is used in mainstream cognitive science and philosophy of mind, a concept is an abstract idea or a mental symbol, typically associated with a corresponding representation in a language or symbology. Definition is from Wikipedia; see further http://en.wikipedia.org/wiki/Concept.
[2] SKOS stands for Simplified Knowledge Organization Systems; it is a controlled vocabulary based on RDF Schema designed to allow the creation of formal languages to represent thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured information. See http://www.w3.org/2004/02/skos/.
[3] Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget's Thesaurus), though Cyc is a clear exception with its stated emphasis on capturing “common knowledge.”
[4] A subject concept bears some resemblance to dc:subject or foaf:interest in other ontologies. However, unlike those approaches, UMBEL: 1) provides a reference set of subject concepts to pick from and synonym-like relationships similar to WordNet synsets; and 2) are not semantically literal descriptions for the terms, but rather "proxies" for the concepts they represent. This referential character for subject concepts make them readily transferrable to multiple human languages.
[5] In a named entity, the word named applies to entities that have a "rigid designators" as defined by Kripke for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind of terms like biological species and substances. BBN categories proposed in 2002 consists of 29 types and 64 subtypes; Sekine’s extended hierarchy also proposed in 2002 is made up of 200 subtypes. We use Sekine (http://nlp.cs.nyu.edu/ene/version6_1_0eng.html) as our guide. For example, Sekine's top 15 named entity classes are: Name_Other, Person, Organization, Location, Facility, Product, Event, Natural_Object, Title, Unit, Vocation, Disease, God, Id_Number and Color; the remaining types are subsumed under these. See further http://en.wikipedia.org/wiki/Named_entity_recognition. Generally, named entities are the instances of UMBEL classes.