Part 4 of 4 on Foundations to UMBEL
Just as DBpedia has provided the nucleating point for linking instance data (see Part 2), UMBEL is designed to provide a similar reference structure for concepts. These concepts provide some fixed positions in space to which other sources can link and relate. And, like references for instance data, the existence of reference concepts can greatly diminish the number of links necessary in the Linked Data environment.
Clearly, the combination of the representativeness of UMBEL’s subject concepts (the “scope” of the ontology) and their relationships (the “structure” of the backbone) is fundamental. These factors in turn express the functional capabilities of the system.
First Things First
The first fundamental point deserving emphasis is that a reference structure of almost any nature has value. We can argue later about what is the best reference structure, but the first task is to just get one in place and begin bootstrapping. Indeed, over time, it is likely that a few reference structures will emerge and compete and get supplemented by still further structures. This evolution is expected and natural and desirable in that it provides choice and options.
A reference structure of concepts has the further benefit of providing a logical reference structure for instances as well. While Wikipedia is perhaps the most comprehensive collection of humanity-wide instances, no single source can or will be complete in scope. Thus, we foresee specialty sources ranging from the companies in Wikicompany to plants and animals in the Encyclopedia of Life or thousands of other rich instance sources also acting as reference hubs.
How do each of these rich instance sources relate to one another? What is the subject concept or topical basis by which they overlap or complement? What is the framework and graph structure of knowledge to give this information context? These are the benefits brought by a structure of reference concepts, independent from the specifics of the reference structure itself.
Another key consideration is that broad-scale acceptance is important. An express purpose of UMBEL is to aid the interconnection of related content using broadly accepted foundations.
Since the Web’s inception fifteen years ago, there have been various alternatives tried or in ascendance for organizing and bringing structure to Web content. Some of these may be too static and inflexible, others perhaps too arbitrary or parochial. All approaches to date have had little collective success.
There are also new and exciting developments in social networks and user-driven content and structure arising from areas such as tagging or Wikipedia (and wikis in general). But it is not clear that bottom-up contributions suitable to individual articles or topics can lead to coherent structural frameworks; arguably, they have not yet so far. And then there are sporadic government or corporate or trade association initiatives as well.
Here is a summary of alternate approaches:
- Existing library systems — Dewey Decimal Classification, Library of Congress, UDC and many other library classification schemes have been touted for the Web and all have failed. Some reasons cited for this failure are physical books are very different from free digital bits; Web schema need to evolve quickly; and lack of stewards and curation
- Market share — at various times certain successful vendors have held temporary minor ascendance with content organizational frameworks, generally directory structures. Examples include About, Yahoo!, Open Directory Project (DMOZ), Northern Light, etc. Yet even at their peaks, market shares were low, external adoption was rare, scope was questioned and arbitrary, with interest in directories now nearly absent
- WordNet — though of strong interest and use to computational linguists, and quite popular for many content analyses, WordNet has seen little consumer or commercial interest. However, the synset structure and its coverage is extremely valuable for concept disambiguation, and therefore has a role in UMBEL (as it does in many other online systems)
- Standards efforts — some sporadic success and some notable failures have occurred in the standards arenas. Generally, the successful initiatives tend to be in close communities where there are clear financial benefits for adherence, such as in the exchange of financial or commerce data; broader and more ambitious efforts have tended to be less successful
- Professional organizations and associations — areas such a finance, pharmaceuticals, biologists, physicists and many bounded communities have enjoyed sporadic and sometimes notable success in developing and using domain-specific schema; none have yet transferred beyond their beginning boundaries to the broader Web
- Government initiatives — there are episodic successes for government-sponsored content organizational initiatives, mostly in metadata, controlled vocabularies and ontologies, often where contractors or suppliers may be compelled to comply. NIH’s National Library of Medicine (and other NIH branches) have also seen significant domain successes, due to its foresight and its receptive biology, genetics and medical communities
- Upper ontologies — UMBEL investigated this area considerably in the early months of the project. Most of the upper ontologies have relatively sparse subject concept content, being geared to smaller, abstract-oriented “upper” structures. Some such as SUMO and DOLCE and now PROTON, have concerted initiatives to extend to middle- and domain-level ontologies . To date, penetration of these systems into general Web or commercial realms has been quite limited
- Wikipedia — a clear and phenomenal success, Wikipedia and related initiatives like Wikinvest and Wikicompany and scores more have proven to be a rich fount for named entities and article-length content, but not for the category and content organization structures in which that content is embedded. This is an area of keen academic and collective interest  and it may still result is useful organizational schema as these popular wikis continue to evolve and mature. However, they have not yet done so, and while a rich source for entities and data, UMBEL decided to pass on their use for “backbone” structure at this time
- No collective structure — tagging or folksonomies or doing nothing have perhaps the greatest market share at present.
Since inception, the stated intent of the UMBEL project was to base its subject structure on extant systems. To minimize development time, the structure needed to be drawn from one of the categories above. Possible development of a de novo structure was rejected because of development time and the low probability of gaining acceptance in the face of so many competing alternatives.
Rationale for OpenCyc
The granddaddy of knowledge bases suitable to all human content and knowledge is Cyc. Because of its more than 20-year history, Cyc brings with it considerable strengths and some weaknesses.
Amongst all alternatives, Cyc rapidly emerged as the leading candidate. While its strengths warranted close attention, its weaknesses also suggested a considerable effort to overcome them. This combination compelled the need for a significant investigation and due diligence.
First, here are OpenCyc’s strengths:
- Venerable and solid — through an estimated 200 person years of engineering and effort, the Cyc structure has been tested and refined through many projects and applications. While a few years back such groundings were unparalleled in the field, we are also now seeing some Internet-wide projects tap into the law of large numbers to get significant inputs of human labor. Cyc has also tapped this venue for ongoing expansion of its KB using the online FACTory game 
- Community — there is a large community of Cyc users and supporters from academic, government, commercial and non-profit realms. Moreover, the formation of The Cyc Foundation has also served as a vehicle for tapping into volunteer effort as well
- Upgrade Path — OpenCyc has an upgrade path to the more capable ResearchCyc, full Cyc and the services of Cycorp
- Comprehensive — no existing system has the scope, breadth and coverage of human concepts to match that of Cyc (however, sources for named entities such as Wikipedia have recently passed Cyc in scope; see next section)
- Common sense — since its founding as a project and then backed by the standalone Cycorp, Cyc has set for itself both a more pragmatic but harder challenge than other knowledge systems. Cyc has set out to capture the common sense at the heart of human reasoning. This objective means codifying generally unstated logic and rules-of-thumb — not unlike teaching a baby to walk and talk and read — all of which are lengthy tasks of trial and error. However, as Cyc has gained this foundation, it has also led to a more solid basis for its reasoning and conceptual relationships
- Power and inference — ultimately the purpose of a knowledge base is to support reasoning and inference by computer when presented with a (often small) starting set of assertions or facts. Cyc has literally thousands of microtheories now governing its inference domains, giving it a scope and power unmatched by other systems. The importance of such reasoning is not the silly science fiction of autonomous intelligent robots, but as achievable aids to make connections, determine relationships and filter and order results
- Robust supporting capabilities — such knowledge base-wide capabilities can also be deeply leveraged in such areas as entity extraction, machine translation, natural language processing, risk analysis or one of the other dozens of specialty modules available in Cyc, and
- Free and open — last, but not least, is the fact that a mostly complete Cyc was released as a free and open source version in 2002. OpenCyc has now been downloaded more than 100,000 times and is in production use for many applications. Non-profits and academics can also obtain access to the full capabilities of the Cyc system through ResearchCyc. This open character is an absolute essential because leading Web applications and leading innovators of the Web eschew proprietary systems.
Literally, after months of investigation and involvement, the richness of practical uses to which the OpenCyc knowledge base can be applied are still revealing themselves.
Drawbacks to OpenCyc
But there are weaknesses and problems with Cyc.
To be sure, there are some individuals and perhaps some historical criticisms of Cyc that involved fears of Big Brother or grandiose claims about artificial intelligence or machine reasoning. These are criticisms of hype, immaturity or ignorance; they are different than the drawbacks observed by our UMBEL project and not further discussed here.
In UMBEL’s investigation of Cyc, we observed these drawbacks:
- Obscure upper ontology — the Cyc upper ontology, shown in the figure below, is perhaps not as clean as more recent upper ontologies (Proton,  for example, is a very clean system). The various sub-classifications of ‘Thing’ and degrees of “tangibility” seem particularly problematic. However, since these are not direct binding concepts for UMBEL and provide appropriate “glue” for the upper portions of the graph, these criticisms can be easily overlooked
- Cruft — twenty years of projects and forays into obscure domains (many for the military or intelligence communities) have left a significant degree of cruft marbled through the knowledge base. Indeed, as our vetting showed, perhaps about 30% of the concepts in Cyc are holdovers from prior projects or relate to internal Cyc-only topics
- Reasoning concepts — another 15% or so of Cyc concepts are abstract or for reasoning purposes, such as reasoning over colors, beliefs, the sizes of objects, their orientations in space, and so forth. These are certainly legitimate concepts and appropriate to Cyc’s purposes, but are not needed or desired for UMBEL’s purposes
- Greater expressivity — Cyc is grounded in the LISP language and has many higher-order logic constructs. Paradoxically, this greater expressiveness may make translation to UMBEL more difficult
- Older conventions — also related to these groundings in an earlier era are the reliance on functions and functional predicates for many relations, and the absence of the current triple data model underlying RDF. While it is true that OWL versions of OpenCyc have been made and are the basis for UMBEL’s work to date, there are also errors in these translations perhaps in some instances due to the lesser expressiveness of RDF and OWL
- Documentation — while complete reference materials can ultimately be found, it is difficult to do so and introductory and entry-level tutorials could stand to be augmented
- Named entities — for many years, but now especially with the emergence of Wikipedia, Cyc has been criticized for its relative paucity of named entity coverage and imbalances of what it does contain. While from UMBEL’s perspective this appears to be strictly correct, such criticism misses the mark of Cyc’s special purpose and contributions as a solid conceptual and common sense framework. Those common-sense portions of the system are more immutable, and can be readily mapped to named entity sources. Indeed, perhaps Cyc will now see new vigor as the Web becomes a superior source for contemporary named entity coverage while Cyc fulfills its more natural (and needed) structural role.
Surprisingly, for a system of its age and evolution, Cyc seems to have adhered well to naming conventions and other standards.
UMBEL’s project diligence thus found the biggest issue going forward to be the cruft in the system. There is a solid structure underneath Cyc, but one that is too often obscured and not made as shiny and clean as it deserves.
The Decision and Design
Five months of nearly full-time due diligence was devoted to this question of the suitability of Cyc as the intellectual grounding for UMBEL.
On balance, OpenCyc’s benefits significantly outweighed its weaknesses. This balance also stands considerably superior to all potential alternatives.
An important factor through this deliberation was the commitment of Cycorp and The Cyc Foundation to the aims of UMBEL, and the willingness of those organizations to lend time and effort to promote UMBEL’s aims. Twenty years of development and the investment of decades of human effort and scrutiny provides a foundation of immense solidity.
Though perhaps Wikipedia (or something like it also based on broad Web input) might emerge with the scope and completeness of Cyc, that prospect is at minimum some years away and by no means certain. No other current framework than Cyc can meet UMBEL’s immediate purposes. Moreover, as stated at the outset, UMBEL’s purpose is pragmatic. We will leave it to others to argue the philosophical nuances of ontology design and “truth” while we get on with the task of creating context of real value.
The next decision was to base all UMBEL subject concepts on existing concepts in OpenCyc.
This means that UMBEL inherits all of the structural relations already in OpenCyc. It also means that UMBEL can act as a sort of contextual middleware between unstructured Web content and the inferential and tools infrastructure within OpenCyc (and beyond into ResearchCyc and Cyc for commercial purposes) and back again to the Web. We term this “roundtripping” and the capability is available for any of the 21,000 subject concepts vetted from OpenCyc within UMBEL.
Having made these commitments, our next effort was to break out the brushes, roll up the sleeves, and plunge into a Spring session of deep cleaning. This effort to vet and clean OpenCyc will be documented in the Technical Report to accompany the first release of the UMBEL ontology. We think you’ll like its shiny new look.