Posted:April 27, 2008

UMBEL’s Eleven,” overviews the project’s first 11 semantic Web services and online demos. The brief slideshow has been posted to Slideshare:

SlideShare | View

UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight reference structure for placing Web content, named entities and data in context with other data. It is comprised of about 21,000 subject concepts and their relationships — with one another and with external vocabularies and named entities.

Recent postings by Fred Giasson and by me discussed these Web services in a bit more detail.

Posted:April 20, 2008

There’s Some Cool Tools in this Box of Crackerjacks

UMBEL is today releasing a new sandbox for its first iteration of Web services. The site is being hosted by Zitgist. All are welcomed to visit and play.

And, UMBEL is What, Again?

UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight reference structure for placing Web content and data in context with other data. It is comprised of about 21,000 subject concepts and their relationships — with one another and with external vocabularies and named entities.

Each UMBEL subject concept represents a defined reference point for asserting what a given chunk of content is about. These fixed hubs enable similar content to be aggregated and then placed into context with other content. These subject context hubs also provide the aggregation points for tying in their class members, the named entities which are the people, places, events, and other specific things of the world.

The backbone to UMBEL is the relationships amongst these subject concepts. It is this backbone that provides the contextual graph for inter-relating content. UMBEL’s subject concepts and their relationships are derived from the OpenCyc version of the Cyc knowledge base.

The UMBEL ontology is based on RDF and written in the RDF Schema vocabulary of SKOS (Simple Knowledge Organization System) with some OWL Full constructs to aid interoperability.

UMBEL’s backbone is also a reference structure for more specific domains or ontologies, thereby enabling further context for inter-relating additional content. Much of the sandbox shows these external relationships.

UMBEL’s Eleven

These first set of Web services provide online demo sandboxes, and descriptions of what they are about and their API documentation. The first 11 services are:

A CLASSy Detailed Report

The single service that provides the best insight to what UMBEL is all about is the Subject Concept Detailed Report. (That is probably because this service is itself an amalgam of some of the others.)

Starting from a single concept amongst the 21,000, in this case ‘Mammal’, we can get descriptions or definitions (the proper basis for making semantic relationships, not the ‘Mammal’ label), aliases and semsets, equivalent classes (in OWL terms), named entities (for leaf concepts), more general or specific external classes, and domain and range relationships with other ontologies. Here is the sample report for ‘Mammal’:

The discerning eye likely observes that while there are a rich set of relationships to the internal UMBEL subject concepts, coverage is still light for external classes and named entities. This sandbox is, after all, a first release and we are early in the mapping process. :)

But, it should also start to become clear that the ability of this structure to map and tie in all forms of external concepts and class structures is phenomenal. Once such class relationships are mapped (to date, most other Linked Data only occurs at the instance level), all external relationships and properties can be inherited as well. And, vice versa.

So, for aficionados of the network effect, stand back! You ain’t seen nothing yet. If we have seen amazing emergent properties arising from the people and documents on the Web, with data we move to another quantum level, like moving from organisms to cells. The leverage of such concept and class structures to provide coherence to atomic data is literally primed to explode.

Bloomin’ Concepts!

To put it mildly, trying to get one’s mind around the idea of 21,000 concepts and all of their relationships and all of their possible tie in points and mappings to still further ontologies and all of their interactions with named entities and all of their various levels of aggregation or abstraction and all of their possible translations into other languages or all of their contextual descriptions or all of their aliases or synonyms or all of their clusterings or all of their spatial relationships or all of the still more detailed relationships and instances in specific domains or, well, whew! You get the idea.

It is all pretty complex and hard to grasp.

One great way to wrap one’s mind around such scope is through interactive visualization. The first UMBEL service to provide this type of view is the Subject Concept Explorer, a screenshot of which is shown here:

But really, to gain the true feel, go to the service and explore for yourself. It feels like snorkeling through those schools of billions of tiny silver fish. Very cool!

These amazing visualizations are being brought to us by Moritz Stefaner, imho one of the best visualization and Flash gurus around. We will be showcasing more about Moritz’s unbelievable work in some forthcoming posts, where some even cooler goodies will be on display. His work is also on display at a couple of other sites that you can spend hours drooling over. Thanks, Moritz!

Missing Endpoints and Next Steps

You should note that developer access to the actual endpoints and external exposure of the subject concepts as Linked Data are not yet available. The endpoints, Linked Data and further technical documentation will be forthcoming shortly.

The currently displayed services and demos provided on this UMBEL Web services site are a sandbox for where the project is going. Next releases will soon provide as open source under attribution license:

  • The formal UMBEL ontology written in OWL Full and SKOS
  • Technical documentation for the ontology and its use and extension
  • Freely accessible Web services according to the documentation already provided
  • Technical documentation and reports for the derivation of the subject concepts from OpenCyc and the creation and extension of semsets and named entities related to that structure.

When we hit full stride, we expect to be releasing still further new Web services on a frequent basis.

BTW, for more technical details on this current release, see Fred Giasson’s accompanying post. Fred is the magician who has brought much of this forward.

Posted:April 13, 2008

'Impossible Image' from Wikipedia under GNU Free Documentation License

A New Entrant into the Lion’s Den

Exactly one month ago I wrote in The Shaky Semantics of the Semantic Web, “The time is now and sorely needed to get the issues of representation, resources and reference cleaned up once and for all.”

The piece was prompted by growing rumblings on semantic Web mailing lists and elsewhere about semantic Web terminology, plus concerns that lack of clarity was opening the door for re-branding or appropriating the semantic Web ‘space.’ I observed these issues were “complex and vexing boils just ready to erupt through the surface.”

My own post was little noticed but the essential observations, I think, were correct. In the past month the rumblings have become a distinct growl and aspects of the debate are now coming into direct focus. I think the perspective has (thankfully) shifted from wanting to not re-open the so-called arcanely named “httpRange-14” debate (a less technical explanation is in Wikipedia on its role in bringing the concept of “information resource” to the Web) of three years past to perhaps finally lancing the boil.

A Determined Protagonist

Many of us monitor multiple mailing lists; they seem to have their own ebb and flow, most often quiet, but sometimes rat-a-tat-tat furious. In and of itself, it is fascinating to see which topics and threads catch fire while others remain fallow.

One mail list that I monitor is the W3C‘s Technical Architecture Group, in essence the key deliberation body for technical aspects of the Web. Key authors of the Web such as Tim Berners-Lee, Roy Fielding and many, many others of stature and knowledge either are on the TAG or participate in its deliberations. The TAG’s public mailing list is immensely helpful to learn about technical aspects of the Web and to get a bit of early warning regarding upcoming issues. The W3C and its TAG are exemplars of open community process and governance in the Internet era.

I assume many hundreds monitor the TAG list; most, like me, comment rarely or not at all. The matters can indeed be quite technical and there is much history and well-thought rationale behind the architecture of the Web.

Xiaoshu Wang has recently been a quite active participant. English is not Xiaoshu’s native language, but because of his passion he has nonetheless been a determined protagonist to probe the basis and rationale behind the use of resources, representations and descriptions on the Web. These are difficult concepts under the best of circumstances, made all the harder due to language differences and special technical senses that have been adopted by the TAG in its prior deliberations.

These concerns were first and most formally expressed in a technical report, URI Identity and Web Architecture Revisited, by Xiaoshu and colleagues in November 2007.

My layman’s explanation of Xiaoshu’s concerns is that the earlier httpRange-14 decision to establish a technical category of “information resources” begs and leaves open the question of the inverse — what has been called a “non-information resource” — and actually violates prior semantics and understandings of what should be better understood as representations.

A Respected Interlocutor

This discussion arose in relation to the Uniform Access to Descriptions [1], a thread begun by Jonathan Rees of the TAG to assemble use cases related to HttpRedirections-57, a proposal to standardize the description of URI things, such as documents, by rejuvenating the link header. Because of its topic, discussion of httpRange-14 was discouraged since putatively the core definition of “information resource” was not at issue.

However, after introduction of a most interesting pre-print, In Defense of Ambiguity [2], co-author Harry Halpin perhaps inadvertenly opened the door to the httpRange-14 question again. Then, Xiaoshu began submitting and commenting in earnest, and Stuart Williams of the TAG, in particular, was helpful and patient to help draw out and articulate the points.

My observation is that Xiaoshu was never advocating a change in the basic or current architecture of the Web, but perhaps that was not apparent or readily clear. Again, the frailty of human communications compounded by language and perspective have been much in evidence.

Pat Hayes, the editor of the excellent RDF Semantics W3C recommendation, then intervened as interlocuter for Xiaoshu’s basic positions. Many, many others, notably including Berners-Lee and Fielding, have also joined the fray. The entire thread [4] is worth reading and study.

Since Xiaoshu has publicly endorsed Hayes’ interpretation, here are some important snippets from Pat’s articulation [3]:

The central point is that now that we have the technology and ideas of the semantic web available, we have a wider range of ways of representing, and a richer notion of what words like ”metadata” mean. If we are willing to take fuller advantage of this new richness, we make available new ways to do semantic things within the same overall design of the pre-semantic web.

. . .

There simply is no other word [than 'represents'] that will do. And the size, history and, I’m sorry, but scholarly and intellectual authority of the community which uses a wider sense of ‘represent’ so greatly exceeds the AWWW [W3C Web] community that I don’t think you can reasonably claim possession of such a basic and central term for such a very narrow, arcane and special (and, by the way, under-defined) sense.

. . .

If AWWW had used a technical word in a new technical way, then this would likely have been harmless. Mathematics re-used ‘field’ without getting confused with agriculture. But the AWWW/semantics clash over the meaning of ‘represent’ is harmful because the senses are not independent: the AWWW usage is a (very) special case of the original meaning, so it is inherently ambiguous every time it is used; and, still worse, we need the broader meaning in these very discussions, because the TAG has decreed that URIs can denote anything: so we are here discussing semantics in a broad sense whether we like it or not. And if the word ‘represent’ is to be co-opted to be used only in one very narrow sense, then we have no word left for the ordinary semantic sense. To adopt a usage like this is almost pathological in the way it is likely to generate confusion (as it already has, and continues to do so, in spades.)

. . .

The way we name Web pages is a special case of this picture, where the ‘storyteller’ is the same thing as the resource. Things that can be their own storytellers fit nicely within current AWWW, with its official understanding of words like ‘represent’. (In fact, capable of being ones own storyteller might be a way to define ‘information resource’.) But the nice thing about this picture [as presented by Xiaoshu] is that other kinds of resource, which do not fit at all within the AWWW – things that aren’t documents, ‘non-information resources’ – also fit within it; still, ironically, using the AWWW language, but with a semantic rather than AWWW sense of ‘represent’.

Right now, the semantic web really does not have a coherent story to tell about how it works with non-information resources, other than it should use RDF (plus whatever is sitting on it in higher levels) to describe them; which says nothing, since RDF can describe anything. URIs in RDF are just names, their Web role as http entities semantically irrelevant. Http-range-14 connects access and denotation for document-ish things, but for other things we have no account of how they should or should not be related, or what anything a URI might access via http has got to do with what it denotes.

The way that the three participants (denoted-thing, URI-name and Web-information-resource ‘storyteller’) interact must be basically different when the denoted-thing isn’t an information resource from when it is. All that being suggested here is that there is an account that we could give about this, one that works in both cases and which fits the language of AWWW quite, er, nicely.

. . .

A person exists and has properties entirely separate from the Web. Many people have nothing to do with the Web in their entire lives. People are not Web objects. And when the URI is being used in an RDF graph to refer to a person, the fact that it starts with http: is nothing more than a lexical accident, which has no bearing whatever on the role of the URI as a name denoting a person.

. . .

I think this particular shoe is on the other foot. If you can actually say, clearly enough to prevent continual trails of endless email debate, what AWWW actually means by ‘represent’, then I’d be delighted if you would use some technical word to refer to that elusive notion. But the word ‘represent’ and its cognates has been a technical word in far larger and more precisely stated forums for over a century; and since the day that Web science has included the semantic web, AWWW has taken an irrevocable step into the same academy. You are using the language of semantics now. If you want to be understood, you have to learn to use it correctly.

. . .

All it would do is move the responsibility of deciding what a URI denotes from a rather messy and widely ill-understood distinction based on http codes, to a matter of content negotiation. This would allow phenomena which violate http-range-14, but it would by no means insist on such violations in all cases. In fact, if we were to agree on some simple protocols for content negotiation which themselves referred to http codes, it could provide a uniform mechanism for implementing the http-range decision.

. . .

Moreover, this approach would put ‘information resources’ on exactly the same footing as all other things in the matter of how to choose representations of them for various purposes, a uniformity which means little at present but is likely to increase in value in the future.

. . .

But right now, for the case where a URI is understood to denote something other than an information resource, we have a completely blank slate. There is nothing which tells our software how to interoperate in this case. Our situation is not a kind of paradise of reference-determination from which Xiaoshu and I are threatening to have everyone banished. Right now for the semantic web, things are about as bad as they can get.

. . .

. . . we, as a society, can use [the conventions we decide] for whatever we decide and find convenient. The Web and the Internet are replete with mechanisms which are being used for purposes not intended by their original designers, and which are alien to their original purpose. For a pertinent example, the http-range-14 decision uses http codes in this way. That isn’t what http codes are for.

I have repeated much of this material because I believe it to be of wide import to the semantic Web’s development and future. Obviously, for better understanding, the full thread [4] plus its generous sprinkling of excellent prior documents and discussions is most recommended.

My Take

There are certainly technical aspects to this debate that go well beyond my ken. I strongly suspect there are edge cases for which more complicated technical guidance is warranted.

And, it is true, I have been selective in which sides of this debate I am highlighting and therefore supporting. This is not accidental.

While some in this debate have claimed the need to conform to existing doctrine in order to ensure interoperability or the integrity of software systems, from my different perspective as someone desiring to help build a market by extending reach into the broader public, that argument is false. Let’s take the existing architecture we have, but make our best practices recipes simple, our language clear, and our semantics correct. How can we really promote and grow the semantic Web when our own semantics are so patently challenged?

Our community faces a challenge of poor terminology and muddled concepts (or, perhaps more precisely, concepts defined in relation to the semantic Web that are not in conformance with standard understandings). My strong suspicion is that we risk at present over-specification and just plain confusion in the broader public.

This mailing list debate is hugely important, informative and thought provoking. Xiaoshu deserves thanks for his courage and tenacity in engaging this debate in a non-native language; Pat Hayes deserves thanks for trying to capture the arguments in language and terminology more easily understandable to the rest of us and to add his own considerable experience to the debate, and many of the mail list regulars deserve sincere thanks for being patient and engaged to allow the nuances of these arguments to unfold.

From my standpoint there is real pragmatic value to these arguments that would bring the terminology and semantics of the semantic Web into better understood and more easily communicated usage, all without affecting or changing the underlying architecture of the Web. (Or, so, to my naïve viewpoint, the argument seems to suggest.)

So long as the semantic Web’s practitioners still number in the hundreds, and those with nuanced understanding of these arcane matters likely only in the scores, the time is ripe to get the language and concepts right. Doing so can help our enterprise reach millions and much more quickly.


[2] Patrick J. Hayes and Harry Halpin, 2008. “In Defense of Ambiguity,” preprint for the International Journal on Semantic Web and Information Systems 4(3), to appear later this year. See http://www.ibiblio.org/hhalpin/homepage/publications/indefenseofambiguity.html.

Posted by AI3's author, Mike Bergman Posted on April 13, 2008 at 9:53 pm in Semantic Web | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/437/semantic-web-semantics-arcane-but-important/
The URI to trackback this post is: http://www.mkbergman.com/437/semantic-web-semantics-arcane-but-important/trackback/
Posted:April 11, 2008

As late as 2002, no single search engine indexed the entire surface Web. There is much that has been written about that time, but emergence of Google (indeed others, it was a key battle at the time), worked to extend full search coverage to the Web, ending the need for so-called desktop metasearchers, then the only option for getting full Web search coverage.

Strangely, though full coverage of document indexing had been conquered for the Web, dynamic Web sites and database-backed sites fronted by search forms were also emerging. Estimates as of about 2001, made by myself and others, suggested such ‘deep Web‘ content was many, many times larger than the indexable document Web and was found in literally hundreds of thousands of sites.

Standard Web crawling is a different technique and technology than “probing” the contents of searchable databases, which require a query to be issued to a site’s search form. A company I founded, BrightPlanet, but many others such as Copernic or Intelliseek and others, many of which no longer exist, were formed with the specific aim to probe these thousands of valuable content sites.

From those company’s standpoints, mine at that time as well, there was always the threat that the major search engines would draw a bead on deep Web content and use their resources and clout to appropriate this market. Yahoo, for example, struck arrangements with some publishers of deep content to index their content directly, but that still fell short of the different technology that deep Web retrieval requires.

It was always a bit surprising that this rich storehouse of deep Web content was being neglected. In retrospect, perhaps it was understandable: there was still the standard Web document content to index and conquer.

Today, however, Google posted on one of its developer blog sites, Crawling through HTML forms, written by Jayant Madhavan and Alon Halevy, noted search and semantic Web researcher, announcing its new deep Web search:

In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.

To be sure, there are differences and nuances to retrieval from the deep Web. What is described here is not truly directed nor comprehensive. But, the barrier has fallen. With time, and enough servers, the more inaccessible aspects of the deep Web will fall to the services of major engines such as Google.

And, this is a good thing for all consumers desiring full access to the Web of documents.

So, an era is coming to a close. And this, too, is appropriate. For we are also now transitioning into the complementary era of the Web of data.

Posted by AI3's author, Mike Bergman Posted on April 11, 2008 at 11:51 pm in Deep Web | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/436/another-deep-web-barrier-falls/
The URI to trackback this post is: http://www.mkbergman.com/436/another-deep-web-barrier-falls/trackback/
Posted:April 2, 2008

Part 4 of 4 on Foundations to UMBEL

CycorpJust as DBpedia has provided the nucleating point for linking instance data (see Part 2), UMBEL is designed to provide a similar reference structure for concepts. These concepts provide some fixed positions in space to which other sources can link and relate. And, like references for instance data, the existence of reference concepts can greatly diminish the number of links necessary in the Linked Data environment.

Clearly, the combination of the representativeness of UMBEL’s subject concepts (the “scope” of the ontology) and their relationships (the “structure” of the backbone) is fundamental. These factors in turn express the functional capabilities of the system.

First Things First

The first fundamental point deserving emphasis is that a reference structure of almost any nature has value. We can argue later about what is the best reference structure, but the first task is to just get one in place and begin bootstrapping. Indeed, over time, it is likely that a few reference structures will emerge and compete and get supplemented by still further structures. This evolution is expected and natural and desirable in that it provides choice and options.

A reference structure of concepts has the further benefit of providing a logical reference structure for instances as well. While Wikipedia is perhaps the most comprehensive collection of humanity-wide instances, no single source can or will be complete in scope. Thus, we foresee specialty sources ranging from the companies in Wikicompany to plants and animals in the Encyclopedia of Life or thousands of other rich instance sources also acting as reference hubs.

How do each of these rich instance sources relate to one another? What is the subject concept or topical basis by which they overlap or complement? What is the framework and graph structure of knowledge to give this information context? These are the benefits brought by a structure of reference concepts, independent from the specifics of the reference structure itself.

Another key consideration is that broad-scale acceptance is important. An express purpose of UMBEL is to aid the interconnection of related content using broadly accepted foundations.

Alternative Approaches

Since the Web’s inception fifteen years ago, there have been various alternatives tried or in ascendance for organizing and bringing structure to Web content. Some of these may be too static and inflexible, others perhaps too arbitrary or parochial. All approaches to date have had little collective success.

There are also new and exciting developments in social networks and user-driven content and structure arising from areas such as tagging or Wikipedia (and wikis in general). But it is not clear that bottom-up contributions suitable to individual articles or topics can lead to coherent structural frameworks; arguably, they have not yet so far. And then there are sporadic government or corporate or trade association initiatives as well.

Here is a summary of alternate approaches:

  • Existing library systems — Dewey Decimal Classification, Library of Congress, UDC and many other library classification schemes have been touted for the Web and all have failed. Some reasons cited for this failure are physical books are very different from free digital bits; Web schema need to evolve quickly; and lack of stewards and curation
  • Market share — at various times certain successful vendors have held temporary minor ascendance with content organizational frameworks, generally directory structures. Examples include About, Yahoo!, Open Directory Project (DMOZ), Northern Light, etc. Yet even at their peaks, market shares were low, external adoption was rare, scope was questioned and arbitrary, with interest in directories now nearly absent
  • WordNet — though of strong interest and use to computational linguists, and quite popular for many content analyses, WordNet has seen little consumer or commercial interest. However, the synset structure and its coverage is extremely valuable for concept disambiguation, and therefore has a role in UMBEL (as it does in many other online systems)
  • Standards efforts — some sporadic success and some notable failures have occurred in the standards arenas. Generally, the successful initiatives tend to be in close communities where there are clear financial benefits for adherence, such as in the exchange of financial or commerce data; broader and more ambitious efforts have tended to be less successful
  • Professional organizations and associations — areas such a finance, pharmaceuticals, biologists, physicists and many bounded communities have enjoyed sporadic and sometimes notable success in developing and using domain-specific schema; none have yet transferred beyond their beginning boundaries to the broader Web
  • Government initiatives — there are episodic successes for government-sponsored content organizational initiatives, mostly in metadata, controlled vocabularies and ontologies, often where contractors or suppliers may be compelled to comply. NIH’s National Library of Medicine (and other NIH branches) have also seen significant domain successes, due to its foresight and its receptive biology, genetics and medical communities
  • Upper ontologies — UMBEL investigated this area considerably in the early months of the project. Most of the upper ontologies have relatively sparse subject concept content, being geared to smaller, abstract-oriented “upper” structures. Some such as SUMO and DOLCE and now PROTON, have concerted initiatives to extend to middle- and domain-level ontologies [1]. To date, penetration of these systems into general Web or commercial realms has been quite limited
  • Wikipedia — a clear and phenomenal success, Wikipedia and related initiatives like Wikinvest and Wikicompany and scores more have proven to be a rich fount for named entities and article-length content, but not for the category and content organization structures in which that content is embedded. This is an area of keen academic and collective interest [2] and it may still result is useful organizational schema as these popular wikis continue to evolve and mature. However, they have not yet done so, and while a rich source for entities and data, UMBEL decided to pass on their use for “backbone” structure at this time
  • No collective structure — tagging or folksonomies or doing nothing have perhaps the greatest market share at present.

Since inception, the stated intent of the UMBEL project was to base its subject structure on extant systems. To minimize development time, the structure needed to be drawn from one of the categories above. Possible development of a de novo structure was rejected because of development time and the low probability of gaining acceptance in the face of so many competing alternatives.

Rationale for OpenCyc

The granddaddy of knowledge bases suitable to all human content and knowledge is Cyc. Because of its more than 20-year history, Cyc brings with it considerable strengths and some weaknesses.

Amongst all alternatives, Cyc rapidly emerged as the leading candidate. While its strengths warranted close attention, its weaknesses also suggested a considerable effort to overcome them. This combination compelled the need for a significant investigation and due diligence.

First, here are OpenCyc’s strengths:

  • Venerable and solid — through an estimated 200 person years of engineering and effort, the Cyc structure has been tested and refined through many projects and applications. While a few years back such groundings were unparalleled in the field, we are also now seeing some Internet-wide projects tap into the law of large numbers to get significant inputs of human labor. Cyc has also tapped this venue for ongoing expansion of its KB using the online FACTory game [3]
  • Community — there is a large community of Cyc users and supporters from academic, government, commercial and non-profit realms. Moreover, the formation of The Cyc Foundation has also served as a vehicle for tapping into volunteer effort as well
  • Upgrade Path — OpenCyc has an upgrade path to the more capable ResearchCyc, full Cyc and the services of Cycorp
  • Comprehensive — no existing system has the scope, breadth and coverage of human concepts to match that of Cyc (however, sources for named entities such as Wikipedia have recently passed Cyc in scope; see next section)
  • Common sense — since its founding as a project and then backed by the standalone Cycorp, Cyc has set for itself both a more pragmatic but harder challenge than other knowledge systems. Cyc has set out to capture the common sense at the heart of human reasoning. This objective means codifying generally unstated logic and rules-of-thumb — not unlike teaching a baby to walk and talk and read — all of which are lengthy tasks of trial and error. However, as Cyc has gained this foundation, it has also led to a more solid basis for its reasoning and conceptual relationships
  • Power and inference — ultimately the purpose of a knowledge base is to support reasoning and inference by computer when presented with a (often small) starting set of assertions or facts. Cyc has literally thousands of microtheories now governing its inference domains, giving it a scope and power unmatched by other systems. The importance of such reasoning is not the silly science fiction of autonomous intelligent robots, but as achievable aids to make connections, determine relationships and filter and order results
  • Robust supporting capabilities — such knowledge base-wide capabilities can also be deeply leveraged in such areas as entity extraction, machine translation, natural language processing, risk analysis or one of the other dozens of specialty modules available in Cyc, and
  • Free and open — last, but not least, is the fact that a mostly complete Cyc was released as a free and open source version in 2002. OpenCyc has now been downloaded more than 100,000 times and is in production use for many applications. Non-profits and academics can also obtain access to the full capabilities of the Cyc system through ResearchCyc. This open character is an absolute essential because leading Web applications and leading innovators of the Web eschew proprietary systems.

Literally, after months of investigation and involvement, the richness of practical uses to which the OpenCyc knowledge base can be applied are still revealing themselves.

Drawbacks to OpenCyc

But there are weaknesses and problems with Cyc.

To be sure, there are some individuals and perhaps some historical criticisms of Cyc that involved fears of Big Brother or grandiose claims about artificial intelligence or machine reasoning. These are criticisms of hype, immaturity or ignorance; they are different than the drawbacks observed by our UMBEL project and not further discussed here.

In UMBEL’s investigation of Cyc, we observed these drawbacks:

  • Obscure upper ontology — the Cyc upper ontology, shown in the figure below, is perhaps not as clean as more recent upper ontologies (Proton, [4] for example, is a very clean system). The various sub-classifications of ‘Thing’ and degrees of “tangibility” seem particularly problematic. However, since these are not direct binding concepts for UMBEL and provide appropriate “glue” for the upper portions of the graph, these criticisms can be easily overlooked
Cyc Upper Ontology
  • Cruft — twenty years of projects and forays into obscure domains (many for the military or intelligence communities) have left a significant degree of cruft marbled through the knowledge base. Indeed, as our vetting showed, perhaps about 30% of the concepts in Cyc are holdovers from prior projects or relate to internal Cyc-only topics
  • Reasoning concepts — another 15% or so of Cyc concepts are abstract or for reasoning purposes, such as reasoning over colors, beliefs, the sizes of objects, their orientations in space, and so forth. These are certainly legitimate concepts and appropriate to Cyc’s purposes, but are not needed or desired for UMBEL’s purposes
  • Greater expressivity — Cyc is grounded in the LISP language and has many higher-order logic constructs. Paradoxically, this greater expressiveness may make translation to UMBEL more difficult
  • Older conventions — also related to these groundings in an earlier era are the reliance on functions and functional predicates for many relations, and the absence of the current triple data model underlying RDF. While it is true that OWL versions of OpenCyc have been made and are the basis for UMBEL’s work to date, there are also errors in these translations perhaps in some instances due to the lesser expressiveness of RDF and OWL
  • Documentation — while complete reference materials can ultimately be found, it is difficult to do so and introductory and entry-level tutorials could stand to be augmented
  • Named entities — for many years, but now especially with the emergence of Wikipedia, Cyc has been criticized for its relative paucity of named entity coverage and imbalances of what it does contain. While from UMBEL’s perspective this appears to be strictly correct, such criticism misses the mark of Cyc’s special purpose and contributions as a solid conceptual and common sense framework. Those common-sense portions of the system are more immutable, and can be readily mapped to named entity sources. Indeed, perhaps Cyc will now see new vigor as the Web becomes a superior source for contemporary named entity coverage while Cyc fulfills its more natural (and needed) structural role.

Surprisingly, for a system of its age and evolution, Cyc seems to have adhered well to naming conventions and other standards.

UMBEL’s project diligence thus found the biggest issue going forward to be the cruft in the system. There is a solid structure underneath Cyc, but one that is too often obscured and not made as shiny and clean as it deserves.

The Decision and Design

Five months of nearly full-time due diligence was devoted to this question of the suitability of Cyc as the intellectual grounding for UMBEL.

On balance, OpenCyc’s benefits significantly outweighed its weaknesses. This balance also stands considerably superior to all potential alternatives.

An important factor through this deliberation was the commitment of Cycorp and The Cyc Foundation to the aims of UMBEL, and the willingness of those organizations to lend time and effort to promote UMBEL’s aims. Twenty years of development and the investment of decades of human effort and scrutiny provides a foundation of immense solidity.

Though perhaps Wikipedia (or something like it also based on broad Web input) might emerge with the scope and completeness of Cyc, that prospect is at minimum some years away and by no means certain. No other current framework than Cyc can meet UMBEL’s immediate purposes. Moreover, as stated at the outset, UMBEL’s purpose is pragmatic. We will leave it to others to argue the philosophical nuances of ontology design and “truth” while we get on with the task of creating context of real value.

The next decision was to base all UMBEL subject concepts on existing concepts in OpenCyc.

This means that UMBEL inherits all of the structural relations already in OpenCyc. It also means that UMBEL can act as a sort of contextual middleware between unstructured Web content and the inferential and tools infrastructure within OpenCyc (and beyond into ResearchCyc and Cyc for commercial purposes) and back again to the Web. We term this “roundtripping” and the capability is available for any of the 21,000 subject concepts vetted from OpenCyc within UMBEL.

Having made these commitments, our next effort was to break out the brushes, roll up the sleeves, and plunge into a Spring session of deep cleaning. This effort to vet and clean OpenCyc will be documented in the Technical Report to accompany the first release of the UMBEL ontology. We think you’ll like its shiny new look. :)

This is Part 4 of 4 on the foundations to UMBEL. This four-part series covers a Re-Introduction to UMBEL, UMBEL: Making Linked Data Classy, Subject Concepts and Named Entities, and Basing UMBEL’s Backbone on OpenCyc. These articles are lead-ins to the discussion of the actual UMBEL ontology. That series will begin next.

[1] Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc, John Sowa’s Top-Level Categories and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus), though Cyc is a clear exception with its stated emphasis on capturing “common knowledge.”
[2] See, for example, this listing of about 100 academic articles devoted to structure and linguistic uses of Wikipedia: http://www.mkbergman.com/?p=417.
[3] FACTory is a game that lets people enter knowledge into the Cyc knowledge base. Via this online game, Cyc tries to determine the truth or falsehood of a series of facts. When enough people have agreed that a fact is true or not, Cyc considers it confirmed and stops asking about it. See http://game.cyc.com/helpfiles/HowToPlay.html.
[4] There are many aspects that make PROTON one of the more attractive reference ontologies. The PROTON ontology (PROTo ONtology), developed within the scope of the SEKT project, is attractive because of its understandability, relatively small size, modular architecture and a simple subsumption hierarchy. It is available in an OWL Lite form and is easy to adopt and extend. On the face of it, the Topic class within PROTON, which is meant to serve as a bridge between different ontologies, may also provide a binding layer to specific subject topics as sub-classes or class instances.