Posted:May 9, 2007

The Encyclopedia of Life and Linking Open Data

This Week Marks Two New Milestones on the Road to the Semantic Web

Two milestones of the structured Web are occurring as we speak. Let’s break out the RDF and party!

Encyclopedia of Life

Today, the Encyclopedia of Life (http://www.eol.org) was formally unveiled. See EOL’s amazing intro video.

EOL is meant to be the Wikipedia of all 1.8 million known living species, backed with real money and real prestige. The effort continues a growing list of impressive and open data sources being compiled and presented on the Web.

The Encyclopedia of Life is a planned 10-year effort to create a free Internet resource to catalog and describe every one of the planet’s organisms. Initially seeded with $12.5 million in backing from two US philanthropies, the John D. and Catherine T. MacArthur Foundation and the Alfred P. Sloan Foundation, the effort is anticipated to cost $100 million by completion.

EOL is based on an idea by Harvard biologist and noted author Edward O. Wilson, a 2007 TED winner for the idea (as is better seen and explained in his acceptance video). Other initial project backers are Harvard University, the Smithsonian Institution, Missouri Botantical Garden, the Biodiversity Library Project with its international participants, and the Chicago Field Museum of Natural History.

Each species will get its own Web page with a structured record, similar to the “infoboxes” within Wikipedia. Information will include photos, technical name and phylogeny, lay description, range maps, and place within the Tree of Life. More prominent species will also get information on its genetics and behavior and compiled scientific research articles, some from centuries ago.

EOL is two and one-half orders of magnitude more ambitious than the earlier Tree of Life Web Project from the University of Arizona with its 5000 pages, and 20 times more ambitious than Wikipedia’s own Wikispecies.

Potential applications range from conservation to mapping to identifying the odd plant or animal species. But, has been found with important predecessor data sets such as GenBank (65 billion genetic sequences for a 100,000 organisms), FishBase (30,000 species) or Conabio (biodiversity in Mexico), among literally hundreds of others, popularity and use exceeds wildest expectations. Conservationists, researchers, school children, amateur naturalists, and outdoor lovers will all find use for the site.

Linking Open Data

EOL is but the newest poster child of massive and hugely important datasets being exposed on the Internet. One initiative with the sole purpose to promote this trend and the interoperability of the data utilizing RDF is the Linking Open Data project. The LOD project is part of the W3C’s Semantic Web Education and Outreach interest group (yes, the unwieldy, SWEO-IG).

The LOD project is holding its first meetings ever this week at WWW2007 in Banff, Alberta. While only a mere six-months old, the SWEO-IG is stimulating much broader interest in the semantic Web through sponsored products, FAQs (also announced yesterday and got really digged!), use cases, supporting information, and growing and useful compilations such as datasets and semantic Web tools.

Anyone interested in the semantic Web should really be familiar with the ESW wiki (which unfortunately is in need of re-organization and clean-up; but keep poking, there are many riches hidden in the nooks and crannies!). You may also want to track the public mailing list of the Linking Open Data group.

I already wrote extensively on the exciting DBpedia intiative (Did you Blink? The Structured Web Just Arrived), itself one of the catalytic datasets at the core of the Linking Open Data efforts. Other datasets actively being pursued for inclusion by the group include:

Geonames data, including its ontology and 6 million geographical places and features, including implementation of RDF data services
700 million triples of U.S. Census data from Josh Tauberer
Revyu.com, the RDF-based reviewing and rating site, including its links to FOAF, the Review Vocab and Richard Newman's Tag Ontology
The "RDFizers" from MIT Simile Project (not to mention other tools), plus 50 million triples from the MIT Libraries catalog covering about a million books
GEMET, the GEneral Multilingual Environmental Thesaurus of the European Environment Agency
And, WordNet through the YAGO project, and its potential for an improved hierarchical ontology to the Wikipedia data.

Additional candidate datasets of interest have also been identified by the SWEO interest group on this page:

Gene Ontology database and its 6 million annotations
Gene fruitfly embryogenesis images from the Berkeley Drosophila Genome Project
Roller Blog entries using the Atom/OWL vocabulary
Various semantic Web interest group and conference materials
Various FOAF-enabled profiles
The UniProt protein database with its 300 million triples
OpenGuides, a network of wiki-based city guides
dbtune, an RDF-enabled version of the Magnatune music database using the Music Ontology
The SKOS Data Zone
Other MIT Simile data collections, including the CIA's World Factbook, Library of Congress' Thesaurus of Graphic Materials, National Cancer Institute's cancer thesaurus, W3C's technical reports
The RDF version of the DMOZ Open Directory Project
GovTrack.us of U.S. Congress legislator and voting records
Chef Moz restaurant and review guides from DMOZ
DOAP Store and its DOAP project descriptions, and
MusicBrainz.

The SWEO-IG and its Linking Open Data initiative are catalyzing a new phase of excitement with relevant information, data and demos.

If anything, all I can say is: Set your sights higher! There’s data galore that needs to be “RDFized”.

So, folks, keep up the great work! And good luck with this week’s meetings in the beauty of Canada.

[BTW, please make sure that EOL makes its data available in RDF!]

Posted:May 8, 2007

Iris: Smashing Online Photo Smash-ups

Now, that’s way cool!

Jasper Potts and his team at Sun have worked up some very nice magic with photo display and editing. Iris is an online photo browsing, editing and slide show application.

It is a “smash-up” of Java applets and next generation web concepts. You can create galleries, edit photos, rotate and create that cool 3D rotating photo cube effect we’ve been seeing lately, you name it. Jasper’s short online demo of Iris is really cool, too! (I think the live demo is to be presented in Jasper’s talk at JavaOne tomorrow.)

Iris works with the Flickr online photo service. What's going on under the hood is the use of a Java applet that receives JavaScript events when clicking on the Iris images to contact the Flickr web service; it actually provides the on screen ability to do remote Flickr operations. (Probably I shouldn’t mention it, shhhh, but this is what semantic Web interfaces can do once they become cool with remote data.).

Hey! Shag me baby!

BTW, thanks to Henry Story for the link to this (he is always sniffing out the good Java stuff).

Posted:May 3, 2007

Structure Paves the Way to the Semantic Web

[Click on image for full-size pop-up]

Colorado Interstate Construction – 1970
Courtesy National Archives

NOTE: I was pleased when Jim Hendler asked me to pen some thoughts on the semantic Web (as a vehement moderate, I always use the mixed case). I think both of us hoped that my background combining Internet business with Web science might bring a pragmatic perspective to the subject. The material below is to appear as a guest editorial in IEEE Intelligent Systems in the May/June issue. I thank it for allowing early releases by authors, and to Linda World, its super senior editor, for cleaning up my language. There are some other differences due to length considerations. MKB

For a dozen years, my career has been centered on Internet search, dynamic content and the deep Web. For the past few years, I have been somewhat obsessed by two topics. The first topic, a conviction really, is that implicit structure needs to be extracted from Web content to enable it to be disambiguated, organized, shared and re-purposed. The second topic, more an open question as a former academic married to a professor, is what might replace editorial selections and peer review to establish the authoritativeness of content. These topics naturally steer one to the semantic Web.

A Millennial Perspective

The semantic Web, by whatever name it comes to be called, is an inevitability. History tells us that as information content grows, so do the mechanisms for organizing and managing it. Over human history, innovations such as writing systems, alphabetization, pagination, tables of contents, indexes, concordances, reference look-ups, classification systems, tables, figures, and statistics have emerged in parallel with content growth.

When the Lycos search engine, one of the first profitable Internet ventures, was publicly released in 1994, it indexed a mere 54,000 pages [1]. When Google wowed us with its page-ranking algorithm in 1998, it soon replaced my then favorite search engine, AltaVista. Now, tens of billions of indexed documents later, I often find Google’s results to be overwhelming dross — unfortunately true again for all of the major search engines. Faceted browsing, vertical search, and Web 2.0’s tagging and folksonomies demonstrate humanity’s natural penchant to fight this entropy, efforts that will next continue with the semantic Web and then mechanisms unforeseen to manage the chaos of burgeoning content.

An awful lot of hot air has been expelled over the false dichotomy of whether the semantic Web will fail or is on the verge of nirvana. Arguments extend from the epistemological versus ontological (classically defined) to Web 3.0 versus SemWeb or Web services (WS*) versus REST (Representational State Transfer). My RSS feed reader points to at least one such dust up every week.

Some set the difficulties of resolving semantic heterogeneities as absolutes, leading to an illogical and false rejection of semantic Web objectives. In contrast, some advocates set equally divisive arguments for semantic Web purity by insisting on formal ontologies and descriptive logics. Meanwhile, studied leaks about “stealth” semantic Web ventures mean you should grab your wallet while simultaneously shaking your head.

A Decades-Long Perspective

My mental image of the semantic Web is a road from here to some achievable destination–say, Detroit. Parts of the road are well paved; indeed, portions are already superhighways with controlled on-ramps and off-ramps. Other portions are two lanes, some with way too many traffic lights and some with dangerous intersections. A few small portions remain unpaved gravel and rough going.

[Click on image for full-size pop-up]

Wreck in Nebraska during the 1919 Transcontinental Motor Convoy
Courtesy National Archives

A lack of perspective makes things appear either too close or too far away. The automobile isn’t yet a century old as a mass-produced item. It wasn’t until 1919 that the US Army Transcontinental Motor Convoy made the first automobile trip across the United States. The 3,200 mile route roughly followed today’s Lincoln Highway, US 30, from Washington, D.C. to San Francisco. The convoy took 62 days and 250 recorded accidents to complete the trip (see figure), half on dirt roads at an average speed of 6 miles per hour. A tank officer on that trip later observed Germany’s autobahns during World War II. When he subsequently became President Dwight D. Eisenhower, he proposed and then signed the Interstate Highway Act. That was 50 years ago. Today, the US is crisscrossed with 50,000 miles of interstates, which have completely remade the nation’s economy and culture [2].

Today’s Perspective

Like the interstate system in its early years, today’s semantic Web lets you link together a complete trip, but the going isn’t as smooth or as fast as it could be. Nevertheless, making the trip is doable and keeps improving day by day, month by month.

My view of what’s required to smooth the road begins with extracting structure and meaningful information according to understandable schema from mostly uncharacterized content. Then we store the now-structured content as RDF triples that can be further managed and manipulated at scale. By necessity, the journey embraces tools and requirements that, individually, might not constitute semantic Web technology as some strictly define it. These tools and requirements are nonetheless integral to reaching the destination. We are well into that journey’s first leg, what I and others are calling the structured Web.

For the past six months or so I have been researching and assembling as many semantic Web and related tools as I can find [3]. That Sweet Tools listing now exceeds 500 tools [4] (with its presentation using the nifty lightweight Exhibit publication system from MIT’s Simile program [5]). I’ve come to understand the importance of many ancillary tool sets to the entire semantic Web highway, such as natural language processing and information extraction. I’ve also found new categories of pragmatic tools that embody semantic Web and data mediation processes but don’t label themselves as such.

In its entirety, the Sweet Tools listing provides a pretty good picture of the semantic Web’s state. It’s a surprisingly robust picture — though with some notable potholes — and includes impressive open source options in all categories. Content publishing, indexing, and retrieval at massive scales are largely solved problems. We also have the infrastructure, languages, and (yes!) standards for tying this content together meaningfully at the data and object levels.

I also think a degree of consensus has emerged on RDF as the canonical data model for semantic information. RDF triple stores are rapidly improving toward industrial strength, and RESTful designs enable massive scalability, as terabyte- and petabyte-scale full-text indexes prove.

Powerful and flexible middleware options, such as those from OpenLink [6], can transform and integrate diverse file formats with a variety of back ends. The World Wide Web Consortium’s GRDDL standard [7] and related tools, plus various “RDF-izers” from Massachusetts Institute of Technology and elsewhere [8], largely provide the conversion infrastructure for getting Web data into that canonical RDF form. Sure, some of these converters are still research-grade, but getting them to operational capabilities at scale now appears trivial.

Things start getting shakier when trying to structure information into a semantic formalism. Controlled vocabularies and ontologies range broadly and remain a contentious area. Publishers and authors perhaps have too many choices: from straight Atom or RSS feeds and feeds with tags to informal folksonomies and then Outline Processor Markup Language [9] or microformats [10]. From there, the formalism increases further to include the standard RDF ontologies such as SIOC (Semantically-Interlinked Online Communities), SKOS (Simple Knowledge Organizing System), DOAP (Description of a Project), and FOAF (Friend of a Friend) [11] and the still greater formalism of OWL’s various dialects [12].

If we compare the semantic Web to the US interstate highway system, we’re still in the early stages of a journey that will remake our economy and culture.

Many potholes on the road to the semantic Web exist.

One ready task is to transform existing structure to RDF. Another priority is to refine tools to extract structure and meaningful information from uncharacterized content.

Arguing which of these is the theoretical best method is doomed to failure, except possibly in a bounded enterprise environment. We live in the real world, where multiple options will always have their advocates and their applications. All of us should welcome whatever structure we can add to our information base, no matter where it comes from or how it’s done. The sooner we can embrace content in any of these formats and convert it into canonical RDF form, we can then move on to needed developments in semantic mediation, some of the roughest road on the journey.

Potholes on the Semantic Highway

Semantic mediation requires appropriate structured content. Many potholes on the road to the semantic Web exist because the content lacks structured markup; others arise because existing structure requires transformation. We need improved ways to address both problems. We also need more intuitive means for applying schema to structure. Some have referred to these issues as “who pays the tax.”

Recent experience with social software and collaboration proves that a portion of the Internet user community is willing to tag and characterize content. Furthermore, we can readily leverage that resulting structure, and free riders are welcomed. The real pothole is the lack of easy–even fun–data extractors and “structurizers.” But we’re tantalizingly close.

Tools such as Solvent and Sifter from MIT’s Simile program [13] and Marmite from Carnegie Mellon University [14] are showing the way to match DOM (document object model) inspectors with automated structure extractors. DBpedia, the alpha version of Freebase, and System One now provide large-scale, open Web data sets in RDF [15], including all of Wikipedia. Browser extensions such as Zotero [16] are showing how to integrate structure management into acceptable user interfaces, as are services such as Zoominfo [17]. Yet we still lack easy means to design the differing structures suitable for a plenitude of destinations.

Amazingly, a compelling road map for how all these pieces could truly fit together is also incomplete. How do we actually get from here to Detroit? Within specific components, architectural understandings are sometimes OK (although documentation is usually awful for open source projects, as most of the current tools are). Until our community better documents that vision, attracting new contributors will be needlessly slower, thus delaying the benefits of network effects.

So, let’s create a road map and get on with paving the gaps and filling the potholes. It’s not a matter of standards or technology–we have those in abundance. Let’s stop the silly squabbles and commit to the journey in earnest. The structured Web‘s ability to reach Hyperland [18], Douglas Adam’s prescient 1990 forecast of the semantic Web, now looks to be no further away than Detroit.

[1] Chris Sherman, “Happy Birthday, Lycos!,” Search Engine Watch, August 14, 2002. See http://searchenginewatch.com/showPage.html?page=2160551.

[2] David A. Pfeiffer, “Ike’s Interstates at 50: Anniversary of the Highway System Recalls Eisenhower’s Role as Catalyst,” Prologue Magazine, National Archives, Summer 2006, Vol. 38, No. 2. See: http://www.archives.gov/publications/prologue/2006/summer/interstates.html.

[3] The mention of specific tool names is meant to be illustrative and not necessarily a recommendation.

[4] Sweet Tools (SemWeb) listing; see https://www.mkbergman.com/?page_id=325 .

[5] See http://simile.mit.edu/exhibit/.

[6] OpenLink Software’s Virtuoso and Data Spaces products; see http://www.openlinksw.com/.

[7] W3C’s Gleaning Resource Descriptions from Dialects of Languages (GRDDL, pronounced “griddle”). See http://www.w3.org/2004/01/rdxh/spec.

[8] See http://simile.mit.edu/wiki/RDFizers.

[9] Outline Processor Markup Language (OPML); see http://www.opml.org/.

[10] Microformats; see http://microformats.org/.

[11] DOAP (Description of a Project), FOAF (Friend of a Friend), SIOC (Semantically-Interlinked Online Communities) and SKOS (Simple Knowledge Organizing System)..

[12] W3C’s Web Ontology Language (OWL). See http://www.w3.org/TR/owl-features/.

[13] Solvent (http://simile.mit.edu/wiki/Solvent) and Sifter (http://simile.mit.edu/wiki/Sifter) are from MIT’s Simile program.

[14] Marmite (http://www.cs.cmu.edu/~jasonh/projects/marmite/) is from Carnegie Mellon University.

[15] DBpedia (http://dbpedia.org/docs/) and Freebase (in alpha, by invitation only at http://www.freebase.com/) are two of the first large-scale open datasets on the Web; Wikipedia has also been converted to RDF by System One (http://labs.systemone.at/wikipedia3).

[16] Zotero is produced by George Mason University’s Center for History and New Media; see http://www.zotero.org.

[17] ZoomInfo (http://www.zoominfo.com/) provides online structured search of companies and people, plus broader services to enterprises.

[18] The late Douglas Adams, of Doctor Who and A Hitchhiker’s Guide to the Galaxy fame, produced a TV program for BBC2 presaging the Internet called Hyperland. This 50-min video can be seen in five parts via YouTube at Part 1 of 5, 2 of 5, 3 of 5, 4 of 5 and 5 of 5.

Posted:May 1, 2007

There’s Not Yet Enough Backbone

Wikipedia is an Essential — but Insufficient Alone — Organizing Subject Basis for the Structured Web

There are some really, really exciting events converging around the idea of RDF and exposed meaningful data and the ways (OK, yes, ontologies) to organize it. We have seen important announcements in recent weeks by Freebase and DBpedia, among others, that show how RDF and related data forms are being exposed on the Web for large and meaningful datasets. These are the structured Web precursors to the semantic Web.

There are also some really cool browsers and data navigators that are being tested and floating around at present.

Four Current Examples

What is shown below are four current alternatives for accessing and querying Wikipedia content in various structured ways. Each shares many of the same aspects and each has slight differences from the others. All four are experimental to some degree and most have somewhat unrefined interfaces:

Search DBpedia	Metaweb Explorer
System One Wikipedia³	YAGO Experimental

These four systems in clockwise order from upper left are:

Search DBpedia — I earlier covered DBpedia in some detail, which is mostly a conversion of Wikipedia to RDF, though it does include some other datasets. The new search interface from Georgi Kobilarov is what is shown above
Metaweb Explorer — this tool is from the developers of Freebase and works directly against Wikipedia categories and data; it is used to semi-automatically select and clean-up many of the categories and name-value pairs in templates and infoboxes prior to uploading en masse to Freebase (see further)
YAGO Experimental — this innovative approach using a conceptual variant to RDF involves “yet another great ontology” (YAGO) which is a combination of subject structure from Wikipedia as refined by WordNet synsets. A very interesting paper on this approach is found at the authors’ Max Planck Institute SaarbrÃ¼cken Web site, and
System One’s Wikipedia³ — this is the first of the Wikipedia conversions to RDF. It also uses a more structured ontology, largely based on WikiOnt, but also elements of SKOS and Dublin Core (no direct online demo; must have access via System One).

In each of the cases above, a general query on the subject of automobiles was posed to the services. Queries around such topics, while producing many additional appropriate and related topics, also failed to produce a “natural” feeling organizational structure, such as within a subject tree, that would aid browsing or discovery. For example, getting a simple listing of automobile nameplates is generally quite difficult with these systems.

Still Missing Some Vertebrae

These four examples thus point to a real problem: the lack of a referential subject or topic structure around which to organize and access all of this emerging online data. Current attempts to do so based solely on Wikipedia fall significantly short (IMHO). (In fact, the Metaweb Explorer is designed expressly to help overcome this problem.)

That is because the starting basis of Wikipedia information has been built entirely from the bottom up — namely, what is a deserving topic. This has served Wikipedia and the world extremely well, with now nearly 7 million articles online. As socially-driven and -evolving, I foresee Wikipedia to continue to be the substantive core at the center of a knowledge organizational framework for some time to come. To use the backbone analogy of this posting, Wikipedia forms the spinal cord.

But to complete the backbone, more structure is needed.

Wikipedia itself provides much useful structure. There is an internal categorization system (which is the subject organizational basis for much of the four examples noted above), plus its templates and infoboxes. My earlier article described many of these.

Yet I find it interesting that the group at the Max Planck Institute layered on WordNet to provide greater semantic richness and structure to Wikipedia to derive its YAGO ontology, while System One embraced the specific RDFS framework of the Simple Knowledge Organization Systems (SKOS) to provide hierarchical and other structure. I believe both of these attempts are right on target and are adding more vertebrae — more strength — to this backbone.

The W+W+S+? Equation

I thus believe that a suitable subject structure for organizing knowledge is both needed and must be adaptable and self-defining. These criteria reflect expressions of actual social usage and practice, which of course changes over time as knowledge increases and technologies evolve.

Wikipedia provides the bottom-up basis for this subject structure; WordNet provides a contextual richness based on the evolving nature of language, and SKOS provides a representational schema for communicating this structure in RDF. Thus, W + W + S are part of the vertebrae in this subject structure.

Yet there still seems to be a missing piece. Namely, the remaining question mark is the actual top-down superstructure of subject and topic organization. Unlike historical systems (such as the Dewey Decimal System or the Library of Congress Subject Headings), and unlike grand schemas developed by committees or standards bodies, I think that such structure must also evolve from the global community of practice, as has Wikipedia and WordNet. (That does not mean that everyone votes on such things or that the process is democratic with a small d, just simply that there is an open process whereby anyone of interest may contribute or challenge. This can be a general contributory process such as how Wikipedia developed, or a derivation from actual usage as is the case with WordNet.)

In classical plant or animal phylogenies developed by systematicists, classification systems are called “natural” that reflect the actual nature of relationships among organisms. I think it should also be possible to discover more “natural” (as opposed to imposed, arbitrary or “artificial”) systems for classifying knowledge as subjects as well. In fact, we likely already have the raw grist to do so based on folksonomies, large numbers of Web searches, further processing of WordNet and Wikipedia, or similar primary data. (The Freebase approach may also show promise.)

Like WordNet itself, such starting data could be analyzed for subject hierarchies and relationships. Such ontology learning methods from text are well-advanced (or, rather, sufficiently advanced to provide a first-generation hierarchical subject structure). Perhaps a grand challenge against a large contextual Web term set could provide the basis for choosing a representative basis for such a top-down subject structure. This could be the missing vertebra in the W + W + S + ? backbone.

Finally, as with all of the other pieces of the backbone, no one is looking for the best and final answer. All we are looking for today is a satisficing answer that can gain the trust and acceptance of the global online community. Something is better than nothing.

There’s plenty of time to adapt and refine such methods into the future.

Posted:April 26, 2007

Priming the Pump and Threshold Conditions

Image from http://newsimg.bbc.co.uk/media/images/41466000/jpg/_41466772_sotherton300.jpg

The Structured Web is But an Early Hurdle to the Semantic Web

About a year ago (May 30, 2006), Dr. Douglas Lenat, the president and CEO of Cycorp, Inc., gave a great talk at Google called Computers Versus Common Sense. Doug, a former professor at Stanford and Carnegie Mellon, has been working in artificial intelligence and machine learning his entire career. Since 1984 he has been working on the project Cyc, which subsequently formed the basis for starting Cycorp in 1994. Cycorp, as may become apparent below, does much work for the defense and intelligence communities.

Google Research selected this 70-min video as one of its best of 2006, and I have to heartily concur. Doug is very informative and is also an an entertaining and humorous speaker.

But that is not the main reason for my recommendation. For what Doug presents in this video are some of the real common sense challenges of semantic matching and reasoning by computer. These are the threshold hurdles of intelligent agents and real-time answering and (perhaps) forecasting, the ultimate objectives that some equate to the “Semantic Web” (title case):

Because of the reasoning objectives Cycorp and its clients have set for the system, threshold conditions require not only more direct deductive reasoning (logical inference from known facts or premises), but also inductive (inferred based on the likelihood or belief of multiple assertions) and even abductive reasoning (likely or most probabilistic explanation or hypothesis given available facts or assertions). Doug makes clear the devilishly difficult challenges of determining semantic relevance when complete machine-based reasoning is an objective.

As I listened to the video, I interpreted the attempt to reach these objectives as bringing at least four major, and linked, design implications.

The first design implication is that the reasoning basis requires many facts and assertions about the world, the basis of the “common sense” in the knowledge base. An early lesson that some AI practitioners in the “common sense” camp came to hold was that learning systems that did not know much, could not learn much. When Cyc was started more than 20 years ago, Lenat and Marvin Minsky estimated on the back of an envelope that it would take on the order of 1000 person-years to create a knowledge base with sufficient world knowledge to enable meaningful reasoning thereafter. This is what Lenat has called “priming the pump” of the knowledge base with common sense to resolve so many classes of semantic and contextual ambiguities.

However, this large number of assertions has a second design implication, if I understood Lenat correctly, for the need for higher-order predicate calculus. These higher orders for quantification over subsets and relations or even relations over relations are designed to reduce the number of potential “facts” that need to be queried for certain questions. This makes the knowledge base (KB) as a whole more computationally tractable, and able to provide second or sub-second response times.

A third implication is that, again to maintain computational tractability, reasoning should be local (with local ontologies) and with specialized reasoning modules. Today, Cyc has more than 1000 such modules and reasoning in some local areas may not actually infer correctly across the global KB. Lenat likens this to the observation that local geography appears flat even though we know the entire Earth is a globe. This enables local simplifications to make the inferences and reasoning tractable.

Finally, the fourth implication is a very much larger number of predicates than in other knowledge bases or ontologies, actually more than 16,000 in the current Cyc KB. This large number comes about because of:

simplifying some of the more complex expressions that frequently repeat themselves in some of the higher-order patterns noted above as new predicates, again to speed processing time, and
providing more precise meanings to certain language verbs that humans can disambiguate because of context, but which pose problems to computer processing. For example, Cyc contains 23 different predicates relating to the word “in”.

Growing the Knowledge Base

Roughly about the year 2000, sufficient “pump priming” had taken place such that Cyc could itself be used to extend its knowledge base through machine learning. A couple of critical enablers for this process were the querying of Web search engines and the engagement of volunteers and others to test the reasonableness of new assertions for addition to the KB. The basic learning and expansion process is now generally:

formal predicate calculus language -> natural language queries -> issue to Web search engine -> translate results back to predicate language -> present inferences for human review -> accept / reject result (50%) ->’ add to KB (knowledge base)

The engagement of volunteers is also coming about through the use of online “games”. Open source (see below) is another recent tool. (BTW, the 50% acceptance rate is not that there are so many wrong “facts” on the Web, but that context, humor, sarcasm or effect can be used in human language that does not actually lead to a “correct” common-sense assertion. Such observations do give pause to unbounded information extraction.)

Thus, today, Cyc has grown to have a very significant scope, as some of these metrics indicate:

16,000 predicates
1,000 reasoning modules
300,000 concepts
4,000 physical devices
400 event-participant relationships
11,000 event types
171,000 “names” (chemicals, persons, places, etc.)
1,100 geospatial classes, 500 goespatial predicates
3.2 million assertions

As might be expected, the overall KB continues to grow, and on an accelerating rate.

Threshold Conditions and the Structured Web

This all appears daunting. And, when viewed through the lens of the decades to get the knowledge base to this scale and in terms of the ambitiousness of its objectives, it is.

But semantic advantage and the semantic Web is not an either/or proposition. It is a spectrum of use and benefit, with Cyc representing a relatively extreme end of that spectrum.

In my recent post on Structurizing the Web with RDF, I noted a number of key areas in which the structured Web would bring benefit, including more targeted, filtered search results, better results presentation, and the ability to search on objects and entities not simply documents. Structurizing the Web, short of full reasoning, is both an essential first step and will bring its own significant benefits.

The use of RDF is also not unnecessarily limiting compared to Cyc’s internal predicate calculus language. The subject–predicate–object “triple” of RDF and its reliance on first-order logic for deduced inferences can also be used to express propositions about nested contexts resulting in metalanguages for modal and higher-order logic. OWL itself is expressed as a metalanguage of RDF encoded in XML; and virtually all mathematics can be described through RDF.

Thus, Cyc, like DBPedia and related efforts to express Wikipedia as RDF, can itself be a rich source of structure for this evolving Web. Like other formats and frameworks, parts can be used in greater or lesser scope and complexity depending on circumstances. The sheer scope of Cyc’s world view in its knowledge base is a tremendous asset.

OpenCyc as a Useful Knowledge Base

The value of this asset increased enormously with the release of OpenCyc in early 2002. The OpenCyc knowledge base and APIs are available under the Apache License. There are presently about 100,000 users of this open source KB, which has the same ontology as the commercial Cyc, but is limited to about 1 million assertions. The most recent release is v. 1.02. An OWL version is also available. The Cyc Foundation has also been formed to help promote further extensions and development around OpenCyc.

There has been some hint on the OpenCyc mailing list of others interested in creating RDF versions of OpenCyc or portions thereof, with the similar benefits to RDF versions of Wikipedia via SPARQL endpoints and retrievals, mashups and RDF browsing. OpenCyc would also obviously be a tremendous boon to better inferencing for the datasets that are rapidly becoming exposed via RDF.

The idea of the local ontologies and reasoners that Lenat discusses — and the broader collaboration mechanisms emerging around other semantic Web issues and tools — bode well for a resurgence of interest in Cyc. After more than two decades, perhaps its time has truly come.

BTW, some interesting slides with pretty good overlap with Lenat’s Google talk can be found here from the Cyc Foundation.

Main Links

Search

Author: Mike Bergman