Posted:May 11, 2008

The Role of UMBEL: Stuck in the Middle with You . . .

Squeezed Between Two World Views at the Infocline

I remember one of my formative jobs at the American Public Power Association when we were running all of APPA’s technical activities. While mostly a lobbying outfit, we were after all in Washington, DC, and were perceived by many to be near some nexus of power and influence. And, because my group did technical stuff, we were a natural magnet for some inventors seeking an edge. And, believe me, some of those folks were real crackpots.

We’d hear claims how this person or that invented radar, or perpetual energy machines, or bendable concrete, and, once, even, cold fusion. Hehe.

In monitoring various ontology and semantic Web mailing lists, claims sometimes arise about the “ontology of everything” or the single, universal ontology that cures cancer or walks on one leg. It makes me smile and think about those past wild claims about radar or perpetual energy.

Some, I believe, when we mention the UMBEL lightweight subject concept reference structure, conjure up similar visions of a universal “ontology of everything”. That is wrong and not our intent. But, UMBEL is trying to straddle two different worlds and world views, and that can often lead to misunderstandings and misperceptions.

This posting is not the first and surely will not be the last on the subject, but it is worthwhile again to try to explain the role of UMBEL from these different angles and with slightly different analogies.

UMBEL is an Infocline

In prior posts, we have described UMBEL as a backbone, as a roadmap to related content, as a lightweight ontology or a lightweight reference structure, and as middleware. In this post, I am going to concentrate on its role as middleware, in its role as residing at the infocline between two different worlds and world views.

The Greek base -cline is often applied to gradual transition layers or changes in gradients or slope. A thermocline, for example, represents the layer between the deep and surface ocean. While there is mixing across this layer, it is slower than within the two parts that it separates. Both parts and the thermocline layer itself have quite different properties and temperatures, even though all are ocean and salty water.

The UMBEL infocline acts in a similar manner. On one side of the UMBEL layer is the Cyc knowledge base, with its self-contained, more-or-less closed world of higher order logics, microtheories regarding thousands of knowledge domains, rich predicates, and coherence. It is venerable, solid and proven, but with its own language and world view. Its purpose is also directed to reasoning and inference, driven from a foundation of (generally not codified outside of Cyc) common sense. It was designed well in advance of the creation of RDF or OWL, indeed in advance of the Internet and Web itself.

On the other side of the UMBEL infocline is the entire Web. This is a chaotic, decentralized, distributed knowledge environment representing untold numbers of world views. The specifications of the semantic Web and its languages and vocabularies have been designed expressly with these differences in mind and the means and structures to link and interrelate them. The Web environment — though not exactly incoherent — is also not in its ground state coherent. Indeed, it is the very purpose of existing semantic Web standards and UMBEL to help provide that coherence.

A key aspect of the Web is its “open world” assumption, defined in SKOS [1] as:

"RDF and OWL Full are designed for systems in which data may be widely distributed ( e.g., the Web). As such a system becomes larger, it becomes both impractical and virtually impossible to "know" where all of the data in the system is located. Therefore, one cannot generally assume that data obtained from such a system is "complete", i.e., if some data appears to be "missing", one has to assume, in general, that the data might exist somewhere else in the system. This assumption, roughly speaking, is known as the "open world" assumption."

What this means for the Web is that we must assume that the system’s knowledge is incomplete, and that if a statement cannot be inferred from what is explicitly expressed, we still cannot infer it to be false. Adding new information never falsifies a previous conclusion, and most of what we can know about the world will remain unknown. Cyc, on the other hand, can make closed-world assumptions under appropriate conditions.

UMBEL thus must act as a mediator, or middleware, in its role as the interface between these world views. It can lead to tension and turbulence when contemplating or transiting this infocline layer.

The Cyc Reference Knowledge Base

The central purpose of UMBEL is to provide a context for relating information. Once such a purpose for context is embraced, the natural next question is: And what shall be the basis for this context?

A previous post discussed why Cyc was chosen over alternatives as this contextual basis. Ultimately, the reasons for choosing Cyc come down to real practical tools and capabilities such as helping to disambiguate the identities of named entities, mapping ontologies and schema, doing natural language processing, and the sheer provenness of the concept relationships that are at the core of UMBEL.

(Also, as noted many times, others could just as reasonably chose other bases for providing context. The important point, again, is to provide some context over no context.)

We can view the Cyc knowledge base as a complete, albeit large, world unto itself. Like the Earth, it is complex and varied and self-contained. It has its own atmosphere and perspective on the broader universe:

But, like other planets or celestial bodies, Cyc is a world, not the world. There are many different possible worlds with different atmospheres and gravities and temperatures and compositions. And, of course, Cyc is not a physical world at all, but a conceptual “world” representing knowledge and its relationships. We will, however, represent it with the Earth image below.

There is 25 years and perhaps close to 300 person-years of development behind Cyc. It has thousands of able practitioners around the world and has been used in hundreds of meaningful projects and engagements. Since its release in 2002, there have been well in excess of 100,000 downloads of its open-source OpenCyc version.[2]

This legacy and history leads to distinct functional and terminology differences from current semantic Web perspectives. For example, the richness of Cyc predicates does not lead to simple mappings to existing OWL and RDFS properties. The notion of class is different than the closest analog in Cyc, the ‘collection‘. The concept and treatment of individuals and types is different. The 1000 or so microtheory domains in Cyc are not easily transferred or mapped to OWL constructs. Cyc uses reification aggressively to functionally combine concepts from constituent elements, such as “apple tree”. Higher-order logic is not transferable in all cases to the first-order logic (FOL) of the semantic Web. And so forth . . . .

Perhaps most importantly, however, is that Cyc has been designed, built and extended by professional ontologists and related researchers. This brings a degree of consistency and quality control that Web-broad initiatives can not hope to approach.

In our working with Cyc there has been nothing but good will and professionalism from the staff at Cycorp and the Cyc Foundation. But, there are clearly times when world view and terminology can differ, sometimes leading to translation problems and issues. Moreover, attempts to bridge from the Cyc world to the open world assumptions of the general Web means the translation is “lossy”, much like what happens in moving from a 16 million palette of 24-bit colors to something less.

Cleaning Cyc

Here are some statistics showing the relative size and scope of the ResearchCyc and OpenCyc versions of Cyc, current as of the last official distributions [3]:

Category	OpenCyc	ResearchCyc
Reifiied Terms (Constants and NARTs)	263,332	303,340
Assertions	2,040,330	2,964,161
Deductions	323,751	1,305,354
Unique Predicates (Properties) [2]	17 (OWL)	~16,000
Disk Storage (KBs) [4]	495,104	566,400

The sheer size and sophistication of either version is too great for easy comprehension and linkage by standard Web resources. Thus, the UMBEL project set out to determine and derive the most fundamental concepts from within OpenCyc. What was desired was a tractable set of subject concept “hub” nodes from within OpenCyc. A further design criterion was to maintain a 100% consistency with OpenCyc for this subset of subject concepts in order for UMBEL to preserve linkage into the Cyc knowledge base.

A subsequent post will relate in detail the nine-month (and continuing!) vetting and extraction process applied to Cyc, the result of which is currently the identification of about 21,000 subject concepts. These are schematically illustrated by the yellow dots on our Cyc Earth representation:

Of course, these yellow dots are not really physical locations on a globe. Rather, they represent important “hub” locations within the virtual Cyc knowledge “space”.

UMBEL: A Lightweight Skein

Once removed from the broader knowledge base, we now have a simple skein of these 21,000 subject concepts and their interrelations. We can show this lightweight structure as a ball of subject concept nodes (in red) connected to one another via their graph edges. We can represent this lightweight skein as follows, which has similarities to a hairnet with the nodes represented by the knots (in red) in the net:

This simplistic wireframe representation has been presented before for all 21,000 UMBEL nodes via the Cytoscape graph visualization software (see figure right; click for larger size).

The following table shows that the overall size and complexity of Cyc has been reduced by 1-2 orders of magnitude through this cleaning exercise, resulting in a lightweight UMBEL structure about 5-10% of the original size:

Category	OpenCyc	ResearchCyc	UMBEL
Terms or Concepts	263,332	303,340	21,057
Assertions	2,040,330	2,964,161	285,700
Deductions	323,751	1,305,354	—
Unique Predicates (Properties) [2,5]	17 (OWL)	~16,000	18
Disk Storage (KBs) [4]	495,104	566,400	14,445

Very striking is the predicate reduction, which is both a key source of “lossiness” and a challenge in maintaining a meaningful OWL and RDFS correspondence with the original Cyc. However, since the purpose of UMBEL is context and not reasoning or inference, this reduction is appropriate and understandable.

The ‘Hairnet Over the Basketball’

Metaphorically, we can now re-apply this UMBEL skein over the Cyc knowledge base. We have described this visual metaphor as the “hairnet over the basketball,” with UMBEL being the hairnet, and Cyc (Earth) the basketball:

Note that the UMBEL skein can act and be used fully independently from the underlying Cyc structure or not.

21,000 Docking Ports for an Open World

This UMBEL lightweight skein or wireframe structure now is ready to act as middleware, to play its role as an infocline. Each of UMBEL’s 21,000 subject concepts is, in effect, a “docking port” to which external Web data can “attach”. Once attached, this data can then be related to other Web data via the subject concept relationships in the UMBEL skein. This docking and attachment mechanism can be visualized as follows (click to enlarge):

If you mentally remove the Earth figure (Cyc) above, the UMBEL skein acts solely as a context reference structure for other Web data through its lightweight SKOS taxonomy structure (narrowerTransitive and broaderTransitive). These are the internal edge relationships of the wireframe structure with the red nodes above.

Though lightweight, this structure is surprisingly powerful in that it also enables tie-ins with external ontology classes — what Fred Giasson has called ‘exploding the domain‘ — and provides a reference context for Web data. Without these docking ports via UMBEL’s subject concepts, there is no contextual frame of reference and these Web data bits essentially tumble aimlessly in a dark knowledge space.[6]

But one need not stop at the infocline wireframe layer of UMBEL. Because each subject concept (“docking port”) has a direct correspondence to Cyc, we can dive more deeply into the Cyc knowledge environment. First through OpenCyc and then (via licensing or other arrangements) into ResearchCyc or the full Cyc, another dimension of tools and capabilities can become available. We now have backup and support to assess mappings and assignments and inferences and reasoning.

Will everyone want such capabilities? Most will not.

But it also surely does not hurt to have these value-added pathways so readily available for use and exploitation.

Some Context is Better than No Context at All

Some perhaps in the Cyc community may look at this picture and say, Whoa!: We’re giving Web denizens loaded Cyc guns via the UMBEL infocline to harm themselves and others.

Perhaps so. But this is also why we have courses on firearms safety and practice ranges for gaining the experience. Ontology mapping of any nature in an open world requires attention and skill to maintain quality.

The open world circumstances have already shown challenges with sameAs assignments and will certainly be exacerbated as we extend to class mappings in ontologies and inferencing. Quality and provenance will assert their prominence. Who do you trust and who is capable? But haven’t these always been operative questions?

Some perhaps in the broader Web community may go, Whoa! We are free and independent actors who hate any sniff of possible centralized Big Brother crap. Why UMBEL? Why Cyc? I want to free-form tag and twitter to my heart’s content.

OK, well, sure. But how can the Web of data meaningfully expand without reference points, structure and context? Though we may have foundational semantic Web standards in place, if we are going to meaningfully inter-relate data, we also need context and semantics.

UMBEL and Cyc offer one set of contexts, semantics and tools. Whether they are the best or not is a matter for the market to decide. But I think it will rapidly become clear that future Linked Data that is published without context will remain largely unused data. The question now going forward is not the rejection of context but deciding what contextual frameworks work better, are easy to implement, and are readily understood.

So, I think the game has changed and I’d like to believe for the better. UMBEL has placed a marker down — and it’s smack dab in the middle.

Yes I’m stuck in the middle with you,
And I’m wondering what it is I should do,
It’s so hard to keep this smile from my face,
And knowledge, yeah, is all over the place,
Cyc to the left of me, Open Web to the right,
Here I am, stuck in the middle with you. [7]

[1] W3C, SKOS Simple Knowledge Organization System Reference, W3C Working Draft, 25 January 2008; see http://www.w3.org/TR/2008/WD-skos-reference-20080125/#L881.

[2] See Priming the Pump and Threshold Conditions for the ResearchCyc estimate; the OpenCyc predicate count is from the OWL distribution version; the standard OpenCyc predicate count is not calculated.

[3] Obtained by running the CycL query, (kb-statistics), to a local instance of the distribution. The OpenCyc version is 5006; the ResearchCyc version 7117.

[4] Only the distribution size for the World model is reported; there are additional executables and supporting files not included.

[5] For UMBEL, 9 of the 18 are new properties, the rest are from existing OWL and RDFS vocabularies. These predicates include: type (RDF); subClassOf (RDFS); equivalentClass (OWL Full); language (DC, including all lingvoj instances); prefLabel, altLabel, definition, broaderTransitive, and narrowerTransitive (SKOS; also, there are some SKOS notes properties not listed); and the new properties of superClassOf, hasSemset, isAligned, withOverlap, linksConcept, isAbout, linksEntity, isLike, and withLikelihood (UMBEL). The forthcoming UMBEL technical documentation will explain this vocabulary in detail.

[6] While the current and common use of the sameAs relation provides linkage between instances or named entity identities in various datasets, this predicate does nothing to orient or provide frames of reference for the datasets themselves.

[7] Stuck in the Middle with You, with all due respect to Bob Dylan and Stealers Wheel.

Posted:May 6, 2008

The Semantics of Context

Many Kinds of RDF Links Can Provide Linked Data ‘Glue’

In a recent blog post, Kingsley Idehen picked up on the UMBEL project’s mantra of “context, Context, CONTEXT!” as contained in our recent slideshow. He likened context to the real estate phrase of “location, location, location”. Just so. I like Kingsley’s association because it reinforces the idea that context places concepts and things into some form of referential road map with respect to other things and concepts.

To me, context describes the relationships and environmental proximities of what UMBEL calls subject concepts and their instance sub-concepts and named entity members, the whole of which might be visualized as a graph of reference nodes in the firmament of a global knowledge space.

Indeed, it is this very ‘cloud’ of subject concept nodes that we tried to convey in an earlier piece on what UMBEL’s backbone structure of 21,000 subject concepts might look like, shown at right. (Of course, this visualization results from the combination of UMBEL’s OpenCyc contextual framework and specific modeling algorithms; the graph would vary considerably if based on other frameworks or models.)

Yet in a comment to Kingsley’s post, Giovanni Tummarello said, “If you believe in context so much then the linking open data idea goes bananas. Why? Because ‘sameAs’ is fundamentally wrong.. an entity on DBpedia IS NOT sameAs one on GeoNames because the context is different and bla bla… so it all crumbles.” [1]

Well, hmmm. I must beg to differ.

I suspect as we now are seeing Linked Data actually enter into practice, new implications and understandings are coming to the fore. And, as we try new approaches, we also sometimes suffer from the sheer difficulty of explicating those new understandings in the context of the shaky semantics of the semantic Web.

Giovanni’s comment raises two issues:

What the context or meaning of context is, and
The dominant RDF link ‘glue’ for Linked Data that has been used to date, the owl:sameAs predicate.

Therefore, since UMBEL is putting forth the argument for the importance of context in Linked Data, it is appropriate to be precise about the semantics of what is meant.

Context in Context

What is context? The tenth edition of Merriam-Websters Collegiate Dictionary (and the online version) defines it as:

context \ˈkän-ˌtekst\ n.;ME, weaving together of words, Latin contextus connection of words, coherence, from contexere to weave together, from com- + texere to weave ( ca.1586)

1: the parts of a discourse that surround a word or passage and can throw light on its meanings

2: the interrelated conditions in which something exists or occurs: environment, setting <the historical context of the war>.

Another online source I like for visualization purposes is Visuwords, which displays the accompanying graph relationships view based on WordNet.

Both of these references, of course, base their perspective on language and language relationships. But, both also provide the useful perspective that context also conveys the senses of environment, surroundings, interrelationships, connections and coherence.

Context has itself been a focus of much research from linguistics to philosophy and computer science. Each field has its specific take on the concept, but I believe it fair to say that context is consensually used as a holistic reference structure that tries to put all worlds and views, including that of the observer and observed, into a consistent framework. Indeed, when that framework and its assertions fit and make sense, we give that a word, too: coherent.

Hu Yijun [2], for example, intersects the interplay of language, semantics and the behavior and circumstances of human actors to frame context. Yijun observes that an invariably-applied research principle is that meaning is determined by context. Context refers to environmental conditions surrounding a discourse and its parts which are related with it, and provides the framework to interpret that discourse. There are world views, relationships and interrelationships, and assertions by human actors that combine to establish the context of those assertions and the means to interpret them.

In the concept of context, therefore, we see all of the components and building blocks of RDF itself. We have things or concepts (subjects or objects) that are related to one another (via properties or predicates) to form the basic assertions (triples). These are combined together and related in still more complex structures attempting to capture a world view or domain (ontology). These assertions have trust and credibility based on the actors (provenance) that make them.

In short, context is the essence of the semantic Web and Linked Data, not somehow in variance or conflict with it.

Without context, there is no meaning.

While one interpretation might be that the characteristics of one individual (say, Quebec City) might be oriented to latitude and longitude in a GeoNames source, while the characteristics of that individual may have a different context (say, population or municipal government) in the different DBpedia (Wikipedia) source, we need to be very careful of what is meant by context here. The identity of the individual (Quebec City) remains the same in both sources. The context does not change the individual nor its identity, only the nature of the characteristics used to provide different coherent information about it.

Not the Same Old sameAs

With the growth in Linked Data, we are starting to hear the rumblings around possible misuse and misapplication of the sameAs predicate [3]. Frankly, this is good, because I share the view there has been some confusion regarding the predicate and misapplications given its semantics.

The built-in OWL property owl:sameAs links an individual to an individual [4]. Such an owl:sameAs statement indicates that two URI references actually refer to the same thing: the individuals have the same “identity”.

A link is a predicate is an assertion. It by nature ties (“glues”) two resources to one another. Such an assertion can either: (1) be helpful and “correct”; (2) be made incorrectly; (3) assert the wrong or perhaps semantically poor relationship; or (4) be used maliciously or to deceive.

(Unlike email spam, #4 above has not occurred anywhere to my knowledge for Linked Data. Unfortunately, and most sadly, deceitful links will occur at some point, however. This inevitability is a contingency the community must be cognizant of as it moves forward.)

To date, almost all inter-source Linked Data links have occurred via owl:sameAs. If we liken this situation to early child language acquisition, it is like we only have one verb to describe the world. And because our vocabulary is relatively spare, we have tended to apply sameAs to situations and relations that, comparatively, have a bit of semblance to baby-talk.

So long as we have high confidence two disparate sources are referring to the same individual with the same identity, sameAs is the semantically correct RDF link. In all other cases, the use of this predicate should be suspect.

Simple string or label matches are insufficient to make a sameAs assertion. If sameAs can not be confidently asserted, as might be the case where the relation of individual referents is perhaps likely but uncertain, we need to invoke new predicates or make no assertion at all. And, if the resources at hand are not individuals at all but classes, the need for new semantics increases still further.

As we increase the size of the Linked Data ‘cloud’ or show rapid growth in Linked Data, we should be aware that quality, not size, may be the most important metric powering acceptance. The community has made unbelievable progress in finally putting real data behind the semantic Web promise. The challenge now is to add to our vocabulary and ensure quality assertions for the linkages we publish.

Many Predicates Can Richen the RDF Link ‘Glue’

One of UMBEL’s purposes, for example, is to broaden our relations to the class level of subject concepts. As we move beyond the early days of FOAF and other early vocabularies, we will see further richening of our predicates. We also need predicates and predicate language that reflects the open-world nature [5] of public Linked Data and the semantic Web.

So, while sameAs helps us aggregate related information about the same identifiable individual, the predicates of class relations in context to other classes helps to put all information into context. And, if done right — that is, if the semantics and assertions are relatively correct — these desired contextual relations and interlinkages can blossom.

The new predicates forthcoming from the UMBEL project, to be published with technical documentation this month, and related to these purposes will include:

isAligned — the predicate for aligning external ontology classes to UMBEL subject concepts
isAbout — the predicate for relating individuals and instances to their contextual subject concepts, and
isLikely — the predicate for likely relations between the same identifiable individual, but where there is some ambiguity or uncertainty short of a sameAs assertion.

Assertions such as these that are open to ambiguity or uncertainty, while appropriate for much of the open-world nature of the semantic Web, may also be difficult predicates for the community to achieve consensus. Like our early experience with sameAs, these predicates — or others that can just as easily arise in their stead — will certainly prove subject to some growing pains. 🙂

Any Context is Better than No Context at All

Most people active in the semantic Web and Linked Data communities believe a decentralized Web environment leads to innovation and initiative. Open software, standards activities, and vigorous community participation affirm these beliefs daily.

The idea of context and global frames of reference, such as represented by UMBEL or perhaps any contextual ontology, could appear to be at odds with those ideals of decentralization. But one paradox is that without context, the basis for RDF linkages is made much poorer and therefore the potential for the benefits (and thus adoption) of Linked Data lessen.

The object lesson should therefore not be a rejection of context. Indeed, any context is better than no context at all.

Of course, whether that context gets provided by UMBEL or by some other framework(s) remains to be seen. This is for the market to decide. But the ability of contextual frameworks to richen our semantics should be clear.

The past year with the growth and acceptance of Linked Data have affirmed that the mechanisms for linking and relating data are now largely in place. We have a simple, yet powerful and extensible data model in RDF. We have beginning vocabularies and constructs for conducting the data discourse. We have means for moving legacy data and information into this promising new environment.

Context and Linked Data are not in any way at odds, nor are context and sameAs. Indeed, context itself is an essential framework for how we can orient and grow our semantics. Human language required its referents in the real world in order to grow and blossom. Context is just as essential to derive and grow the semantics and meaning of the semantic Web.

The early innovators of the Linked Open Data community are the very individuals best placed to continue this innovation. Let’s accept sameAs for what it is — one kind of link in a growing menagerie of RDF link predicates — and get on with the mission of putting our enterprise in context. I think we’ll find our data has a lot more meaningfully to say — and with more coherence.

[1] See the original post for the full comment; shown with some minor re-formatting.

[2] Hu Yijun, 2006. On the Essence of Discourse: Context Coherence, see http://www.paper.edu.cn/en/downloadpaper.php?serial_number=200606-221&type=1.

[3] For example, two papers presented at the the Linked Data on the Web (LDOW2008) Workshop at WWW2008, April 22, 2008, Beijing, China, two weeks ago highlight this issue. In the first, Bouquet et al. (Paolo Bouquet, Heiko Stoermer, Daniele Cordioli and Giovanni Tummarello, 2008. An Entity Name System for Linking Semantic Web Data, paper presented at LDOW2008, see http://events.linkeddata.org/ldow2008/papers/23-bouquet-stoermer-entity-name-system.pdf) state, “In fact it appears that the current use of owl:sameAs in the Linked Data community is not entirely in line with this definition and uses owl:sameAs more like a Semantic Web substitute forx [sp] a hyperlink instead of realizing the full logical consequences.”

Also Jaffri et al. (Afraz Jaffri, Hugh Glaser and Ian Millard, 2008. URI Disambiguation in the Context of Linked Data, paper presented at LDOW2008, see http://events.linkeddata.org/ldow2008/papers/19-jaffri-glaser-uri-disambiguation.pdf) note that misuses of sameAs may result in a large percentage of entities being improperly conflated, entity references may be incorrect, and there is potential error propagation when mislabeled sameAs are then applied to new instances. As the authors state, “This will have a major impact on the Semantic Web when such repositories are used as data sources without any attempt to manage the inconsistencies or ‘clean’ the data.” The authors also pose thoughtful mechanisms for addressing these issues built on the hard-earned co-referencing experience gained within the information extraction (IE) community.

These observations are not in any way meant to be critical or alarmist. They simply point to the need for quality control and accurate semantics when asserting relationships. These growing pains are a natural circumstance of rapid growth.

[4] W3C, OWL Web Ontology Language Reference, W3C Recommendation, 10 February 2004. See http://www.w3.org/TR/owl-ref/. Note that in OWL Full sameAs can also be applied to classes, but that is a special case, not applicable to how Linked Data has been practiced to date, and is not further discussed here.

[5] The “open world” assumption is defined in the SKOS ontology reference documentation (W3C, SKOS Simple Knowledge Organization System Reference, W3C Working Draft, 25 January 2008; see http://www.w3.org/TR/2008/WD-skos-reference-20080125/#L881 ) as:

“RDF and OWL Full are designed for systems in which data may be widely distributed ( e.g., the Web). As such a system becomes larger, it becomes both impractical and virtually impossible to “know” where all of the data in the system is located. Therefore, one cannot generally assume that data obtained from such a system is “complete”, i.e., if some data appears to be “missing”, one has to assume, in general, that the data might exist somewhere else in the system. This assumption, roughly speaking, is known as the “open world” assumption.”

Posted:April 27, 2008

Slideshow on UMBEL Web Services

“UMBEL’s Eleven,” overviews the project’s first 11 semantic Web services and online demos. The brief slideshow has been posted to Slideshare:

| View

UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight reference structure for placing Web content, named entities and data in context with other data. It is comprised of about 21,000 subject concepts and their relationships — with one another and with external vocabularies and named entities.

Recent postings by Fred Giasson and by me discussed these Web services in a bit more detail.

Posted:April 20, 2008

Announcing the UMBEL Web Services Sandbox

There’s Some Cool Tools in this Box of Crackerjacks

UMBEL is today releasing a new sandbox for its first iteration of Web services. The site is being hosted by Zitgist. All are welcomed to visit and play.

And, UMBEL is What, Again?

UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight reference structure for placing Web content and data in context with other data. It is comprised of about 21,000 subject concepts and their relationships — with one another and with external vocabularies and named entities.

Each UMBEL subject concept represents a defined reference point for asserting what a given chunk of content is about. These fixed hubs enable similar content to be aggregated and then placed into context with other content. These subject context hubs also provide the aggregation points for tying in their class members, the named entities which are the people, places, events, and other specific things of the world.

The backbone to UMBEL is the relationships amongst these subject concepts. It is this backbone that provides the contextual graph for inter-relating content. UMBEL’s subject concepts and their relationships are derived from the OpenCyc version of the Cyc knowledge base.

The UMBEL ontology is based on RDF and written in the RDF Schema vocabulary of SKOS (Simple Knowledge Organization System) with some OWL Full constructs to aid interoperability.

UMBEL’s backbone is also a reference structure for more specific domains or ontologies, thereby enabling further context for inter-relating additional content. Much of the sandbox shows these external relationships.

UMBEL’s Eleven

These first set of Web services provide online demo sandboxes, and descriptions of what they are about and their API documentation. The first 11 services are:

Find Subject Concepts — look up via name, URI or alias one of the 21,000 UMBEL subject concepts
Subject Concept Report — basic information about the subject concept; description, canonical name, and the like
Subject Concept Detailed Report — very cool! see below
List Sub-Concepts & Sub-Classes — for developers and ontology geeks
List Super-Concepts & Super-Classes — for developers and ontology geeks
List Equivalent External Classes — for developers and ontology geeks
Verify Sub-Class Relationship — for developers and ontology geeks
Verify Super-Class Relationship — for developers and ontology geeks
Verify Equivalent Class Relationship — for developers and ontology geeks
Subject Concepts Explorer — good golly, Miss Molly, we have visuals!
Yago Ontology — a little help from our friends.

A CLASSy Detailed Report

The single service that provides the best insight to what UMBEL is all about is the Subject Concept Detailed Report. (That is probably because this service is itself an amalgam of some of the others.)

Starting from a single concept amongst the 21,000, in this case ‘Mammal’, we can get descriptions or definitions (the proper basis for making semantic relationships, not the ‘Mammal’ label), aliases and semsets, equivalent classes (in OWL terms), named entities (for leaf concepts), more general or specific external classes, and domain and range relationships with other ontologies. Here is the sample report for ‘Mammal’:

The discerning eye likely observes that while there are a rich set of relationships to the internal UMBEL subject concepts, coverage is still light for external classes and named entities. This sandbox is, after all, a first release and we are early in the mapping process. 🙂

But, it should also start to become clear that the ability of this structure to map and tie in all forms of external concepts and class structures is phenomenal. Once such class relationships are mapped (to date, most other Linked Data only occurs at the instance level), all external relationships and properties can be inherited as well. And, vice versa.

So, for aficionados of the network effect, stand back! You ain’t seen nothing yet. If we have seen amazing emergent properties arising from the people and documents on the Web, with data we move to another quantum level, like moving from organisms to cells. The leverage of such concept and class structures to provide coherence to atomic data is literally primed to explode.

Bloomin’ Concepts!

To put it mildly, trying to get one’s mind around the idea of 21,000 concepts and all of their relationships and all of their possible tie in points and mappings to still further ontologies and all of their interactions with named entities and all of their various levels of aggregation or abstraction and all of their possible translations into other languages or all of their contextual descriptions or all of their aliases or synonyms or all of their clusterings or all of their spatial relationships or all of the still more detailed relationships and instances in specific domains or, well, whew! You get the idea.

It is all pretty complex and hard to grasp.

One great way to wrap one’s mind around such scope is through interactive visualization. The first UMBEL service to provide this type of view is the Subject Concept Explorer, a screenshot of which is shown here:

But really, to gain the true feel, go to the service and explore for yourself. It feels like snorkeling through those schools of billions of tiny silver fish. Very cool!

These amazing visualizations are being brought to us by Moritz Stefaner, imho one of the best visualization and Flash gurus around. We will be showcasing more about Moritz’s unbelievable work in some forthcoming posts, where some even cooler goodies will be on display. His work is also on display at a couple of other sites that you can spend hours drooling over. Thanks, Moritz!

Missing Endpoints and Next Steps

You should note that developer access to the actual endpoints and external exposure of the subject concepts as Linked Data are not yet available. The endpoints, Linked Data and further technical documentation will be forthcoming shortly.

The currently displayed services and demos provided on this UMBEL Web services site are a sandbox for where the project is going. Next releases will soon provide as open source under attribution license:

The formal UMBEL ontology written in OWL Full and SKOS
Technical documentation for the ontology and its use and extension
Freely accessible Web services according to the documentation already provided
Technical documentation and reports for the derivation of the subject concepts from OpenCyc and the creation and extension of semsets and named entities related to that structure.

When we hit full stride, we expect to be releasing still further new Web services on a frequent basis.

BTW, for more technical details on this current release, see Fred Giasson’s accompanying post. Fred is the magician who has brought much of this forward.

Posted:April 13, 2008

Semantic Web Semantics: Arcane, but Important

A New Entrant into the Lion’s Den

Exactly one month ago I wrote in The Shaky Semantics of the Semantic Web, “The time is now and sorely needed to get the issues of representation, resources and reference cleaned up once and for all.”

The piece was prompted by growing rumblings on semantic Web mailing lists and elsewhere about semantic Web terminology, plus concerns that lack of clarity was opening the door for re-branding or appropriating the semantic Web ‘space.’ I observed these issues were “complex and vexing boils just ready to erupt through the surface.”

My own post was little noticed but the essential observations, I think, were correct. In the past month the rumblings have become a distinct growl and aspects of the debate are now coming into direct focus. I think the perspective has (thankfully) shifted from wanting to not re-open the so-called arcanely named “httpRange-14” debate (a less technical explanation is in Wikipedia on its role in bringing the concept of “information resource” to the Web) of three years past to perhaps finally lancing the boil.

A Determined Protagonist

Many of us monitor multiple mailing lists; they seem to have their own ebb and flow, most often quiet, but sometimes rat-a-tat-tat furious. In and of itself, it is fascinating to see which topics and threads catch fire while others remain fallow.

One mail list that I monitor is the W3C‘s Technical Architecture Group, in essence the key deliberation body for technical aspects of the Web. Key authors of the Web such as Tim Berners-Lee, Roy Fielding and many, many others of stature and knowledge either are on the TAG or participate in its deliberations. The TAG’s public mailing list is immensely helpful to learn about technical aspects of the Web and to get a bit of early warning regarding upcoming issues. The W3C and its TAG are exemplars of open community process and governance in the Internet era.

I assume many hundreds monitor the TAG list; most, like me, comment rarely or not at all. The matters can indeed be quite technical and there is much history and well-thought rationale behind the architecture of the Web.

Xiaoshu Wang has recently been a quite active participant. English is not Xiaoshu’s native language, but because of his passion he has nonetheless been a determined protagonist to probe the basis and rationale behind the use of resources, representations and descriptions on the Web. These are difficult concepts under the best of circumstances, made all the harder due to language differences and special technical senses that have been adopted by the TAG in its prior deliberations.

These concerns were first and most formally expressed in a technical report, URI Identity and Web Architecture Revisited, by Xiaoshu and colleagues in November 2007.

My layman’s explanation of Xiaoshu’s concerns is that the earlier httpRange-14 decision to establish a technical category of “information resources” begs and leaves open the question of the inverse — what has been called a “non-information resource” — and actually violates prior semantics and understandings of what should be better understood as representations.

A Respected Interlocutor

This discussion arose in relation to the Uniform Access to Descriptions [1], a thread begun by Jonathan Rees of the TAG to assemble use cases related to HttpRedirections-57, a proposal to standardize the description of URI things, such as documents, by rejuvenating the link header. Because of its topic, discussion of httpRange-14 was discouraged since putatively the core definition of “information resource” was not at issue.

However, after introduction of a most interesting pre-print, In Defense of Ambiguity [2], co-author Harry Halpin perhaps inadvertenly opened the door to the httpRange-14 question again. Then, Xiaoshu began submitting and commenting in earnest, and Stuart Williams of the TAG, in particular, was helpful and patient to help draw out and articulate the points.

My observation is that Xiaoshu was never advocating a change in the basic or current architecture of the Web, but perhaps that was not apparent or readily clear. Again, the frailty of human communications compounded by language and perspective have been much in evidence.

Pat Hayes, the editor of the excellent RDF Semantics W3C recommendation, then intervened as interlocuter for Xiaoshu’s basic positions. Many, many others, notably including Berners-Lee and Fielding, have also joined the fray. The entire thread [4] is worth reading and study.

Since Xiaoshu has publicly endorsed Hayes’ interpretation, here are some important snippets from Pat’s articulation [3]:

The central point is that now that we have the technology and ideas of the semantic web available, we have a wider range of ways of representing, and a richer notion of what words like ”metadata” mean. If we are willing to take fuller advantage of this new richness, we make available new ways to do semantic things within the same overall design of the pre-semantic web.

. . .

There simply is no other word [than ‘represents’] that will do. And the size, history and, I’m sorry, but scholarly and intellectual authority of the community which uses a wider sense of ‘represent’ so greatly exceeds the AWWW [W3C Web] community that I don’t think you can reasonably claim possession of such a basic and central term for such a very narrow, arcane and special (and, by the way, under-defined) sense.

. . .

If AWWW had used a technical word in a new technical way, then this would likely have been harmless. Mathematics re-used ‘field’ without getting confused with agriculture. But the AWWW/semantics clash over the meaning of ‘represent’ is harmful because the senses are not independent: the AWWW usage is a (very) special case of the original meaning, so it is inherently ambiguous every time it is used; and, still worse, we need the broader meaning in these very discussions, because the TAG has decreed that URIs can denote anything: so we are here discussing semantics in a broad sense whether we like it or not. And if the word ‘represent’ is to be co-opted to be used only in one very narrow sense, then we have no word left for the ordinary semantic sense. To adopt a usage like this is almost pathological in the way it is likely to generate confusion (as it already has, and continues to do so, in spades.)

. . .

The way we name Web pages is a special case of this picture, where the ‘storyteller’ is the same thing as the resource. Things that can be their own storytellers fit nicely within current AWWW, with its official understanding of words like ‘represent’. (In fact, capable of being ones own storyteller might be a way to define ‘information resource’.) But the nice thing about this picture [as presented by Xiaoshu] is that other kinds of resource, which do not fit at all within the AWWW – things that aren’t documents, ‘non-information resources’ – also fit within it; still, ironically, using the AWWW language, but with a semantic rather than AWWW sense of ‘represent’.

Right now, the semantic web really does not have a coherent story to tell about how it works with non-information resources, other than it should use RDF (plus whatever is sitting on it in higher levels) to describe them; which says nothing, since RDF can describe anything. URIs in RDF are just names, their Web role as http entities semantically irrelevant. Http-range-14 connects access and denotation for document-ish things, but for other things we have no account of how they should or should not be related, or what anything a URI might access via http has got to do with what it denotes.

The way that the three participants (denoted-thing, URI-name and Web-information-resource ‘storyteller’) interact must be basically different when the denoted-thing isn’t an information resource from when it is. All that being suggested here is that there is an account that we could give about this, one that works in both cases and which fits the language of AWWW quite, er, nicely.

. . .

A person exists and has properties entirely separate from the Web. Many people have nothing to do with the Web in their entire lives. People are not Web objects. And when the URI is being used in an RDF graph to refer to a person, the fact that it starts with http: is nothing more than a lexical accident, which has no bearing whatever on the role of the URI as a name denoting a person.

. . .

I think this particular shoe is on the other foot. If you can actually say, clearly enough to prevent continual trails of endless email debate, what AWWW actually means by ‘represent’, then I’d be delighted if you would use some technical word to refer to that elusive notion. But the word ‘represent’ and its cognates has been a technical word in far larger and more precisely stated forums for over a century; and since the day that Web science has included the semantic web, AWWW has taken an irrevocable step into the same academy. You are using the language of semantics now. If you want to be understood, you have to learn to use it correctly.

. . .

All it would do is move the responsibility of deciding what a URI denotes from a rather messy and widely ill-understood distinction based on http codes, to a matter of content negotiation. This would allow phenomena which violate http-range-14, but it would by no means insist on such violations in all cases. In fact, if we were to agree on some simple protocols for content negotiation which themselves referred to http codes, it could provide a uniform mechanism for implementing the http-range decision.

. . .

Moreover, this approach would put ‘information resources’ on exactly the same footing as all other things in the matter of how to choose representations of them for various purposes, a uniformity which means little at present but is likely to increase in value in the future.

. . .

But right now, for the case where a URI is understood to denote something other than an information resource, we have a completely blank slate. There is nothing which tells our software how to interoperate in this case. Our situation is not a kind of paradise of reference-determination from which Xiaoshu and I are threatening to have everyone banished. Right now for the semantic web, things are about as bad as they can get.

. . .

. . . we, as a society, can use [the conventions we decide] for whatever we decide and find convenient. The Web and the Internet are replete with mechanisms which are being used for purposes not intended by their original designers, and which are alien to their original purpose. For a pertinent example, the http-range-14 decision uses http codes in this way. That isn’t what http codes are for.

I have repeated much of this material because I believe it to be of wide import to the semantic Web’s development and future. Obviously, for better understanding, the full thread [4] plus its generous sprinkling of excellent prior documents and discussions is most recommended.

My Take

There are certainly technical aspects to this debate that go well beyond my ken. I strongly suspect there are edge cases for which more complicated technical guidance is warranted.

And, it is true, I have been selective in which sides of this debate I am highlighting and therefore supporting. This is not accidental.

While some in this debate have claimed the need to conform to existing doctrine in order to ensure interoperability or the integrity of software systems, from my different perspective as someone desiring to help build a market by extending reach into the broader public, that argument is false. Let’s take the existing architecture we have, but make our best practices recipes simple, our language clear, and our semantics correct. How can we really promote and grow the semantic Web when our own semantics are so patently challenged?

Our community faces a challenge of poor terminology and muddled concepts (or, perhaps more precisely, concepts defined in relation to the semantic Web that are not in conformance with standard understandings). My strong suspicion is that we risk at present over-specification and just plain confusion in the broader public.

This mailing list debate is hugely important, informative and thought provoking. Xiaoshu deserves thanks for his courage and tenacity in engaging this debate in a non-native language; Pat Hayes deserves thanks for trying to capture the arguments in language and terminology more easily understandable to the rest of us and to add his own considerable experience to the debate, and many of the mail list regulars deserve sincere thanks for being patient and engaged to allow the nuances of these arguments to unfold.

From my standpoint there is real pragmatic value to these arguments that would bring the terminology and semantics of the semantic Web into better understood and more easily communicated usage, all without affecting or changing the underlying architecture of the Web. (Or, so, to my naÃ¯ve viewpoint, the argument seems to suggest.)

So long as the semantic Web’s practitioners still number in the hundreds, and those with nuanced understanding of these arcane matters likely only in the scores, the time is ripe to get the language and concepts right. Doing so can help our enterprise reach millions and much more quickly.

[1] See this beginning thread, http://lists.w3.org/Archives/Public/www-tag/2008Mar/0024.html, and its supporting refererences: http://lists.w3.org/Archives/Public/www-tag/2008Feb/0013.html; http://www.w3.org/Mail/Request; http://esw.w3.org/topic/FindingResourceDescriptions; http://esw.w3.org/topic/LinkHeader

[2] Patrick J. Hayes and Harry Halpin, 2008. “In Defense of Ambiguity,” preprint for the International Journal on Semantic Web and Information Systems 4(3), to appear later this year. See http://www.ibiblio.org/hhalpin/homepage/publications/indefenseofambiguity.html.

[3] Extracts from these specific posts: http://lists.w3.org/Archives/Public/www-tag/2008Apr/0139.html; http://lists.w3.org/Archives/Public/www-tag/2008Apr/0166.html; and http://lists.w3.org/Archives/Public/www-tag/2008Apr/0173.html.

[4] See http://lists.w3.org/Archives/Public/www-tag/2008Apr/thread.html, and then select ‘Uniform access to descriptions ‘.

Main Links

Search

Author: Mike Bergman

Posted:May 11, 2008

The Role of UMBEL: Stuck in the Middle with You . . .

Squeezed Between Two World Views at the Infocline

UMBEL is an Infocline

The Cyc Reference Knowledge Base

Cleaning Cyc

UMBEL: A Lightweight Skein

The ‘Hairnet Over the Basketball’

21,000 Docking Ports for an Open World

Some Context is Better than No Context at All

Posted:May 6, 2008

The Semantics of Context

Many Kinds of RDF Links Can Provide Linked Data ‘Glue’

Context in Context

Not the Same Old sameAs

Many Predicates Can Richen the RDF Link ‘Glue’

Any Context is Better than No Context at All

Posted:April 27, 2008

Slideshow on UMBEL Web Services

Posted:April 20, 2008

Announcing the UMBEL Web Services Sandbox

There’s Some Cool Tools in this Box of Crackerjacks

And, UMBEL is What, Again?

UMBEL’s Eleven

A CLASSy Detailed Report

Bloomin’ Concepts!

Missing Endpoints and Next Steps

Posted:April 13, 2008

Semantic Web Semantics: Arcane, but Important

A New Entrant into the Lion’s Den

A Determined Protagonist

A Respected Interlocutor

My Take