Posted:September 24, 2008

Fitting the Pieces

Coherence is Needed for Continued Sustainability

Since early in 2008 my colleague, Fred Giasson, has been authoring a series of important blog posts on ‘exploding the domain.’ Exploding the domain means to make class-to-class mappings between a reference ontology and external ontologies, which allows properties, domains and ranges of applicability to be inherited under appropriate circumstances. Exploding the domain expands inferencing power to this newly mapped information.

Fred first used the phrase in an April post that introduced the concept:

“. . . once the linkages between UMBEL subject concepts and external ontologies classes are made, the following becomes possible: 1) the UMBEL subject concept structure can be used to describe (instantiate) things using the UMBEL data structure; 2) external ontology properties can be re-used to describe these new instances since external ontologies classes are linked to UMBEL subject concept classes; and 3) in some cases, the properties defined in these ontologies can be used in relation with UMBEL subject concept classes.”

Since that time, Fred has continued to explore these implications. In an August posting looking at UMBEL as a possible reference framework for mapping and exploding domains, Fred stated,

“. . . once we have the context in place, we are on our way to achieve coherence. UMBEL is 100% based on OpenCyc and Cyc, which are internally consistent and coherent within themselves. We thus use these coherent frameworks to make the mappings to external ontologies coherent, too.

The equation is simple:

a coherent framework + ontologies contextualized by this framework = more coherent ontologies.”

The thrust of this analysis was to show how UMBEL subject concepts can act to create context (his emphasis) for linked classes defined in external ontologies. Where the individuals in a dataset are instances of classes, and some of these classes are linked to UMBEL (or a similar contextual structure), these subject concept classes also give context to those individuals.

Finally, and most recently, Fred demonstrated how the use of UMBEL could explode DBpedia’s domain by linking classes using only three properties: rdfs:subClassOf, owl:equivalentClass and umbel:isAligned. (And, as I noted in an earlier posting this week, those mappings have now been made bi-directional from DBpedia to UMBEL as well.)

As we discuss and apply these concepts we are starting to see some further guidelines emerge. Presenting these is the purpose of this post in this ongoing exploding the domain series.

The Mere Existence of Classes is Not Enough

Since its inception, DBpedia has had a class structure of sorts, first from the native Wikipedia categories from which it was derived and then with the incorporation of the YAGO structure (based on WordNet concept relationships). Yet we have claimed that class structure has truly only recently been brought to DBpedia with the mappings to UMBEL. Why? Does not DBpedia’s initial class structure meet the test?

(BTW, these same questions may be applied to some of the other large data structures beginning to emerge on the semantic Web such as Freebase. But, those are stories for another day.)

There are really two answers to these questions. First, the mere existence of classes is not enough; they must actually be used. And, second, the nature of those classes and their structure and coherence are absolutely fundamental. This subsection addresses the first point; the following the second; with both aided by the table below.

There has been a class structure within DBpedia from inception, which was then supplemented a few months after release with the YAGO structure. The starting Wikipedia structure showed early issues which began to be addressed through a cleaned Wikipedia category class (CWCC) hierarchy. These relationships were established with the rdf:type predicate that relates an instance to a class. The classes themselves were related to one another through the rdfs:subClassOf predicate. These class relationships allowed the linked classes to be shown in a hierarchical (taxonomy-like) structure.

Initially, in the case of the beginning Wikipedia categories, the internal class relationships were weak. This was somewhat improved with the addition of YAGO and its WordNet-based concept relationships (with better semantics).

However, these class relationships were (to my knowledge) never mapped to any external structures or ontologies. If used at all, they were only implied for ad hoc navigation within the internal instance data.

Really, anyone could have approached DBpedia at this point and began an effort of mapping its existing class structures to external data. Indeed, we (as editors of UMBEL) considered doing so, but chose a different structure (see next section) for reasons of context and coherence.

The net result is that DBpedia and the other instances of the linking open data (LOD) cloud have remained focused and useful at the instance level, and not yet at the class level.

As we brought in UMBEL to provide a class structure to DBpedia and linked data, this circumstance began to change, as this table indicates:

 YagoUMBEL
Predicate
Differences
– subClassOfsubClassOf
equivalentClass
superClassOf
isAligned
– and, entity-to-class predicates
Mapping/Application
Differences
– No external mappings made– Aggressive use of external mappings (‘exploding the domain‘)
– Consistent internal structure
Structural
Differences
– Based on WordNet concept relationships– Based on Cyc common sense structural relationships
– Inferencing and reasoning Cyc tools for testing coherence
– Microtheories framework for domain differences
– Extendable structure
Unique Class Count~ 55,000~ 20,000

Though shown for comparison reasons, the number of classes probably has no real importance.

The key argument in this subsection is that classes matter. Indeed, one of the current challenges before the linked data community is to understand and treat differently the issues of instances from classes. But, the question of whether one class structure is better than another is moot if class mappings are neglected altogether.

Sustainability Requires Coherence

UMBEL’s reasons for not taking up the Wikipedia structure or the WordNet structure — that is, the initial structures within DBpedia — for its lightweight subject ontology was based on lack of coherence. I have spoken earlier about When is Content Coherent? regarding these arguments. Other analysis supports the conclusions in various ways [1].

A central (or “upper”) reference framework should be one that is solid and venerable. Over time, many subsidiary ontologies and structures will relate to it. Like a steel superstructure to a skyscraper or a structural framework to a large ocean-going tanker, this beginning structure needs to withstand many stresses and maintain its integrity as various subsidiary structures hang from it.

So long as we are still in “toy” mode with relatively few external mappings and relative few ontologies, simple class-to-class mappings without respect to the coherence of the underlying ontologies may be OK. But, we will soon (if not already) see that structural flaws, like slight perturbations at the Big Bang, may propagate to create huge discontinuities.

At the pace of development we are now seeing, there will be tens to hundreds of thousands of ontologies within the foreseeable future. Granted, for any given circumstance, only a minor few of those may be applicable. But the potential combinations still can defy imagination in terms of complexity and potential scope at widely varying scales.

At the scale of the Web, of course, there will never be a central authority (nor should there be) for “official” vocabularies or structures. Yet, just as certainly, those ontologies and structures that do share some conceptual and structural coherence and are therefore more likely to easily integrate and interoperate will (I believe) win the Darwinian race. Without some degree of coherence, these disparate structures become like ill-fitting jigsaw pieces from different puzzles.

As we watch structures and relationships accrete like layers in a pearl, we should begin with a solid grain of common sense and coherence. That is why we chose the Cyc structure as the basis of UMBEL — it provides one such solid, coherent framework for moving forward.

I am not sanguine that ad hoc, free-form ontological structures, created in the same manner as topic-specific articles in Wikipedia or as informal tags in bookmarking systems, will bring such coherence. But who knows? Perhaps on the Web where novelty and the joy of creation and exploration can trump usefulness, such could transpire.

But, when we look to linked data and semantic Web constructs finally achieving its potential in the enterprise to overcome decades of data silos, I suspect purposeful coherence will win the day.

Sustainable ontologies, which themselves can host and interoperate with still further ontologies and structures, will require coherent underpinnings to not collapse from the weight of keeping consistency. Just as our current highways and interstates follow the earlier roads before them, as early trailblazers we have a responsibility to follow the natural contours of our applicable information spaces. And that requires coherence and consistency; in other words, logic and design.

Vocabularies are Not Free Form

In the past few months there has been a remarkable emergence of interest in vocabularies and semantics (as traditionally understood by linguists). Today, for example, marks the kick-off of the first VoCamp get-together in Oxford, England, with interest and discussion active about potentially many others to follow. Peter Mika, Matthias Samwald and Tom Heath have each outlined their desires for this meeting.

I hope the participants in this meeting and others to follow look seriously at the issues of coherence and interoperability and sustainability. My caution is as follows: like tagging and Wikipedia, we have seen amazing contributions from user-generated content that have totally re-shaped our information world. However, we have not yet seen such processes work for structure and coherent conceptual relationships.

I believe participation and UGC have real roles to play in the emergence of coherent structures and vocabularies to enable interoperability. But I also believe they have not done so to date, and useful approaches to those will not emerge in a free-form fashion and without consideration for sustainability and coherence testing.

Putting this Context into Context

In these observations there is absolutely no criticism intended or implied with DBpedia or prior linked data practice. A natural and understandable progression has been followed here: first make connections between things, then begin to surface knowledge through the exploration of relationships. We are just now beginning that exploration through the use of classes, vocabularies and ontologies to explicate relationships. The fact that linked data and DBpedia first emphasized linking things and publishing things is a major milestone. It is now time to move on to the new challenges of structure and relationships.

There is much to be learned from other pathbreaking efforts such as the Open Biomedical Ontologies efforts and their attempts at coordination and standardization. As the demands and interests in interoperability increase, interfaces, consistency and coordination will continue, I believe, to come to the fore.

In another 18 months we will likely look back at today’s issues and thoughts as also naïve in light of new understandings. The pace of discovery and learning is such that I believe best practices will remain fluid for quite some time.


[1] An exact analysis related to our arguments of coherence has not been made, but related studies point to these observations in part or in slightly different contexts.

Wordnets tend to be star-like in structure, with sparse relations, and dominated by a few hub clusters. c.f., Holger Wunsch, 2008. Exploiting Graph Structure for Accelerating the Calculation of Shortest Paths in Wordnets, in Proceedings of Coling 2008, Manchester, UK, August 2008. See http://www.sfs.uni-tuebingen.de/~wunsch/wn-shortest-paths.pdf.

The strict uncorrected structure of Wikipedia categories can also be inconsistent, inaccurate, populated with administrative categories, demonstrate cycles, and lack uniform coverage. c.f., Jonathan Yu, James A. Thom and Audrey Tam, 2007. Ontology Evaluation Using Wikipedia Categories for Browsing, in Proceedings of 16th Conference on Information and Knowledge Management (CIKM), 2007; see http://goanna.cs.rmit.edu.au/~jyu/publications/YuEtal07.pdf. This paper also presents a comprehensive framework for ontology evaluation.

This reference describes a new way to calculate semantic relatedness (not the same as coherence) in relation to Wikipedia, ConceptNet and WordNet: Sander Wubben, 2008. Using Free Link Structure to Calculate Semantic Relatedness. ILK Research Group Technical Report Series no. 08-01, July 2008, 61 pp. See http://ilk.uvt.nl/downloads/pub/papers/wubben2008-techrep.pdf.

Table 3 in this citation presents an interesting contrast between what the authors call collaborative knowledge bases (CKBs, like Wikipedia) and linguistic knowledge basis (LKBs, like WordNet), again however not really addressing the coherence issue: Torsten Zesch and Christof Müller and Iryna Gurevych, 2008. Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary, in Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), May 28-30, Marrakech, Morocco. See http://elara.tk.informatik.tu-darmstadt.de/publications/2008/lrec08_camera_ready.pdf. Also see Torsten Zesch and Iryna Gurevych, 2007. Analysis of the Wikipedia Category Graph for NLP Applications, in Proceedings of the Workshop TextGraphs-2: Graph-Based Algorithms for Natural Language Processing at HLT-NAACL 2007, 26 April, 2007, Rochester, NY, pp. 1-8. See http://www.aclweb.org/anthology-new/W/W07/W07-02.pdf.

Posted by AI3's author, Mike Bergman Posted on September 24, 2008 at 6:03 pm in Ontologies, Semantic Web, Structured Web, UMBEL | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/454/exploding-the-domain-in-context/
The URI to trackback this post is: http://www.mkbergman.com/454/exploding-the-domain-in-context/trackback/
Posted:September 22, 2008

The Linkage of UMBEL’s 20,000 Subject Concepts and Inferencing Brings New Capabilities

Thanks to Kingsley Idehen and OpenLink Software, DBpedia has been much enrichened with its mapping to UMBEL‘s 20,000 class-based subject concepts. DBpedia is the structured data version of Wikipedia that I (among many) wrote about in depth in April of last year shortly after its release.

We have also recently gotten an updated estimate of the size of the semantic Web and a new release of the linking open data (LOD) cloud diagram.

A New Instance of the LOD Cloud Diagram

Since DBpedia’s release, it has become the central hub of linked open data as shown by this now-famous (and recently updated!) LOD diagram [1]:

Click for full size
[click for full size]

Each version of the diagram adds new bubbles (datasets) and new connections. The use of linked data, which is based on the RDF data model and uses Web protocols to name and access data, is proving to be a powerful framework for interconnecting disparate and heterogeneous information. As the diagram above shows, all types of information from a variety of public sources now make up the LOD cloud [2].

A Beginning Basis for Estimating the Size of the Semantic Web

The most recent analysis of this LOD cloud is by Michael Hasenblas and colleagues as presented at I-Semantics08 in September [3]. About 50 major datasets comprising roughly two billion triples and three million interlinks were contained in the cloud at the time of their analysis. They partitioned their analysis into two distinct types: 1) single-point-of-access datasets (akin to conventional databases), such as DBpedia or Geonames, and 2) distributed records characterized by RDF ontologies such as FOAF or SIOC. Their paper [3] should be reviewed for its own conclusions. In general, though, most links appear to be of low value (though a minority are quite useful).

Simple measures such as triples or links have little meaning in themselves. Moreover, and this is most telling, all of the LOD relationships in the diagram above and the general nature of linked data to date have based their connections on instance-level data. Often this takes the form that a specific person, place or thing in one dataset is related to that very same thing in another dataset using the owl:sameAs property; sometimes it is that one person knows another person; or, it may be in other examples that one entry has an associated photo. Entities are related to other entities and their attributes, but little is provided about the conceptual or structural relationships amongst those entities.

Instance-level mapping is highly useful to aggregate various attributes or facts about given entities or things. But, they only scratch the surface of the structure that can be made available through linked data and the conceptual relationships between and amongst all of those things. For those relationships to be drawn or inferred a different level of linkages needs to be made: what is the class or collection or schema view of the data.

The UMBEL Subject Concept ‘Backbone’

UMBEL, or similar conceptual frameworks, can provide this structural backbone.

UMBEL (Upper Mapping and Binding Exchange Layer; see http://www.umbel.org) is a lightweight reference ontology of about 20,000 subject concepts and their logical and semantic relationships. The UMBEL ontology is a direct derivation of the proven Cyc knowledge base from Cycorp, Inc. (see http://www.cyc.com).

UMBEL’s subject concepts provide mapping points for the many (indeed, millions of) named entities that are their notable instances. Examples might include the names of specific physicists, cities in a country, or a listing of financial stock exchanges. UMBEL mappings enable us to link a given named entity to the various subject classes of which it is a member.

And, because of relationships amongst subject concepts in the backbone, we can also relate that entity to other related entities and concepts. The UMBEL backbone traces the major pathways through the content graph of the Web.

The UMBEL backbone provides structure and relationships at large or small scale. For example, in its full extent, the structure of UMBEL’s complete structure resembles:

UMBEL Big Braph

But, we can dive into that structure with respect to automobiles or related concepts . . .

UMBEL Big Saab

. . . all the way down to seeing the relationships to Saab cars:

UMBEL Saab Neighborhood

It is this ability to provide context through structure and relations that can help organize and navigate large datasets of instances such as DBpedia. Until the application of UMBEL — or any subject or class structure like it — most of the true value within DBpedia has remained hidden.

But no longer.

Some Example Queries

UMBEL already had mapped most DBpedia instances to its own internal classes. By a simple mapping of files and then inferencing against the UMBEL classes, this structure has now been brought to DBpedia itself. Any SPARQL queries applied against DBpedia can now take advantage of these relationships.

Below are some sample queries Kingsley used to announce these UMBEL capabilities to the LOD mailing list [4]. You can test these queries yourself or try alternative ones by using a standard SPARQL query.

For example, go to one of DBpedia’s query endpoints such as http://dbpedia.org/sparql and cut-and-paste one of these highlighted code snippets into the ‘Query text’ box:

Example Query 1

define input:inference ‘http://dbpedia.org/resource/inference/rules/umbel#’
prefix umbel: <http://umbel.org/umbel/sc/>
select ?s
where

{
?s a umbel:RoadVehicle
}

Example Query 2

define input:inference ‘http://dbpedia.org/resource/inference/rules/umbel#’
prefix umbel: <http://umbel.org/umbel/sc/>
select ?s
where

{
?s a umbel:Automobile_GasolineEngine
}

Example Query 3

define input:inference ‘http://dbpedia.org/resource/inference/rules/umbel#’
prefix umbel: <http://umbel.org/umbel/sc/>
select ?s
where

{
?s a umbel:Project
}

Example Query 4

define input:inference ‘http://dbpedia.org/resource/inference/rules/umbel#’
prefix umbel: <http://umbel.org/umbel/sc/>
select ?s
where

{
?s a umbel:Person
}

Example Query 5

define input:inference ‘http://dbpedia.org/resource/inference/rules/umbel#’
prefix umbel: <http://umbel.org/umbel/sc/>
select ?s
where

{
?s
a umbel:Graduate;
a umbel:Boxer.
}

Example Query 6

define input:inference ‘http://dbpedia.org/resource/inference/rules/umbel#’
prefix umbel: <http://umbel.org/umbel/sc/>
prefix yago: <http://dbpedia.org/class/yago/>
select ?s
where

{
?s
a yago:FemaleBoxers;
a umbel:Graduate;
a umbel:Boxer.
}

Creating Your Own Mapping

By going to UMBEL’s technical documentation page at http://umbel.org/documentation.html, you can download the files to create your own mappings (assuming you have a local instance of DBpedia).

The example below also assumes you are using the OpenLink Virtuoso server as your triple store. If you are using a different system, you will need to adjust your commands accordingly.

1. Load linkages (owl:sameAs) between UMBEL named entities and DBpedia resources

File: umbel_dbpedia_linkage.n3

select ttlp_mt (file_to_string_output (‘umbel_dbpedia_linkage.n3′), ”, ‘http://dbpedia.org’);

2. Load inferred DBpedia types (rdf:types) based on UMBEL named entities

File: umbel_dbpedia_types.n3

select ttlp_mt (file_to_string_output (‘umbel_dbpedia_types.n3′), ”, ‘http://dbpedia.org’);

3. Load Virtuoso-specific file containing the rules for inferencing

File: umbel_virtuoso_inference_rules.n3

select ttlp_mt (file_to_string_output (‘umbel_virtuoso_inference_rules.n3′), ”, ‘http://dbpedia.org/resource/classes/umbel#’);

4. Load UMBEL External Ontology Mapping into a Named Graph (owl:equivalentClasses)

File: umbel_external_ontologies_linkage.n3

select ttlp_mt (file_to_string_output (‘umbel_external_ontologies_linkage.n3′), ”, ‘http://dbpedia.org/resource/classes/umbel#’);

5. Create UMBEL Inference Rules

rdfs_rule_set (‘http://dbpedia.org/resource/inference/rules/umbel#’, ‘http://dbpedia.org/resource/classes/umbel#’);

Conclusion

A new era of interacting with DBpedia is at hand. Within a period of just more than a year, the infrastructure and data are now available to show the advantages of the semantic Web based on a linked Web of data. DBpedia has been a major reason for showing these benefits; it is now positioned to continue to do so.


[1] This new LOD diagram is still being somewhat updated based on review. The version shown above is based on the one posted at the W3C’s SWEO wiki with my own updates of the two-way UMBEL links and the blue highlighting of DBpedia and UMBEL. There is also a clickable version of the diagram that will take you to the home references for the consituent data sources in this diagram; see http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2008-09-18.html.
[2] The objective of the Linking Open Data community is to extend the Web with a data commons by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different data sources. All of the sources on the LOD diagram are such open data. However, the best practices of linked data can also be applied to proprietary or intranet information as well; see this FAQ.
[3] See, Michael Hausenblas, Wolfgang Halb, Yves Raimond and Tom Heath, 2008. What is the Size of the Semantic Web?, paper presented at the International Conference on Semantic Systems (I-Semantics08) at TRIPLE-I, Sept. 2008. See http://sw-app.org/pub/isemantics08-sotsw.pdf.

Posted by AI3's author, Mike Bergman Posted on September 22, 2008 at 11:47 pm in Open Source, Semantic Web, Structured Web, UMBEL | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/453/dbpedia-gains-a-subject-class-structure-lod-cloud-diagram-updated/
The URI to trackback this post is: http://www.mkbergman.com/453/dbpedia-gains-a-subject-class-structure-lod-cloud-diagram-updated/trackback/
Posted:August 20, 2008

In a recent posting on the Ontolog forum, Toby Considine discussed the difficulty of describing to several business CEOs the concept of an ontology. He noted that when one of the CEOs finally got it, he explained it thus to the others:

“Ontology is always a value proposition, how a company makes money. Each company, and perhaps each sales professional must be able to define his own ontology and explain it to his customers. We need semantic alignment to create a common basis for discussing value. If it is a good semantic set, then the ontologies that each sales director creates will be better; better to produce sales differentiation, and better to produce true long-term value for the company.
“A general purpose ontology gives us a framework to develop and discuss our own value propositions. But those value propositions, and their underlying ontologies must remain proprietary, or else every company is just building to the lowest common denominator, and innovation and value creation end.”

BTW, Toby is chair of the OASIS Open Building Information Exchange (oBIX) Technical Committee (see http://www.oasis-open.org), and is used to conversing about standards and technical matters to business audiences.

This discussion came up in relation to the use of the Cyc knowledge base and the possible role of “lightweight” or “foundational” reference ontologies.

There are a number of interesting points embedded and implied in this discussion, and at the risk of reading too much into them, include:

  • Foundational, reference ontologies have an important role, but as frameworks and for external interoperability
  • Each enterprise has its own world view, which can be expressed as an ontology and represents its “value proposition”; in this regard, internal ontologies work similarly to current legacy schema
  • Semantic “alignment” (and therefore interoperabililty) is important to discuss value
  • For a business enterprise, the real focus of its ontologies is to express its value proposition, how it makes money.

I think these sentiments are just about right, with the last point especially profound.

We have supported UMBEL as an important reference structure, and see the role for ever more specific ones. But, at the other end of the spectrum, ontologies are also specific world views, and can and should be private for proprietary enterprises. Yet this is not in any way in conflict with the interoperation — with increasingly widening circles — using shared structure (ontologies).

The balance and integration of the private and public in semantic Web ontologies is still being worked out. But, I truly believe it is appropriate and necessary that both the public and the private be embraced.

Toby’s CEO got it almost right: innovation depends on reserving some proprietary aspects. But the complete story, I also think, is that embracing ontologies themselves and interoperable linked data frameworks in that context is also a key source of innovation and added value.

Posted:August 4, 2008

Updated Slide Show is Now Available

An updated UMBEL slide show was recently posted to Slideshare:

SlideShare | View

Posted by AI3's author, Mike Bergman Posted on August 4, 2008 at 11:23 am in Semantic Web, Structured Web, UMBEL | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/451/umbel-subject-concepts-layer-for-the-web/
The URI to trackback this post is: http://www.mkbergman.com/451/umbel-subject-concepts-layer-for-the-web/trackback/
Posted:July 25, 2008

'Dust Motes Dancing in Sunlight, Interior from the Artist's Home, Strandgade 30,' Vilhelm HammerchÃÆ’ƒÆ’ ¸i, 1900; courtesy of http://www.thecityreview.com/copen.html

Structure Demands Context; But, is that Enough?

Last week marked a red-letter day in my professional life with release of the UMBEL subject concept structure. UMBEL began as a gleam in the eye more than a year ago when I observed that semantic Web techniques, while powerful — especially with regard to the RDF data model as a universal and simple (at its basics) means for representing any information and its structure — still lacked something. It took me a while to recognize that the first something was context.

Now, I have written and talked much about context before on this blog, with The Semantics of Context being the most salient article for the present discussion.

This is my mental image of Web content without context: Unconnected dust motes floating through a sun-lite space, moving slowly, randomly, and without connections, sort of like Brownian motion. Think of the sunlight on dust shown by the picture to the left.

By providing context, my vision saw we could freeze these moving dust motes and place them into a fixed structure, perhaps something like constellations in the summer sky. Or, at least, more stable, and floating less aimlessly and unconnected.

So, my natural response was to look for structural frameworks to provide that context. And that was the quest I set forward at UMBEL’s initiation.

At the time of UMBEL’s genesis, the impact of Wikipedia and other sources of user-generated content (UGC) such as del.icio.us or Flickr or many, many others was becoming clear. The usefulness of tags, folksonomies, microformats and other forms of “bottom-up” structure was proven.

The evident — and to me, exciting — aspect of globally-provided UGC was that this was the ultimate democratic voice: the World has spoken, and the article about this or the tag about that had been vetted in the most interactive, exposed, participatory and open framework possible. Moreover, as the World changed and grew, these new realizations would also be fed back into the system in a self-correcting goodness. Final dot.

Through participation and collective wisdom, therefore, we could gain consensus and acceptance and avoid the fragility and arbitrariness of “wise man” or imposed from the “top-down” answers. The people have spoken. All voices have been heard. The give and take of competing views have found their natural resting point. Again, I thought, final dot.

Thus, when I first announced UMBEL, my stated desire (and hope) was that something like Wikipedia could or would provide that structural context. Here is a quote from the announcement of UMBEL, nearly one year ago to this day:

The selection of the actual subject proxies within the UMBEL core are to be based on consensus use. The subjects of existing and popular Web subject portals such as Wikipedia and the Open Directory Project (among others) will be intersected with other widely accepted subject reference systems such as WordNet and library classification systems (among others) in order to derive the candidate pool of UMBEL subject proxies.

Yet, that is not the basis of the structure announced last week for UMBEL. Why?

The Strengths of User-Generated Content

Before we probe the negative, let’s rejoice the positive.

User-generated content (UGC) works, has rapidly proven itself in venues from authoritative subjects (Wikipedia), photos (Flickr), bookmarking and tagging (del.icio.us), blogs, video (YouTube) and every Web space imaginable. This is new, was not foreseen by most a few years ago, and has totally remade our perception of content and how it can be generated. Wow!

The nature of this user-generated content, of course, as is true for the Web itself, is that it has arisen from a million voices without coercion, coordination or a plan, spontaneously in relation to chosen platforms and portals. Yet, still, today, as to what makes one venue more successful than others, we are mostly clueless. My suspicion is that — akin to financial markets — when Web portals or properties are successful, they readily lend themselves to retrospective books and learned analysis explaining that success. But, just try to put down that “recipe” in advance, and you will most likely fail.

So, prognostication is risky business around these parts.

There is a reason why both the head and sub-head of this article are stated as questions: I don’t know. For the reasons stated above, I would still prefer to see user-generated structure (UGS) emerge in the same way that topic- and entity-specific content has on Wikipedia. However, what I can say is this: for the present, this structure has not yet emerged in a coherent way.

Might it? Actually, I hope so. But, I also think it will not arise from systems or environments exactly like Wikipedia and, if it does arise, it will take considerable time. I truly hope such new environments emerge, because user-mediated structure will also have legitimacy and wisdom that no “expert” approach may ever achieve.

But these are what if‘s, and nice to have‘s and wouldn’t it be nice‘s. For my purposes, and the clients my company serves, what is needed must be pragmatic and doable today — all with acceptable risk, time to delivery and cost.

So, I think it safe to say that UGC works well today at the atomic level of the individual topic or data object, what might be called the nodes in global content, but not in the connections between those nodes, its structure. And, the key to the answer of why user-generated structure (UGS) has not emerged in a bottom-up way resides in that pivotal word above: coherence.

Coherence was the second something to accompany context as lacking missing pieces for the semantic Web.

Coherence in Context

What is it to be coherent? The tenth edition of Merriam-Websters Collegiate Dictionary (and the online version) defines it as:

coherent \kō-ˈhir-ənt\ adj.; Middle French or Latin; Middle French cohérent, from Latin cohaerent-, cohaerens, present participle of cohaerēre Date: (ca. 1555)

1: a: logically or aesthetically ordered or integrated : consistent <coherent style> <a coherent argument> b: having clarity or intelligibility : understandable <a coherent person> <a coherent passage>
2: having the quality of cohering; especially : cohesive, coordinated <a coherent plan for action>
3: a : relating to or composed of waves having a constant difference in phase <coherent light> b: producing coherent light <a coherent source>.

Another online source I like for visualization purposes is Visuwords, which displays the accompanying graph relationships view based on WordNet.

Of course, coherent is just the adjectival property of having coherence. Again, the Merriam Webster dictionary defines coherence as 1: the quality or state of cohering: as a: systematic or logical connection or consistency b: integration of diverse elements, relationships, or values.

Decomposing even further, we can see that coherence is itself the state of the verb, cohere. Cohere, as in its variants above, has as its etymology a derivation from the Latin cohaerēre, from co- + haerēre to stick, namely “to stick with”. Again, the Merriam Webster dictionary defines cohere as 1: a: to hold together firmly as parts of the same mass; broadly: stick, adhere b: to display cohesion of plant parts 2: to hold together as a mass of parts that cohere 3: a: to become united in principles, relationships, or interests b: to be logically or aesthetically consistent.

These definitions capture the essence of coherence in that it is a state of logical, consistent connections, a logical framework for integrating diverse elements in an intelligent way. In the sense of a content graph, this means that the right connections (edges or predicates) have been drawn between the object nodes (or content) in the graph.

Bottom-up UGC: The Hip Bone is Connected to the Arm Bone

Structure without coherence is where connections are being drawn between object nodes, but those connections are incomplete or wrong (or, at least, inconsistent or unintelligible). The nature of the content graph lacks logic. The hip bone is not connected to the thigh bone, but perhaps to something wrong or silly, like the arm or cheek bone.

Ambiguity is one source for such error, as when, for example, the object “bank” is unclear as to whether it is a financial institution, billiard shot, or edge of a river. If we understand the object to be the wrong thing, then connections can get drawn that are in obvious error. This is why disambiguation is such a big deal in semantic systems.

However, ambiguity tends not to be a major source of error in user-generated content (UGC) systems because the humans making the connections can see the context and resolve the meanings. Context is thus a very important basis for resolving disambiguities.

A second source of possible incoherence is the organizational structure or schema of the actual concept relationships. This is the source that poses the most difficulty to UGC systems such as folksonomies or Wikipedia.

Remember in the definitions above that logic, consistency and intelligibility were some of the key criteria for a coherent system. Bottom-up UGS (user-generated structure) is prone to not meet the test in all three areas.

“In the context of an information organization framework, a structure is a cohesive whole or ‘container’ that establishes qualified, meaningful relationships among those activities, events, objects, concepts which, taken together, comprise the ‘bounded space’ of the universe of interest.”

– J.T. Tennis and E.K. Jacob [1]

Logic and consistency almost by definition imply the application of a uniform perspective, a single world view. Multiple authors and contributors doing so without a common frame of reference or viewpoint are unable to bring this consistency of perspective. For example, how time might be treated with regard to famous people’s birth dates in Wikipedia is very different than its discussion of time with respect to topics on geological eras, and Wikipedia contains no mechanisms for relating those time dimensions or making them consistent.

Logic and intelligibility suggest that the structure should be testable and internally consistent. Is the hip bone connected with the arm bone? No? and why not? In UGC systems, individual connections are made by consensus and at the object-to-object level. There are no mechanisms, at least in present systems, for resolving inconsistencies as these individual connections get aggregated. We can assign dogs as mammals and dogs as pets, but does that mean that all pets are mammals? The connections can get complicated fast and such higher-order relationships remain unstated or more often than not wrong.

Note as well that in UGC systems items may be connected (“assigned”) to categories, but their “factual” relation is not being asserted. Again, without a consistency of how relations are treated and the ability to test assertions, the structures may not only be wrong in their topology, but totally lack any inference power. Is the hip bone connected with the cheek bone? UGC structures lack such fundamental logic underpinnings to test that, or any other, assertion.

From the first days of the Web, notably Yahoo! in its beginnings but many other portals as well, we have seen many taxonomies and organizational structures emerge. As simple heuristic devices for clustering large amounts of content, this is fine (though certainly there, too, there are some structures that are better at organizing along “natural” lines than others). Wikipedia itself, in its own structure, has useful organizational clustering.

But once a system is proposed, such as UMBEL, with the purpose of providing broad referenceability to virtually any Web content, the threshold condition changes. It is no longer sufficient to merely organize. The structure must now be more fully graphed, with intelligent, testable, consistent and defensible relations.

Full Circle to Cyc and UGC

Once the seemingly innocent objective of being a lightweight subject reference structure was established for UMBEL, the die was cast. Only a coherent structure would work, since anything else would be fragile and rapidly break in the attempt to connect disparate content. Relating content coherently itself demands a coherent framework.

As noted in the lead-in, this was not a starting premise. But, it became an unavoidable requirement once the UMBEL effort began in earnest.

I have spoken elsewhere about other potential candidates as possibly providing the coherent underlying structure demanded by UMBEL. We have also discussed why Cyc, while by no means perfect, was chosen as the best starting framework for contributing this coherent structure.

I anticipate we will see many alternative structures proposed to UMBEL based on other frameworks and premises. This is, of course, natural and the nature of competition and different needs and world views.

However, it will be most interesting to see if either ad hoc structures or those derived from bottom-up UGC systems like Wikipedia can be robust and coherent enough to support data interoperability at Web scale.

I strongly suspect not.


[1] Joseph T. Tennis and Elin K. Jacob, 2008. “Toward a Theory of Structure in Information Organization Frameworks,” upcoming presentation at the 10th International Conference of the International Society for Knowledge Organization (ISKO 10), in Montréal, Canada, August 5th-8th, 2008. See http://www.ebsi.umontreal.ca/isko2008/documents/abstracts/tennis.pdf.