Last week marked a red-letter day in my professional life with release of the UMBEL subject concept structure. UMBEL began as a gleam in the eye more than a year ago when I observed that semantic Web techniques, while powerful — especially with regard to the RDF data model as a universal and simple (at its basics) means for representing any information and its structure — still lacked something. It took me a while to recognize that the first something was context.
Now, I have written and talked much about context before on this blog, with The Semantics of Context being the most salient article for the present discussion.
This is my mental image of Web content without context: Unconnected dust motes floating through a sun-lite space, moving slowly, randomly, and without connections, sort of like Brownian motion. Think of the sunlight on dust shown by the picture to the left.
By providing context, my vision saw we could freeze these moving dust motes and place them into a fixed structure, perhaps something like constellations in the summer sky. Or, at least, more stable, and floating less aimlessly and unconnected.
So, my natural response was to look for structural frameworks to provide that context. And that was the quest I set forward at UMBEL’s initiation.
At the time of UMBEL’s genesis, the impact of Wikipedia and other sources of user-generated content (UGC) such as del.icio.us or Flickr or many, many others was becoming clear. The usefulness of tags, folksonomies, microformats and other forms of “bottom-up” structure was proven.
The evident — and to me, exciting — aspect of globally-provided UGC was that this was the ultimate democratic voice: the World has spoken, and the article about this or the tag about that had been vetted in the most interactive, exposed, participatory and open framework possible. Moreover, as the World changed and grew, these new realizations would also be fed back into the system in a self-correcting goodness. Final dot.
Through participation and collective wisdom, therefore, we could gain consensus and acceptance and avoid the fragility and arbitrariness of “wise man” or imposed from the “top-down” answers. The people have spoken. All voices have been heard. The give and take of competing views have found their natural resting point. Again, I thought, final dot.
Thus, when I first announced UMBEL, my stated desire (and hope) was that something like Wikipedia could or would provide that structural context. Here is a quote from the announcement of UMBEL, nearly one year ago to this day:
The selection of the actual subject proxies within the UMBEL core are to be based on consensus use. The subjects of existing and popular Web subject portals such as Wikipedia and the Open Directory Project (among others) will be intersected with other widely accepted subject reference systems such as WordNet and library classification systems (among others) in order to derive the candidate pool of UMBEL subject proxies.
Yet, that is not the basis of the structure announced last week for UMBEL. Why?
Before we probe the negative, let’s rejoice the positive.
User-generated content (UGC) works, has rapidly proven itself in venues from authoritative subjects (Wikipedia), photos (Flickr), bookmarking and tagging (del.icio.us), blogs, video (YouTube) and every Web space imaginable. This is new, was not foreseen by most a few years ago, and has totally remade our perception of content and how it can be generated. Wow!
The nature of this user-generated content, of course, as is true for the Web itself, is that it has arisen from a million voices without coercion, coordination or a plan, spontaneously in relation to chosen platforms and portals. Yet, still, today, as to what makes one venue more successful than others, we are mostly clueless. My suspicion is that — akin to financial markets — when Web portals or properties are successful, they readily lend themselves to retrospective books and learned analysis explaining that success. But, just try to put down that “recipe” in advance, and you will most likely fail.
So, prognostication is risky business around these parts.
There is a reason why both the head and sub-head of this article are stated as questions: I don’t know. For the reasons stated above, I would still prefer to see user-generated structure (UGS) emerge in the same way that topic- and entity-specific content has on Wikipedia. However, what I can say is this: for the present, this structure has not yet emerged in a coherent way.
Might it? Actually, I hope so. But, I also think it will not arise from systems or environments exactly like Wikipedia and, if it does arise, it will take considerable time. I truly hope such new environments emerge, because user-mediated structure will also have legitimacy and wisdom that no “expert” approach may ever achieve.
But these are what if‘s, and nice to have‘s and wouldn’t it be nice‘s. For my purposes, and the clients my company serves, what is needed must be pragmatic and doable today — all with acceptable risk, time to delivery and cost.
So, I think it safe to say that UGC works well today at the atomic level of the individual topic or data object, what might be called the nodes in global content, but not in the connections between those nodes, its structure. And, the key to the answer of why user-generated structure (UGS) has not emerged in a bottom-up way resides in that pivotal word above: coherence.
Coherence was the second something to accompany context as lacking missing pieces for the semantic Web.
What is it to be coherent? The tenth edition of Merriam-Websters Collegiate Dictionary (and the online version) defines it as:
1: a: logically or aesthetically ordered or integrated : consistent <coherent style> <a coherent argument> b: having clarity or intelligibility : understandable <a coherent person> <a coherent passage>
2: having the quality of cohering; especially : cohesive, coordinated <a coherent plan for action>
3: a : relating to or composed of waves having a constant difference in phase <coherent light> b: producing coherent light <a coherent source>.
Of course, coherent is just the adjectival property of having coherence. Again, the Merriam Webster dictionary defines coherence as 1: the quality or state of cohering: as a: systematic or logical connection or consistency b: integration of diverse elements, relationships, or values.
Decomposing even further, we can see that coherence is itself the state of the verb, cohere. Cohere, as in its variants above, has as its etymology a derivation from the Latin cohaerēre, from co- + haerēre to stick, namely “to stick with”. Again, the Merriam Webster dictionary defines cohere as 1: a: to hold together firmly as parts of the same mass; broadly: stick, adhere b: to display cohesion of plant parts 2: to hold together as a mass of parts that cohere 3: a: to become united in principles, relationships, or interests b: to be logically or aesthetically consistent.
These definitions capture the essence of coherence in that it is a state of logical, consistent connections, a logical framework for integrating diverse elements in an intelligent way. In the sense of a content graph, this means that the right connections (edges or predicates) have been drawn between the object nodes (or content) in the graph.
Structure without coherence is where connections are being drawn between object nodes, but those connections are incomplete or wrong (or, at least, inconsistent or unintelligible). The nature of the content graph lacks logic. The hip bone is not connected to the thigh bone, but perhaps to something wrong or silly, like the arm or cheek bone.
Ambiguity is one source for such error, as when, for example, the object “bank” is unclear as to whether it is a financial institution, billiard shot, or edge of a river. If we understand the object to be the wrong thing, then connections can get drawn that are in obvious error. This is why disambiguation is such a big deal in semantic systems.
However, ambiguity tends not to be a major source of error in user-generated content (UGC) systems because the humans making the connections can see the context and resolve the meanings. Context is thus a very important basis for resolving disambiguities.
A second source of possible incoherence is the organizational structure or schema of the actual concept relationships. This is the source that poses the most difficulty to UGC systems such as folksonomies or Wikipedia.
Remember in the definitions above that logic, consistency and intelligibility were some of the key criteria for a coherent system. Bottom-up UGS (user-generated structure) is prone to not meet the test in all three areas.
– J.T. Tennis and E.K. Jacob 
Logic and consistency almost by definition imply the application of a uniform perspective, a single world view. Multiple authors and contributors doing so without a common frame of reference or viewpoint are unable to bring this consistency of perspective. For example, how time might be treated with regard to famous people’s birth dates in Wikipedia is very different than its discussion of time with respect to topics on geological eras, and Wikipedia contains no mechanisms for relating those time dimensions or making them consistent.
Logic and intelligibility suggest that the structure should be testable and internally consistent. Is the hip bone connected with the arm bone? No? and why not? In UGC systems, individual connections are made by consensus and at the object-to-object level. There are no mechanisms, at least in present systems, for resolving inconsistencies as these individual connections get aggregated. We can assign dogs as mammals and dogs as pets, but does that mean that all pets are mammals? The connections can get complicated fast and such higher-order relationships remain unstated or more often than not wrong.
Note as well that in UGC systems items may be connected (“assigned”) to categories, but their “factual” relation is not being asserted. Again, without a consistency of how relations are treated and the ability to test assertions, the structures may not only be wrong in their topology, but totally lack any inference power. Is the hip bone connected with the cheek bone? UGC structures lack such fundamental logic underpinnings to test that, or any other, assertion.
From the first days of the Web, notably Yahoo! in its beginnings but many other portals as well, we have seen many taxonomies and organizational structures emerge. As simple heuristic devices for clustering large amounts of content, this is fine (though certainly there, too, there are some structures that are better at organizing along “natural” lines than others). Wikipedia itself, in its own structure, has useful organizational clustering.
But once a system is proposed, such as UMBEL, with the purpose of providing broad referenceability to virtually any Web content, the threshold condition changes. It is no longer sufficient to merely organize. The structure must now be more fully graphed, with intelligent, testable, consistent and defensible relations.
Once the seemingly innocent objective of being a lightweight subject reference structure was established for UMBEL, the die was cast. Only a coherent structure would work, since anything else would be fragile and rapidly break in the attempt to connect disparate content. Relating content coherently itself demands a coherent framework.
As noted in the lead-in, this was not a starting premise. But, it became an unavoidable requirement once the UMBEL effort began in earnest.
I have spoken elsewhere about other potential candidates as possibly providing the coherent underlying structure demanded by UMBEL. We have also discussed why Cyc, while by no means perfect, was chosen as the best starting framework for contributing this coherent structure.
I anticipate we will see many alternative structures proposed to UMBEL based on other frameworks and premises. This is, of course, natural and the nature of competition and different needs and world views.
However, it will be most interesting to see if either ad hoc structures or those derived from bottom-up UGC systems like Wikipedia can be robust and coherent enough to support data interoperability at Web scale.
I strongly suspect not.
Today marks the first public release of UMBEL, a lightweight subject concept reference structure for the Web. This version 0.70 release required a full 12 months and many person-years of development effort.
UMBEL (Upper Mapping and Binding Exchange Layer) is a lightweight ontology structure for relating Web content and data to a standard set of 20,000 subject concepts. Its purpose is to provide a fixed set of common reference points in the global knowledge space. These subject concepts have defined relationships between them, and can act as semantic binding nodes for any Web content or data. The UMBEL reference structure is a large, inclusive, linked concept graph.
Connecting to the UMBEL structure gives context and coherence to Web data. In this manner, Web data can be linked, made interoperable, and more easily navigated and discovered. UMBEL is a great vehicle for interconnecting content metadata.
The UMBEL vocabulary defines some important new predicates and leverages existing semantic Web standards. The ontology is provided as Linked Data with Web services access (and pending SPARQL endpoints). Besides its 20,000 subject concepts and relationships distilled from OpenCyc, a further 1.5 million named entities are mapped to that structure. The system is easily extendable.
Fred Giasson, UMBEL’s co-editor, posts separately on how the UMBEL vocabulary can enrich existing semantic Web ontologies and techniques. Also, see the project’s Web site for additional background and explanatory information on the project.
UMBEL is provided as open source under the Creative Commons 3.0 Attribution-Share Alike license; the complete ontology with all subject concepts, definitions, terms and relationships can be freely downloaded. All subject concepts are Web-accessible as Linked Data URIs.
Five volumes of technical documentation are available. The two key volumes explaining the UMBEL project and process are UMBEL Ontology, Vol. A1: Technical Documentation (also online) and Distilling Subject Concepts from OpenCyc, Vol. B1: Overview and Methodology.
A new overview slideshow is also available.
There are two input files for Cytoscape, the open source program used for certain large-scale UMBEL visualization and analysis:
The two complete references to all current and archived files and access procedures in the UMBEL project are UMBEL Ontology, Vol. A2: Subject Concepts and Named Entities Instantiation and Distilling Subject Concepts from OpenCyc, Vol. B2: Files Documentation. Finally, the fifth documentation volume accompanying the release is Distilling Subject Concepts from OpenCyc, Vol. B3: Appendices, which provides supporting materials and detailed backup.
As discussed on the Web site on UMBEL’s role, the project currently has adopted two pivotal positions with respect to OpenCyc and its use:
For these positions to be effective, we are putting in place mechanisms for UMBEL to collect and forward community comments regarding the suitability of the subject concept structure, and for Cycorp to deliberate on that input and respond as appropriate to maintain the coherence of the knowledge base.
Fortunately, Cycorp has been supremely responsive to date and made changes to the OpenCyc concept structure and its conversion to OWL in support of needs and observations brought forth by the UMBEL project. We anticipate this excellent working relationship to continue.
This version 0.70 release is based on versioning and numbering as presented in the supporting documentation. But, also, releasing with a version increment below 1.0 additionally signals the newness and relative immaturity of the system.
This release is the first one in which the UMBEL subject concepts and ontology will be applied as a real vocabulary in public settings. Some areas are known to be weaker and less complete than others. Some areas, such as the coverage of Internet and the Web topics particular to domain experts, are relatively sparse. Other areas, such as organizing science and academic disciplines, have seen much improvement, but more is necessary. Still additional areas will certainly surface as warranting better subject concept coverage.
Input mechanisms are being put in place for user feedback and input and discussion is always welcomed at the project’s discussion forum and mailing list. We anticipate rapid changes and versioning over the next six months or so, which is also roughly the forecasted horizon for the first production-grade version 1.0.
A number of individuals and organizations have contributed significantly to this release, for which the project offers hearty thanks.
|Zitgist LLC has been the major source of staff time and hosting services to the project. Two of Zitgist’s principals, Mike Bergman and Fred Giasson, have acted as editors on the UMBEL project.Zitgist also has contributed nearly two person-years of effort to the project.Zitgist intends on continuing to lead and manage the project with a substantial future commitment of time and effort.|
|OpenLink Software has been the major source of infrastructure, financing and software for the project. OpenLink’s Virtuoso virtual data management system is the hosting software environment for UMBEL and its Web services.Kingsley Idehen, CEO and President of OpenLink, has been a key source of inspiration for the project.|
|Cycorp is the developer of the Cyc knowledge base, with more than 1,000 person-years of effort behind it, from which the OpenCyc open source version is derived.Since the initial selection of OpenCyc for UMBEL, Cycorp staff have devoted many person-months of effort to help explain the underlying system and, then, most recently, to make improvements and revisions to OpenCyc and its OWL version in response to project input. Larry Lefkowitz, VP of business development, has been a very effective interface with the project.|
|YAGO is a project from Fabian Suchanek, Gjergji Kasneci and Gerhard Weikum of the Max-Planck-Institute for Computer Science, Saarbruecken, Germany. It is based on extracting and organizing entities from Wikipedia according to the WordNet concept structure.YAGO demonstrated the methodology for how to replace the native Wikipedia structure with alternate external structures and provided the starting set of named entities used within UMBEL. Fabian has been especially helpful in data, software and methodology support to the project.|
|The Cyc Foundation and its members have been devoted to Web exposure of OpenCyc and have provided great guidance to the project in learning and navigating the knowledge base. Their concepts browser and other Web services have also been extremely helpful to the project’s initial ideas and testing.Mark Baltzegar and John De Oliveira, the two lead directors of the Cyc Foundation, have been particularly helpful.|
|Moritz Stefaner is one of the innovators and rising stars in large-scale data visualization.Moritz has kindly contributed his cool Flash explorer implementation used in UMBEL’s Subject Concept Explorer and continues to make ongoing improvements to UMBEL’s visualization.Moritz’s Web site and separate blog are each worth perusing for neat graphics and ideas.|
Thanks, all of you! This is a day we have worked long and hard to see come to reality. As Fred puts it, let the fun begin!
I’m pleased to present a timeline of 100 or so of the most significant events and developments in the innovation and management of information and documents from cave paintings ( ca 30,000 BC) to the present. Click on the link to the left or on the screen capture below to go to the actual interactive timeline.
This timeline has fast and slow scroll bands — including bubble popups with more information and pictures for each of the entries offered. (See the bottom of this posting for other usage tips.)
Note the timeline only presents non-electronic innovations and developments from alphabets to writing to printing and information organization and conventions. Because there are so many innovations and they are concentrated in the last 100 years or fewer, digital and electronic communications are somewhat arbitrarily excluded from the listing.
I present below some brief comments on why I created this timeline, some caveats about its contents, and some basic use tips. I conclude with thanks to the kind contributors.
Readers of this AI3 blog or my detailed bio know that information — biological embodied in genes, or cultural embodied in human artefacts — has been my lifelong passion. I enjoy making connections between the biological and cultural with respect to human adaptivity and future prospects and I like to dabble on occasion as an amateur economic or information science historian.
About 18 months ago I came across David Huynh‘s nifty Exhibit lightweight data display widget, gave it a glowing review, and then proceeded to convert my growing Sweet Tools listing of semantic Web and related tools to that format. Exhibit still powers the listing (which I just updated yesterday for the twelfth time or so).
At the time of first rolling out Exhibit I also noted that David had earlier created another lightweight timeline display widget that looked similarly cool (and which was also the first API for rendering interactive timelines in Web pages). (In fact, Exhibit and Timeline are but two of the growing roster of excellent lightweight tools from David.) Once I completed adopting Exhibit, I decided to find an appropriate set of chronological or time-series data to play next with Timeline.
I had earlier been ruminating on one of the great intellectual mysteries of human development: Why, roughly beginning in 1820 to 1850 or so, did the historical economic growth patterns of all prior history suddenly take off? I first wrote on this about two years ago in The Biggest Disruption in History: Massively Accelerated Growth Since the Industrial Revolution, with a couple of follow-ups and expansions since then.
I realized that in developing my thesis that wood pulp paper and mechanized printing were the key drivers for this major inflection change in growth (as they effected literacy and the broadscale access to written information) I already had the beginnings of a listing of various information innovations throughout history. So, a bit more than a year ago, I began adding to that list in terms of how humans learned to write, print, share, organize, collate, reproduce and distribute information and when those innovations occurred.
There are now about 100 items in this listing (I’m still looking for and researching others; please send suggestions at any time. ). Here are some of the current items in chronological order from upper left to lower right:
|calendars||tree diagram||encyclopedia||pencil (mass produced)|
|cuneiform||quill pen||capitalization||rotary perfection press|
|papyrus (paper)||library catalog||magazines||catalogues|
|hieroglyphs||movable type||taxonomy (binomial classification)||typewriter|
|alphabet||paper (rag)||timeline||chemical pulp (sulfite)|
|Phaistos Disc||word spaces||data graphs||classification (Dewey)|
|scrolls||printing press||punch cards||kraft process (pulp)|
|manuscripts||advertising (poster)||steam-powered (mechanized) papermaking||flexography|
|glossaries||bookbinding||book (machine-paper)||classification (LoC)|
|dictionaries||pagination||chemcial symbols||classification (UDC)|
|parchment (paper)||punctuation||mechanical pencil||offset press|
|bibliographies||library catalog (printed)||chromolithography||screenprinting|
|concept of categories||public lending library||paper (wood pulp)||ballpoint pen|
|library||dictionaries (alphabetic)||rotary press||xerographic copier|
|classification system (library)||newspapers||mail-order catalog||hyperlink|
|zero||Information graphics||fountain pen||metadata (MARC)|
So, off and on, I have been working with and updating the data and display of this timeline in draft. (I may someday also post my notes about how to effectively work with the Timeline widget.)
With the listing above, completion was sufficient to finally post this version. One of the neat things with Timeline is the ability to drive the display from a simple XML listing. I will update the timeline when I next have an opportunity to fill in some of the missing items still remaining on my innovations list such as alphabeticization, citations, and table of contents, among many others.
Of course, rarely can an innovation be traced to a single individual or a single moment in time. Historians are increasingly documenting the cultural milieu and multiple individuals that affect innovation.
In these regards, then, a timeline such as this one is simplistic and prone to much error and uncertainty. We have no real knowledge, for examples, for the precise time certain historical innovations occurred, and others (the ballpoint pen being one case in point) are a matter of interpretation as to what and when constituted the first expression. For instances where the record indicated multiple dates, I chose to use the date when released to the publlic.
Nonetheless, given the time scales here of more than 30,000 years, I do think broad trends and rough time frames can be discerned. As long as one interprets this timeline as indicative and not meant as definitive in any scholary sense, I believe this timeline can inform and provide some insight and guidance for how information has evolved over human history.
The operation of Timeline is pretty straightforward and intuitive. Here are a couple of tips to get a bit more out of playing with it:
For the sake of consistency, nearly all entries and pictures on the timeline are drawn from the respective entries within Wikipedia. Subsequent updates may add to this listing by reference to original sources, at which time all sources will be documented.
The fantastic Timeline was developed by David Huynh while he was a graduate student at MIT. Timeline and its sibling widgets were developed under funding from MIT’s Simile program. Thanks to all in the program and best wishes for continued funding and innovation.
Finally, my sincere thanks go to Professor Michael Buckland of the School of Information at the University of California, Berkeley, for his kind suggestions, input and provision of additonal references and sources. Of course, any errors or omissions are mine alone. I also thank Professor Buckland for his admonitions about use and interpretation of the timeline dates.
AI3's listing of semantic Web and -related tools has now crossed the barrier to 702 tools in total. There are 10 new tools since the last posting on this listing, with a few older ones retired.
A parallel listing is maintained by the Semantic Web Company with a very attractive presentation. They have also been aiding greatly in the general maintenance of the list.
Background on prior listings and earlier statistics may be found on these previous posts:
With interim updates periodically over that period.
Please use ‘Comments’ on this post for suggestions or additions to the listing.