Posted:October 15, 2008

W3C Semantic WebWikipedia

SWEETpedia Listing of 163 Research Articles; NZ Technical Report Affirm Trend

An earlier popular entry of this AI3 blog was “99 Wikipedia Sources Aiding the Semantic Web”. Each academic paper or research article in that compilation was based on Wikipedia for semantic Web-related research. Many of you suggested additions to that listing. Thanks!

Wikipedia continues to be an effective and unique source for many information extraction and semantic Web purposes. Recently, I needed to update my own research and found that many valuable new papers have been added to the literature.

I thus decided to make a compilation of such papers a permanent feature — which I’ve named SWEETpedia — and to update it on a periodic basis. You can now find the most recent version under the permanent SWEETpedia page link.

Hint, hint: Check out this link to see the 163 Wikipedia research sources!

NOTE: If you know of a paper that I’ve overlooked, please suggest it as a comment to this posting and I will add it to the next update.

Status of Wikipedia

Meanwhile, a complementary technical report, Mining Meaning from Wikipedia [1], was just released from the University of Waikato in New Zealand. It is a fantastic resource for anyone in this field.

For starters, it summarizes the size and status of the English-version Wikipedia with a more discerning eye than usual:

Categories 390,000
Articles and related pages 5,460,000
redirects 2,970,000
disambiguation pages 110,000
lists and stubs 620,000
bona-fide articles 1,760,000
Templates 174,000
infoboxes 9,000
other 165,000
Links
between articles 62,000,000
between category and subcategory 740,000
between category and article 7,270,000

The size, scope and structure of Wikipedia make it an unprecedented resource for researchers engaged in natural language processing (NLP), information extraction (IE) and semantic Web-related tasks. Further, the more than 250 language versions of Wikipedia also make it a great resource for multi-lingual and translation studies.

Growth of SWEETpedia

In the eight months since posting the semantic Web-related research papers using Wikipedia, my new SWEETpedia listing has grown by about 65%. There are now 63 new papers, bringing the total to 163.

Of course, these are not the only academic papers published about or using Wikipedia. The SWEETpedia listing is specifically related to structure, term, or semantic extractions from Wikipedia. Other research about frequency of updates or collaboration or growth or comparisons with standard encyclopedias may also be found under Wikipedia’s own listing of academic studies.

Wikipedia Research Papers by Year

This graph indicates the growth in use of Wikipedia as a source of semantic Web research. It is hard to tell if the effort is plateauing or not; the apparent slight dip in 2008 is too early to yet conclude that.

For example, the current SWEETpedia listing adds another 35% more listings for 2007 to the earlier records. It is likely many 2008 papers will also be discovered later in 2009. Many of the venues at which these papers get presented can be somewhat obscure, and new researchers keep entering the field.

However, we can conclude that Wikipedia is assuming a role in semantic Web and natural language research never before seen for other frameworks.

Kinds of Semantic Web-related Research

As noted, the new 82-page technical report by Olena Medelyan et al. from the University of Waikato in New Zealand, Mining Meaning from Wikipedia [1], is now the must-have reference for all things related to the use of Wikipedia for semantic Web and natural language research.

Olena and her co-authors, Catherine Legg, David Milne and Ian Witten, have each published much in this field and were some of the earliest researchers tapping into the wealth of Wikipedia.

They first note the many uses to which Wikipedia is now being put:

  • Wikipedia as an encyclopedia — the standard use familiar to the general public
  • Wikipedia as corpus — large text collections for testing and modeling NLP tasks
  • Wikipedia as a thesaurus — equivalent and hierarchical relationships between terms and related or synoymous terms
  • Wikipedia as a database — the extraction and codification of structure and structural relationships
  • Wikipedia as an ontology — the formal expression of relationships in semantic Web and logical constructs, and
  • Wikipedia as a network structure — relationship analysis and mining through Wikipedia’s representation as a network graph.

These type of uses then enable the authors to place various research efforts and papers into context. They do so via four major clusters of relevant tasks related to language processing and the semantic Web:

Natural Language Processing (NLP) Tasks:
Semantic relatedness
Word sense disambiguation
words and phrases
named entities
thesaurus and ontology terms
Co-reference resolution
Multilingual alignment
Information Retrieval Tasks:
Query expansion
Multilingual retrieval
Question answering
Entity ranking
Text categorization
Topic indexing

Information Extraction (IE) Tasks:
Semantic relations in raw (unstructured) text
Semantic relations in structure
Typing (classifying) named entities

Ontology Building Tasks:
Knowledge organization
Named entities
Thesaurus information
Ontology alignment
Facts extraction and assertion

There are many interesting observations throughout this report. There are also useful links to related tools, supporting and annotated datasets, and key researchers in the field.

I highly recommend this report as the essential starting point for anyone first getting into these research topics. Many of the newly added references to the SWEETpedia listing arose from this report. Reading the report is useful grounding to know where to look for specific papers in a given task area.

Though clearly the authors have their own perspectives and research emphases, they do an admirable job of being complete and even-handed in their coverage. Basic review reports such as this play an important role in helping to focus new research and make it productive.

Excellent job, folks! And, thanks!


[1] Olena Medelyan, Catherine Legg, David Milne and Ian H. Witten, 2008. Mining Meaning from Wikipedia, Working Paper Series ISSN 1177-777X, Department of Computer Science, The University of Waikato (New Zealand), September 2008, 82 pp. See http://arxiv.org/ftp/arxiv/papers/0809/0809.4530.pdf.

Posted by AI3's author, Mike Bergman Posted on October 15, 2008 at 10:57 pm in Adaptive Information, Semantic Web, Structured Web | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/460/research-shows-natural-fit-between-wikipedia-and-semantic-web/
The URI to trackback this post is: http://www.mkbergman.com/460/research-shows-natural-fit-between-wikipedia-and-semantic-web/trackback/
Posted:October 5, 2008

LOD Cloud Diagram

Class-level Mappings Now Generalize Semantic Web Connectivity

We are pleased to present a complementary view to the now-famous linking open data (LOD) cloud diagram (shown to the left; click on it for a full-sized view) [1]. This new diagram (shown below) — what we call the LOD constellation to distinguish it from its notable LOD cloud sibling — presents the current class-level structure within LOD datasets.

This new LOD constellation complements the instance-level view in the LOD cloud that has been the dominant perspective to date. The LOD cloud centrally hubs around DBpedia, the linked data structured representation of Wikipedia. The connections shown in the cloud diagram mostly reflect owl:sameAs relations, which means that the same individual things or instances are referenced and then linked between the datasets. Across all datasets, linking open data (LOD) now comprises some two billion RDF triples, which are interlinked by around 3 million RDF links [2]. This instance-level view of the LOD cloud shown to the left was updated a bit over a week ago [1].

The objective of the Linking Open Data community is to extend the Web with a data commons by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different data sources. All of the sources on these LOD diagrams are open data [3].

So, Tell me Again Why Classes Are Important?

In prior postings, Fred Giasson and I have explained the phenomenon of ‘exploding the domain‘. Exploding the domain means to make class-to-class mappings between a reference ontology and external ontologies, which allows properties, domains and ranges of applicability to be inherited under appropriate circumstances [4]. Exploding the domain expands inferencing power to this newly mapped information. Importantly, too, exploding the domain also means that instances or individuals that are members of these mapped classes also inherit or assume the structural relations (schema, if you will) of their mapped sources as well.

Trying to think through the statements above, however, is guaranteed to make your head hurt. When just reading the words, these sound like fairly esoteric or abstract ideas.

So, to draw out the distinctions, let’s discuss linked data that is based on instance (individual) mappings versus those mapped on the basis of classes. Like all things, there are exceptions and edge cases, but let us simply premise our example using basic set theory. Our individual instances are truly discrete individuals, in this case some famous dogs, and our classes are the conceptual sets by which these individuals might be characterized.

To make our example simple, we will use two datasets (A and B) about dogs and their possible relations, each characterized by their structure (classes) or their individuals (instances):

Dataset A (organisms) Dataset B (pets)
Classes (structure) mammal
canid
wolf
dog
pet
dog
breed (list)
Instances (individuals) and class assignments Rin Tin Tin (dog) Rin Tin Tin (German shepherd)
Lassie (collie)
Clifford (Visla)
Old Yeller (mutt)

When datasets are linked based on instance mappings alone, as is generally the case with current practice using sameAs, and there are no class mappings, we can say that Rin Tin Tin is both a dog pet and a mammal. However, we can not say that Lassie, for example, is a mammal, because there is no record for Lassie in Dataset A.

So, we thus see our first lesson: to draw an inference about instances using sameAs in the absence of class mappings requires each record (instance) to exist in an external dataset in order to make that assertion. Instances can inherit the properties and structure of the datasets in which they specifically occur, but only there. Thus, what can be said about a given individual (linked via owl:sameAs) is at most the intersection of what is contained in only the datasets in which that individual appears and is mapped. Assertions are thus totally specific and can not be made without the presence of a matching instance record. We can call this scenario the intersection model: only where there is an intersection of matching instance records can the structure of their source datasets be inferred.

However, when mappings can be made at the class level, then inferences can be drawn about all of the members of those sets. By asserting equivalentClass for dog between Datasets A and B, we can now infer that Lassie, Clifford and Old Yeller are canids and mammals as well as Rin Tin Tin, even though their instance records are not part of Dataset A. To complete the closure we can also now infer that Rin Tin Tin (Dataset A) is a pet and a German shepherd from Dataset B. We can call this scenario the union model. The mappings have become generalized and our inferencing power now extends to all instances of mapped classes whether there are records or not for them in other datasets.

This power of generalizability, plus the inheritance of structure, properties and domain and range attributes, is why class mappings are truly essential for the semantic Web. Exploding the domain is real and powerful. Thus, to truly understand the power of linked data, it is necessary to view its entirety from a class perspective [5].

Thus, to summarize our answer to the rhetorical question, class mappings are important because they can:

  • Generalize the understanding of individual instances
  • Expand the description of things in the world by inheriting and reusing other class properties, domains and ranges, and
  • Place and contextualize things by inheriting class structure and hierarchical relationships.

The LOD Constellation

So, here is the new LOD constellation of class-level linkages. The definition of class-level linkages is based on one of four possible predicates (rdfs:subClassOf, owl:equivalentClass, umbel:superClassOf or umbel:isAligned). Because of the newness of UMBEL as a vocabulary, only a few of the sources linked to UMBEL have the umbel:superClassOf relationship and one (bibo) has isAligned.

Note that some of the sources are combined vocabularies (ontologies) and instance representations (e.g., UMBEL, GeoNames), others are strict ontologies (e.g., event, bibo), and still others are ontologies used to characterize distributed instances (e.g., foaf, sioc, doap). Other distinctions might be applied as well:

Click for full size
[click for full size]

The current 21 LOD datasets and ontologies that contribute to these class-level mappings are (with each introduced by its namespace designation):

  • bibo — Bibilographic ontology
  • cc — Creative Commons ontology
  • damltime — Time Zone ontology
  • doap — Description of a Project ontology
  • event — Event ontology
  • foaf — Friend-of-a-Friend ontology
  • frbr — Functional Requirements for Bibliographic Records
  • geo — Geo wgs84 ontology
  • geonames — GeoNames ontology
  • mo — Music Ontology
  • opencyc — OpenCyc knowledge base
  • owl — Web Ontology Language
  • pim_contact — PIM (personal information management) Contacts ontology
  • po — Programmes Ontology (BBC)
  • rss — Really Simple Syndicate (1.0) ontology
  • sioc — Socially Interlinked Online Communities ontology
  • sioc_types — SIOC extension
  • skos — Simple Knowledge Organization System
  • umbel — Upper Mapping and Binding Exchange Layer ontology
  • wordnet — WordNet lexical ontology
  • yandex_foaf — FOAF (Friend-of-a-Friend) Yandex extension ontology

The diagram was programmatically generated using Cytoscape (see below) [6], with some minor adjustments in bubble position to improve layout separation. The bubble sizes are related to number of linked structures (ontologies) to which the node has class linkages. The arrow thicknesses are related to number of linking predicates between the nodes. Two-way arrows are shown as darker and indicate equivalentClass or matching superClassOf and subClassOf; single arrows represent subClassOf relationships only.

Note we are not presenting any rdf:type relations because those are not structural, and rather deal with the assignment of instances to classes [7]. More background is provided in the discussion of the construction methodology [6].

At this time, we have not calculated how many individuals or instances might be directly included in these class-level mappings. The data and files used in constructing this diagram are available for download without restriction [8].

Finally, we have expended significant effort to discover class-level mappings for which we may not be directly aware (see next). Please bring any missing, erroneous or added linkages to our attention. We will be pleased to incorporate those updates into future releases of the diagram.

How the LOD Constellation Was Constructed

Our diligence has not been exhaustive since not all LOD datasets are indexed locally and others do not have SPARQL endpoints. The general method was to query the datasets to check which ontologies used external classes to instantiate their individuals using the rdf:type predicate. The externally referenced ontology was then checked to determine its own external class mappings.

Here is the basic SPARQL query to discover the the rdf:type predicate for non-Virtuoso resources:

select ?o where
{
?s a ?o.
}

And here is the SPARQL query for Virtuoso-hosted datasets (note Virtuoso supports the distinct non-standard extension to SPARQL which is a more efficient way to get listings):

select distinct ?o where
{
?s a ?o.
}

We then created a simple script to go to all of the ontology namespaces so listed in these external mappinigs. If there was an external class mapping in the source with one of the four possible predicates of rdfs:subClassOf, owl:equivalentClass, umbel:superClassOf or umbel:isAligned, we noted the source and predicate and wrote it to a simple CSV (comma delimited) file. This formed the input file to the Cytoscape program that created the network graph [6].

There are possibly oversights and omissions in this first-release diagram since not all bubbles in the LOD cloud were exhaustively inspected. Please notify us with updates or new class linkages. Alternatively, you can also download and modify the diagram yourself [8].

Conspicuous by Their Absence

We gave particular diligence to a few of the more dominant sources in the LOD instance cloud that showed no class mappings. These include DBpedia and YAGO. While these have numerous and useful rdf:type and owl:sameAs relationships, and all have rich internal class structures, none apparently map at the class level to external sources.

However, because of the unique overlap of instances (named entities) in UMBEL, which does have extensive external class mappings, and which have been mapped to DBpedia instances, it is possible to infer some of these external class linkages.

For example, go to the DBpedia SPARQL endpoint:

http://dbpedia.org/sparql/

And try out some sample queries by pasting the following into the Query text box and running the query:

define input:inference ‘http://dbpedia.org/resource/inference/rules/umbel#’
prefix umbel: <http://umbel.org/umbel/sc/>
select ?s
where
{
?s a umbel:Person
}

This example query applies the external class structure of UMBEL to individual person instances in DBpedia because of the prior specification of some mapping rules used for inferencing [9]. The result set is limited to 1000 results.

Alternatively, since UMBEL has also been mapped to the external FOAF ontology, we can also now invoke the FOAF class structure directly to produce the exact same result set (since umbel:Person owl:equivalentClass foaf:Person). We do this by applying the same inferencing rules in a different way:

define input:inference ‘http://dbpedia.org/resource/inference/rules/umbel#’
select ?s
where
{
?s a <http://xmlns.com/foaf/0.1/Person>.
}

UMBEL can thus effectively act as a class structure bridge to the DBpedia instances.

Since DBpedia is an instance hub, this bridging effect is quite effective between UMBEL and other DBpedia instances in the LOD cloud. However, because there is not the same degree of overlap of instances with, say, GeoNames, this technique would be less effective there.

Explicit class-level mappings between datasets will always be more powerful than instance-level ones with class mediators. And, in all cases, both of those techniques that explicitly invoke classes are more powerful than instance-level links alone.

The Linked Data Infrastructure is Now Complete

Though all of the available linkages have not yet been made in the LOD datasets, we can now see that all essential pieces of the linkage infrastructure are in place and ready to be exploited. Of course, new datasets can take advantage of this infrastructure as well.

UMBEL is one of the essential pieces that provides the bridging “glue” to these two perspectives or “worlds” of the instances in the LOD cloud and the classes in the LOD constellation. This “glue” becomes possible because of UMBEL’s unique combination of three components or roles:

  • UMBEL provides a rich set of 20,000 subject concept classes and their relations (a reference structure “backbone”) that facilitates class-level mappings with virtually any external ontology with the benefits as described above
  • UMBEL contains a named entity dictionary from Wikipedia also mapped to these classes, which therefore strongly intersects with DBpedia and YAGO, and therefore helps provide the individual instances <–> classes bridging “glue”, and
  • UMBEL is also a vocabulary that enhances the lightweight SKOS vocabulary to explicitly facilitate linkages to external ontologies at the subject concept layer.

In fact, it is the latter vocabulary sense, in combination with the reference subject concepts, that enables us to draw the LOD class constellation.

So, we can now see a merger of the LOD cloud and the LOD constellation to produce all of the needed parts to the LOD infrastructure for going forward:

  • A hub of instances (DBpedia)
  • A hub of subject-oriented (“is about”) reference classes (Cyc-UMBEL), and
  • A vocabulary for glueing it all together (SKOS-UMBEL).

This infrastructure is ready today and available to be exploited for those who can grasp its game-changing significance.

And, from UMBEL’s standpoint, we can also point to the direct tie-ins to the Cyc knowledge base and structure for conceptual and relationship coherence testing. This infrastructure is an important enabler to extend these powerful frameworks to new domains and for new purposes. But that is a story for another day. ;)


[1] The current version of the LOD cloud may be found at the World Wide Web Consortium’s (W3C) SWEO wiki page. There is also a clickable version of the diagram that will take you to the home references for the consituent data sources in this diagram; see http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2008-09-18.html.
[2] According to [1], this estimate was last updated one year ago in October 2007. The numbers today are surely much larger, since the number of datasets has also nearly doubled in the interim.
[3] Open data has many definitions, but a common one with a badge is often seen. However, the best practices of linked data can also be applied to proprietary or intranet information as well; see this FAQ.
[4] For further information about exploding the domain, see these postings: F. Giasson, Exploding the Domain: UMBEL Web Services by Zitgist (April 20, 2008), UMBEL as a Coherent Framework to Support Ontology Development (August 27, 2008), Exploding DBpedia’s Domain using UMBEL (September 4, 2008); M. Bergman, ‘ Exploding the Domain’ in Context (September 24, 2008).
[5] Of course, instance-level mappings with sameAs have value as well, as the usefulness of linked data to date demonstrates. The point is not that class-level mappings are the “right” way to construct linked data. Both instance mappings and class mappings complement one another, in combination bringing both specificity and generalizability. The polemic here is merely to redress today’s common oversight of the class-level perspective.
[6] See the text for how the listing of class relationships was first assembled. After removal of duplicates, a simple comma delimited file was produced, class_level_lod_constellation.csv, with four columns. The first three columns show the subject-predicate-object linking the datasets by the class-level predicate. The fourth column presents the count of the number of types of class-level predicates used between the dataset pairs; the maximum value is 4.

This CVS file is the import basis to Cytoscape. After import, the spring algorithm was applied to set the initial distribution of nodes (bubbles) in the diagram. Styles were developed to approximate the style in the LOD cloud diagram, and each of the class-linkage predicates was assigned different views and arrows. (The scaling of arrow width allowed the chart to be cleaned up with repeat linkages removed and simplified to a single arrow, with the strongest link type used as the remaining assignment. For example, equivalentClass is favored over subClassOf is favored over superClassOf.)
In addition, each node was scaled according to the number of external dataset connections, with the assignments as shown in the file. Prior to finalizing, the node bubbles were adjusted for clear spacing and presentation. The resulting output is provided as the Cytoscape file, lod_class_constellation.cys. For the published diagram, the diagram was also exported as an SVG file.
This SVG file was lastly imported into Inkscape for final clean up. The method for constructing the final diagram, including discussion about how the shading effect was added, is available upon request.
[7] rdf:type is a predicate for assigning an instance to a class, which is often an external class. It is an important source of linkages between datasets at the instance level and provides a structural understanding of the instance within its dataset. While this adds structural richness for instances, rdf:type is by definition not class-level and provides no generalizability.
[8] The CSV file and the Cytoscape files may be downloaded from http://www.umbel.org/lod_constellation.html.
[9] See the explanation of the external linkage file in M. Bergman, ‘ Exploding the Domain’ in Context (September 24, 2008).

Posted by AI3's author, Mike Bergman Posted on October 5, 2008 at 7:26 pm in Adaptive Information, Linked Data, Semantic Web, UMBEL | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/457/a-new-constellation-in-the-linking-open-data-lod-sky/
The URI to trackback this post is: http://www.mkbergman.com/457/a-new-constellation-in-the-linking-open-data-lod-sky/trackback/
Posted:August 20, 2008

In a recent posting on the Ontolog forum, Toby Considine discussed the difficulty of describing to several business CEOs the concept of an ontology. He noted that when one of the CEOs finally got it, he explained it thus to the others:

“Ontology is always a value proposition, how a company makes money. Each company, and perhaps each sales professional must be able to define his own ontology and explain it to his customers. We need semantic alignment to create a common basis for discussing value. If it is a good semantic set, then the ontologies that each sales director creates will be better; better to produce sales differentiation, and better to produce true long-term value for the company.
“A general purpose ontology gives us a framework to develop and discuss our own value propositions. But those value propositions, and their underlying ontologies must remain proprietary, or else every company is just building to the lowest common denominator, and innovation and value creation end.”

BTW, Toby is chair of the OASIS Open Building Information Exchange (oBIX) Technical Committee (see http://www.oasis-open.org), and is used to conversing about standards and technical matters to business audiences.

This discussion came up in relation to the use of the Cyc knowledge base and the possible role of “lightweight” or “foundational” reference ontologies.

There are a number of interesting points embedded and implied in this discussion, and at the risk of reading too much into them, include:

  • Foundational, reference ontologies have an important role, but as frameworks and for external interoperability
  • Each enterprise has its own world view, which can be expressed as an ontology and represents its “value proposition”; in this regard, internal ontologies work similarly to current legacy schema
  • Semantic “alignment” (and therefore interoperabililty) is important to discuss value
  • For a business enterprise, the real focus of its ontologies is to express its value proposition, how it makes money.

I think these sentiments are just about right, with the last point especially profound.

We have supported UMBEL as an important reference structure, and see the role for ever more specific ones. But, at the other end of the spectrum, ontologies are also specific world views, and can and should be private for proprietary enterprises. Yet this is not in any way in conflict with the interoperation — with increasingly widening circles — using shared structure (ontologies).

The balance and integration of the private and public in semantic Web ontologies is still being worked out. But, I truly believe it is appropriate and necessary that both the public and the private be embraced.

Toby’s CEO got it almost right: innovation depends on reserving some proprietary aspects. But the complete story, I also think, is that embracing ontologies themselves and interoperable linked data frameworks in that context is also a key source of innovation and added value.

Posted by AI3's author, Mike Bergman Posted on August 20, 2008 at 9:31 pm in Adaptive Information, Ontologies, Semantic Web, UMBEL | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/452/a-business-perspective-on-ontologies/
The URI to trackback this post is: http://www.mkbergman.com/452/a-business-perspective-on-ontologies/trackback/
Posted:July 25, 2008

'Dust Motes Dancing in Sunlight, Interior from the Artist's Home, Strandgade 30,' Vilhelm Hammerchi 1900; courtesy of http://www.thecityreview.com/copen.htmlStructure Demands Context; But, is that Enough?

Last week marked a red-letter day in my professional life with release of the UMBEL subject concept structure. UMBEL began as a gleam in the eye more than a year ago when I observed that semantic Web techniques, while powerful — especially with regard to the RDF data model as a universal and simple (at its basics) means for representing any information and its structure — still lacked something. It took me a while to recognize that the first something was context.

Now, I have written and talked much about context before on this blog, with The Semantics of Context being the most salient article for the present discussion.

This is my mental image of Web content without context: Unconnected dust motes floating through a sun-lite space, moving slowly, randomly, and without connections, sort of like Brownian motion. Think of the sunlight on dust shown by the picture to the left.

By providing context, my vision saw we could freeze these moving dust motes and place them into a fixed structure, perhaps something like constellations in the summer sky. Or, at least, more stable, and floating less aimlessly and unconnected.

So, my natural response was to look for structural frameworks to provide that context. And that was the quest I set forward at UMBEL’s initiation.

At the time of UMBEL’s genesis, the impact of Wikipedia and other sources of user-generated content (UGC) such as del.icio.us or Flickr or many, many others was becoming clear. The usefulness of tags, folksonomies, microformats and other forms of “bottom-up” structure was proven.

The evident — and to me, exciting — aspect of globally-provided UGC was that this was the ultimate democratic voice: the World has spoken, and the article about this or the tag about that had been vetted in the most interactive, exposed, participatory and open framework possible. Moreover, as the World changed and grew, these new realizations would also be fed back into the system in a self-correcting goodness. Final dot.

Through participation and collective wisdom, therefore, we could gain consensus and acceptance and avoid the fragility and arbitrariness of “wise man” or imposed from the “top-down” answers. The people have spoken. All voices have been heard. The give and take of competing views have found their natural resting point. Again, I thought, final dot.

Thus, when I first announced UMBEL, my stated desire (and hope) was that something like Wikipedia could or would provide that structural context. Here is a quote from the announcement of UMBEL, nearly one year ago to this day:

The selection of the actual subject proxies within the UMBEL core are to be based on consensus use. The subjects of existing and popular Web subject portals such as Wikipedia and the Open Directory Project (among others) will be intersected with other widely accepted subject reference systems such as WordNet and library classification systems (among others) in order to derive the candidate pool of UMBEL subject proxies.

Yet, that is not the basis of the structure announced last week for UMBEL. Why?

The Strengths of User-Generated Content

Before we probe the negative, let’s rejoice the positive.

User-generated content (UGC) works, has rapidly proven itself in venues from authoritative subjects (Wikipedia), photos (Flickr), bookmarking and tagging (del.icio.us), blogs, video (YouTube) and every Web space imaginable. This is new, was not foreseen by most a few years ago, and has totally remade our perception of content and how it can be generated. Wow!

The nature of this user-generated content, of course, as is true for the Web itself, is that it has arisen from a million voices without coercion, coordination or a plan, spontaneously in relation to chosen platforms and portals. Yet, still, today, as to what makes one venue more successful than others, we are mostly clueless. My suspicion is that — akin to financial markets — when Web portals or properties are successful, they readily lend themselves to retrospective books and learned analysis explaining that success. But, just try to put down that “recipe” in advance, and you will most likely fail.

So, prognostication is risky business around these parts.

There is a reason why both the head and sub-head of this article are stated as questions: I don’t know. For the reasons stated above, I would still prefer to see user-generated structure (UGS) emerge in the same way that topic- and entity-specific content has on Wikipedia. However, what I can say is this: for the present, this structure has not yet emerged in a coherent way.

Might it? Actually, I hope so. But, I also think it will not arise from systems or environments exactly like Wikipedia and, if it does arise, it will take considerable time. I truly hope such new environments emerge, because user-mediated structure will also have legitimacy and wisdom that no “expert” approach may ever achieve.

But these are what if‘s, and nice to have‘s and wouldn’t it be nice‘s. For my purposes, and the clients my company serves, what is needed must be pragmatic and doable today — all with acceptable risk, time to delivery and cost.

So, I think it safe to say that UGC works well today at the atomic level of the individual topic or data object, what might be called the nodes in global content, but not in the connections between those nodes, its structure. And, the key to the answer of why user-generated structure (UGS) has not emerged in a bottom-up way resides in that pivotal word above: coherence.

Coherence was the second something to accompany context as lacking missing pieces for the semantic Web.

Coherence in Context

What is it to be coherent? The tenth edition of Merriam-Websters Collegiate Dictionary (and the online version) defines it as:

coherent \kō-ˈhir-ənt\ adj.; Middle French or Latin; Middle French cohérent, from Latin cohaerent-, cohaerens, present participle of cohaerēre Date: (ca. 1555)1: a: logically or aesthetically ordered or integrated : consistent <coherent style> <a coherent argument> b: having clarity or intelligibility : understandable <a coherent person> <a coherent passage>
2: having the quality of cohering; especially : cohesive, coordinated <a coherent plan for action>
3: a: relating to or composed of waves having a constant difference in phase <coherent light> b: producing coherent light <a coherent source>.

Another online source I like for visualization purposes is Visuwords, which displays the accompanying graph relationships view based on WordNet.

Of course, coherent is just the adjectival property of having coherence. Again, the Merriam Webster dictionary defines coherence as 1: the quality or state of cohering: as a: systematic or logical connection or consistency b: integration of diverse elements, relationships, or values.

Decomposing even further, we can see that coherence is itself the state of the verb, cohere. Cohere, as in its variants above, has as its etymology a derivation from the Latin cohaerēre, from co- + haerēre to stick, namely “to stick with”. Again, the Merriam Webster dictionary defines cohere as 1: a: to hold together firmly as parts of the same mass; broadly: stick, adhere b: to display cohesion of plant parts 2: to hold together as a mass of parts that cohere 3: a: to become united in principles, relationships, or interests b: to be logically or aesthetically consistent.

These definitions capture the essence of coherence in that it is a state of logical, consistent connections, a logical framework for integrating diverse elements in an intelligent way. In the sense of a content graph, this means that the right connections (edges or predicates) have been drawn between the object nodes (or content) in the graph.

Bottom-up UGC: The Hip Bone is Connected to the Arm Bone

Structure without coherence is where connections are being drawn between object nodes, but those connections are incomplete or wrong (or, at least, inconsistent or unintelligible). The nature of the content graph lacks logic. The hip bone is not connected to the thigh bone, but perhaps to something wrong or silly, like the arm or cheek bone.

Ambiguity is one source for such error, as when, for example, the object “bank” is unclear as to whether it is a financial institution, billiard shot, or edge of a river. If we understand the object to be the wrong thing, then connections can get drawn that are in obvious error. This is why disambiguation is such a big deal in semantic systems.

However, ambiguity tends not to be a major source of error in user-generated content (UGC) systems because the humans making the connections can see the context and resolve the meanings. Context is thus a very important basis for resolving disambiguities.

A second source of possible incoherence is the organizational structure or schema of the actual concept relationships. This is the source that poses the most difficulty to UGC systems such as folksonomies or Wikipedia.

Remember in the definitions above that logic, consistency and intelligibility were some of the key criteria for a coherent system. Bottom-up UGS (user-generated structure) is prone to not meet the test in all three areas.

“In the context of an information organization framework, a structure is a cohesive whole or ‘container’ that establishes qualified, meaningful relationships among those activities, events, objects, concepts which, taken together, comprise the ‘bounded space’ of the universe of interest.” – J.T. Tennis and E.K. Jacob [1]

Logic and consistency almost by definition imply the application of a uniform perspective, a single world view. Multiple authors and contributors doing so without a common frame of reference or viewpoint are unable to bring this consistency of perspective. For example, how time might be treated with regard to famous people’s birth dates in Wikipedia is very different than its discussion of time with respect to topics on geological eras, and Wikipedia contains no mechanisms for relating those time dimensions or making them consistent.

Logic and intelligibility suggest that the structure should be testable and internally consistent. Is the hip bone connected with the arm bone? No? and why not? In UGC systems, individual connections are made by consensus and at the object-to-object level. There are no mechanisms, at least in present systems, for resolving inconsistencies as these individual connections get aggregated. We can assign dogs as mammals and dogs as pets, but does that mean that all pets are mammals? The connections can get complicated fast and such higher-order relationships remain unstated or more often than not wrong.

Note as well that in UGC systems items may be connected (“assigned”) to categories, but their “factual” relation is not being asserted. Again, without a consistency of how relations are treated and the ability to test assertions, the structures may not only be wrong in their topology, but totally lack any inference power. Is the hip bone connected with the cheek bone? UGC structures lack such fundamental logic underpinnings to test that, or any other, assertion.

From the first days of the Web, notably Yahoo! in its beginnings but many other portals as well, we have seen many taxonomies and organizational structures emerge. As simple heuristic devices for clustering large amounts of content, this is fine (though certainly there, too, there are some structures that are better at organizing along “natural” lines than others). Wikipedia itself, in its own structure, has useful organizational clustering.

But once a system is proposed, such as UMBEL, with the purpose of providing broad referenceability to virtually any Web content, the threshold condition changes. It is no longer sufficient to merely organize. The structure must now be more fully graphed, with intelligent, testable, consistent and defensible relations.

Full Circle to Cyc and UGC

Once the seemingly innocent objective of being a lightweight subject reference structure was established for UMBEL, the die was cast. Only a coherent structure would work, since anything else would be fragile and rapidly break in the attempt to connect disparate content. Relating content coherently itself demands a coherent framework.

As noted in the lead-in, this was not a starting premise. But, it became an unavoidable requirement once the UMBEL effort began in earnest.

I have spoken elsewhere about other potential candidates as possibly providing the coherent underlying structure demanded by UMBEL. We have also discussed why Cyc, while by no means perfect, was chosen as the best starting framework for contributing this coherent structure.

I anticipate we will see many alternative structures proposed to UMBEL based on other frameworks and premises. This is, of course, natural and the nature of competition and different needs and world views.

However, it will be most interesting to see if either ad hoc structures or those derived from bottom-up UGC systems like Wikipedia can be robust and coherent enough to support data interoperability at Web scale.

I strongly suspect not.


[1] Joseph T. Tennis and Elin K. Jacob, 2008. “Toward a Theory of Structure in Information Organization Frameworks,” upcoming presentation at the 10th International Conference of the International Society for Knowledge Organization (ISKO 10), in Montréal, Canada, August 5th-8th, 2008. See http://www.ebsi.umontreal.ca/isko2008/documents/abstracts/tennis.pdf.
Posted:July 6, 2008

Breakthroughs in the Basis, Nature and Organization of Information Across Human History

I’m pleased to present a timeline of 100 or so of the most significant events and developments in the innovation and management of information and documents from cave paintings ( ca 30,000 BC) to the present. Click on the link to the left or on the screen capture below to go to the actual interactive timeline.

This timeline has fast and slow scroll bands — including bubble popups with more information and pictures for each of the entries offered. (See the bottom of this posting for other usage tips.)

Note the timeline only presents non-electronic innovations and developments from alphabets to writing to printing and information organization and conventions. Because there are so many innovations and they are concentrated in the last 100 years or fewer, digital and electronic communications are somewhat arbitrarily excluded from the listing.

I present below some brief comments on why I created this timeline, some caveats about its contents, and some basic use tips. I conclude with thanks to the kind contributors.

Why This Timeline?

Readers of this AI3 blog or my detailed bio know that information — biological embodied in genes, or cultural embodied in human artefacts — has been my lifelong passion. I enjoy making connections between the biological and cultural with respect to human adaptivity and future prospects and I like to dabble on occasion as an amateur economic or information science historian. SIMILE Timeline

About 18 months ago I came across David Huynh‘s nifty Exhibit lightweight data display widget, gave it a glowing review, and then proceeded to convert my growing Sweet Tools listing of semantic Web and related tools to that format. Exhibit still powers the listing (which I just updated yesterday for the twelfth time or so).

At the time of first rolling out Exhibit I also noted that David had earlier created another lightweight timeline display widget that looked similarly cool (and which was also the first API for rendering interactive timelines in Web pages). (In fact, Exhibit and Timeline are but two of the growing roster of excellent lightweight tools from David.) Once I completed adopting Exhibit, I decided to find an appropriate set of chronological or time-series data to play next with Timeline.

I had earlier been ruminating on one of the great intellectual mysteries of human development: Why, roughly beginning in 1820 to 1850 or so, did the historical economic growth patterns of all prior history suddenly take off? I first wrote on this about two years ago in The Biggest Disruption in History: Massively Accelerated Growth Since the Industrial Revolution, with a couple of follow-ups and expansions since then.

I realized that in developing my thesis that wood pulp paper and mechanized printing were the key drivers for this major inflection change in growth (as they effected literacy and the broadscale access to written information) I already had the beginnings of a listing of various information innovations throughout history. So, a bit more than a year ago, I began adding to that list in terms of how humans learned to write, print, share, organize, collate, reproduce and distribute information and when those innovations occurred.

There are now about 100 items in this listing (I’m still looking for and researching others; please send suggestions at any time. ;) ). Here are some of the current items in chronological order from upper left to lower right:

cave paintings codex footnotes microforms
ideographs woodblock printing copyrights thesaurus
calendars tree diagram encyclopedia pencil (mass produced)
cuneiform quill pen capitalization rotary perfection press
papyrus (paper) library catalog magazines catalogues
hieroglyphs movable type taxonomy (binomial classification) typewriter
ink almanacs statistics periodic table
alphabet paper (rag) timeline chemical pulp (sulfite)
Phaistos Disc word spaces data graphs classification (Dewey)
logographs registers card catalogs linotype
maps intaglio lithography mimeograph machine
scrolls printing press punch cards kraft process (pulp)
manuscripts advertising (poster) steam-powered (mechanized) papermaking flexography
glossaries bookbinding book (machine-paper) classification (LoC)
dictionaries pagination chemcial symbols classification (UDC)
parchment (paper) punctuation mechanical pencil offset press
bibliographies library catalog (printed) chromolithography screenprinting
concept of categories public lending library paper (wood pulp) ballpoint pen
library dictionaries (alphabetic) rotary press xerographic copier
classification system (library) newspapers mail-order catalog hyperlink
zero Information graphics fountain pen metadata (MARC)
paper scientific journal

So, off and on, I have been working with and updating the data and display of this timeline in draft. (I may someday also post my notes about how to effectively work with the Timeline widget.)

With the listing above, completion was sufficient to finally post this version. One of the neat things with Timeline is the ability to drive the display from a simple XML listing. I will update the timeline when I next have an opportunity to fill in some of the missing items still remaining on my innovations list such as alphabeticization, citations, and table of contents, among many others.

Some Interpretation Caveats

Of course, rarely can an innovation be traced to a single individual or a single moment in time. Historians are increasingly documenting the cultural milieu and multiple individuals that affect innovation.

In these regards, then, a timeline such as this one is simplistic and prone to much error and uncertainty. We have no real knowledge, for examples, for the precise time certain historical innovations occurred, and others (the ballpoint pen being one case in point) are a matter of interpretation as to what and when constituted the first expression. For instances where the record indicated multiple dates, I chose to use the date when released to the publlic.

Nonetheless, given the time scales here of more than 30,000 years, I do think broad trends and rough time frames can be discerned. As long as one interprets this timeline as indicative and not meant as definitive in any scholary sense, I believe this timeline can inform and provide some insight and guidance for how information has evolved over human history.

Some Use Tips

The operation of Timeline is pretty straightforward and intuitive. Here are a couple of tips to get a bit more out of playing with it:

  • The timeline has two scrolling panels, fast and slow. For rapid scolling, use mouse down and left or right movement on the lower panel
  • The lower panel also shows small ticks for each innovation in the upper panel
  • Clicking any icon or label in the upper panel will cause a bubble popup to appear with a bit more detail and a picture for the item; click the ‘X’ to close the bubble
  • Each entry is placed in one or more categories keyed by icon. You may “filter” results by using keywords such as: alphabets, book, calendars, libraries, maps, mechanization, paper, papermaking, printing, organizing, scripts, standardization, statistics, timelines, or typography. Partial strings also match
  • Similarly, you may enter one of those same terms into one of the four color highlight boxes. Partial strings also match.

Sources, Contributions and Thanks

For the sake of consistency, nearly all entries and pictures on the timeline are drawn from the respective entries within Wikipedia. Subsequent updates may add to this listing by reference to original sources, at which time all sources will be documented.

The timeline icons are from David Vignoni’s Nuvola set, available under the LGPL license. Thanks David!

The fantastic Timeline was developed by David Huynh while he was a graduate student at MIT. Timeline and its sibling widgets were developed under funding from MIT’s Simile program. Thanks to all in the program and best wishes for continued funding and innovation.

Finally, my sincere thanks go to Professor Michael Buckland of the School of Information at the University of California, Berkeley, for his kind suggestions, input and provision of additonal references and sources. Of course, any errors or omissions are mine alone. I also thank Professor Buckland for his admonitions about use and interpretation of the timeline dates.