Posted:May 21, 2012

UMBEL Big GraphModularization Also Leads to Big Graph Visualization

We are pleased to announce the release of version 1.05 of UMBEL, which now has linkages to schema.org [6] and  GeoNames [1]. UMBEL has also been split into ‘core’ and ‘geo’ modules. The resulting smaller size of UMBEL ‘core’ — now some 26,000 reference concepts — has also enabled us to create a full visualization of UMBEL’s content graph.

The first notable change in UMBEL v. 1.05 is its mapping to schema.org. schema.org is a collection of schema (usable as HTML tags) that webmasters can use to markup their pages in ways recognized by major search providers. schema.org was first developed and organized by the major search engines of Bing, Google and Yahoo!; later Yandex joined as a sponsor. Now many groups are supporting schema.org and contributing vocabularies and schema.

I was one of the first to hail schema.org hours after its announcement [7]. It seemed only fair that we put our money where our mouth is and map UMBEL to it as well.

The UMBEL-schema.org mapping was manually done by, firstly, searching and inspecting the current UMBEL concept base for appropriate matches. If that mapping failed to find a rather direct correspondence between existing UMBEL concepts and the types in schema.org, the source concept reference of OpenCyc was then inspected in a similar manner. Failing a match from either of these two sources, the decision was to add a new concept to the ‘core’ UMBEL. This new concept was then appropriately placed into the UMBEL reference concept subject structure.

The net result of this process was to add 298 mapped schema.org types to UMBEL. This mapping required a further three concepts from OpenCyc, and a further 78 new reference concepts, to be added to UMBEL. Along with the new updates to UMBEL and its mappings, the section of Key Files below provides further explanatory links. We are reserving the addition of schema.org properties for a later time, when we plan to re-organize the Attributes SuperType within UMBEL.

Modularization of the UMBEL Vocabulary

Even in the early development of UMBEL there was a tension about the scope and level of what geographic information to include in its concept base. The initial decision was to support country and leading-country province and state concepts, and some leading cities. This decision was in the spirit of a general reference structure, but still felt arbitrary.

GeoNames is devoted to geographical information and concepts — both natural and human artifacts — and has become the go-to resource for geo-locational information. The decision was thus made to split out the initial geo-locational information in UMBEL and replace it with mappings to GeoNames. This decision also had the advantage of beginning a process of modularization of UMBEL. UMBEL Vocabulary and Reference Concept Ontology

Two sets of reference concepts were identified as useful for splitting out from the ‘core’ UMBEL in a geo-locational aspect:

  1. Geopolitical places and places of human activities and facilities
  2. Natural geographical places and features.

These removed concepts were then placed into a separate ‘geo’ module of UMBEL, including all existing annotations and relations, resulting in a module of 1,854 concepts. That left 26,046 concepts in UMBEL ‘core’. Because of some shared parent concepts, there is some minor overlap between the two modules. These are now the modular splits in UMBEL version 1.05.

Mapping to GeoNames

GeoNames has a different structure to UMBEL. It has few classes and distinguishes its geographic information on the basis of some 671 feature codes. These codes span from geopolitical divisions — such as countries, states or provinces, cities, or other administrative districts — to splits and aggregations by natural and human features. Types of physical terrain — above ground and underwater — are denoted, as well as regions and landscape features governed by human activities (such as vineyards or lighthouses) [1]. We wanted to retain this richness in our mappings.

We needed a bridge between feature codes and classes, a sort of umbrella property generally equivalent to owl:sameAs in nature, but with some possible inexactitude or degree of approximation. The appropriate choice here is umbel:correspondsTo, which was designed specifically for this purpose [2]. This predicate is thus the basis for the mappings.

The 671 GeoNames feature codes were manually mapped to corresponding classes in the UMBEL concepts, in a manner identical to what was described for schema.org above. The result was to add another further three OpenCyc concepts and to add 88 new UMBEL reference concepts to accommodate the full GeoNames feature codes. We thus now have a complete representation of the full structure and scope of GeoNames in UMBEL.

There are three modes in which one can now work with UMBEL:

  1. With UMBEL ‘core’ alone, recommended when your concept space is not concerned with geographical information
  2. UMBEL ‘core’ plus the UMBEL ‘geo’ module — equivalent to prior versions of UMBEL, or
  3. UMBEL ‘core’ plus GeoNames, recommended where geographical information is important to your concept space.

In the latter case, you may use SPARQL queries with the umbel:correspondsTo predicate to achieve the desired retrievals. If more logic is required, you will likely need to look to a rules-based addition such as SWRL [3] or RIF [4] to capture the umbel:correspondsTo semantics.

New Big Graph Visualization

Because of the UMBEL modularization, it has now become tractable to graph the main ontology in its entirety. The core UMBEL ontology contains about 26,000 reference concepts organized according to 33 super types. There are more than 60,000 relationships amongst these concepts, resulting in a graph structure of very large size.

It is difficult to grasp this graph in the abstract. Thus, using methods earlier described in our use of the Gephi visualization software [5], we present below a dynamic, navigable rendering of this graph of UMBEL core:

Note: at standard resolution, if this graph were to be rendered in actual size, it would be larger than 34 feet by 34 feet square at full zoom !!! Hint: that is about 1200 square feet, or 1/2 the size of a typical American house !

Note: If you are viewing this in a feed reader, click here to see the interactive graph.

This UMBEL graph displays:

  • All 26,000 concepts (“nodes”) with labels, and with connections shown (though you must must zoom to see)
  • The color-coded relation of these nodes to the 33 or so major SuperTypes in UMBEL, as well as the relative position of these clusters with respect to one another, and
  • When zooming (use scroll wheel or + icon) or panning (via mouse down moves), wait a couple of seconds to get the clearest image refresh:

You may also want to inspect a static version of this big graph by downloading a PDF.

Key Files and Links

Lastly, we fully updated the UMBEL Web site and re-released the UMBEL wiki.


[1] For more information on GeoNames, see http://www.geonames.org/. The complete mapping to GeoNames is based on its 671 feature codes, which describe natural, geopolitical, and human activity geo-locational information; see further http://www.geonames.org/statistics/total.html

[2] Approximate relationships are discussed in M.K. Bergman, 2010. “The Nature of Connectedness on the Web,” AI3:::Adaptive Information blog, November 22, 2010; see http://www.mkbergman.com/935/the-nature-of-connectedness-on-the-web/. One option, for example, is the x:coref predicate from the UMBC Ebiquity group; see further Jennifer Sleeman and Tim Finin, 2010. “Learning Co-reference Relations for FOAF Instances,” Proceedings of the Poster and Demonstration Session at the 9th International Semantic Web Conference, November 2010; see http://ebiquity.umbc.edu/_file_directory_/papers/522.pdf. In the words of Tim Finin of the Ebiquity group:

The solution we are currently exploring is to define a new property to assert that two RDF instances are co-referential when they are believed to describe the same object in the world. The two RDF descriptions might be incompatible because they are true at different times, or the sources disagree about some of the facts, or any number of reasons, so merging them with owl:sameAs may lead to contradictions. However, virtually merging the descriptions in a co-reference engine is fine — both provide information that is useful in disambiguating future references as well as for many other purposes. Our property (:coref) is a transitive, symmetric property that is a super-property of owl:sameAs and is paired with another, :notCoref that is symmetric and generalizes owl:differentFrom.

When we look at the analog properties noted above, we see that the property objects tend to share reflexivity, symmetry and transitivity. We specifically designed the umbel:correspondsTo predicate to capture these close, nearly equivalent, but uncertain degree of relationships.

[3] SWRL (Semantic Web Rule Language) combines sublanguages of the OWL Web Ontology Language (OWL DL and Lite) with those of the Rule Markup Language (Unary/Binary Datalog). SWRL has the full power of OWL DL, but at the price of decidability and practical implementations. See further http://www.w3.org/Submission/SWRL/.
[4] The Rule Interchange Format (RIF) is a W3C Recommendation. RIF is based on the observation that there are many “rules languages” in existence, and what is needed is to exchange rules between them. RIF includes three dialects, a Core dialect which is extended into a Basic Logic Dialect (BLD) and Production Rule Dialect (PRD). See further http://www.w3.org/2005/rules/wiki/RIF_FAQ.
[5] See further, M.K. Bergman, 2011. “A New Best Friend: Gephi for Large-scale Networks,” AI3:::Adaptive Information blog, August 8, 2011.
[6] schema.org lists its various contributing schema and also provides an OWL ontology of the system.
[7] See further, M.K. Bergman, 2011. “Structured Web Gets Massive Boost,” AI3:::Adaptive Information blog, June 2, 2011.

Posted by AI3's author, Mike Bergman Posted on May 21, 2012 at 12:26 am in Ontologies, Structured Web, UMBEL | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/999/new-umbel-release-gains-schema-org-geonames-capabilities/
The URI to trackback this post is: http://www.mkbergman.com/999/new-umbel-release-gains-schema-org-geonames-capabilities/trackback/
Posted:May 18, 2012

Google logoSome Quick Investigations Point to Promise, Disappointments

It has been clear for some time that Google has been assembling a war chest of entities and attributes. It first began to appear as structured results in its standard results listings, a trend I commented upon more than three years ago in Massive Muscle on the ABox at Google. Its purchase of Metaweb and its Freebase service in July 2010 only affirmed that trend.

This week, perhaps a bit rushed due to attention to the Facebook IPO, Google announced its addition of the Knowledge Graph (GKG) to its search results. It has been releasing this capability in a staged manner. Since I was fortunately one of the first to be able to see these structured results (due to luck of the draw and no special “ins”), I have spent a bit of time deconstructing what I have found.

Apparent Coverage

What you get (see below) when you search on particular kinds of entities is in essence an “infobox“, similar to the same structure as what is found on Wikipedia. This infobox is a tabular presentation of key-value pairs, or attributes, for the kind of entity in the search. A ‘people’ search, for example, turns up birth and death dates and locations, some vital statistics, spouse or famous relations, pictures, and links to other relevant structured entities. The types of attributes shown vary by entity type. Here is an example for Stephen King, the writer (all links from here forward provide GKB results), which shows the infobox and its key-value pairs in the righthand column: Google 'Stephen King' structured result

Reportedly these results are drawn from Freebase, Wikipedia, the CIA World Factbook and other unidentified sources. Some of the results may indeed be coming from Freebase, but I saw none as such. Most entities I found were from Wikipedia, though it is important to know that Freebase in its first major growth incarnation springboarded from a Wikipedia baseline. These early results may have been what was carried forward (since other contributed data to Freebase is known to be of highly variable quality).

The entity coverage appears to be spotty and somewhat disappointing in this first release. Entity types that I observed were in these categories:

  • People
  • Chemical Compounds
  • Directors
  • Some Companies
  • Some National Parks
  • Places
  • Musicians/Musical Groups
  • Actors
  • Some Government Agencies
  • Many Local Businesses
  • Animals
  • Movies
  • Albums
  • Notable Landmarks
  • Who Knows What Else

 

Entity types that I expected to see, but did not find include:

  • Products
  • Most Companies
  • Who Knows What Else
  • Songs
  • Most Government Agencies
  • Concepts
  • Non-government Organizations

This is clearly not rigorous testing, but it would appear that entity types along the lines of what is in schema.org is what should be expected over time.

I have no way to gauge the accuracy of Google’s claims that it has offered up structured data on some 500 million entities (and 3.5 billion facts). However, given the lack of coverage in key areas of Wikipedia (which itself has about 3 million entities in the English version), I suspect much of that number comes from the local businesses and restaurants and such that Google has been rapidly adding to its listings in recent years. Coverage of broadly interesting stuff still seems quite lacking.

The much-touted knowledge graph is also kind of disappointing. Related links are provided, but they are close and obvious. So, an actor will have links to films she was in, or a person may have links to famous spouses, but anything approaching a conceptual or knowledge relationship is missing. I think, though, we can see such links and types and entity expansions to steadily creep in over time. Google certainly has the data basis for making these expansions. And, constant, incremental improvement has been Google’s modus operandi.

Deconstructing the URL

For some time, and at various meetings I attend, I have always been at pains to question Google representatives whether there is some unique, internal ID for entities within its databases. Sometimes the reps I have questioned just don’t know, and sometimes they are cagey.

But, clearly, anything like the structured data that Google has been working toward has to have some internal identifier. To see if some of this might now have surfaced with the Knowledge Graph, I did a bit of poking of the URLs shown in the searches and the affiliated entities in the infoboxes. Under most standard searches, the infobox appears directly if there is one for that object. But, by inspecting cross-referenced entities from the infoboxes themselves, it is possible to discern the internal key.

The first thing one can do in such inspections is to remove that stuff that is local or cookie things related to one’s own use preferences or browser settings. Other tests can show other removals. So, using the Stephen King example above, we can eliminate these aspects of the URL:

https://www.google.com/search?num=100&hl=en&safe=off&q=stephen+king&stick=H4sIAAAAAAAAAONgVuLQz9U3MCvIKwQAcT1q1AwAAAA&sa=X&ei=Q-S2T6e1HYOw2wXR2OXOCQ&ved=0CPMGEJsTKAA&biw=1280&bih=827

This actually conformed to my intuition, because the ‘&stick’ aspect was a new parameter for me. (Typically, in many of these dynamic URLs, the various parameters are separated by one another by a set designator character. In the case of Google, that is the ampersand &.)

By simply doing repeated searches that result in the same entity references, I was able to confirm that the &stick parameter is what invokes the unique ID and infobox for each entity. Further, we can decompose that further, but the critical aspect seems to be what is not included within the following: &stick=H4sIAAAAAAAAAONg . . [VuLQz9U3] . . AAAA. The stuff in the brackets varies less, and I suspect might be related to the source, rather than the entity.

I started to do some investigation on types and possible sources, but ran out of time. Listed below are some &stick identifiers for various types of entities (each is a live link):

Type Entity Infobox Identifier
Place Los Angeles &stick=H4sIAAAAAAAAAONgVuLSz9U3MDYoTDIuAQD9ON7KDgAAAA
Place Brentwook &stick=H4sIAAAAAAAAAONgVuLUz9U3MKwsNEkDAN6nm-sNAAAA
Person Arthur Miller &stick=H4sIAAAAAAAAAONgVmLXz9U3qEwrAADRsxaaCwAAAA
Person Joe DiMaggio &stick=H4sIAAAAAAAAAONgVuLQz9U3SDFMqwAAAy8zdQwAAAA
Person Marilyn Monroe &stick=H4sIAAAAAAAAAONgVuLQz9U3MCkvLAIAW7x_LwwAAAA
Movie Shawshank Redemption &stick=H4sIAAAAAAAAAONgVuLQz9U3MM_KKwEAPYgNDQwAAAA
Movie The Green Mile &stick=H4sIAAAAAAAAAONgVuLUz9U3MC62zC4AAGg8mEkNAAAA
Movie Dr. Strangelove &stick=H4sIAAAAAAAAAONgVuLQz9U3MEopzwIAfsFCUQwAAAA
Animal Toucan &stick=H4sIAAAAAAAAAONgVuLQz9U3MM8wNAcA4g1_3AwAAAA
Animal Wolverine &stick=H4sIAAAAAAAAAONgUeLQz9U3sDAryQIAhJ3RUwwAAAA
Musicians/Groups The Beatles &stick=H4sIAAAAAAAAAONgVuLQz9U3ME82yAIAC_7r3AwAAAA
Albums The White Album &stick=H4sIAAAAAAAAAONgVuLSz9U3MMxIN0nKAADnd5clDgAAAA

 

You can verify that this ‘&stick‘ reference is what is pulling in the infobox by looking at this modified query that has substituted Marilyn Monroe’s &stick in the Stephen King URL string: Google 'Stephen King' + 'Marilyn Monroe' structured results Note the standard search results in the lefthand results panel are the same as for Stephen King, but we now have fooled the Google engine to display Marilyn Monroe’s infobox.

I’m sure over time that others will deconstruct this framework to a very precise degree. What would really be great, of course, as noted on many recent mailing lists, is for Google to expose all of this via an API. The Google listing could become the de facto source for Webby entity identifiers.

Some Concluding Thoughts

Sort of like when schema.org was first announced, there have been complaints from some in the semantic Web community that Google released this stuff without once saying the word “semantic”, that much had been ripped off from the original researchers and community without attribution, that a gorilla commercial entity like Google could only be expected to milk this stuff for profit, etc., etc.

That all sounds like sour grapes to me.

What we have here is what we are seeing across the board: the inexorable integration of semantic technology approaches into many existing products. Siri did it with voice commands; Bing, and now Google, are doing it too with search.

We should welcome these continued adoptions. The fact is, semWeb community, that what we are seeing in all of these endeavors is the right and proper role for these technologies: in the background, enhancing our search and information experience, and not something front and center or rammed down our throats. These are the natural roles of semantic technologies, and they are happening at a breakneck pace.

Welcome to the semantic technology space, Google! I look forward to learning much from you.

Posted by AI3's author, Mike Bergman Posted on May 18, 2012 at 7:43 pm in Adaptive Information, Structured Web | Comments (3)
The URI link reference to this post is: http://www.mkbergman.com/1009/deconstructing-the-google-knowledge-graph/
The URI to trackback this post is: http://www.mkbergman.com/1009/deconstructing-the-google-knowledge-graph/trackback/
Posted:April 30, 2012

Canberra, Australia Linked Data is Sometimes a Useful Technique, but is an Inadequate Focus

While in Australia on other business, I had the great fortune to be invited by Adam Bell of the Australian War Memorial to be the featured speaker at the Canberra Semantic Web Meetup on April 23. The talk was held within the impressive BAE Systems Theatre of the Memorial and was very well attended. My talk was preceded by an excellent introduction to the semantic Web by David Ratcliffe and Armin Haller of CSIRO.  They have kindly provided their useful slides online.

Many of the attendees came from the perspective of libraries, archives or museums. They naturally had an interest in the linked data activities in this area, a growing initiative that is now known under the acronym of LOD-LAM. Though I have been an advocate of linked data going back to 2006, one of my main theses was that linked data was an inadequate focus to achieve interoperability. The key emphases of my talk were that the pragmatic contributions of semantic technologies reside more in mindsets, information models and architectures than in ‘linked data’ as currently practiced.

Disappointments and Successes

The semantic Web and its most recent branding of linked data has antecedents going back to 1945 via Vannevar Bush’s memex and Ted Nelson’s hypertext of the early 1960s. The most powerful portrayal of the potential of the semantic Web comes in Douglas Adams’ 1990 Hyperland special for the BBC, a full decade before Tim Berners-Lee and colleagues first coined the term ‘semantic web’ [1]. The Hyperland vision of obsequious intelligent agents doing our very bidding has, of course, not been fully realized. The lack of visible uptake of this full vision has caused some proponents to back away from the idea of the semantic Web. Linked data, in fact, was a term coined by Berners-Lee himself, arguably in part to re-brand the idea and to focus on a more immediate, achievable vision. In its first formulation linked data emphasized the RDF (Resource Description Framework) data model, though others, notably Kingsley Idehen, have attempted to put forward a revisionist definition of linked data that includes any form of structured data involving entity attribute values (EAV).

No matter how expressed, the idea behind all of these various terms has in essence been to make meaningful connections, to provide the frameworks for interoperability. Interoperability means getting disparate sources of data to relate to each other, as a means of moving from data to information. Interoperability requires that source and receiver share a vocabulary about what things mean, as well as shared understandings about the associations or degree of relationship between the items being linked.

The current concept of linked data attempts to place these burdens mostly on the way data is published. While apparently “simpler” than earlier versions of the semantic Web (since linked data de-emphasizes shared vocabularies and nuanced associations), linked data places onerous burdens on how publishers express their data. Though many in the advocacy community point to the “billions” of RDF triples expressed as a success, actual consumers of linked data are rare. I know of no meaningful application or example where the consumption of linked data is an essential component.

However, there are a few areas of success in linked data. DBpedia, Freebase (now owned by Google), and GeoNames have been notable in providing identifiers (URIs) for common concepts, things, entities and places. There has also been success in the biomedical community with linked data.

Meanwhile, other aspects of the semantic Web have also shown success, but been quite hidden. Apple’s spoken Siri service is driven by an ontological back-end; schema.org is beginning to provide shared ways for tagging key entities and concepts, as promoted by the leading search engines of Google, Bing, Yahoo! and Yandex; Bing itself has been improved as a search service by the incorporation of the semantic search technologies of its earlier Powerset acquisition; and Google is further showing how NLP (natural language processing) techniques can be used to extract meaningful structure for characterizing entities in search results and in search completion and machine language translation. These services are here today and widely used. All operate in the background.

What Lessons Can We Derive?

These failures and successes help provide some pragmatic lessons going forward.

While I disagree with Kingsley’s revisionist approach to re-defining linked data, I very much agree with his underlying premise:  effective data exchange does not require RDF. Most instance records are already expressed as simple entity-value pairs, and any data transfer serialization — from key-value pairs to JSON to CSV spreadsheets — can be readily transformed to RDF.

Semantic technologies are fundamentally about knowledge representation, not data transfer.

This understanding is important because the fundamental contribution of RDF is not as a data exchange format, but as a foundational data model. The simple triple model of RDF can easily express the information assertions in any form of content, from completely unstructured text (after information extraction or metadata characterization) to the most structured data sources. Triples can themselves be built up into complete languages (such as OWL) that also capture the expressiveness necessary to represent any extant data or information schema [2].

The ability of RDF to capture any form of data or any existing schema makes it a “universal solvent” for information. This means that the real role of RDF is as a canonical data model at the core of the entire information architecture. Linked data, with its emphasis on data publishing and exchange, gets this focus exactly wrong. Linked data emphasizes RDF at the wrong end of the telescope.

The idea of common schema and representations is at the core of the semantic Web successes that do exist. In fact, when we look at Siri, emerging search, or some of the other successes noted above, we see that their semantic technology components are quite hidden. Successful semantics tend to work in the background, not in the foreground in terms of how data is either published or consumed. Semantic technologies are fundamentally about knowledge representation, not data transfer.

Where linked data is being consumed, it is within communities such as the life sciences where much work has gone into deriving shared vocabularies and semantics for linking and mapping data. These bases for community sharing express themselves as ontologies, which are really just formalized understandings of these shared languages in the applicable domain (life sciences, in this case). In these cases, curation and community processes for deriving shared languages are much more important to emphasize than how data gets exposed and published.

Linked data as presently advocated has the wrong focus. The techniques of publishing data and de-referencing URIs are given prominence over data quality, meaningful linkages (witness the appalling misuse of owl:sameAs [3]), and shared vocabularies. These are the reasons we see little meaningful consumption of linked data. It is also the reason that the much touted FYN (“follow your nose”) plays no meaningful information role today other than a somewhat amusing diversion.

Shifting the Focus

In our own applications Structured Dynamics promotes seven pillars to pragmatic semantic technologies [4]. Linked data is one of those pillars, because where the other foundations are in place, including shared understandings, linked data is the most efficient data transfer format. But, as noted, linked data alone is insufficient.

Linked data is thus the wrong starting focus for new communities and users wishing to gain the advantages of interoperability. The benefits of interoperability must first obtain from a core (or canonical) data model — RDF — that is able to capture any extant data or schema. As these external representations get boiled down to a canonical form, there must be shared understandings and vocabularies to capture the meaning in this information. This puts community involvement and processes at the forefront of the semantic enterprise. Only after the community has derived these shared understandings should linked data be considered as the most efficient way to interchange data amongst the community members.

Identifying and solving the “wrong” problems is a recipe for disappointment. The challenges of the semantic Web are not in branding or messaging. The challenges of the semantic enterprise and Web reside more in mindsets, approaches and architecture. Linked data is merely a technique that contributes little — perhaps worse by providing the wrong focus — to solving the fundamental issue of information interoperability.

Once this focus shifts, a number of new insights emerge. Structure is good in any form; arguments over serializations or data formats are silly and divert focus. The role of semantic technologies is likely to be a more hidden one, to reside in the background as current successes are now showing us. Building communities with trusted provenance and shared vocabularies (ontologies) are the essential starting points. Embracing and learning about NLP will be important to include the 80% of content currently in unstructured text and disambiguating reference conflicts. Ultimate users, subject matter experts and librarians are much more important contributors to this process than developers or computer scientists. We largely now have the necessary specifications and technologies in place; it is time for content and semantic reconciliation to guide the process.

It is great that the abiding interest in interoperability is leading to the creation of more and more communities, such as LOD-LAM, forming around the idea of linked data. What is important moving forward is to use these interests as springboards, and not boxes, for exploring the breadth of available semantic technologies.

For More on the Talk

Below is a link to my slides used in Canberra:

View more presentations from Mike Bergman.

Also, as mentioned, the intro slides are online, a video recording of the presentations is also available, and some other blog postings occasioned by the talks are also online.


[1] Tim Berners-Lee, James Hendler and Ora Lassila, 2001. “The Semantic Web”. Scientific American Magazine; see http://www.scientificamerican.com/article.cfm?id=the-semantic-web.
[2] See further, M.K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Innovation blog, April 8, 2009. See http://www.mkbergman.com/483/advantages-and-myths-of-rdf/.
[3] See, among many, M.K. Bergman, 2010. “Practical P-P-P-Problems with Linked Data,” AI3:::Adaptive Innovation blog, October 4, 2010. See http://www.mkbergman.com/917/practical-p-p-p-problems-with-linked-data/.
[4] M.K. Bergman, 2010. “Seven Pillars of the Open Semantic Enterprise,” AI3:::Adaptive Innovation blog, January 12, 2010. See http://www.mkbergman.com/859/seven-pillars-of-the-open-semantic-enterprise/.

Posted by AI3's author, Mike Bergman Posted on April 30, 2012 at 2:46 am in Linked Data, Semantic Web | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/1006/pragmatic-approaches-to-the-semantic-web/
The URI to trackback this post is: http://www.mkbergman.com/1006/pragmatic-approaches-to-the-semantic-web/trackback/
Posted:April 4, 2012

Tractricious Sculpture at Fermilab; picture by Mike KappelAdaptive Information is a Hammer, but Genes are Not a Nail

Since Richard Dawkins first put forward the idea of the “meme” in his book The Selfish Gene some 35 years ago [1], the premise has struck in my craw. I, like Dawkins, was trained as an evolutionary biologist. I understand the idea of the gene and its essential role as a vehicle for organic evolution. And, all of us clearly understand that “ideas” themselves have a certain competitive and adaptive nature. Some go viral; some run like wildfire and take prominence; and some go nowhere or fall on deaf ears. Culture and human communications and ideas play complementary — perhaps even dominant — roles in comparison to the biological information contained within DNA (genes).

I think there are two bases for why the “meme” idea sticks in my craw. The first harkens back to Dawkins. In formulating the concept of the “meme”, Dawkins falls into the trap of many professionals, what the French call déformation professionnelle. This is the idea of professionals framing problems from the confines of their own points of view. This is also known as the Law of the Instrument, or (Abraham) Maslow‘s hammer, or what all of us know colloquially as “if all you have is a hammer, everything looks like a nail [2]. Human or cultural information is not genetics.

The second — and more fundamental — basis for why this idea sticks in my craw is its mis-characterization of what is adaptive information, the title and theme of this blog. Sure, adaptive information can be found in the types of information structures at the basis of organic life and organic evolution. But, adaptive information is much, much more. Adaptive information is any structure that provides arrangements of energy and matter that maximizes entropy production. In inanimate terms, such structures include chemical chirality and proteins. It includes the bases for organic life, inheritance and organic evolution. For some life forms, it might include communications such as pheromones or bird or whale songs or the primitive use of tools or communicated behaviors such as nest building. For humans with their unique abilities to manipulate and communicate symbols, adaptive information embraces such structures as languages, books and technology artifacts. These structures don’t look or act like genes and are not replicators in any fashion of the term. To hammer them as “memes” significantly distorts their fundamental nature as information structures and glosses over what factors might — or might not — make them adaptive.

I have been thinking of these concepts much over the past few decades. Recently, though, there has been a spate of the “meme” term, particularly on the semantic Web mailing lists to which I subscribe. This spewing has caused me to outline some basic ideas about what I find so problematic in the use of the “meme” concept.

A Brief Disquisition on Memes

As defined by Dawkins and expanded upon by others, a “meme” is an idea, behavior or style that spreads from person to person within a culture. It is proposed as being able to be transmitted through writing, speech, gestures or rituals. Dawkins specifically called melodies, catch-phrases, fashion and the technology of building arches as examples of memes. A meme is postulated as a cultural analogue to genes in that they are assumed to be able to self-replicate, mutate or respond to selective pressures. Thus, as proposed, memes may evolve by natural selection in a manner analogous to that of biological evolution.

However, unlike a gene, a structure corresponding to a “meme” has never been discovered or observed. There is no evidence for it as a unit of replication, or indeed as any kind of coherent unit at all. In its sloppy use, it is hard to see how “meme” differs in its scope from concepts, ideas or any form of cultural information or transmission, yet it is imbued with properties analogous to animate evolution for which there is not a shred of empirical evidence.

One might say, so what, the idea of a “meme” is merely a metaphor, what is the harm? Well, the harm comes about when it is taken seriously as a means of explaining human behavior and cultural changes, a field of study called memetics. It becomes a pseudo-scientific term that sets a boundary condition for understanding the nature of information and what makes it adaptive or not [3]. Mechanisms and structures appropriate to animate life are not universal information structures, they are simply the structures that have evolved in the organic realm. In the human realm of signs and symbols and digital information and media, information is the universal, not the genetic structure of organic evolution.

The noted evolutionary geneticist, R.C. Lewontin, one of my key influences as a student, has also been harshly critical of the idea of memetics [4]:

 “The selectionist paradigm requires the reduction of society and culture to inheritance systems that consist of randomly varying, individual units, some of which are selected, and some not; and with society and culture thus reduced to inheritance systems, history can be reduced to ‘evolution.’ . . . we conclude that while historical phenomena can always be modeled selectionistically, selectionist explanations do not work, nor do they contribute anything new except a misleading vocabulary that anesthetizes history.”

Consistent with my recent writings about Charles S. Peirce [5], many logicians and semiotic theorists are also critical of the idea of “memes”, but on different grounds. The criticism here is that “memes” distort Peirce’s ideas about signs and the reification of signs and symbols via a triadic nature. Notable in this camp is Terrence Deacon [6].

Information is a First Principle

It is not surprising that the concept of “memes” arose in the first place. It is understandable to seek universal principles consistent with natural laws and observations. The mechanism of natural evolution works on the information embodied in DNA, so why not look to genes as some form of universal model?

The problem here, I think, was to confuse mechanisms with first principles. Genes are a mechanism — a “structure” if you will — that along with other forms of natural selection such as the entire organism and even kin selection [7], have evolved as means of adaptation in the animate world. But the fundamental thing to be looked for here is the idea of information, not the mechanism of genes and how they replicate. The idea of information holds the key for drilling down to universal principles that may find commonality between information for humans in a cultural sense and information conveyed through natural evolution for life forms. It is the search for this commonality that has driven my professional interests for decades, spanning from population genetics and evolution to computers, information theory and semantics [8].

But before we can tackle these connections head on, it is important to address a couple of important misconceptions (as I see them).

Seque #1: Information is (Not!) Entropy

In looking to information as a first principle, Claude Shannon‘s seminal work in 1948 on information theory must be taken as the essential point of departure [9]. The motivation of Shannon’s paper and work by others preceding him was to understand information losses in communication systems or networks. Much of the impetus for this came about because of issues in wartime communications and early ciphers and cryptography. (As a result, the Shannon paper is also intimately related to data patterns and data compression, not further discussed here.)

In a strict sense, Shannon’s paper was really talking about the amount of information that could be theoretically and predictably communicated between a sender and a receiver. No context or semantics were implied in this communication, only the amount of information (for which Shannon introduced the term “bits” [10]) and what might be subject to losses (or uncertainty in the accurate communication of the message). In this regard, what Shannon called “information” is what we would best term “data” in today’s parlance.

The form that the uncertainty (unpredictability) calculation that Shannon derived:

 \displaystyle H(X) = - \sum_{i=1}^np(x_i)\log_b p(x_i)

very much resembled the mathematical form for Boltzmann‘s original definition of entropy (as elaborated upon by Gibbs, denoted as S, for Gibb’s entropy):

S = - k_B \sum p_i \ln p_i \,

and thus Shannon also labelled his measure of unpredictability, H, as entropy [10].

After Shannon, and nearly a century after Boltzmann, work by individuals such as Jaynes in the field of statistical mechanics came to show that thermodynamic entropy can indeed be seen as an application of Shannon’s information theory, so there are close parallels [11]. This parallel of mathematical form and terminology has led many to assert that information is entropy.

I believe this assertion is a misconception on two grounds.

First, as noted, what is actually being measured here is data (or bits), not information embodying any semantic meaning or context. Thus, the formula and terminology is not accurate for discussing “information” in a conventional sense.

Second, the Shannon methods are based on the communication (transmittal) between a sender and a receiver. Thus the Shannon entropy measure is actually a measure of the uncertainty for either one of these states. The actual information that gets transmitted and predictably received was formulated by Shannon as R (which he called rate), and he expressed basically as:

R = Hbefore – Hafter

R, then, becomes a proxy for the amount of information accurately communicated. R can never be zero (because all communication systems have losses). Hbefore and Hafter are both state functions for the message, so this also makes R a function of state. So while there is Shannon entropy (unpredictability) for any given sending or receiving state, the actual amount of information (that is, data) that is transmitted is a change in state as measured by a change in uncertainty between sender (Hbefore) and receiver (Hafter). In the words of Thomas Schneider, who provides a very clear discussion of this distinction [12]:

Information is always a measure of the decrease of uncertainty at a receiver.

These points do not directly bear on the basis of information as discussed below, but help remove misunderstandings that might undercut those points. Further, these clarifications make consistent theoretical foundations of information (data) with natural evolution while being logically consistent with the 2nd law of thermodynamics (see next).

Seque #2: Entropy is (Not!) Disorder

The 2nd law of thermodynamics expresses the tendency that, over time, differences in temperature, pressure, or chemical potential equilibrate in an isolated physical system. Entropy is a measure of this equilibration: for a given physical system, the highest entropy state is one at equilibrium. Fluxes or gradients arise when there are differences in state potentials in these systems. (In physical systems, these are known as sources and sinks; in information theory, they are sender and receiver.) Fluxes go from low to high entropy, and are non-reversible — the “arrow of time” — without the addition of external energy. Heat, for example, is a by product of fluxes in thermal energy. Because these fluxes are directional in isolation, a perpetual motion machine is shown as impossible.

In a closed system (namely, the entire cosmos), one can see this gradient as spanning from order to disorder, with the equilibrium state being the random distribution of all things. This perspective, and much schooling regarding these concepts, tends to present the idea of entropy as a “disordered” state. Life is seen as the “ordered” state in this mindset. Hewing to this perspective, some prominent philosophers, scientists and others have sometimes tried to present the “force” representing life and “order” as an opposite one to entropy. One common term for this opposite “force” is “negentropy[13].

But, in the real conditions common to our lives, our environment is distinctly open, not closed. We experience massive influxes of energy via sunlight, and have learned as well how to harness stored energy from eons past in further sources of fossil and nuclear energy. Our open world is indeed a high energy one, and one that increases that high-energy state as our knowledge leads us to exploit still further resources of higher and higher quality. As Buckminster Fuller once famously noted, electricity consumption (one of the highest quality energy resources found to date) has become a telling metric about the well-being and wealth of human societies [14].

The high-energy environments fostering life on earth and more recently human evolution establish a local (in a cosmic sense) gradient that promotes fluxes to more ordered states, not lesser unordered ones. These fluxes remain faithful to basic physical laws and are non-deterministic [15]. Indeed, such local gradients can themselves be seen as consistent with the conditions initially leading to life, favoring the random event in the early primordial soup that led to chemical structures such as chirality, auto-catalytic reactions, enzymes, and then proteins, which became the eventual building blocks for animate life [16].

These events did not have preordained outcomes (that is, they were non-deterministic), but were the result of time and variation in the face of external energy inputs to favor the marginal combinatorial improvement. The favoring of the new marginal improvement also arises consistent with entropy principles, by giving a competitive edge to those structures that produce faster movements across the existing energy gradient. According to Annila and Annila [16]:

“According to the thermodynamics of open systems, every entity, simple or sophisticated, is considered as a catalyst to increase entropy, i.e., to diminish free energy. Catalysis calls for structures. Therefore, the spontaneous rise of structural diversity is inevitably biased toward functional complexity to attain and maintain high-entropy states.”

Via this analysis we see that life is not at odds with entropy, but is consistent with it. Further, we see that incremental improvements in structure that are consistent with the maximum entropy production principle will be favored [17]. Of course, absent the external inputs of energy, these gradients would reverse. Under those conditions, the 2nd law would promote a breakdown to a less ordered system, what most of us have been taught in schools.

With these understandings we can now see the dichotomy as life representing order with entropy disorder as being false. Further, we can see a guiding set of principles that is consistent across the broad span of evolution from primordial chemicals and enzymes to basic life and on to human knowledge and artifacts. This insight provides the fundamental “unit” we need to be looking toward, and not the gene nor the “meme”.

Information is Structure

Of course, the fundamental “unit” we are talking about here is information, and not limited as is Shannon’s concept to data. The quality that changes data to information is structure, and structure of a particular sort. Like all structure, there is order or patterns, often of a hierarchical or fractal or graph nature. But the real aspect of the structure that is important is the marginal ability of that structure to lead to improvements in entropy production. That is, processes are most adaptive (and therefore selected) that maximize entropy production. Any structure that emerges that is able to reduce the energy gradient faster will be favored.

However, remember, these are probabilistic, statistical processes. Uncertainties in state may favor one structure at one time versus another at a different time. The types of chemical compounds favored in the primordial soup were likely greatly influenced by thermal and light cycles and drying and wet conditions. In biological ecosystems, there are huge differences in seed or offspring production or in overall species diversity and ecological complexity based on the stability (say, tropics) or instability (say, disturbance) of local environments. As noted, these processes are inherently non-deterministic.

As we climb up the chain from the primordial ooze to life and then to humans and our many information mechanisms and technology artifacts (which are themselves embodiments of information), we see increasing complexity and structure. But we do not see uniformity of mechanisms or vehicles.

The general mechanisms of information transfer in living organisms occur (generally) via DNA in genes, mediated by sex in higher organisms, subject to random mutations, and then kept or lost entirely as their host organisms survive to procreate or not. Those are harsh conditions: the information survives or not (on a population basis) with high concentrations of information in DNA and with a priority placed on remixing for new combinations via sex. Information exchange (generally) only occurs at each generational event.

Human cultural information, however, is of an entirely different nature. Information can be made persistent, can be recorded and shared across individuals or generations, extended with new innovations like written language or digital computers, or combined in ways that defy the limits of sex. Occasionally, of course, loss of living languages due to certain cultures or populations dying out or horrendous catastrophes like the Spanish burning (nearly all of) the Mayan’s existing books can also occur [18]. The environment will also be uncertain.

So, while we can define DNA in genes or the ideas of a “meme” all as information, in fact we now see how very unlike the dynamics and structures of these two forms really are. We can be awestruck with the elegance and sublimity of organic evolution. We can also be inspired by song or poem or moved to action through ideals such as truth and justice. But organic evolution does not transpire like reading a book or hearing a sermon, just like human ideas and innovations don’t act like genes. The “meme” is a totally false analogy. The only constant is information.

Some Tentative Implications

The closer we come to finding true universals, the better we will be able to create maximum entropy producing structures. This, in turn, has some pretty profound implications. The insight that keys these implications begins with an understanding of the fundamental nature — and importance — of information. According to Karnani et al [19]:

“. . . the common contemporary consent, the second law of thermodynamics, is perceived to drive disorder. Therefore, it may appear, at first sight, inconceivable that this universal law could possibly account for the existence and orderly characteristics of information, as well as for its meaningful content. However, the second law, or equivalently the principle of increasing entropy, merely states that difference among energy densities tends to vanish. When the surrounding energy density is high, the system will evolve toward a stationary state by increasing its energy content, e.g, by devising orderly machinery for energy transduction to acquire energy. . . . Syntax of information, when described by thermodynamics, is associated with the entropy of the physical representation, and significance of information is associated with the entropy increase in the receiver system when it executes the encoded information.”

All would agree that the evolution of life over the past few billion years is truly wondrous. But, what is equally wondrous is that the human species has come to learn and master symbols. That mastery, in turn, has broken the bounds of organic evolution and has put into our hands the very means and structure of information itself. Via this entirely new — and incredibly accelerated — path to information structures, we are only now beginning to see some of its implications:

  • Unlike all other organisms, we dominate our environment and have experienced increasing wealth and freedom. Wealth increases and their universal applicability continue to increase at an exponential rate [20]
  • We no longer depend on the random variant to maximize our entropy producing structures. We can now do so purposefully and with symbologies and bit streams of our own devising
  • Potentially all information variants can be recorded and shared across all human individuals and generations, a complete decoupling from organic boundaries
  • Key ideas and abstractions, such as truth, justice and equality, can operate on a species-wide basis and become adopted without massive die-offs of individuals
  • We are actively moving ourselves into higher-level energy states, further increasing the potential for wealth and new structures
  • We are actively impacting our local environment, potentially creating the conditions for our species’ demise
  • We are increasingly engaging all individuals of the human species in these endeavors through literacy, education and access to global information sources. This provides a still further multiplier effect on humanity’s ability to devise and manipulate information structures into more adaptive and highly-ordered states.

The idea of a “meme” actually cheapens our understanding of these potentials.

Ideas matter and terminology matters. These are the symbols by which we define and communicate potentials. If we choose the wrong analogies or symbols — as “meme” is in this case — we are picking the option with the lower entropy potential. Whether I assert it to be so or not, the “meme” concept is an information structure doomed for extinction.


[1] Richard Dawkins, 1976. The Selfish Gene, Oxford University Press, New York City, ISBN 0-19-286092-5.
[2] This phrase was perhaps first made famous by Mark Twain or Bernard Baruch, but in any case is clearly understood now by all.
[3] According to Wikipedia, Benitez-Bribiesca calls memetics “a dangerous idea that poses a threat to the serious study of consciousness and cultural evolution”. He points to the lack of a coding structure analogous to the DNA of genes, and to instability of any mutation mechanisms for “memes” sufficient for standard evolution processes. See Luis Benitez Bribiesca, 2001. “Memetics: A Dangerous Idea”, Interciencia: Revista de Ciencia y Technologia de América (Venezuela: Asociación Interciencia) 26 (1): 29–31, January 2001. See http://redalyc.uaemex.mx/redalyc/pdf/339/33905206.pdf.
[4] Joseph Fracchia and R.C. Lewontin, 2005. “The Price of Metaphor”, History and Theory (Wesleyan University) 44 (44): 14–29, February 2005.
[5] See further M. K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” posting on AI3:::Adaptive Information blog, January 24, 2012. See http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/.
[6] Terrence Deacon, 1999. “The Trouble with Memes (and what to do about it)”. The Semiotic Review of Books 10(3). See http://projects.chass.utoronto.ca/semiotics/srb/10-3edit.html.
[7] Kin selection refers to changes in gene frequency across generations that are driven at least in part by interactions between related individuals. Some mathematical models show how evolution may favor the reproductive success of an organism’s relatives, even at a cost to an individual organism. Under this mode, selection can occur at the level of populations and not the individual or the gene. Kin selection is often posed as the mechanism for the evolution of altruism or social insects. Among others, kin selection and inclusive fitness was popularized by W. D. Hamilton and Robert Trivers.
[8] You may want to see my statement of purpose under the Blogasbörd topic, first written seven years ago when I started this blog.
[9] Claude E. Shannon, 1948. “A Mathematical Theory of Communication”, Bell System Technical Journal, 27: 379–423, 623-656, July, October, 1948. See http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.
[10] As Shannon acknowledges in his paper, the “bit” term was actually suggested by J. W. Tukey. Shannon can be more accurately said to have popularized the term via his paper.
[12] See Thomas D. Schneider, 2012. “Information Is Not Entropy, Information Is Not Uncertainty!,” Web page retrieved April 4, 2012; see http://www.lecb.ncifcrf.gov/~toms/information.is.not.uncertainty.html.
[13] The “negative entropy” (also called negentropy or syntropy) of a living system is the entropy that it exports to keep its own entropy low, and according to proponents lies at the intersection of entropy and life. The concept and phrase “negative entropy” were introduced by Erwin Schrödinger in his 1944 popular-science book What is Life?. See Erwin Schrödinger, 1944. What is Life – the Physical Aspect of the Living Cell, Cambridge University Press, 1944. A copy may be downloaded at http://old.biovip.com/UpLoadFiles/Aaron/Files/2005051204.pdf.
[14] R. Buckminster Fuller, 1981. Critical Path, St. Martin’s Press, New York City, 471 pp. See especially p. 103 ff.
[15] The seminal paper first presenting this argument is Vivek Sharma and Arto Annila, 2007. “Natural Process – Natural Selection”, Biophysical Chemistry 127: 123-128. See http://www.helsinki.fi/~aannila/arto/natprocess.pdf. This basic theme has been much expanded upon by Annila and his various co-authors. See, for example, [16] and [19], among many others.
[16] Arto Annila and Erkki Annila, 2008. “Why Did Life Emerge?,” International Journal of Astrobiology 7(3 and 4): 293-300. See http://www.helsinki.fi/~aannila/arto/whylife.pdf.
[17] According to Wikipedia, the principle (or “law”) of maximum entropy production is an aspect of non-equilibrium thermodynamics, a branch of thermodynamics that deals with systems that are not in thermodynamic equilibrium. Most systems found in nature are not in thermodynamic equilibrium and are subject to fluxes of matter and energy to and from other systems and to chemical reactions. One fundamental difference between equilibrium thermodynamics and non-equilibrium thermodynamics lies in the behavior of inhomogeneous systems, which require for their study knowledge of rates of reaction which are not considered in equilibrium thermodynamics of homogeneous systems. Another fundamental difference is the difficulty in defining entropy in macroscopic terms for systems not in thermodynamic equilibrium.
The principle of maximum entropy production states that the in comparing two or more alternate paths for crossing an energy gradient that the one that creates the maximum entropy change will be favored. The maximum entropy (sometimes abbreviated MaxEnt or MaxEp) concept is related to this notion. It is also known as the maximum entropy production principle, or MEPP.
[18] The actual number of Mayan books burned by the Spanish conquistadors is unknown, but is somewhere between tens and thousands; see here. Only three or four codexes are known to survive today. Also, Wikipedia contains a listing of notable book burnings throughout history.
[19] Mahesh Karnani, Kimmo Pääkkönen and Arto Annila, 2009. “The Physical Character of Information,” Proceedings of the Royal Society A, April 27, 2009. See http://www.helsinki.fi/~aannila/arto/natinfo.pdf.
[20] I discuss and chart the exponential growth of human wealth based on Angus Maddison data in M. K. Bergman, 2006. “The Biggest Disruption in History: Massively Accelerated Growth Since the Industrial Revolution,” post in AI3:::Adaptive Information blog, July 27, 2006. See http://www.mkbergman.com/250/the-biggest-disruption-in-history-massively-accelerated-growth-since-the-industrial-revolution/.
Posted:March 27, 2012

W3C Logo from http://www.w3.org/Icons/w3c_homeCasting My Vote on Revising httpRange-14

The httpRange-14 issue and its predecessor “identity crisis” debate have been active for more than a decade on the Web [1]. It has been around so long that most acknowledge “fatigue” and it has acquired that rarified status as a permathread. Many want to throw up their hands when they hear of it again and some feel — because of its duration and lack of resolution — that there never will be closure on the question. Yet everyone continues to argue and then everyone wonders why actual consumption of linked data remains so problematic.

Jonathan Rees is to be thanked for refusing to let this sleeping dog lie. This issue is not going to go away so long as its basis and existing prescriptions are, in essence, incoherent. As a member of the W3C’s TAG (Technical Architecture Group), Rees has worked diligently to re-surface and re-frame the discussion. While I don’t agree with some of the specifics and especially with the constrained approach proposed for resolving this question [2], the sleeping dog has indeed been poked and is awake. For that we can thank Jonathan. Maybe now we can get it right and move on.

I don’t agree with how this issue has been re-framed and I don’t agree that responses to it must be constrained to the prescriptive approach specified in the TAG’s call for comments. Yet, that being said, as someone who has been vocal for years about the poor semantics of the semantic Web community, I feel I have an obligation to comment on this official call.

Thus, I am casting my vote behind David Booth’s alternative proposal [3], with one major caveat. I first explain the caveat and then my reasons for supporting Booth’s proposal. I have chosen not to submit a separate alternative in order to not add further to the noise, as Bernard Vatant (and, I’m sure, many, many others) has chosen [4].

Bury the Notion of ‘Information Resource’ Once and for All

I first commented on the absurdity of the ‘information resource’ terminology about five years ago [5]. Going back to Claude Shannon [6] we have come to understand information as entropy (or, more precisely, as differences in energy state). One need not get that theoretical to see that this terminology is confusing. “Information resource” is a term that defies understanding (meaning) or precision. It is also a distinction that leads to a natural counter-distinction, the “non-information resource”, which is also an imprecise absurdity.

What the confusing term is meant to encompass is web-accessible content (“documents”), as opposed to descriptions of (or statements about) things. This distinction then triggers a different understanding of a URI (locator v identifier alone) and different treatments of how to process and interpret that URI. But the term is so vague and easily misinterpreted that all of the guidance behind the machinery to be followed gets muddied, too. Even in the current chapter of the debate, key interlocutors confuse and disagree as to whether a book is an “information resource” or not. If we can’t basically separate the black balls from the white balls, how are we to know what to do with them?

If there must be a distinction, it should be based on the idea of the actual content of a thing — or perhaps more precisely web-accessible content or web-retrievable content — as opposed to the description of a thing. If there is a need to name this class of content things (a position that David Booth prefers, pers. comm.), then let’s use one of these more relevant terms and drop “information resource” (and its associated IR and NIR acronyms) entirely.

The motivation behind the “information resource” terminology also appears to be a desire that somehow a URI alone can convey the name of what a thing is or what it means. I recently tried to blow this notion to smithereens by using Peirce’s discussion of signs [1]. We should understand that naming and meaning may only be provided by the owner of a URI through additional explication, and then through what is understood by the recipient; the string of the URI itself conveys very little (or no) meaning in any semantic sense.

We should ban the notion of “information resource” forever. If the first exposure a potential new publisher or consumer of linked data encounters is “information resource”, we have immediately lost the game. Unresolvable abstractions lead to incomprehension and confusion.

The approach taken by the TAG in requesting new comments on httpRange-14 only compounds this problem. First, the guidance is to not allow any questioning of the “information resource” terminology within the prescribed comment framework [7]. Then, in the suggested framework for response, still further terminology such as “probe URIs”, “URI documentation carrier” or “nominal URI documentation carrier for a URI” is introduced. Aaaaarggghh! This only furthers the labored and artificial terminology common to this particular standards effort.

While Booth’s proposal does not call for an outright rejection of the “information resource” terminology (my one major qualification in supporting it), I like it because it purposefully sidesteps the question of the need to define “information resource” (see his Section 2.7). Booth’s proposal is also explicit in its rejection of implied meaning in URIs and through embrace of the idea of a protocol. Remember, all that is being put forward in any of these proposals is a mechanism for distinguishing between retrievable content obtainable at a given URL and a description of something found at a URI. By racheting down the implied intent, Booth’s proposal is more consistent with the purpose of the guidance and is not guilty of overreach.

Keep It Simple

One of the real strengths of Booth’s proposal is its rejection of the prescriptive method proposed by the TAG for suggesting an alternative to httpRange-14 [7]. The parsimonious objective should be to be simple, be clear, and be somewhat relaxed in terms of mechanisms and prescriptions. I believe use patterns — negotiated via adoption between publishers and consumers — will tell us over time what the “right” solutions may be.

Amongst the proposals put forward so far, David Booth’s is the most “neutral” with respect to imposed meanings or mechanisms, and is the simplest. Though I quibble in some respects, I offer qualified support for his alternative because it:

  • Sidesteps the “information resource” definition (though weaker than I would want; see above)
  • Addresses only the specific HTTP and HTTPS cases
  • Avoids the constrained response format suggested by the TAG
  • Explicitly rejects assigning innate meanings to URIs
  • Poses the solution as a protocol (an understanding between publisher and consumer) rather than defining or establishing a meaning via naming
  • Provides multiple “cow paths” by which resource definitions can be conveyed, which gives publishers and consumers choice and offers the best chance for more well-trodden paths to emerge
  • Does not call for an outright repeal of the httpRange-14 rule, but retains it as one of multiple options for URI owners to describe resources
  • Permits the use of an HTTP 200 response with RDF content as a means of conveying a URI definition
  • Retains the use of the hash URI as an option
  • Provides alternatives to those who can not easily (or at all) use the 303 see also redirect mechanism, and
  • Simplifies the language and the presentation.

I would wholeheartedly support this approach were two things to be added: 1) the complete abandonment of all “information resource” terminology; and 2) an official demotion of the httpRange-14 rule (replacing it with a slash 303 option on equal footing to other options), including a disavowal of the “information resource” terminology. I suspect if the TAG adopts this option, that subsequent scrutiny and input might address these issues and improve its clarity even further.

There are other alternatives submitted, prominently the one by Jeni Tennison with many co-signatories [8]. This one, too, embraces multiple options and cow paths. However, it has the disadvantage of embedding itself into the same flawed terminology and structure as offered by httpRange-14.


[1] For my recent discussion about the history of these issues, see M.K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” in AI3:::Adaptive Information blog, January 24, 2012; see http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/.
[2] In all fairness, this call was the result of ISSUE-57, which had its own constraints. Not knowing all of the background that led to the httpRange-14 Pandora’s Box being opened again, the benefit of the doubt would be that the form and approach prescribed by the TAG dictated the current approach. In any event, now that the Box is open, all pertinent issues should be addressed and the form of the final resolution should also not be constrained from what makes best sense and is most pragmatic.
[3] David Booth‘s alternative proposal is for the “URI Definition and Discovery Protocol” (uddp). The actual submission according to form is found here.
[4] See Bernard Vatant, 2012. “Beyond httpRange-14 Addiction,” the wheel and the hub blog, March 27, 2012. See http://blog.hubjects.com/2012/03/beyond-httprange-14-addiction.html.
[5] M.K. Bergman, 2007. “More Structure, More Terminology and (hopefully) More Clarity,” in AI3:::Adaptive Information blog, July 27, 2007; see http://www.mkbergman.com/391/more-structure-more-terminology-and-hopefully-more-clarity/. Subsequent to that piece, I have written further on semantic Web semantics in “The Semantic Web and Industry Standards” (January 26, 2008), ” “The Shaky Semantics of the Semantic Web” (March 12, 2008), “Semantic Web Semantics: Arcane, but Important,” (April 8, 2008), “The Semantics of Context,” (May 6, 2008), “When Linked Data Rules Fail” (November 16, 2009), “The Semantic ‘Gap’” (October 24, 2010) and [1].
[6] Claude E. Shannon, 1948. “A Mathematical Theory of Communication,” Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, 1948.

[7] In the “Call for proposals to amend the “httpRange-14 resolution” (February 29, 2012), Jonathan Rees (presumably on behalf of the TAG), stated this as one of the rules of engagement: “9. Kindly avoid arguing in the change proposals over the terminology that is used in the baseline document. Please use the terminology that it uses. If necessary discuss terminology questions on the list as document issues independent of the 303 question.” The specific template formfor alternative proposals was also prescribed. In response to interactions on this question on the mailing list, Jonathan stated:

If it were up to me I’d purge “information resource” from the document, since I don’t want to argue about what it means, and strengthen the (a) clause to be about content or instantiation or something. But the document had to reflect the status quo, not things as I would have liked them to be.
I have not submitted this as a change proposal because it doesn’t address ISSUE-57, but it is impossible to address ISSUE-57 with a 200-related change unless this issue is addressed, as you say, head on. This is what I’ve written in my TAG F2F preparation materials.
[8] Jeni Tennison, 2012. “httpRange-14 Change Proposal,” submitted March 25, 2012. See the mailing list notice and actual proposal.

Posted by AI3's author, Mike Bergman Posted on March 27, 2012 at 5:45 pm in Linked Data, Semantic Web | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/1002/tortured-terminology-and-problematic-prescriptions/
The URI to trackback this post is: http://www.mkbergman.com/1002/tortured-terminology-and-problematic-prescriptions/trackback/