Posted:March 27, 2012

W3C Logo from http://www.w3.org/Icons/w3c_homeCasting My Vote on Revising httpRange-14

The httpRange-14 issue and its predecessor “identity crisis” debate have been active for more than a decade on the Web [1]. It has been around so long that most acknowledge “fatigue” and it has acquired that rarified status as a permathread. Many want to throw up their hands when they hear of it again and some feel — because of its duration and lack of resolution — that there never will be closure on the question. Yet everyone continues to argue and then everyone wonders why actual consumption of linked data remains so problematic.

Jonathan Rees is to be thanked for refusing to let this sleeping dog lie. This issue is not going to go away so long as its basis and existing prescriptions are, in essence, incoherent. As a member of the W3C’s TAG (Technical Architecture Group), Rees has worked diligently to re-surface and re-frame the discussion. While I don’t agree with some of the specifics and especially with the constrained approach proposed for resolving this question [2], the sleeping dog has indeed been poked and is awake. For that we can thank Jonathan. Maybe now we can get it right and move on.

I don’t agree with how this issue has been re-framed and I don’t agree that responses to it must be constrained to the prescriptive approach specified in the TAG’s call for comments. Yet, that being said, as someone who has been vocal for years about the poor semantics of the semantic Web community, I feel I have an obligation to comment on this official call.

Thus, I am casting my vote behind David Booth’s alternative proposal [3], with one major caveat. I first explain the caveat and then my reasons for supporting Booth’s proposal. I have chosen not to submit a separate alternative in order to not add further to the noise, as Bernard Vatant (and, I’m sure, many, many others) has chosen [4].

Bury the Notion of ‘Information Resource’ Once and for All

I first commented on the absurdity of the ‘information resource’ terminology about five years ago [5]. Going back to Claude Shannon [6] we have come to understand information as entropy (or, more precisely, as differences in energy state). One need not get that theoretical to see that this terminology is confusing. “Information resource” is a term that defies understanding (meaning) or precision. It is also a distinction that leads to a natural counter-distinction, the “non-information resource”, which is also an imprecise absurdity.

What the confusing term is meant to encompass is web-accessible content (“documents”), as opposed to descriptions of (or statements about) things. This distinction then triggers a different understanding of a URI (locator v identifier alone) and different treatments of how to process and interpret that URI. But the term is so vague and easily misinterpreted that all of the guidance behind the machinery to be followed gets muddied, too. Even in the current chapter of the debate, key interlocutors confuse and disagree as to whether a book is an “information resource” or not. If we can’t basically separate the black balls from the white balls, how are we to know what to do with them?

If there must be a distinction, it should be based on the idea of the actual content of a thing — or perhaps more precisely web-accessible content or web-retrievable content — as opposed to the description of a thing. If there is a need to name this class of content things (a position that David Booth prefers, pers. comm.), then let’s use one of these more relevant terms and drop “information resource” (and its associated IR and NIR acronyms) entirely.

The motivation behind the “information resource” terminology also appears to be a desire that somehow a URI alone can convey the name of what a thing is or what it means. I recently tried to blow this notion to smithereens by using Peirce’s discussion of signs [1]. We should understand that naming and meaning may only be provided by the owner of a URI through additional explication, and then through what is understood by the recipient; the string of the URI itself conveys very little (or no) meaning in any semantic sense.

We should ban the notion of “information resource” forever. If the first exposure a potential new publisher or consumer of linked data encounters is “information resource”, we have immediately lost the game. Unresolvable abstractions lead to incomprehension and confusion.

The approach taken by the TAG in requesting new comments on httpRange-14 only compounds this problem. First, the guidance is to not allow any questioning of the “information resource” terminology within the prescribed comment framework [7]. Then, in the suggested framework for response, still further terminology such as “probe URIs”, “URI documentation carrier” or “nominal URI documentation carrier for a URI” is introduced. Aaaaarggghh! This only furthers the labored and artificial terminology common to this particular standards effort.

While Booth’s proposal does not call for an outright rejection of the “information resource” terminology (my one major qualification in supporting it), I like it because it purposefully sidesteps the question of the need to define “information resource” (see his Section 2.7). Booth’s proposal is also explicit in its rejection of implied meaning in URIs and through embrace of the idea of a protocol. Remember, all that is being put forward in any of these proposals is a mechanism for distinguishing between retrievable content obtainable at a given URL and a description of something found at a URI. By racheting down the implied intent, Booth’s proposal is more consistent with the purpose of the guidance and is not guilty of overreach.

Keep It Simple

One of the real strengths of Booth’s proposal is its rejection of the prescriptive method proposed by the TAG for suggesting an alternative to httpRange-14 [7]. The parsimonious objective should be to be simple, be clear, and be somewhat relaxed in terms of mechanisms and prescriptions. I believe use patterns — negotiated via adoption between publishers and consumers — will tell us over time what the “right” solutions may be.

Amongst the proposals put forward so far, David Booth’s is the most “neutral” with respect to imposed meanings or mechanisms, and is the simplest. Though I quibble in some respects, I offer qualified support for his alternative because it:

  • Sidesteps the “information resource” definition (though weaker than I would want; see above)
  • Addresses only the specific HTTP and HTTPS cases
  • Avoids the constrained response format suggested by the TAG
  • Explicitly rejects assigning innate meanings to URIs
  • Poses the solution as a protocol (an understanding between publisher and consumer) rather than defining or establishing a meaning via naming
  • Provides multiple “cow paths” by which resource definitions can be conveyed, which gives publishers and consumers choice and offers the best chance for more well-trodden paths to emerge
  • Does not call for an outright repeal of the httpRange-14 rule, but retains it as one of multiple options for URI owners to describe resources
  • Permits the use of an HTTP 200 response with RDF content as a means of conveying a URI definition
  • Retains the use of the hash URI as an option
  • Provides alternatives to those who can not easily (or at all) use the 303 see also redirect mechanism, and
  • Simplifies the language and the presentation.

I would wholeheartedly support this approach were two things to be added: 1) the complete abandonment of all “information resource” terminology; and 2) an official demotion of the httpRange-14 rule (replacing it with a slash 303 option on equal footing to other options), including a disavowal of the “information resource” terminology. I suspect if the TAG adopts this option, that subsequent scrutiny and input might address these issues and improve its clarity even further.

There are other alternatives submitted, prominently the one by Jeni Tennison with many co-signatories [8]. This one, too, embraces multiple options and cow paths. However, it has the disadvantage of embedding itself into the same flawed terminology and structure as offered by httpRange-14.


[1] For my recent discussion about the history of these issues, see M.K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” in AI3:::Adaptive Information blog, January 24, 2012; see http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/.
[2] In all fairness, this call was the result of ISSUE-57, which had its own constraints. Not knowing all of the background that led to the httpRange-14 Pandora’s Box being opened again, the benefit of the doubt would be that the form and approach prescribed by the TAG dictated the current approach. In any event, now that the Box is open, all pertinent issues should be addressed and the form of the final resolution should also not be constrained from what makes best sense and is most pragmatic.
[3] David Booth‘s alternative proposal is for the “URI Definition and Discovery Protocol” (uddp). The actual submission according to form is found here.
[4] See Bernard Vatant, 2012. “Beyond httpRange-14 Addiction,” the wheel and the hub blog, March 27, 2012. See http://blog.hubjects.com/2012/03/beyond-httprange-14-addiction.html.
[5] M.K. Bergman, 2007. “More Structure, More Terminology and (hopefully) More Clarity,” in AI3:::Adaptive Information blog, July 27, 2007; see http://www.mkbergman.com/391/more-structure-more-terminology-and-hopefully-more-clarity/. Subsequent to that piece, I have written further on semantic Web semantics in “The Semantic Web and Industry Standards” (January 26, 2008), ” “The Shaky Semantics of the Semantic Web” (March 12, 2008), “Semantic Web Semantics: Arcane, but Important,” (April 8, 2008), “The Semantics of Context,” (May 6, 2008), “When Linked Data Rules Fail” (November 16, 2009), “The Semantic ‘Gap’” (October 24, 2010) and [1].
[6] Claude E. Shannon, 1948. “A Mathematical Theory of Communication,” Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, 1948.

[7] In the “Call for proposals to amend the “httpRange-14 resolution” (February 29, 2012), Jonathan Rees (presumably on behalf of the TAG), stated this as one of the rules of engagement: “9. Kindly avoid arguing in the change proposals over the terminology that is used in the baseline document. Please use the terminology that it uses. If necessary discuss terminology questions on the list as document issues independent of the 303 question.” The specific template formfor alternative proposals was also prescribed. In response to interactions on this question on the mailing list, Jonathan stated:

If it were up to me I’d purge “information resource” from the document, since I don’t want to argue about what it means, and strengthen the (a) clause to be about content or instantiation or something. But the document had to reflect the status quo, not things as I would have liked them to be.
I have not submitted this as a change proposal because it doesn’t address ISSUE-57, but it is impossible to address ISSUE-57 with a 200-related change unless this issue is addressed, as you say, head on. This is what I’ve written in my TAG F2F preparation materials.
[8] Jeni Tennison, 2012. “httpRange-14 Change Proposal,” submitted March 25, 2012. See the mailing list notice and actual proposal.

Posted by AI3's author, Mike Bergman Posted on March 27, 2012 at 5:45 pm in Linked Data, Semantic Web | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/1002/tortured-terminology-and-problematic-prescriptions/
The URI to trackback this post is: http://www.mkbergman.com/1002/tortured-terminology-and-problematic-prescriptions/trackback/
Posted:January 24, 2012

The Triadic of SignsCoca-Cola, Toucans and Charles Sanders Peirce

The crowning achievement of the semantc Web is the simple use of URIs to identify data. Further, if the URI identifier can resolve to a representation of that data, it now becomes an integral part of the HTTP access protocol of the Web while providing a unique identifier for the data. These innovations provide the basis for distributed data at global scale, all accessible via Web devices such as browsers and smartphones that are now a ubiquitous part of our daily lives.

Yet, despite these profound and simple innovations, the semantic Web’s designers and early practitioners and advocates have been mired in a muddled, metaphysical argument of at least a decade over what these URIs mean, what they reference, and what their actual true identity is. These muddles about naming and identity, it might be argued, are due to computer scientists and programmers trying to grapple with issues more properly the domain of philosophers and linguists. But that would be unfair. For philosophers and linguists themselves have for centuries also grappled with these same conundrums [1].

As I argue in this piece, part of the muddle results from attempting to do too much with URIs while another part results from not doing enough. I am also not trying to directly enter the fray of current standards deliberations. (Despite a decade of controversy, I optimistically believe that the messy process of argument and consensus building will work itself out [2].) What I am trying to do in this piece, however, is to look to one of America’s pre-eminent philosophers and logicians, Charles Sanders Peirce (pronounced “purse”), to inform how these controversies of naming, identity and meaning may be dissected and resolved.

‘Identity Crisis’, httpRange-14, and Issue 57

The Web began as a way to hyperlink between documents, generally Web pages expressed in the HTML markup language. These initial links were called URLs (uniform resource locators), and each pointed to various kinds of electronic resources (documents) that could be accessed and retrieved on the Web. These resources could be documents written in HTML or other encodings (PDFs, other electronic formats), images, streaming media like audio or videos, and the like [3].

All was well and good until the idea of the semantic Web, which postulated that information about the real world — concepts, people and things — could also be referenced and made available for reasoning and discussion on the Web. With this idea, the scope of the Web was massively expanded from electronic resources that could be downloaded and accessed via the Web to now include virtually any topic of human discourse. The rub, of course, was that ideas such as abstract concepts or people or things could not be “dereferenced” nor downloaded from the Web.

One of the first things that needed to change was to define a broader concept of a URI “identifier” above the more limited concept of a URL “locator”, since many of these new things that could be referenced on the Web went beyond electronic resources that could be accessed and viewed [3]. But, since what the referent of the URI now actually might be became uncertain — was it a concept or a Web page that could be viewed or something else? — a number of commentators began to note this uncertainty as the “identity crisis” of the Web [4]. The topic took on much fervor and metaphysical argument, such that by 2003, Sandro Hawke, a staffer of the standards-setting W3C (World Wide Web Consortium), was able to say, “This is an old issue, and people are tired of it” [5].

Yet, for many of the reasons described more fully below, the issue refused to go away. The Technical Architecture Group (TAG) of the W3C took up the issue, under a rubric that came to be known as httpRange-14 [6]. The issue was first raised in March 2002 by Tim Berners-Lee, accepted for TAG deliberations in February 2003, with then a resolution offered in June 2005 [7]. (Refer to the original resolution and other information [6] to understand the nuances of this resolution, since particular commentary on that approach is not the focus of this article.) Suffice it to say here, however, that this resolution posited an entirely new distinction of Web content into “information resources” and “non-information resources”, and also recommended the use of the HTTP 303 redirect code for when agents requesting a URI should be directed to concepts versus viewable documents.

This “resolution” has been anything but. Not only can no one clearly distinguish these de novo classes of “information resources” [19], but the whole approach felt arbitrary and kludgy.

Meanwhile, the confusions caused by the “identity crisis” and httpRange-14 continued to perpetuate themselves. In 2006, a major workshop on “Identity, Reference and the Web” (IRW 2006) was held in conjunction with the Web’s major WWW2006 conference in Edinburgh, Scotland, on May 23, 2006 [8]. The various presentations and its summary (by Harry Halpin) are very useful to understand these issues. What was starting to jell at this time was the understanding that the basis of identity and meaning on the Web posed new questions, and ones that philosophers, logicians and linguists needed to be consulted to help inform.

The fiat of the TAG’s 2005 resolution has failed to take hold. Over the ensuing years, various eruptions have occurred on mailing lists and within the TAG itself (now expressed as Issue 57) to revisit these questions and bring the steps moving forward into some coherent new understanding. Though linked data has been premised on best-practice implementation of these resolutions [9], and has been a qualified success, many (myself included) would claim that the extra steps and inefficiencies required from the TAG’s httpRange-14 guidance have been hindrances, not facilitators, of the uptake of linked data (or the semantic Web).

Today, despite the efforts of some to claim the issue closed, it is not. Issue 57 and the periodic bursts from notable semantic Web advocates such as Ian Davis [10], Pat Hayes and Harry Halpin [11], Ed Summers [12], Xiaoshu Wang [13], David Booth [14] and TAG members themselves, such as Larry Masinter [15] and Jonathan Rees [16], point to continued irresolution and discontent within the advocate community. Issue 57 currently remains open. Meanwhile, I think, all of us interested in such matters can express concern that linked data, the semantic Web and interoperable structured data have seen less uptake than any of us had hoped or wanted over the past decade. As I have stated elsewhere, unclear semantics and muddled guidelines help to undercut potential use.

As each of the eruptions over these identity issues has occurred, the competing camps have often been characterized as “talking past one another”; that is, not communicating in such a way as to help resolve to consensus. While it is hardly my position to do so, I try to encapsulate below the various positions and prejudices as I see them in this decades-long debate. I also try to share my own learning that may help inform some common ground. Forgive me if I overly simplify these vexing issues by returning to what I see as some first principles . . . .

What’s in a Name?

Original Coca-Cola bottle

One legacy of the initial document Web is the perception that Web addresses have meaning. We have all heard of the multi-million dollar purchasing of domains [17] and the adjudication that may occur when domains are hijacked from their known brands or trademark owners. This legacy has tended to imbue URIs with a perceived value. It is not by accident, I believe, that many within the semantic Web and linked data communities still refer to “minting” URIs. Some believe that ownership and control over URIs may be equivalent to grabbing up valuable real estate. It is also the case that many believe the “name” given to a URI acts to name the referent to which it refers.

This perception is partially true, partially false, but moreover incomplete in all cases. We can illustrate these points with the global icon, “Coca-Cola”.

As for the naming aspects, let’s dissect what we mean when we use the label “Coca-Cola” (in a URI or otherwise). Perhaps the first thing that comes to mind is “Coca-Cola,” the beverage (which has a description on Wikipedia, among other references). Because of its ubiquity, we may also recognize the image of the Coca-Cola bottle to the left as a symbol for this same beverage. (Though, in the hilarious movie, The Gods, They Must be Crazy, Kalahari Bushmen, who had no prior experience of Coca-Cola, took the bottle to be magical with evil powers [18].) Yet even as reference to the beverage, the naming aspects are a bit cloudy since we could also use the fully qualified synonyms of “Coke”, “Coca-cola” (small C), “Classic Coke” and the hundreds of language variants worldwide.

On the other hand, the label “Coca-Cola” could just as easily conjure The Coca-Cola Company itself. Indeed, the company web site is the location pointed to by the URI of http://www.thecoca-colacompany.com/. But, even that URI, which points to the home Web page of the company, does not do justice to conveying an understanding or description of the company. For that, additional URIs may need to be invoked, such as the description at Wikipedia, the company’s own company description page, plus perhaps the company’s similar heritage page.

Of course, even these links and references only begin to scratch the surface of what the company Coca-Cola actually is: headquarters, manufacturing facilities, 140,000 employees, shareholders, management, legal entities, patents and Coke recipe, and the like. Whether in human languages or URIs, in any attempt to signify something via symbols or words (themselves another form of symbol), we risk ambiguity and incompleteness.

URI shorteners also undercut the idea that a URI necessarily “names” something. Using the service bitly, we can shorten the link to the Wikipedia description of the Coke beverage to http://bit.ly/xnbA6 and we can shorten the link to The Coca-Cola Company Web site to http://bit.ly/9ojUpL. I think we can fairly say that neither of these shortened links “name” their referents. The most we can say about a URI is that it points to something. With the vagaries of meaning in human languages, we might also say that URIs refer to something, denote something or identify (but not in the sense of completely define) something.

From this discussion, we can assert with respect to the use of URIs as “names” that:

  1. In all cases, URIs are pointers to a particular referent
  2. In some cases, URIs do act to “name” some things
  3. Yet, even when used as “names,” there can be ambiguity as to what exactly the referent is that is denoted by the name
  4. Resolving what such “names” mean is a matter of context and reference to further information or links, and
  5. Because URIs may act as “names”, it is appropriate to consider social conventions and contracts (e.g., trademarks, brands, legal status) in adjudicating who can own the URI.

In summary, I think we can say that URIs may act as names, but not in all or most cases, and when used as such are often ambiguous. Absolutely associating URIs as names is way too heavy a burden, and incorrect in most cases.

What is a Resource?

The “name” discussion above masks that in some cases we are talking about a readable Web document or image (such as the Wikipedia description of the Coke beverage or its image) versus the “actual” thing in the real world (the Coke beverage itself or even the company). This distinction is what led to the so-called “identity crisis”, for which Ian Davis has used a toucan as his illustrative thing [10].Keel-billed Toucan

As I note in the conclusion, I like Davis’ approach to the identity conundrum insofar as Web architecture and linked data guidance are concerned. But here my purpose is more subtle: I want to tease apart still further the apparent distinction between an electronic description of something on the Web and the “actual” something. Like Davis, let’s use the toucan.

In our strawman case, we too use a description of the toucan (on Wikipedia) to represent our “information resource” (the accessible, downloadable electronic document). We contrast to that a URI that we mean to convey the actual physical bird (a “non-information resource” in the jumbled jargon of httpRange-14), which we will designate via the URI of http://example.com/toucan.

Despite the tortured (and newly conjured) distinction between “information resource” and “non-information resource”, the first blush reaction is that, sure, there is a difference between an electronic representation that can be accessed and viewed on the Web and its true, “actual” thing. Of course people can not actually be rendered and downloaded on the Web, but their bios and descriptions and portrait images may. While in the abstract such distinctions appear true and obvious, in the specifics that get presented to experts, there is surprising disagreement as to what is actually an “information resource” v. a “non-information resource” [19]. Moreover, as we inspect the real toucan further, even that distinction is quite ambiguous.

When we inspect what might be a definitive description of “toucan” on Wikipedia, we see that the term more broadly represents the family of Ramphastidae, which contains five genera and forty different species. The picture we are showing to the right is but of one of those forty species, that of the keel-billed toucan (Ramphastos sulfuratus). Viewing the images of the full list of toucan species shows just how divergent these various “physical birds” are from one another. Across all species, average sizes vary by more than a factor of three with great variation in bill sizes, coloration and range. Further, if I assert that the picture to the right is actually that of my pet keel-billed toucan, Pretty Bird, then we can also understand that this representation is for a specific individual bird, and not the physical keel-billed toucan species as a whole.

The point of this diversion is not a lecture on toucans, but an affirmation that distinctions between “resources” occur at multiple levels and dimensions. Just as there is no self-evident criteria as to what constitutes an “information resource”, there is also not a self-evident and fully defining set of criteria as to what is the physical “toucan” bird. The meaning of what we call a “toucan” bird is not embodied in its label or even its name, but in the context and accompanying referential information that place the given referent into a context that can be communicated and understood. A URI points to (“refers to”) something that causes us to conjure up an understanding of that thing, be it a general description of a toucan, a picture of a toucan, an understanding of a species of toucan, or a specific toucan bird. Our understanding or interpretation results from the context and surrounding information accompanying the reference.

In other words, a “resource” may be anything, which is just the way the W3C has defined it. There is not a single dimension which, magically, like “information” and “non-information,” can cleanly and definitely place a referent into some state of absolute understanding. To assert that such magic distinctions exist is a flaw of Cartesian logic, which can only be reconciled by looking to more defensible bases in logic [20].

Peirce and the Logic of Signs

The logic behind these distinctions and nuances leads us to Charles Sanders PeirceCharles Sanders Peirce (1839 – 1914). Peirce (pronounced “purse”) was an American logician, philosopher and polymath of the first rank. Along with Frege, he is acknowledged as the father of predicate calculus and the notation system that formed the basis of first-order logic. His symbology and approach arguably provide the logical basis for description logics and other aspects underlying the semantic Web building blocks of the RDF data model and, eventually, the OWL language. Peirce is the acknowledged founder of pragmatism, the philosophy of linking practice and theory in a process akin to the scientific method. He was also the first formulator of existential graphs, an essential basis to the whole field now known as model theory. Though often overlooked in the 20th century, Peirce has lately been enjoying a renaissance with his voluminous writings still being deciphered and published.

The core of Peirce’s world view is based in semiotics, the study and logic of signs. In his seminal writing on this, “What is in a Sign?” [21], he wrote that “every intellectual operation involves a triad of symbols” and “all reasoning is an interpretation of signs of some kind”. Peirce had a predilection for expressing his ideas in “threes” throughout his writings.

Semiotics is often split into three branches: 1) syntactics – relations among signs in formal structures; 2) semantics – relations between signs and the things to which they refer; and 3) pragmatics – relations between signs and the effects they have on the people or agents who use them.

Peirce’s logic of signs in fact is a taxonomy of sign relations, in which signs get reified and expanded via still further signs, ultimately leading to communication, understanding and an approximation of “canonical” truth. Peirce saw the scientific method as itself an example of this process.

A given sign is a representation amongst the triad of the sign itself (which Peirce called a representamen, the actual signifying item that stands in a well-defined kind of relation to the two other things), its object and its interpretant. The object is the actual thing itself. The interpretant is how the agent or the perceiver of the sign understands and interprets the sign. Depending on the context and use, a sign (or representamen) may be either an icon (a likeness), an indicator or index (a pointer or physical linkage to the object) or a symbol (understood convention that represents the object, such as a word or other meaningful signifier).

An interpretant in its barest form is a sign’s meaning, implication, or ramification. For a sign to be effective, it must represent an object in such a way that it is understood and used again. This makes the assignment and use of signs a community process of understanding and acceptance [20], as well as a truth-verifying exercise of testing and confirming accepted associations.

John Sowa has done much to help make some of Peirce’s obscure language and terminology more accessible to lay readers [22]. He has expressed Peirce’s basic triad of sign relations as follows, based around the Yojo animist cat figure used by the character Queequeg in Herman Melville’s Moby-Dick:

The Triangle of Meaning

In this figure, object and symbol are the same as the Peirce triad; concept is the interpretant in this case. The use of the word ‘Yojo’ conjures the concept of cat.

This basic triad representation has been used in many contexts, with various replacements or terms at the nodes. Its basic form is known as the Meaning Triangle, as was popularized by Ogden and Richards in 1923 [23].

The key aspect of signs for Peirce, though, is the ongoing process of interpretation and reference to further signs, a process he called semiosis. A sign of an object leads to interpretants, which, as signs, then lead to further interpretants. In the Sowa example below, we show how meaning triangles can be linked to one another, in this case by abstracting that the triangles themselves are concepts of representation; we can abstract the ideas of both concept and symbol:

Representing an Object by a Concept

We can apply this same cascade of interpretation to the idea of the sign (or representamen), which in this case shows that a name can be related to a word symbol, which in itself is a combination of characters in a string called ‘Yojo’:

Representing Signs of Signs of Signs

According to Sowa [22]:

“What is revolutionary about Peirce’s logic is the explicit recognition of multiple universes of discourse, contexts for enclosing statements about them, and metalanguage for talking about the contexts, how they relate to one another, and how they relate to the world and all its events, states, and inhabitants.
“The advantage of Peircean semiotics is that it firmly situates language and logic within the broader study of signs of all types. The highly disciplined patterns of mathematics and logic, important as they may be for science, lie on a continuum with the looser patterns of everyday speech and with the perceptual and motor patterns, which are organized on geometrical principles that are very different from the syntactic patterns of language or logic.”

Catherine Legg [20] notes that the semiotic process is really one of community involvement and consensus. Each understanding of a sign and each subsequent interpretation helps come to a consensus of what a sign means. It is a way of building a shared understanding that aids communication and effective interpretation. In Peirce’s own writings, the process of interpretation can lead to validation and an eventual “canonical” or normative interpretation. The scientific method itself is an extreme form of the semiotic process, leading ultimately to what might be called accepted “truths”.

Peircean Semiotics of URIs

So, how do Peircean semiotics help inform us about the role and use of URIs? Does this logic help provide guidance on the “identity crisis”?

The Peircean taxonomy of signs has three levels with three possible sign roles at each level, leading to a possible 27 combinations of sign representations. However, because not all sign roles are applicable at all levels, Peirce actually postulated only ten distinct sign representations.

Common to all roles, the URI “sign” is best seen as an index: the URI is a pointer to a representation of some form, be it electronic or otherwise. This representation bears a relation to the actual thing that this referent represents, as is true for all triadic sign relationships. However, in some contexts, again in keeping with additional signs interpreting signs in other roles, the URI “sign” may also play the role of a symbolic “name” or even as a signal that the resource can be downloaded or accessed in electronic form. In other words, by virtue of the conventions that we choose to assign to our signs, we can supply additional information that augments our understanding of what the URI is, what it means, and how it is accessed.

Of course, in these regards, a URI is no different than any other sign in the Peircean world view: it must reside in a triadic relationship to its actual object and an interpretation of that object, with further understanding only coming about by the addition of further signs and interpretations.

In shortened form, this means that a URI, acting alone, can at most play the role of a pointer between an object and its referent. A URI alone, without further signs (information), can not inform us well about names or even what type of resource may be at hand. For these interpretations to be reliable, more information must be layered on, either by accepted convention of the current signs or the addition of still further signs and their interpretations. Since the attempts to deal with the nature of a URI resource by fiat as stipulated by httpRange-14 neither meet the standards of consensus nor empirical validity, the attempt can not by definition become “canonical”. This does not mean that httpRange-14 and its recommended practices can not help in providing more information and aiding interpretation for what the nature of a resource may be. But it does mean that httpRange-14 acting alone is insufficient to resolve ambiguity.

Moreover, what we see in the general nature of Peirce’s logic of signs is the usefulness of adding more “triads” of representation as the process to increase understanding through further interpretation. Kind of sounds like adding on more RDF triples, does it not?

Global is Neither Indiscriminate Nor Unambiguous

Names, references, identity and meaning are not absolutes. They are not philosophically, and they are not in human language. To expect machine communications to hold to different standards and laws than human communications is naive. To effect machine communications our challenge is not to devise new rules, but to observe and apply the best rules and practices that human communications instruct.

There has been an unstated hope at the heart of the semantic Web enterprise that simply expressing statements in the right way (syntax) and in the right form (RDF) is sufficient to facilitate machine communications. But this hope, too, is naive and silly. Just as we do not accept all human utterances as truth, neither will we accept all machine transmissions as reliable. Some of the information will be posted in error; some will be wrong or ill-fitting to our world view; some will be malicious or intended to deceive. Spam and occasionally lousy search results on the Web tell us that Web documents are subject to these sources of unsuitability, why is not the same true of data?

Thus, global data access via the semantic Web is not — and can never be — indiscriminate nor unambiguous. We need to understand and come to trust sources and provenance; we need interpretation and context to decide appropriateness and validity; and we need testing and validation to ensure messages as received are indeed correct. Humans need to do these things in their normal courses of interaction and communication; our machine systems will need to do the same.

These confirmations and decisions as to whether the information we receive is actionable or not will come about via still more information. Some of this information may come about via shared conventions. But most will come about because we choose to provide more context and interpretation for the core messages we hope to communicate.

A Go-Forward Approach

Nearly five years ago Hayes and Halpin put forth a proposal to add ex:refersTo and ex:describedBy to the standard RDF vocabulary as a way for authors to provide context and explanation for what constituted a specific RDF resource [11]. In various ways, many of the other individuals cited in this article have come to similar conclusions. The simple redirect suggestions of both Ian Davis [10] and Ed Summers [12] appear particularly helpful.

Over time, we will likely need further representations about resources regarding such things as source, provenance, context and other interpretations that would help remove ambiguities as to how the information provided by that resource should be consumed or used. These additional interpretations can mechanically be provided via referenced ontologies or embedded RDFa (or similar). These additional interpretations can also be aided by judicious, limited additions of new predicates to basic language specifications for RDF (such as the Hayes and Halpin suggestions).

In the end, of course, any frameworks that achieve consensus and become widely adopted will be simple to use, easy to understand, and straightforward to deploy. The beauty of best practices in predicates and annotations is that failures to provide are easy to test. Parties that wish to have their data consumed have incentive to provide sufficient information so as to enable interpretation.

There is absolutely no reason that these additions can not co-exist with the current httpRange-14 approach. By adding a few other options and making clear the optional use of httpRange-14, we would be very Peirce-like in our go-forward approach: We are being both pragmatic while we add more means to improve our interpretations for what a Web resource is and is meant to be.


[1] Throughout intellectual history, a number of prominent philosophers and logicians have attempted to describe naming, identity and reference of objects and entities. Here are a few that you may likely encounter in various discussions of these topics in reference to the semantic Web; many are noted philosophers of language:

  • Aristotle (384 BC – 322 BC) – founder of formal logic; formulator and proponent of categorization; believed in the innate “universals” of various things in the natural world
  • Rudolf Carnap (1891 – 1970) -  proposed a logical syntax that provided a system of concepts, a language, to enable logical analysis via exactly formula; a basis for natural language processing;rejected the idea and use of metaphysics
  • René Descartes (1596 – 1650) – posited a boundary between mind and the world; the meaning of a sign is the intension of its producer, and is private and incorrigible
  • Friedrich Ludwig Gottlob Frege (1848 – 1925) – one of the formulators of first-order logic, though syntax not adopted; advocated shared senses, which can be objective and sharable
  • Kurt Gödel (1906 – 1978) – his two incompleteness theorems are some of the most important logic contributions of all time; they establish inherent limitations of all but the most trivial axiomatic systems capable of doing arithmetic, as well as for computer programs
  • David Hume (1711 – 1776) – embraced natural empiricism, but kept the Descartes concept of an “idea”
  • Immanuel Kant (1724 – 1804) – one of the major philosophers in history, argued that experience is purely subjective without first being processed by pure reason; a major influence on Peirce
  • Saul Kripke (1940 – ) – proposed the causal theory of reference and what proper names mean via a “baptism” by the namer
  • Gottfried Wilhelm Leibniz (1646 – 1716) – the classic definition of identity is Leibniz’s Law, which states that if two objects have all of their properties in common, they are identical and so only one object
  • Richard Montague (1930 – 1971) – wrote much on logic and set theory; student of Tarski; pioneered a logical approach to natural language semantics; associated with model theory, model-theoretic semantics
  • Charles Sanders Peirce (1839 – 1914) – see main text
  • Willard Van Orman Quine (1908 – 2000) – noted analytical philosopher, advocated the “radical indeterminancy of translation” (can never really know)
  • Bertrand Russell (1872 – 1970) – proposed the direct theory of reference and what it means to “ground in references”; adopted many Peirce arguments without attribution
  • Ferdinand de Saussure (1857 – 1913) – also proposed an alternative view to Peirce of semiotics, one grounded in sociology and linguistics
  • John Rogers Searle (1932 – ) – argues that consciousness is a real physical process in the brain and is subjective; has argued against strong AI (artificial intelligence)
  • Alfred Tarski (1901 – 1983) – analytic philosopher focused on definitions of models and truth; great admirer of Peirce; associated with model theory, model-theoretic semantics
  • Ludwig Josef Johann Wittgenstein (1889 – 1951) – he disavowed his earlier work, arguing that philosophy needed to be grounded in ordinary language, recognzing that the meaning of words is dependent on context, usage, and grammar.
Also, Umberto Eco has been a noted proponent and popularizer of semiotics.
[2] As any practitioner ultimately notes, standards development is a messy, lengthy and trying process. Not all individuals can handle the messiness and polemics involved. Personally, I prefer to try to write cogent articles on specific issues of interest, and then leave it to others to slug it out in the back rooms of standards making. Where the process works well, standards get created that are accepted and adopted. Where the process does not work well, the standards are not embraced as exhibited by real-world use.
[3] Tim Berners-Lee, 2007. What Do HTTP URIs Identify?
This article does not discuss the other sub-category of URIs, URNs (for names). URNs may refer to any standard naming scheme (such as ISBNs for books) and has no direct bearing on any network access protocol, as do URLs and URIs when they are referenceable. Further, URNs are little used in practice.
[4] Kendall Clark was one of the first to question “resource” and other identity ambiguities, noting the tautology between URI and resource as “anything that has identity.” See Kendall Clark, 2002. “Identity Crisis,” in XML.com, Sept 11 2002; see http://www.xml.com/pub/a/2002/09/11/deviant.html. From the topic map community, one notable contribution was from Steve Pepper and Sylvia Schwab, 2003. “Curing the Web’s Identity Crisis,” found at : http://www.ontopia.net/topicmaps/materials/identitycrisis.html.
[5] Sandro Hawke, 2003. Disambiguating RDF Identifiers. W3C, January 2003. See http://www.w3.org/2002/12/rdf-identifiers/.
[6] The issue was framed as what is the proper “range” for HTTP referrals and was also the 14th major TAG issue recorded, hence the name. See further the httpRange-14 Webography .
[7] See W3C, “httpRange-14: What is the range of the HTTP dereference function?”; see http://www.w3.org/2001/tag/issues.html#httpRange-14.
[9] Leo Sauermann and Richard Cyganiak, eds., 2008. Cool URIs for the Semantic Web, W3C Interest Group Note, December 3, 2008. See http://www.w3.org/TR/cooluris/.
[10] Ian Davis, 2010. Is 303 Really Necessary? Blog post, November 2010, accessed 20 January 2012. (See http://blog.iandavis.com/2010/11/04/is-303-really-necessary/.) A considerable thread resulted from this post; see http://markmail.org/thread/mkoc5kxll6bbjbxk.
[11] See first Harry Halpin, 2006. “Identity, Reference and Meaning on the Web,” presented at WWW 2006, May 23, 2006. See http://www.ibiblio.org/hhalpin/irw2006/hhalpin.pdf. This was then followed up with greater elaboration by Patrick J. Hayes and Harry Halpin, 2007. “In Defense of Amibiguity,” http://www.ibiblio.org/hhalpin/homepage/publications/indefenseofambiguity.html.
[12] Ed Summers, 2010. Linking Things and Common Sense, blog post of July 7, 2010. See http://inkdroid.org/journal/2010/07/07/linking-things-and-common-sense/.
[13] Xiaoshu Wang, 2007. URI Identity and Web Architecture Revisited, Word document posted on posterous.com, November 2007. (Former Web documents have been removed.)
[14] David Booth, 2006. “URIs and the Myth of Resource Identity,” see http://dbooth.org/2006/identity/.
[15] See Larry Masinter, 2012. “The ‘tdb’ and ‘duri’ URI Schemes, Based on Dated URIs,” 10th version, IETF Network Working Group Internet-Draft,January 12, 2012. See http://tools.ietf.org/html/draft-masinter-dated-uri-10.
[16] Jonathan Rees has been the scribe and author for many of the background documents related to Issue 57. A recent mailing list entry provides pointers to four relevant documents in this entire discussion. See Jonathan A Rees, 2012. Guide to ISSUE-57 (httpRange-14) document suiteJanuary, 21, 2012.
[17] At least twenty domain names, led by insure.com, have sold for more the $2 million each; see this Wikipedia listing.
[18] In the wonderful movie, The Gods, They Must be Crazy, Bushmen in the Kalahari Desert one day find an unbroken glass Coke bottle that had been thrown out of an airplane. Initially, this strange artifact seems to be another boon from the gods, and the Bushmen find many uses for it. But unlike anything that they have had before, there is only one bottle to go around. This creates jealousy, envy, anger, hatred, even violence. The protagonist, Xi, decides that the bottle is an evil thing and must be thrown off of the edge of the world. The hilarity of the movie comes from that premise and Xi’s encounters with the modern world as he pursues his quest with the magic bottle.

[19] Wang [13]rhetorically asked which of the following things would be categorized as an “information resource”:

  1. A book
  2. A clock
  3. The clock on the wall of my bedroom
  4. A gene
  5. The sequence of a gene
  6. A software
  7. A service
  8. A namespace
  9. An ontology
  10. A language
  11. A number
  12. A concept, such as Dublin Core’s creator.

See the 2007 thread on this issue, mostly by Sean Palmer and Noah Mendelsohn, the latter aknowledging that various experts may only agree on 85% of the items.

[20] See further Catherine Legg, 2010. “Pragmaticsm on the Semantic Web,” in Bergman, M., Paavola, S., Pietarinen, A.-V., & Rydenfelt, H. eds., Ideas in Action: Proceedings of the Applying Peirce Conference, pp. 173–188. Nordic Studies in Pragmatism 1. Helsinki: Nordic Pragmatism Network. See http://www.nordprag.org/nsp/1/Legg.pdf.
[21] Charles Sanders Peirce, 1894. “What is in a Sign?”, see http://www.iupui.edu/~peirce/ep/ep2/ep2book/ch02/ep2ch2.htm.
[22] The figures in particular are from John F. Sowa, 2000. “Ontology, Metadata, and Semiotics,” presented at ICCS 2000 in Darmstadt, Germany, on August 14, 2000; published in B. Ganter & G. W. Mineau, eds., Conceptual Structures: Logical, Linguistic, and Computational Issues, Lecture Notes in AI #1867, Springer-Verlag, Berlin, 2000, pp. 55-81. May be found at http://www.jfsowa.com/ontology/ontometa.htm. Also see John F. Sowa, 2006. “Peirce’s Contributions to the 21st Century,” presented at International Conference on Conceptual Structures, Aalborg, Denmark, July 17, 2006. See http://www.jfsowa.com/pubs/csp21st.pdf.
[23] C.K. Ogden and I. A. Richards, 1923. The Meaning of Meaning, Harcourt, Brace, and World, New York, 8th edition 1946.

Posted by AI3's author, Mike Bergman Posted on January 24, 2012 at 9:52 am in Adaptive Information, Linked Data, Semantic Web | Comments (9)
The URI link reference to this post is: http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/
The URI to trackback this post is: http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/trackback/
Posted:September 12, 2011

Judgment for Semantic TechnologiesFive Unique Advantages for the Enterprise

There have been some notable attempts of late to make elevator pitches [1] for semantic technologies, as well as Lee Feigenbaum’s recent series on Are We Asking the Wrong Question? about semantic technologies [2]. Some have attempted to downplay semantic Web connotations entirely and to replace the pitch with Linked Data (capitalized). These are part of a history of various ways to try to make a business case around semantic approaches [3].

What all of these attempts have in common is a view — an angst, if you will — that somehow semantic approaches have not fulfilled their promise. Marketing has failed semantic approaches. Killer apps have not appeared. The public has not embraced the semantic Web consonant with its destiny. Academics and researchers can not make the semantic argument like entrepreneurs can.

Such hand wringing, I believe, is misplaced on two grounds. First, if one looks to end user apps that solely distinguish themselves by the sizzle they offer, semantic technologies are clearly not essential. There are very effective mash-up and data-intensive sites such as many of the investment sites (Fidelity, TDAmeritrade, Morningstar, among many), real estate sites (Trulia, Zillow, among many), community data sites (American FactFinder, CensusScope, City-Data.com, among many), shopping sites (Amazon, Kayak, among many), data visualization sites (Tableau, Factual, among many), etc. , etc., that work well, are intuitive and integrate much disparate information. For the most part, these sites rely on conventional relational database backends and have little semantic grounding. Effective data-intensive sites do not require semantics per se [4].

Second, despite common perceptions, semantics are in fact becoming pervasive components of many common and conventional Web sites. We see natural language processing (NLP) and extraction technologies becoming common for most search services. Google and Bing sprinkle semantic results and characterizations across their standard search results. Recommendation engines and targeted ad technologies now routinely use semantic approaches. Ontologies are creeping into the commercial spaces once occupied by taxonomies and controlled vocabularies. Semantics-based suggestion systems are now the common technology used. A surprising number of smartphone apps have semantics at their core.

So, I agree with Lee Feigenbaum that we are asking the wrong question. But I would also add that we are not even looking in the right places when we try to understand the role and place of semantic technologies.

The unwise attempt to supplant the idea of semantic technologies with linked data is only furthering this confusion. Linked data is merely a means for publishing and exposing structured data. While linked data can lead to easier automatic consumption of data, it is not necessary to effective semantic approaches and is actually a burden on data publishers [5]. While that burden may be willingly taken by publishers because of its consumption advantages, linked data is by no means an essential precursor to semantic approaches. None of the unique advantages for semantic technologies noted below rely on or need to be preceded by linked data. In semantic speak, linked data is not the same as semantic technologies.

The essential thing to know about semantic technologies is that they are a conceptual and logical foundation to how information is modeled and interrelated. In these senses, semantic technologies are infrastructural and groundings, not applications per se. There is a mindset and worldview associated with the use of semantic technologies that is far more essential to understand than linked data techniques and is certainly more fundamental than elevator pitches or “killer apps.”

Five Unique Advantages

Thus, the argument for semantic technologies needs to be grounded in their foundations. It is within the five unique advantages of semantic technologies described below that the benefits to enterprises ultimately reside.

#1: Modern, Back-end Data Federation

The RDF data model — and its ability to represent the simplest of data up through complicated domain schema and vocabularies via the OWL ontology language — means that any existing schema or structure can be represented. Because of this expressiveness and flexibility, any extant data source or schema can be represented via RDF and its extensions. This breadth means that a common representation for any existing schema may be expressed. That expressiveness, in turn, means that any and all data representations can be described in a canonical way.

A shared, canonical representation of all existing schema and data types means that all of that information can now be federated and interrelated. The canonical means of federating information via the RDF data model is the foundational benefit of semantic technologies. Further, the practice of giving URIs as unique identifiers to all of the constituent items in this approach makes it perfectly suitable to today’s reality of distributed data accessible via the Web [6].

#2: Universal Solvent for Structure

I have stated many times that I have not met a form of structured data I did not like [7]. Any extant data structure or format can be represented as RDF. RDF can readily express information contained within structured (conventional databases), semi-structured (Web page or XML data streams), or unstructured (documents and images) information sources. Indeed, the use of ontologies and entity instance records in RDF is a powerful basis for driving the extraction systems now common for tagging unstructured sources.

(One of the disservices perpetuated by an insistence on linked data is to undercut this representational flexibility of RDF. Since most linked data is merely communicating value-attribute pairs for instance data, virtually any common data format can be used as the transmittal form.)

The ease of representing any existing data format or structure and the ability to extract meaningful structure from unstructured sources makes RDF a “universal solvent” for any and all information. Thus, with only minor conversion or extraction penalties, all information in its extant form can be staged and related together via RDF.

#3: Adaptive, Resilient Schema

A singular difference between semantic technologies (as we practice them) and conventional relational data systems is the use of an open world approach [8]. The relational model is a paradigm where the information must be complete and it must be described by a schema defined in advance. The relational model assumes that the only objects and relationships that exist in the domain are those that are explicitly represented in the database. This makes the closed world of relational systems a very poor choice when attempting to combine information from multiple sources, to deal with uncertainty or incompleteness in the world, or to try to integrate internal, proprietary information with external data.

Semantic technologies, on the other hand, allow domains to be captured and modeled in an incremental manner. As new knowledge is gained or new integrations occur, the underlying schema can be added to and modified without affecting the information that already exists in the system. This adaptability is generally the biggest source of economic benefits to the enterprise from semantic technologies. It is also a benefit that enables experimentation and lowers risk.

#4: Unmatched Productivity

Having all information in a canonical form means that generic tools and applications can be designed to work against that form. That, in turn, leads to user productivity and developer productivity. New datasets, structure and relationships can be added at any time to the system, but how the tools that manipulate that information behave remains unchanged.

User productivity arises from only needing to learn and master a limited number of toolsets. The relationships in the constituent datasets are modeled at the schema (that is, ontology) level. Since manipulation of the information at the user interface level consists of generic paradigms regarding the selection, view or modification of the simple constructs of datasets, types and instances, adding or changing out new data does not change the interface behavior whatsoever. The same bases for manipulating information can be applied no matter the datasets, the types of things within them, or the relationships between things. The behavior of semantic technology applications is very much akin to having generic mashups.

Developer productivity results from leveraging generic interfaces and APIs and not bespoke ones that change every time a new dataset is added to the system. In this regard, ontology-driven applications [9] arising from a properly designed semantic technology framework also work on the simple constructs of datasets, types and instances. The resulting generalization enables the developer to focus on creating logical “packages” of functionality (mapping, viewing, editing, filtering, etc.) designed to operate at the construct level, and not the level of the atomic data.

#5: Natural, Connected Knowledge Systems

All of these factors combine to enable more and disparate information to be assembled and related to one another. That, in turn, supports the idea of capturing entire knowledge domains, which can then be expanded and shifted in direction and emphasis at will. These combinations begin to finally achieve knowledge capture and representation in its desired form.

Any kind of information, any relationship between information, and any perspective on that information can be captured and modeled. When done, the information remains amenable to inspection and manipulation through a set of generic tools. Rather simple and direct converters can move that canonical information to other external forms for use by existing external tools. Similarly, external information in its various forms can be readily converted to the internal canonical form.

These capabilities are the direct opposite to today’s information silos. From its very foundations, semantic technologies are perfectly suited to capture the natural connections and nature of relevant knowledge systems.

A Summary of Advantages Greater than the Parts

There are no other IT approaches available to the enterprise that can come close to matching these unique advantages. The ideal of total information integration, both public and private, with the potential for incremental changes to how that information is captured, manipulated and combined, is exciting. And, it is achievable today.

With semantic technologies, more can be done with less and done faster. It can be done with less risk. And, it can be implemented on a pay-as-you-benefit basis [10] responsive to the current economic climate.

But awareness of this reality is not yet widespread. This lack of awareness is the result of a couple of factors. One factor is that semantic technologies are relatively new and embody a different mindset. Enterprises are only beginning to get acquainted with these potentials. Semantic technologies require both new concepts to be learned, and old prejudices and practices to be questioned.

A second factor is the semantic community itself. The early idea of autonomic agents and the heavy AI emphasis of the initial semantic Web advocacy now feels dated and premature at best. Then, the community hardly improved matters with its shift in emphasis to linked data, which is merely a technique and which completely overlooks the advantages noted above.

However, none of this likely matters. The five unique advantages for enterprises from semantic technologies are real and demonstrable today. While my crystal ball is cloudy as to how fast these realities will become understood and widely embraced, I have no question they will be. The foundational benefits of semantic technologies are compelling.

I think I’ll take this to the bank while others ride the elevator.


[1] This series was called for by Eric Franzon of SemanticWeb.com. Contributions to date have been provided by Sandro Hawke, David Wood, and Mark Montgomery.
[2] See Lee Feigenbaum, 2011. “Why Semantic Web Technologies: Are We Asking the Wrong Question?,” TechnicaLee Speaking blog, August 22, 2011; see http://www.thefigtrees.net/lee/blog/2011/08/why_semantic_web_technologies.html, and its follow up on “The Magic Crank,” August 29, 2011; see http://www.thefigtrees.net/lee/blog/2011/08/the_magic_crank.html. For a further perspective on this issue from Lee’s firm, Cambridge Semantics, see Sean Martin, 2010. “Taking the Tech Out of SemTech,” presentation at the 2010 Semantic Technology Conference, June 23, 2010. See http://www.slideshare.net/LeeFeigenbaum/taking-the-tech-out-of-semtech.
[3] See, for example, Jeff Pollock, 2008. “A Semantic Web Business Case,” Oracle Corporation; see http://www.w3.org/2001/sw/sweo/public/BusinessCase/BusinessCase.pdf.
[4] Indeed, many semantics-based sites are disappointingly ugly with data and triples and URIs shoved in the user’s face rather than sizzle.
[5] Linked data and its linking predicates are also all too often misused or misapplied, leading to poor quality of integrations. See, for example, M.K. Bergman and F. Giasson, 2009. “When Linked Data Rules Fail,” AI3:::Adaptive Innovation blog, November 16, 2009. See http://www.mkbergman.com/846/when-linked-data-rules-fail/.
[6] Greater elaboration on all of these advantages is provided in M. K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Innovation blog, April 8, 2009. See http://www.mkbergman.com/483/advantages-and-myths-of-rdf/.
[7] See M.K. Bergman, 2009. “‘Structs’: Naïve Data Formats and the ABox,” AI3:::Adaptive Innovation blog, January 22, 2009. See http://www.mkbergman.com/471/structs-naive-data-formats-and-the-abox/.
[8] A considerable expansion on this theme is provided in M.K. Bergman, 2009. “‘The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Innovation blog, December 21, 2009. See http://www.mkbergman.com/852/the-open-world-assumption-elephant-in-the-room/.
[9] For a full expansion on this topic, see M.K. Bergman, 2011. “Ontology-driven Apps Using Generic Applications,” AI3:::Adaptive Innovation blog, March 7, 2011. See http://www.mkbergman.com/948/ontology-driven-apps-using-generic-applications/.
[10] See M.K. Bergman, 2010. “‘Pay as You Benefit’: A New Enterprise IT Strategy,” AI3:::Adaptive Innovation blog, July 12, 2010. See http://www.mkbergman.com/896/pay-as-you-benefit-a-new-enterprise-it-strategy/.

Posted by AI3's author, Mike Bergman Posted on September 12, 2011 at 3:11 am in Linked Data, Semantic Enterprise, Semantic Web | Comments (4)
The URI link reference to this post is: http://www.mkbergman.com/974/making-the-argument-for-semantic-technologies/
The URI to trackback this post is: http://www.mkbergman.com/974/making-the-argument-for-semantic-technologies/trackback/
Posted:July 18, 2011

Photo courtesy of levelofhealth.comA Decade of Remarkable Advances in Ten Grand IT Challenges

I’ve been in the information theory and technology game for quite some time, but believe nothing has matched the pace of advances of the past ten years. As one example, it was a mere eight years ago that I was sitting in a room with language translation vendors contemplating automated translation techniques for US intelligence agencies. The prospects finally looked doable, but the success of large-scale translation was not assured.

At about that same time, and the years until just recently, a whole slew of Grand Challenges [1] in computing hung out there: tantalizing yet not proven. These areas ranged from information extraction and natural language understanding to speech recognition and automated reasoning.

But things have been changing fast, and with a subtle steadiness that has caused it to go largely unremarked. Sure, all of us have been aware of the huge changes on the Web and search engine ubiquity and social networking. But some of the fundamentally hard problems in computing have also gone through some remarkable (but largely unremarked) advances.

We now have smart phones that speak instructions to us while we instruct them by voice in turn. Virtually all information conceivable is now indexed and made available through the Web; structure is now rapidly characterizing that information, making it even more useful to discover and organize. We can translate documents online with acceptable accuracy into more than 60 languages [2]. We can get directions to or see satellite views of virtually any place on earth. We have in fact become accustomed to new technology magic on a nearly daily basis, so much so that the pace of these advances seems to be a constant, blunting our perspective of just how rapid these advances have been progressing.

These advances are perhaps not the realization of artificial intelligence as articulated in the 1950s to 1980s, but are contributing to a machine-based ability to do tasks useful to humans heretofore impossible and at scales unimaginable. As Google and IBM’s Watson are showing, statistics (among other techniques) applied to massive knowledge bases or text corpora are breaking down all of the Grand Challenges of symbolic computing. The image that is emerging is less one of intelligent machines working autonomously than it is of computers working interactively or semi-automatically with humans to address previously unsolvable problems.

By using a perspective of the decade past, we also demark the seminal paper on the semantic Web by Berners-Lee, Hendler and Lassila from May 2001 [3]. Yet, while this semantic Web vision has been a contributor to the success of the Grand Challenge advances of the past ten years, I think we can also say that it has not been the key or even a primary driver. That day may still yet come. Rather, I think we have to look to natural language and statistics surrounding large-scale corpora as the more telling drivers.

Ten Grand Challenge Advances

Over the past ten years there have been significant advances on at least ten Grand Challenges in symbolic computation. As the concluding section notes, these advances can be traced in most part to broader advances in natural language processing, the logical and semiotic bases for interoperability, and standards (nominally in the semantic Web) for embracing them. Here are these ten areas of advance, all achieved over the past ten years:

#1 Information Extraction

Information extraction (IE) uses various forms of natural language processing (NLP) to identify structured information within unstructured or semi-structured documents. These documents are presented in machine-readable form (including straight text, various document formats or HTML) with the various types of information “tagged” or prompted for inclusion. Information types that can be extracted with one of the various techniques include entities, relations, topics, categories, and so forth. Once tagged or extracted, the information in the documents can now be included and linked to standard structured information (as might come from conventional databases) or to structure in other documents.

Most recently, a large number of online services and open source systems have also become available with strengths in one or more of these extraction types [4]. Some current examples include Yahoo! Term Extraction, OpenCalais, BeliefNetworks, OpenAmplify, Alchemy API, Evri, Extractiv, Illinois Tagger, and about 80 others [4].

#2 Machine Translation

Machine translation is the automatic translation of machine-readable text from one human language to another. Accurate and acceptable machine translation requires applying different types of knowledge including grammar, semantics, facts about the real world, etc. Various approaches have been developed and refined over time.

Especially helpful has been the availability of huge corpora in multiple languages to which large-scale statistical analysis may be applied (as is the case of Google’s machine translation) or human editing and refinement (as is the case with the more than 280 language versions of Wikipedia).

While it is true none of these systems have 100% accuracy (even human translators show much variation), the more advanced ones are truly impressive with remaining ambiguities flagged for resolution by semi-automatic means.

#3 Sentiment Analysis

Though sentiment analysis is strictly speaking a subset of information extraction, it has the more demanding and useful task of extracting subjective information, often across a group of documents or texts. Sentiment analysis can be applied to online reviews to determine the “polarity” about specific objects, and it is especially useful for identifying public opinion trends or evaluating social media for ranking, polling or marketing purposes.

Because of its greater difficulty and potential high value, many of the leading sentiment analysis capabilities remain proprietary. Some capable open source versions are available nonetheleless. There is also an interesting online application using Twitter feeds.

#4 Disambiguation

Many words have more than one meaning. Word sense disambiguation uses either machine learning, dictionaries (gazetteers) of known entities and concepts, ontologies or linguistic databases such as WordNet, or combinations thereof to evaluate ambiguous terms or phrases and resolve them based on context. Some systems need to be “trained” or some work automatically or others are based on evaulation and prompting (semi-automatic) to complete the disambiguation process.

State-of-the-art systems have greater than 90% precision [5]. Most of the leading open source NLP toolkits have quite capable disambiguation modules, and even better proprietary systems exist.

#5 Speech Synthesis and Recognition

Speech synthesis is the conversion of text to spoken speech and has been around for quite some time. Speech recognition is a far more difficult task in that a given sound clip or real-time spoken speech of a person must be converted to a textual representation, which itself can then be acted upon such as navigating or making selections. Speech recognition is made difficult because of individual voice differences, the variations of human languages and speech patterns, and the need to segment speech into a sequence of words. (In most spoken languages, the sounds representing successive letters blend into each other, so the conversion of the modulated wave form to discrete characters or tokens can be a very difficult process.)

Crude systems of a decade ago required much training with a specific speaker’s voice to show much effectiveness. Today, the range and ability to use these systems without training has markedly improved.

Until recently, improvements largely were driven by military and intelligence requirements. Today, however, with the ubiquity of smart phones and speech interfaces, the consumer market is greatly accelerating progress.

#6 Image Recognition

Image recognition is the ability to determine whether or not an electronic image contains some specific object, feature, or activity, and then to extract the image data associated with it. Today, under specific circumstances and for specific tasks, this can be done by computer. However, for the general case of arbitrary objects in arbitrary situations this challenge has not yet been fully met. The systems of today work best for simple geometric objects (e.g., polyhedra), human faces, printed or hand-written characters, or vehicles, and in specific situations, typically described in terms of well-defined illumination, background, and orientation of the object relative to the camera.

Auto license recognition at intersections, face recognition by security cameras, and greatly expanded and improved character recognition systems (machine vision) represent some of the current state-of-the-art. Again, smart phone apps are helping to drive advances.

#7 Interoperability Standards and Methods


Rapid Progress in Climbing the Data Federation Pyramid

Most of the previous advances are related to extracting structured information or mapping or deriving additional structured information. Once obtained, of course, the next challenge is in how to relate that information together; that is, how to make it interoperate.

We have been steadily climbing a data federation pyramid [6] — and at an impressively accelerating rate since the adoption of the Internet and Web. These network innovations gave us a common basis and protocols for connecting distributed devices. That, in turn, has freed us to concentrate on the standards for data representation and interoperability.

XML first provided a means for a common data serialization that encouraged various communities and industries to devise exchange vocabularies. RDF provided a means for a common data model, one that was both simple and extensible at the same time [7]. OWL built upon that basis to enable us to build common domain models (see next).

There are alternatives to the semantic Web standards of RDF and OWL such as common logic and there are many competing data exchange formats to XML. None of these standards is essential on its own and all have their communities and advocates. However, because they are standards and they share common network bases, it has also been relatively easy to convert amongst the various available protocols. We are nearly at a global level where everything is connected, machine-readable, and in structured form.

#7 Common Domain Models

Semantics in machine-readable form means that we can more confidently link and combine available information. We are seeing a veritable explosion of domain models to represent various domains and viewpoints in consensual, interoperable form. What this means is that we are now gaining the computing vocabularies and grammars — along with shared community models (world views) — to get this stuff to work together.

Five years ago we called this phenomena mashups, but no one uses that term any longer because these information brewpots are everywhere, including in our very hands when we interact with the apps on our smart phones. This glue of domain models is generally as invisible to us as is the glue in laminates or the resin in plastics. But they are the strength and foundations nonetheless that enable much of the computing magic unfolding around us.

#9 Virtual Apps (Cloud Computing)

Once the tyranny of physical separation was shattered between data and machine by the network, the rationale for keeping the data with the app or even the user with the app disappeared. Cloud computing may seem mysterious or sound to have some high-octave hum, but it really is nothing more than saying that the Web enables us to treat all of our computing resources as virtual. Data can be anywhere; machines and hard drives can be anywhere; and applications can be anywhere.

And, virtualness brings benefits in and of itself. Whole computing environments can be installed or removed nearly instantaneously. Peak computing demands can be met with virtual headrooms. Backup and rollover and redundancy practices and strategies can change. Web services mean tailored capabilities can be invoked from anywhere and integrated for local needs. Massive computing resources and server farms can be as accessible to the individual as they are to prior computing behemoths. Combined with continued advances in underlying computing hardware and chips, the computing power available to any user is rising exponentially. There is now even more power in the power curve.

#10 Big Data

One hears stories of Google or the National Security Agency having access and managing servers measured in the hundreds of thousands. Entirely new operating systems and computing environments — many with roots in open source — such as virtual operating systems and MapReduce approaches like Hadoop have been innovated to deal with the current era of “big data”.

MapReduce is a framework for processing huge datasets using a large number of servers. The “map” step partitions the problem into tractable sub-problems, organized in a tree structure. The “reduce” step then takes the answers to all the sub-problems and combines them to produce the final output.

Such techniques enable analysis of datasets of a size impossible before. This has enabled the development of statistics and analytical techniques that have been able to make correlations and find patterns for some of the Grand Challenge tasks noted before that simply could not be addressed within previous limits. The “big data” approach is providing a brute force alternative to previously intractable problems.

Why Such Progress?

Declining hardware costs and increasing performance (such as from Moore’s Law), combined with the adoption of the Internet + Web network, set the fertile conditions for these unprecedented advances in computing’s Grand Challenges. But the adaptive radiation in innovations now occurring has its own dynamics. In computing terms, we are seeing the equivalent of the Cambrian explosion in evolutionary history.

The dynamics driving this computing explosion are based largely, I believe, on the statistics of information retrieval and extraction needed to cope with the scale of documents on the Web. That, in turn, has impelled innovations in big data and distributed architectures and designs that have pried open previously closed computing lockboxes. As data from everywhere and from every provenance pours into the system, means for handling and interoperating with it have become imperatives. These forces, in turn, have been channeled and are being met through the open and standards-based approaches that helped lead to the development of the Internet and its infrastructure in the first place.

These powerful evolutionary forces in computing are clearly evident in the ten Grand Challenge advances above. But the challenges above are also silent on another factor, underpinning the interoperability initiatives, that is only now just becoming evident and exerting its own powerful force. That is the workable, intellectual foundations for interoperability itself.

Clearly, as the advances in the Grand Challenges show, we are seeing immense exposures of new structured information and impressive means for accessing and managing it on a global, distributed scale.  Yet all of this data and this structure begs the question of how to get the information to work together. Further, the sources and viewpoints and methods by which all of this data has been created also puts a huge premium on means to deal with the diversity. Though not evident, and perhaps not even known to many of the innovators and practitioners, there has been a growing intellectual force shaping our foundational views about the nature of things and their representations. This force has been, I believe, one of those root cause drivers helping to show the way to interoperability.

John Sowa, despite his unending criticism of the semantic Web in favor of common logic, has nonetheless been a very positive evangelist for the 19th century American logician and philosopher, Charles Sanders Peirce. Sowa points out that the entire 20th century largely neglected Peirce’s significant contributions in many areas and some philosophers appropriated Peircean insights without proper attribution [8]. Indeed, Peirce has only come to wider attention within the past decade or so. Much of his voluminous lifetime writings have still not yet been committed to publication.

Among many notable contributions, Peirce was passionate about signs and their triadic representations, in a field known as semiotics. The philosophical and logical basis of his triangle of signs deserves your attention, which can not be adequately treated here [9]. However, as summarized by Sowa [8], “A semiotic view of language and logic gets to the heart of the philosophical controversies and their practical implications for linguistics, artificial intelligence, and related subjects.”

In essence, Peirce’s triadic logic of semiotics helps clarify philosophical questions about things, how they are perceived and how they are named that has vexed philosophers at least since the time of Aristotle. What Peirce was able to put forward was a testable logic for how things and the names of things can be understood and related to one another, via logical statements or structures. These, in turn, can be symbolized and formalized into logical constructs that can capture the structure of natural language as well as more structured data.

The clarity of Peirce’s logic of signs is an underlying factor, I believe, for why we are finally seeing our way clear to how to capture, represent and relate information from a diversity of sources and viewpoints that is defensible and interoperable [10]. As we plumb Peircean logics further, I believe we will continue to gain additional insights and methods for combining and relating information. The next phase of our advances on these Grand Challenges is likely to be fueled more by connections and interoperability than in basic extraction or representation.

The Widening Explosion

We are not seeing the vision of artificial intelligence unfold as posed three decades ago. Nor are we seeing the AI-complete type of problems being solved in their entirety [11]. Rather, we are seeing impressive but incomplete approaches. Full automation and autonomy are not yet at hand, and may be so far in the future as to never be. But we are nevertheless seeing advances across the board in all Grand Challenge areas.

What is emerging is a practical achievement of the Grand Challenges, the scale and scope of which is unprecedented in symbolic computing. As we see Peircean logic continue to take hold and interoperability grow in usefulness and stature, I think it fair to say we can look back in ten years to describe where we stand today as having been in the midst of an evolutionary explosion.


[1] Grand Challenges were United States policy objectives for high-performance computing and communications research set in the late 1980s. According to “A Research and Development Strategy for High Performance Computing”, Executive Office of the President, Office of Science and Technology Policy, 29 pp., November 20, 1987, “A grand challenge is a fundamental problem in science or engineering, with broad applications, whose solution would be enabled by the application of high performance computing resources that could become available in the near future.”
[2] For example, as of July 17, 2011, Google offered 63 different source or target languages for translation.
[3] Tim Berners-Lee, James Hendler and Ora Lassila, 2001. “The Semantic Web”. Scientific American Magazine; see http://www.scientificamerican.com/article.cfm?id=the-semantic-web.
[4] Go to Sweet Tools, and enter the search ‘information extraction’ to see a list of about 85 tools.
[5] See, for example, Roberto Navigli, 2009. “Word Sense Disambiguation: A Survey,” ACM Computing Surveys, 41(2), 2009, pp. 1–69. See http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf.
[6] M.K. Bergman, 2006. “Climbing the Data Federation Pyramid,” AI3:::Adaptive Information blog, May 25, 2006; see http://www.mkbergman.com/229/climbing-the-data-federation-pyramid/.
[7] M. K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Information blog, April 8, 2009. See http://www.mkbergman.com/483/advantages-and-myths-of-rdf/
[8] John Sowa, 2006. “Peirce’s Contributions to the 21st Century”, in H. Schärfe, P. Hitzler, & P. Øhrstrøm, eds., Conceptual Structures: Inspiration and Application, LNAI 4068, Springer, Berlin, 2006, pp. 54-69. See http://www.jfsowa.com/pubs/csp21st.pdf.
[9] See, as a start, the Wikipedia article on Charles Sanders Peirce (pronounced “purse”), as well as the Arisbe collection of his assembled papers (to date). Also see John Sowa, 2010. “The Role of Logic and Ontology in Language and Reasoning,” from Chapter 11 of Theory and Applications of Ontology: Philosophical Perspectives, edited by R. Poli & J. Seibt, Berlin: Springer, 2010, pp. 231-263. See http://www.jfsowa.com/pubs/rolelog.pdf. Sowa also says, “Although formal logic can be studied independently of natural language semantics, no formal ontology that has any practical application can ever be developed and used without acknowledging its intimate connection with NL semantics.”
[10] While Peirce’s logic and clarity of conceptual relationships is compelling, I find reading his writings quite demanding.
[11] In the field of artificial intelligence, the most difficult problems are informally known as AI-complete or AI-hard, meaning that the difficulty of these computational problems is equivalent to solving the central artificial intelligence problem of making computers as intelligent as people. Computer vision, autonomous robots and understanding natural language are amongst challenges recognized by consensus as being AI-complete. However, practical advances on the Grand Challenges were never defined as needing to meet the AI-complete criterion. Indeed, it is even questionable whether such a hurdle is even worthwhile or meaningful on its own.

Posted by AI3's author, Mike Bergman Posted on July 18, 2011 at 10:00 pm in Adaptive Innovation, Semantic Web, Structured Web | Comments (3)
The URI link reference to this post is: http://www.mkbergman.com/965/in-the-midst-of-an-evolutionary-explosion/
The URI to trackback this post is: http://www.mkbergman.com/965/in-the-midst-of-an-evolutionary-explosion/trackback/
Posted:May 10, 2011

Deciphering Information Assets Exposing $4.7 Trillion Annually in Undervalued Information

Something strange began to happen with company valuations beginning twenty to thirty years ago. Book values increasingly began to diverge — go lower — from stock prices or acquisition prices. Between 1982 and 1992 the ratio of book value to market value decreased from 62% to 38% for public US companies [1]. The why of this mystery has largely been solved, but what to do about it has not. Significantly, semantic technologies and approaches offer both a rationale and an imperative for how to get the enterprises’ books back in order. In the process, semantics may also provide a basis for more productive management and increased valuations for enterprises as well.

The mystery of diverging value resides in the importance of information in an information economy. Unlike the historical and traditional ways of measuring a company’s assets — based on the tangible factors of labor, capital, land and equipment — information is an intangible asset. As such, it is harder to see, understand and evaluate than other assets. Conventionally, and still the more common accounting practice, intangible assets are divided into goodwill, legal (intellectual property and trade secrets) and competitive (know-how) intangibles. But — given that intangibles now equal or exceed the value of tangible assets in advanced economies — we will focus instead on the information component of these assets.

As used herein, information is taken to be any data that is presented in a form useful to recipients (as contrasted to the more technical definition of Shannon and Weaver [2]). While it is true that the there is always a question of whether the collection or development of information is a cost or represents an investment, that “information” is of growing importance and value to the enterprise is certain.

The importance of this information focus can be demonstrated by two telling facts, which I elaborate below. First, only five to seven percent of existing information is adequately used by most enterprises. And, second, the global value of this information amounts to perhaps a range of $2.0 trillion to $7.4 trillion annually (yes, trillions with a T)! It is frankly unbelievable that assets of such enormous magnitude are so poorly understood, exploited or managed.

Amongst all corporate resources and assets, information is surely the least understood and certainly the least managed. We value what we measure, and measure what we value. To say that we little measure information — its generation, its use (or lack thereof) or its value — means we are attempting to manage our enterprises with one eye closed and one arm tied behind our backs. Semantic approaches offer us one way, perhaps the best way, to bring understanding to this asset and then to leverage its value.

The Seven “Laws” of Information

More than a decade ago Moody and Walsh put forward a seminal paper on the seven “laws” of information [3]. Unlike other assets, information has some unique characteristics that make understanding its importance and valuing it much more difficult than other assets. Since I think it a shame that this excellent paper has received little attention and few citations, let me devote some space to covering these “laws”.

Like traditional factors of production — land, labor, capital — it is critical to understand the nature of this asset of “information”. As the laws below make clear, the nature of “information” is totally unique with respect to other factors of production. Note I have taken some liberty and done some updating on the wording and emphasis of the Moody and Walsh “laws” to accommodate recent learnings and understandings.

Law #1: Information Is (Infinitely) Shareable

Information is not friable and can not be depleted. Using or consuming it has no direct affect on others using or consuming it and only using portions of it does not undermine the whole of it. Using it does not cause a degradation or loss of function from its original state. Indeed, information is actually not “consumed” at all (in the conventional sense of the term); rather, it is “shared”. And, absent other barriers, information can be shared infinitely. The access and
use to information on the Web demonstrates this truth daily.

Thus, perhaps the most salient characteristic of information as an asset is that it can be shared between any number of people, business areas and organizations without loss of value to any party (absent the importance of confidentiality or secrecy, which is another information factor altogether). The sharability or maintenance of value irrespective of use makes information quite different to how other assets behave. There is no dilution from use. As Moody and Walsh put it, “from the firm’s perspective, value is therefore cumulative rather than apportioned across different users.”

In practice, however, this very uniqueness is also a cause of other organizational challenges. Both personal and institutional barriers get erected to limit sharing since “knowledge is power.” One perverse effect of information hoarding or lack of institutional support for sharing is to force the development anew of similar information. When not shared, existing information becomes a cost, and one that can get duplicated many times over.

Law #2: The Value of Information Increases With Use

Most resources degrade with use, such as equipment wearing out. In contrast, the per unit value of information increases with use. The major cost of information is in its capture, storage and maintenance. The actual variable costs of using the information (particularly digital information) is, in essence, zero. Thus, with greater use, the per unit cost of information drops.

There is a corollary to this that also goes to the heart of the question of information as an asset. From an accounting point of view, something can only be an asset if it provides future economic value. If information is not used, it cannot possibly result in such benefits and is therefore not an asset. Unused information is really a liability, because no value is extracted from it. In such cases the costs of capture, storage and maintenance are incurred, but with no realized
value. Without use, information is solely a cost to the enterprise.

The additional corollary is that awareness of the information’s existence is an essential requirement in order to obtain this value. As Moody and Walsh state, “information is at its highest ‘potential’ when everyone in the organization knows where it is, has access to it and knows how to use it. Information is at its lowest ‘potential’ when people don’t even know it is there.”

A still further corollary is the importance of information literacy. Awareness of information without an understanding of where it fits or how to take advantage of it also means its value is hidden to potential users. Thus, in addition to awareness, training and documentation are important factors to help ensure adequate use. Both of these factors
may seem like additional costs to the enterprise beyond capture, storage and maintenance, but — without them — no or little value will be leveraged and the information will remain a sunk cost.

Law #3: Information is Perishable

Like most other assets, the value of information tends to depreciate over time [4]. Some information has a short shelf life (such as Web visitations); other has a long shelf life (patents, contracts and many trade secrets). Proper valuation of information should take into account such differences in operational life, analysis or decision life, and statutory life. Operational shelf life tends to be the shortest.

In these regards, information is not too dissimilar from other asset types. The most important point is to be cognizant of use and shelf differences amongst different kinds of information. This consideration is also traded off against the declining costs of digital information storage.

Law #4: The Value of Information Increases With Accuracy

A standard dictum is that the value of information increases with accuracy. The caveat, however, is that some information, because it is not operationally dependent or critical to the strategic interests of the firm, actually can become a cost when capture costs exceed value. Understanding such Pareto principles is an important criterion in evaluating information approaches. Generally, information closest to the transactional or business purpose of the organization will demand higher accuracy.

Such statements may sound like platitudes — and are — in the absence of an understanding of information dependencies within the firm. But, when certain kinds of information are critical to the enterprise, it is just as important to know the accuracy of the information feeding that “engine” as it is for oil changes or maintenance schedules for physical engines. Thus an understanding of accuracy requirements in information should be a deliberate management focus for critical business functions. It is the rare firm that attends to such imperatives today.

Law #5: The Value of Information Increases in Combination

A unique contribution from semantic approaches — and perhaps the one resulting in the highest valuation benefit — arises from the increased value of connecting the information. We have come to understand this intimately as the “network effect” from interconnected nodes on a network. It also arises when existing information is connected as well.

Today’s enterprise information environment is often described by many as unconnected “silos”. Scattered databases and spreadsheets and other information repositories litter the information landscape. Not only are these sources unconnected and isolated, but they also may describe similar information in different and inconsistent ways.

As I have described previously in The Law of Linked Data [5], existing information can act as nodes that — once connected to one another — tend to produce a similar network effect to what physical networks exhibit with increasing numbers of users. Of course, the nature of the information that is being connected and its centrality to the mission of the enterprise will greatly affect the value of new connections. But, based on evidence to date, the value of information appears to go up somewhere between a quadratic and exponential function for the number of new connections. This value is especially evident in know-how and competitive areas.

Law #6: More Is Not Necessarily Better

Information overload is a well-known problem. On the other hand, lack of appropriate information is also a compelling problem. The question of information is thus one of relevancy. Too much irrelevant information is a bad thing, as is too little relevant information.

These observations lead to two use considerations. First, means to understand and focus information capture on relevant information is critical. And, second, information management systems should be purposefully designed with user interfaces for easy filtering of irrelevant information.

The latter point sounds straightforward, but, in actuality, requires a semantic underpinning to the enterprise’s information assets. This requirement is because relevancy is in the eye of the beholder, and different users have different terms, perspectives, and world views by which information evaluation occurs. In order for useful filtering, information must be presented in similar terms and perspectives relevant to those users. Since multiple studies affirm that information decision-makers seek more information beyond their overload points [3], it is thus more useful to provide relevant access and filtering methods that can be tailored by user rather than “top down” information restrictions.

Law #7: Information is Self-propagating

With access and connections, information tends to beget more information. This propagation results from summations, analysis, unique combinations and other ways that basic datum get recombined into new datum. Thus, while the first law noted that information can not be consumed (or depleted) by virtue of its use, we can also say that information tends to reproduce and expand itself via use and inspection.

Indeed, knowledge itself is the result of how information in its native state can be combined and re-organized to derive new insights. From a valuation standpoint, it is this very understanding that leads to such things as competitive intelligence or new know-how. In combination with insights from connections, this propagating factor of information is the other leading source of intangible asset valuations.

This law also points to the fact that information per se is not a scarce resource. (Though its availability may be scarce.) Once available, techniques like data mining, analysis, visualization and so forth can be rich sources for generating new information from existing holdings of data.

Information as an Asset and How to Value

These “laws” — or perspectives — help to frame the imperatives for how to judge information as an asset and its resulting value. The methodological and conceptual issues of how to explicitly account for information on a company’s books are, of course, matters best left to economists and professional accountants. With the growing share of information in relation to intangible assets, this would appear to be a matter of great importance to national policy. Accounting for R&D efforts as an asset versus a cost, for example, has been estimated to add on the order of 11 percent to US national GDP estimates [9].

The mere generation of information is not necessarily an asset, as the “laws” above indicate. Some of the information has no value and some indeed represents a net sunk cost. What we can say, however, is that valuable information that is created by the enterprise but remains unused or is duplicated means that what was potentially an asset has now been turned into a cost — sometimes a cost repeated many-fold.

Information that is used is an asset, intangible or not. Here, depending on the nature of the information and its use, it can be valued on the basis of cost (historical cost or what it cost to develop it), market value (what others will pay for it), or utility (what is its present value as benefits accrue into the future). Traditionally the historical cost method has been applied to information. Yet, since information can both be sold and still retained by the organization, it may have both market value and utility value, with its total value being the sum.

In looking at these factors, Moody and Walsh propose a number of new guidelines in keeping with the “laws” noted above [3]:

  • Operational information should be measured as the cost of collection using data entry costs
  • Management information should be valued based on what it cost to extract the data from operational systems
  • Redundant data should be considered to have zero value (Law #1)
  • Unused data should be considered to have zero value (Law #2)
  • The number of users and number of accesses to the data should be used to multiply the value of the information (Law #2). When information is used for the first time, it should be valued at its cost of collection; subsequent uses should add to this value (perhaps on a depreciated basis; see below)
  • The value of information should be depreciated based on its “shelf life” (Law #3)
  • The value of information should be discounted by its accuracy relative to what is considered to be acceptable (Law #4)
  • And, as an added factor, information that is effectively linked or combined should have its value multiplied (Law #5), though the actual multiplier may be uncertain [5].

The net result of thinking about information in this more purposeful way is to encourage more accurate valuation methods, and to provide incentives for more use and re-use, particularly in combined ways. Such methods can also help distinguish what information is of more value to the organization, and therefore worthy of more attention and investment.

The Growing Importance of Intangible Information

The emerging discrepancy between market capitalizations and book values began to get concerted academic attention in the 1990s. To be sure, perceptions by the market and of future earnings potential can always  color these differences. The simple occurrence of a discrepancy is not itself proof of erroneous or inaccurate valuations. (And, the corollary is that the degree of the discrepancy is not sufficient alone to estimate the intangible asset “gap”, a logical error made by many proponents.) But, the fact that these discrepancies had been growing and appeared to be based (in part) on structural changes linked to intangibles was creating attention.

Leonard Nakamura, an economist with the Federal Reserve Board in Philadelphia, published a working paper in 2001 entitled, “What is the U.S. Gross investment in Intangibles?  (At Least) One Trillion Dollars a Year!” [6]. This was one of the first attempts to measure intangible investments, which he defined as private expenditures on assets that are intangible and necessary to the creation and sale of new or improved products and processes, including designs, software, blueprints, ideas, artistic expressions, recipes, and the like. Nakamura acknowledged his work as being preliminary. But he did find direct and indirect empirical evidence to show that US private firms were investing at least $1 trillion annually (as of 2000, the basis year for the data) in intangible assets.  Private expenditures, labor and corporate operating margins were the three measurement methods.  The study also suggested that the capital stock of intangibles in the US has an equilibrium market value of at least $5 trillion.

Another key group — Carol Corrado, Charles Hulten, and Daniel Sichel, known as “CHS” across their many studies — also began to systematically evaluate the extent and basis for intangible assets and its discrepancy [7].  They estimated that spending on long-lasting knowledge capital — not just intangibles broadly — grew relative to other major components of aggregate demand during the 1990s. CHS was the first to show that by the turn of the millenium that fixed US investment in intangibles was at least as large as business investment in traditional, tangible capital.

By later in the decade, Nakamura was able to gather and analyze time series data that showed the steady increase in the contributions of intangibles [8]:

One can see the cross-over point late in the decade. Investment in intangibles he now estimates to be on the order of 8% to 10% of GDP annually in the US.

Roughly at the same time the National Academies in the US was commissioned to investigate the policy questions of intangible assets. The resulting major study [9] contains much relevant information. But it, too, contained an update by CHS on their slightly different approach to analyzing the growing role of intangible assets:

This CHS analysis shows similar trends to what Nakamura found, though the degree of intangible contributions is estimated as higher (~14% of annual GDP today), with investments in intangibles exceeding tangible assets somewhat earlier.

Surveys of more than 5,000 companies in 25 companies confirmed these trends from a different perspective, and also showed that most of these assets did not get reflected in financial statements. A large portion of this value was due to “brands” and other market intangibles [10]. The total “undisclosed” portion appeared to equal or exceed total
reported assets. Figures for the US indicated there might be a cumulative basis of intangible assets of $9.2 trillion [11].

In parallel, these groups and others began to decompose the intangible asset growth by country, sector, or asset type. The specific component of “information” received a great deal of attention. Uday Apte, Uday Karmarkar and Hiranya Nath, in particular, conducted a couple of important studies during this decade [12,13]. For example, they found nearly two-thirds of recent US GDP was due to information or knowledge industry contributions, a percentage that had been growing over time. They also found that a secondary sector of information internal to firms itself constituted well over 40% of the information economy, or some 28% of the entire economy. So the information activities internal to organizations and institutions represent a very large part of the economy.

The specific components that can constitute the informational portion of intangible assets has also been looked at by many investigators, importantly including key accounting groups. FASB, for example, has specific guidance on treatment of intangible assets in SFAS 141 [14]. Two-thirds of the 90 specific intangible items listed by the American Institute of Certified Public Accountants are directly related to information (as opposed to contracts, brands or goodwill), as shown in [15]. There has also been some good analysis by CHS on breakdowns by intangible assets categories [16]. There are also considerable differences by country on various aspects of these measures (for example, [10]). For example, according to OECD figures from 2002, expenditures for knowledge (R&D, education and software) ranged from nearly 7 percent (Sweden) to below 2 percent (Greece) in OECD countries, with the average of about 4 percent and the US at over 6 percent [17].

. . . Plus Too Much Information Goes Unused

The common view is that a typical organization only uses 5 to 7 percent of the information it already has on hand [18], and 20% to 25% of a knowledge worker’s time is spent simply trying to find information [19]. To probe these issues more deeply, I began a series of analyses in 2004 looking at how much money was being spent on preparing documents within US companies, and how much of that investment was being wasted or not re-used [20]. One key finding from that study was that the information within documents in the US represent about a third of total gross domestic product, or an amount equal at the time of the study to about $3.3 trillion annually (in 2010 figures, that would be closer to $4.7 trillion). This level of investment is consistent with the results of Apte et al. and others as noted above.

However, for various reasons — mostly due to lack of awareness and re-use — some 25% of those trillions of dollar spent annually on document creation costs are wasted. If we could just find the information and re-use it, massive benefits could accrue, as these breakdowns in key areas show:

U.S. FIRMS $ Million %
Cost to Create Documents $3,261,091
Benefits to Finding Missed or Overlooked Documents $489,164 63%
Benefits to Improved Document Access $81,360 10%
Benefits of Re-finding Web Documents $32,967 4%
Benefits of Proposal Preparation and Wins $6,798 1%
Benefits of Paperwork Requirements and Compliance $119,868 15%
Benefits of Reducing Unauthorized Disclosures $51,187 7%
Total Annual Benefits $781,314 100%
PER LARGE FIRM $ Million
Cost to Create Documents $955.6
Benefits to Finding Missed or Overlooked Documents $143.3
Benefits to Improving Document Access $23.8
Benefits of Re-finding Web Documents $9.7
Benefits of Proposal Preparation and Wins $2.0
Benefits of Paperwork Requirements and Compliance $35.1
Benefits of Reducing Unauthorized Disclosures $15.0
Total Annual Benefits $229.0

Table. Mid-range Estimates for the Annual Value of Documents, U.S. Firms, 2002 [20]

The total benefit from improved document access and use to the U.S economy is on the order of 8% of GDP. For the 1,000 largest U.S. firms, benefits from these improvements can approach nearly $250 million annually per firm (2002 basis). About three-quarters of these benefits arise from not re-creating the intellectual capital already invested in prior document creation. About one-quarter of the benefits are due to reduced regulatory non-compliance or paperwork, or better competitiveness in obtaining solicited grants and contracts.

This overall value of document use and creation is quite in line with the analyses of intangible assets noted above, and which arose from totally different analytical bases and data. This triangulation brings confidence that true trends in the growing importance of information have been identified.

How Big is the Information Asset Gap?

These various estimates can now be combined to provide an assessment of just how large the “gap” is for the overlooked accounting and use of information assets:

GDP ($T) Intangible % Info Contrib % Info Assets ($T) Unused Info ($T) Total ($T)
Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi
US $14.72 9% 14% 33% 67% $0.44 $1.38 $0.30 $1.21 $0.74 $2.60
European Union $15.25 8% 12% 33% 50% $0.40 $0.92 $0.31 $1.26 $0.72 $2.17
Remaining Advanced $10.17 8% 12% 33% 50% $0.27 $0.61 $0.21 $0.84 $0.48 $1.45
Rest of World $34.32 2% 6% 10% 25% $0.07 $0.51 $0.00 $0.71 $0.07 $1.22
Total $74.46 $1.18 $3.42 $0.83 $4.02 $2.00 $7.44
Notes (see endnotes) [21] [22] [23]

Depending, these estimates can either be viewed as being too optimistic about the importance of information assets [25] or too conservative [26]. The breadth of the ranges of these values is itself an expression of the uncertainty in the numbers and the analysis.

The analysis shows that, globally, the value of unused and unaccounted information assets may be on the order of  $2.0 trillion to $7.4 trillion annually, with a mid-range value of $4.7 trillion. Even considering uncertainties, these are huge, huge numbers by any account. For the US alone, this range is $750 billion to $2.6 trillion annually. The analysis from the prior studies [20] would strongly suggest the higher end of this range is more likely than the lower. Similarly large gaps likely occur within the European Union and within other advanced nations. For individual firms, depending on size, the benefits of understanding and closing these gaps can readily be measured in the millions to billions [27].

At the high end, these estimates suggest that perhaps as much as 10% of global expenditures is wasted and unaccounted for due to information-related activities. This is roughly equivalent to adding a half of the US economy to the global picture.

In the concluding section, we touch on why such huge holes may appear in the world’s financial books. Clearly, though, even with uncertain and heroic assumptions, the magnitude of this gap is huge, with compelling needs to understand and close it as soon as possible.

The Relationship to Semantic Technologies

The seven Moody and Walsh information “laws” provide the clues to the reasons why we are not properly accounting for information and why we inadequately use it:

  • We don’t know what information we have and can not find it
  • What we have we don’t connect
  • We misallocate resources for generating, capturing and storing information, because we don’t understand its value and potential
  • We don’t manage the use of information or its re-use
  • We duplicate efforts
  • We inadequately leverage what information we have and miss valuable (that is, can be “valuated”) insights that could be gained.

Fundamentally, because information is not understood in our bones as central to the well-being of our enterprises, we continue to view the generation, capture and maintenance of information as a “cost” and not an “asset”.

I have maintained for some time an interactive information timeline [28] that attempts to encompass the entire human history of information innovations. For tens of thousands of years steady — yet slow — progress in the ways to express and manage information can be seen in this timeline. But, then, beginning with electricity and then digitization, the pace of innovation explodes.

The same timeframe that sees the importance of intangible assets appear on national and firm accounts is when we see the full digitization of information and its ability to be communicated and linked over digital networks. A very insightful figure by Rama Hoetzlein for his thesis in 2007, which I have modified and enhanced, captures this evolution with some estimated dates as is shown below (click to expand) [29]:

The first insight this figure provides is that all forms of information are now available in digital form. This includes unstructured (images and documents), semi-structured (mark-up and “tagged” information) and structured (database and spreadsheet) information. This information can now be stored and communicated over digital networks with broadly accepted protocols.

But the most salient insight is that we now have the means through semantic technologies and approaches to interrelate all of this information together. Tagging and extraction methods enable us to generate metadata for unstructured documents and content. Data models based on predicate logic and semantic logics give us the flexible means to express the relationships and connections between information. And all of this can be stored and manipulated through graph-based datastores and languages such that we can draw inferences and gain insights. Plus, since all of this is now accessible via the Web and browsers, virtually any user can access, use and leverage this information.

This figure and its dates not only shows where we have come as a species in our use and sophistication with information, but how we need to bring it all together using semantics to complete our transition to a knowledge economy.

The very same metadata and semantic tagging capabilities that enable us to interrelate the information itself also provides the techniques by which we can monitor and track usage and provenance. It is through these additional semantic methods that we can finally begin to gain insight as to what information is of what value and to whom. Tapping this information will complete the circle for how we can also begin to properly valuate and then manage and optimize our information assets.

Conclusion

With our transition to an information economy, we now see that intangible assets exceed the value of tangible ones. We see that the information component of these intangibles represent one-third to two-thirds of these intangibles. In other words, information makes up from 17% to more than one-third of an individual firm’s value in modern economies. Further, we see that at least 25% of firm expenditures on information is wasted, keeping it as a cost and negating its value as an asset.

The “factories” of the modern information economy no longer produce pins with the fixed inputs of labor and capital as in the time of Adam Smith. They rather produce information and knowledge and know-how. Yet our management and accounting systems seem fixed in the techniques of yesteryear. The quaint idea of total factor productivity as a “residual” merely belies our ignorance about the causes of economic growth and firm value. These are issues that should rightly occupy the attention of practitioners in the disciplines of accounting and management.

Why industrial-era accounting methods have been maintained in the present information age is for students of corporate power politics to debate. It should suffice to remind us that when industrialization induced a shift from the extraction of funds from feudal land possessions to earning profits on invested capital, most of the assumptions about how to measure performance had to change. When the expenses for acquiring information capabilities cease to be an arbitrary budget allocation and become the means for gaining Knowledge Capital, much of what is presently accepted as management of information will have to shift from a largely technological view of efficiency to an asset management perspective [30].

Accounting methods grounded in the early 1800s that are premised on only capital assets as the means to increase the productivity of labor no longer work. Our engines of innovation are not physical devices, but ideas, innovation and knowledge; in short, information. Capable executives recognize these trends, but have yet to change management practices to address them [31].

As managers and executives of firms we need not await wholesale modernization of accounting practices to begin to make a difference. The first step is to understand the role, use and importance of information to our organizations. Looking clearly at the seven information “laws” and what that means about tracking and monitoring is an immediate way to take this step. The second step is to understand and evaluate seriously the prospects for semantic approaches to make a difference today.

We have now sufficiently climbed the data federation pyramid [32] to where all of our information assets are digital; we have network protocols to link it; we have natural language and extraction techniques for making documents first-class citizens along side structured data; and we have logical data models and sound semantic technologies for tying it all together.

We need to reorganize our “factory” floors around these principles, just as prime movers and unit electric drives altered our factories of the past. We need to reorganize and re-think our work processes and what we measure and value to compete in the 21st century. It is time to treat information as seriously as it has become an integral part of our enterprises. Semantic technologies and approaches provide just the path to do so.


[1] Baruch Lev and Jürgen H. Daum, 2003. “Intangible Assets and the Need for a Holistic and More Future-oriented Approach to Enterprise Management and Corporate Reporting,” prepared for the 2003 PMA Intellectual Capital Symposium, 2nd October 2003, Cranfield Management Development Centre, Cranfield University, UK; see http://www.juergendaum.de/articles/pma_ic_symp_jdaum_final.pdf.
[2] Claude E. Shannon and Warren Weaver, 1949. The Mathematical Theory of Communication. The University of Illinois Press, Urbana, Illinois, 1949. ISBN 0-252-72548-4.
[3] Daniel Moody and Peter Walsh, 1999. “Measuring The Value Of Information: An Asset Valuation Approach,” paper presented at the Seventh European Conference on Information Systems (ECIS’99), Copenhagen Business School, Frederiksberg, Denmark, 23-25 June, 1999. See http://wwwinfo.deis.unical.it/zumpano/2004-2005/PSI/lezione2/ValueOfInformation.pdf. A precursor paper that is also quite helpful and cited much in Moody and Walsh is R. Glazer, 199. “Measuring the Value of Information: The Information Intensive Organisation”, IBM Systems Journal, Vol 32, No 1, 1993.
[4] Some trade secrets could buck this trend if the value of the underlying enterprise that relies on them increases.
[5] M.K. Bergman, 2009. “The Law of Linked Data,” post in AI3:::Adaptive Information blog, October 11, 2009. See http://www.mkbergman.com/837/the-law-of-linked-data/.
[6]  Leonard Nakamura, 2001. What is the U.S. Gross Investment in Intangibles?  (At Least) One Trillion Dollars a Year!,
Working Paper No. 01-15, Federal Reserve Bank of Philadelphia, October 2001; see http://www.phil.frb.org/files/wps/2001/wp01-15.pdf.
[7] Carol A. Corrado, Charles R. Hulten, and Daniel E. Sichel, 2004. Measuring Capital and Technology: An Expanded Framework. Federal Reserve Board, August 2004. http://www.federalreserve.gov/pubs/feds/2004/200465/200465pap.pdf.
[8] Leonard I. Nakamura, 2009. Intangible Assets and National Income Accounting: Measuring a Scientific Revolution, Working Paper No. 09-11, Federal Reserve Bank of Philadelphia, May 8, 2009; see http://www.philadelphiafed.org/research-and-data/publications/working-papers/2009/wp09-11.pdf.
[9] Christopher Mackie, Rapporteur, 2009. Intangible Assets: Measuring and Enhancing Their Contribution to Corporate Value and Economic Growth: Summary of a Workshop, prepared by the Board on Science, Technology, and Economic Policy (STEP) Committee on National Statistics (CNSTAT), ISBN: 0-309-14415-9, 124 pages; see http://www.nap.edu/openbook.php?record_id=1274 (available for PDF download with sign-in).
[10] Brand Finance, 2006. Global Intangible Tracker 2006: An Annual Review of the World’s Intangible Value, paper published by Brand Finance and The Institute of Practitioners in Advertising, London, UK, December 2006. See  http://www.brandfinance.com/images/upload/9.pdf.
[11] Kenan Patrick Jarboe and Roland Furrow, 2008. Intangible Asset Monetization: The Promise and the Reality, Working Paper #03 from the Athena Alliance, April 2008. See http://www.athenaalliance.org/pdf/IntangibleAssetMonetization.pdf.
[12] Uday M. Apte and Hiranya K. Nath, 2004, Size, Structure and Growth of the US Information Economy,” UCLA Anderson School of Management on Business and Information Technologies, December 2004; see  http://www.anderson.ucla.edu/documents/areas/ctr/bit/ApteNath.pdf.pdf.
[13] Uday M. Apte, Uday S. Karmarkar and Hiranya K Nath, 2008. “Information Services in the US Economy: Value, Jobs and Management Implications,” California Management Review, Vol. 50, No.3, 12-30, 2008.
[14] See the Financial Accounting Standards Board—SFAS 141; see http://www.gasb.org/pdf/fas141r.pdf.

[15] See further, AICPA Special Committee on Financial Reporting, 1994. Improving Business Reporting—A Customer Focus: Meeting the Information Needs of Investors and Creditors. See  http://www.aicpa.org/InterestAreas/AccountingAndAuditing/Resources/EBR/DownloadableDocuments/Jenkins%20Committee%20Report.pdf.

Blueprints Book librariesBroadcast licenses

Buy-sell agreements

Certificates of need

Chemical formulas

Computer software

Computerized databases

Contracts

Cooperative agreements

Copyrights

Credit information files

Customer contracts

Customer and client lists

Customer relationships

Designs and drawings Development rightsEmployment contracts

Engineering drawings

Environmental rights

Film libraries

Food flavorings and recipes

Franchise agreements

Historical documents

Heath maintenance organization enrollment lists

Know-how

Laboratory notebooks

Literary works

Management contracts

Manual databases

Manuscripts Medical charts and recordsMusical compositions

Newspaper morgue files

Noncompete covenants

Patent applications

Patents (both product and process)

Patterns

Prescription drug files

Prizes and awards

Procedural manuals

Product designs

Proposals outstanding

Proprietary computer software

Proprietary processes

Proprietary products Proprietary technologyPublications

Royalty agreements

Schematics and diagrams

Shareholder agreements

Solicitation rights

Subscription lists

Supplier contracts

Technical and specialty libraries

Technical documentation

Technology-sharing agreements

Trade secrets

Trained and assembled workforce

Training manuals

[16] See, for example, Carol Corrado, Charles Hulten and Daniel Sichel, 2009. “Intangible Capital and U.S. Economic Growth,” Review of Income and Wealth Series 55, Number 3, September 2009; see http://www.conference-board.org/pdf_free/IntangibleCapital_USEconomy.pdf.
[17] As stated in Kenan Patrick Jarboe, 2007. Measuring Intangibles: A Summary of Recent Activity, Working Paper #02 from the Athena Alliance, April 2007. See http://www.athenaalliance.org/pdf/MeasuringIntangibles.pdf.
[18] The 5% estimate comes from Graham G. Rong, Chair at MIT Sloan CIO Symposium, as reported in the SemanticWeb.com on May 5, 2011. (Rong also touted the use of semantic technologies to overcome this lack of use.) A similar 7% estimate comes from Pushpak Sarkar, 2002. “Information Quality in the Knowledge-Driven Enterprise,” InfoManagement Direct, November 2002. See http://www.information-management.com/infodirect/20021115/6045-1.html.
[19] M.K. Bergman, 2005. “Search and the ’25% Solution’,” AI3:::Adaptive Innovation blog, September 14, 2005. See http://www.mkbergman.com/121/search-and-the-25-solution/.
[20] M.K. Bergman, 2005.  “Untapped Assets: the $3 Trillion Value of U.S. Documents,” BrightPlanet Corporation White Paper, July 2005, 42 pp. Also available  online and in PDF.
[21] From the CIA, 2011. The World Factbook; accessed online at  https://www.cia.gov/library/publications/the-world-factbook/index.html on May 9, 2011. The “remaining advanced” countries are Australia, Canada, Iceland, Israel, Japan, Liechtenstein, Monaco, New Zealand, Norway, Puerto Rico, Singapore. South Korea, Switzerland, Taiwan.
[22] The range of estimates is drawn from the Nakamura [8] and CHS [9] studies, with each respectively providing the lower and upper bounds. These values have been slightly decremented for non-US advanced countries, and greatly reduced for non-advanced ones.
[23] The high range is based on the categorical share of intangible asset categories (60 of 90) from the AIPCA work [15]; the lower range is from the one-third of GDP estimates from [20].These values have been slightly decremented for non-US advanced countries, and greatly reduced for non-advanced ones.
[24] For unused information assets, the high range is based on the one-third of GDP and 25% “waste” estimates from [20]; the low range halves each of those figures. These values have been slightly decremented for non-US advanced countries, and greatly reduced for non-advanced ones (and zero for the low range).
[25] Reasons for the estimates to be too optimistic are information as important as goodwill; branding; intellectual basis of cited resources is indeed real; considerable differences by country and sector (see [10] and [16]).
[26] Reasons for the estimates to be too conservative: no network effects; greatly discounted non-advanced countries; share is growing (but older estimates used); considerable differences by country and sector (see [10] and [16]).
[27] For some discussion of individual firm impacts and use cases see [10] and [20], among others.
[28] See the Timeline of Information History, and its supporting documentation at M.K. Bergman, 2008. “Announcing the ‘Innovations in Information’ Timeline,” AI3:::Adaptive Information blog, July 6, 2008; see  http://www.mkbergman.com/421/announcing-the-innovations-in-information-timeline/.
[29] This figure is a modification of the original published by Rama C. Hoetzlein, 2007. Quanta – The Organization of Human Knowedge: Systems for Interdisciplinary Research, Master’s Thesis, University of California Santa Barbara, June 2007; see http://www.rchoetzlein.com/quanta/ (p 112). I adapted this figure to add logics, data and metadata to the basic approach, with color coding also added.
[30] From Paul A. Strassmann, 1998. “The Value of Knowledge Capital,” American Programmer, March 1998. See  http://www.strassmann.com/pubs/valuekc/.
[31] For example, according to [11], in a 2003 Accenture survey of senior managers across industries, 49 percent of respondents said that intangible assets are their primary focus for delivering long-term shareholder value, but only 5 percent stated that they had an organized system to track the performance of these assets. Also, according sources cited in Gio Wiederhold, Shirley Tessler, Amar Gupta and David Branson Smith, 2009. “The Valuation of Technology-Based Intellectual Property in Offshoring Decisions,” in Communications for the Association of Information Systems (CAIS) 24, May 2009 (see http://ilpubs.stanford.edu:8090/951/2/Article_07-270.pdf): Owners and stockholders acknowledge that IP valuation of technological assets is not routine within many organizations. A 2007 study performed by Micro Focus and INSEAD highlights the current state of affairs: Of the 250 chief information officers (CIOs) and chief finance officers (CFOs) surveyed from companies in the U.S., UK, France, Germany, and Italy, less than 50 percent had attempted to value their IT assets, and more than 60 percent did not assess the value of their software.
[32] M.K. Bergman, 2006. “Climbing the Data Federation Pyramid,” AI3:::Adaptive Information blog, May 25, 2006; see  http://www.mkbergman.com/229/climbing-the-data-federation-pyramid/.