One of the main reasons I am such a big fan of RDF as a canonical data model is its ability to capture information in structured, semi-structured and unstructured form . These sources are conventionally defined as:
A major trend I have written about for some time is the emergence of the structured Web: that is, the exposing of structure from these varied sources in order for more information to be interconnected and made interoperable. I have posited — really a view shared by many — that the structured Web is an intermediate point in the evolution of the Web from one of documents to one where meaningful semantics occurs .
It is clear in my writings — indeed in the very name of my company, Structured Dynamics — that structure plays a major role in our thinking. The use and reliance on this term, though, begs the question: just what is structure in an informational sense? We’ll find it helpful to get at the question of What is structure? from a basis using first principles. And this, in turn, may also provide insight into how structure and information are in fact inextricably entwined.
According to Wikipedia, structure is a fundamental notion, of tangible or intangible character, that refers to the recognition, observation, nature, or permanence of patterns and relationships of entities. The concept may refer to an object, such as a built structure, or an attribute, such as the structure of society.
Structure may be abstract, or it may be concrete. Its realm ranges from the physical to ideas and concepts. As a term, “structure” seems to be ubiquitous to every domain. Structure may be found across every conceivable scale, from the most minute and minuscule to the cosmic. Even realms without any physical aspect at all — such as ideas and languages and beliefs — are perceived by many to have structure. We apply the term to any circumstance in which things are arranged or connected to one another, as a means to describe the organization or relationships of things. We seem to know structure when we see it, and to be able to discern structure of very many kinds against unstructured or random backgrounds.
In this way structure quite resembles patterns, perhaps could even be used synonymously with that term. Other closely related concepts include order, organization, design or form. When expressed, structure, particularly that of a recognizably ordered or symmetrical nature, is often called beautiful.
One aspect of structure, I think, that provides the key to its roles and importance is that it can be expressed in shortened form as a mathematical statement. One could even be so bold as to say that mathematics is the language of structure. This observation is one of the threads that will help us tie structure to information.
The natural world is replete with structure. Patterns in nature are regularities of visual form found in the natural world. Each such pattern can be modeled mathematically. Typical mathematical forms in nature include fractals, spirals, flows, waves, lattices, arrays, Golden ratios, tilings, Fibonacci sequences, and power laws. We see them in such structures as clouds, trees, leaves, river networks, fault lines, mountain ranges, craters, animal spots and stripes, shells, lightning bolts, coastlines, flowers, fruits, skeletons, cracks, growth rings, heartbeats and rates, earthquakes, veining, snow flakes, crystals, blood and pulmonary vessels, ocean waves, turbulence, bee hives, dunes and DNA.
The mathematical expression of structures in nature is frequently repeated or recursive in nature, often in a self-organizing manner. The swirls of a snail’s shell reflect a Fibonacci sequence, while natural landscapes or lifeforms often have a fractal aspect (as expressed by some of the figures in this article). Fractals are typically self-similar patterns, generally involving some fractional or ratioed formula that is recursively applied. Another way to define it is as a detailed pattern repeating itself.
Even though these patterns can often be expressed simply and mathematically, and they often repeat themselves, their starting conditions can lead to tremendous variability and a lack of predictability. This makes them chaotic, as studied under chaos theory, though their patterns are often discernible.
While we certainly see randomness in statistics, quantum physics and Brownian motion, it is also striking how what gives nature its beauty is structure. As a force separate and apart from the random, there appears to be something in structure that guides the expression of what is nature and what is so pleasing to behold. Self-similar and repeated structures across the widest variety of spatial scales seems to be an abiding aspect of nature.
Such forms of repeated patterns or structure are also inherent in that most unique of human capabilities, language. As a symbolic species , we first used symbols as a way to represent the ideas of things. Simple markings, drawings and ideograms grew into more complicated structures such as alphabets and languages. The languages themselves came to embrace still further structure via sentence structures, document structures, and structures for organizing and categorizing multiple documents. In fact, one of the most popular aspects of this blog site is its Timeline of Information History — worth your look — that shows the progression of structure in information throughout human history.
Grammar is often understood as the rules or structure that governs language. It is composed of syntax, including punctuation, traditionally understood as the sentence structure of languages, and morphology, which is the structural understanding of a language’s linguistic units, such as words, affixes, parts of speech, intonation or context. There is a whole field of linguistic typology that studies and classifies languages according to their structural features. But grammar is hardly the limit to language structure.
Semantics, the meaning of language, used to be held separate from grammar or structure. But via the advent of the thesaurus, and then linguistic databases such as WordNet and more recently concept graphs that relate words and terms into connected understandings, we also have now come to understand that semantics also has structure. Indeed, these very structural aspects are now opening up to us techniques and methods — generally classified under the heading of natural language processing (NLP) — for extracting meaningful structure from the very basis of written or spoken language.
It is the marriage of the computer with language that is illuminating these understandings of structure in language. And that opening, in turn, is enabling us to capture and model the basis of human language discourse in ways that can be codified, characterized, shared and analyzed. Machine learning and processing is now enabling us to complete the virtual circle of language. From its roots in symbols, we are now able to extract and understand those very same symbols in order to derive information and knowledge from our daily discourse. We are doing this by gleaning the structure of language, which in turn enables us to relate it to all other forms of structured information.
The continuation of structure from nature to language extends across all aspects of human endeavor. I remember excitedly describing to a colleague more than a decade ago what likely is a pedestrian observation: pattern matching is a common task in many fields. (I had observed that pattern matching in very different forms was a standard practice in most areas of industry and commerce.) My “insight” was that this commonality was not widely understood, which meant that pattern matching techniques in one field were not often exploited or seen as transferable to other domains.
In computer science, pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern. It is closely related to the idea of pattern recognition, which is the characterization of some discernible and repeated sequence. These techniques, as noted, are widely applied, with each field tending to have its own favorite algorithms. Common applications that one sees for such pattern-based calculations include communications , encoding and coding theory, file compression, data compression, machine learning, video compression, mathematics (including engineering and signal processing via such techniques as Fourier transforms), cryptography, NLP , speech recognition, image recognition, OCR, image analysis, search, sound cleaning (that is, error detection, such as Dolby) and gene sequence searching and alignment, among many others.
To better understand what is happening here and the commonalities, let’s look at the idea of compression. Data compression is valuable for transmitting any form of content in wired or wireless manners because we can transmit the same (or closely similar) message faster and with less bandwidth . There are both lossless (no loss of information) and lossy compression methods. Lossless data compression algorithms usually exploit statistical redundancy — that is, a pattern match — to represent data more concisely without losing information. Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Lossless compression is possible because most real-world data has statistical redundancy. In lossy data compression, some loss of information is acceptable by dropping detail from the data to save storage space. These methods are guided by research that indicates, say, how certain frequencies may not be heard or seen by people and can be removed from the source data.
On a different level, there is a close connection between machine learning and compression: a system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution), while an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history). This equivalence has been used as justification for data compression as a benchmark for “general intelligence.” On a still different level, one major part of cryptography is the exact opposite of these objectives: constructing messages that pattern matching fails against or is extremely costly or time-consuming to analyze.
When one stands back from any observable phenomena — be it natural or human communications — we can see that the “information” that is being conveyed often has patterns, recursion or other structure that enables it to be represented more simply and compactly in mathematical form. This brings me back to my two favorite protagonists in my recent writings — Claude Shannon and Charles S. Peirce.
Claude Shannon‘s seminal work in 1948 on information theory dealt with the amount of information that could be theoretically and predictably communicated between a sender and a receiver  . No context or semantics were implied in this communication, only the amount of information (for which Shannon introduced the term “bits”) and what might be subject to losses (or uncertainty in the accurate communication of the message). In this regard, what Shannon called “information” is what we would best term “data” in today’s parlance.
The context of Shannon’s paper and work by others preceding him was to understand information losses in communication systems or networks. Much of the impetus for this came about because of issues in wartime communications and early ciphers and cryptography and the emerging advent of digital computers. But the insights from Shannon’s paper also relate closely to the issues of data patterns and data compression.
A key measure of Shannon’s theory is what he referred to as information entropy, which is usually expressed by the average number of bits needed to store or communicate one symbol in a message. Entropy quantifies the uncertainty involved in predicting the value of a random variable. The Shannon entropy measure is actually a measure of the uncertainty based on the communication (transmittal) between a sender and a receiver; the actual information that gets transmitted and predictably received was formulated by Shannon as R, which can never be zero because all communication systems have losses.
A simple intuition can show how this formulation relates to patterns or data compression. Let’s take a message of completely random digits. In order to accurately communicate that message, all digits (bits) would have to be transmitted in their original state and form. Absolutely no compression of this message is possible. If, however, there are patterns within the message (which, of course, now ceases to make the message random), these can be represented algorithmically in shortened form, so that we only need communicate the algorithm and not the full bits of the original message. If this “compression” algorithm can then be used to reconstruct the bit stream of the original message, the data compression method is deemed to be lossless. The algorithm so derived is also the expression of the pattern that enabled us to compress the message in the first place (such as a*2+1).
We can apply this same type of intuition to human language. In order to improve communication efficiency, the most common words (e.g., “a”, “the”, “I”) should be shorter than less common words (e.g., “disambiguation”, “semantics”, “ontology”), so that sentences will not be too long. As they are. This is an equivalent principal to data compression. In fact, such repeats and patterns apply to the natural world as well.
Shannon’s idea of information entropy has come to inform the even broader subject of entropy in physics and the 2nd Law of Thermodynamics . According to Koelman, “the entropy of a physical system is the minimum number of bits you need to fully describe the detailed state of the system.” Very random (uncertain) states have high entropy, patterned ones low entropy. As I noted recently, in open systems, structures (patterns) are a means to speed the tendency to equilibrate across energy gradients . This observation helps provide insight into structure in natural systems, and why life and human communications tend toward less randomness. Structure will always continue to emerge because it is adaptive to speed the deltas across these gradients; structure provides the fundamental commonality between biological information (life) and human information.
In the words of Thomas Schneider , “Information is always a measure of the decrease of uncertainty at a receiver.” Of course, in Shannon’s context, what is actually being measured here is data (or bits), not information embodying any semantic meaning or context. Thus, the terminology may not be accurate for discussing “information” in a contemporary sense. But it does show that “structure” — that is, the basis for shortening the length of a message while still retaining its accuracy — is information (in the Shannon context). In this information there is order or patterns, often of a hierarchical or fractal or graph nature. Any structure that emerges that is better able to reduce the energy gradient faster will be favored according to the 2nd Law.
The data that constitutes “information” in the Shannon sense still lacks context and meaning. In communications terms, it is data; it has not yet met the threshold of actionable information. It is in this next step that we can look to Charles Sanders Peirce (1839 – 1914) for guidance .
The core of Peirce’s world view is based in semiotics, the study and logic of signs. In his seminal writing on this, “What is in a Sign?” , he wrote that “every intellectual operation involves a triad of symbols” and “all reasoning is an interpretation of signs of some kind”. A sign of an object leads to interpretants, which, as signs, then lead to further interpretants. Peirce’s triadic logic of signs in fact is a taxonomy of sign relations, in which signs get reified and expanded via still further signs, ultimately leading to communication, understanding and an approximation of “canonical” truth. Peirce saw the scientific method as itself an ultimate example of this process. The key aspect of signs for Peirce is the ongoing process of interpretation and reference to further signs.
Ideograms leading to characters, that get combined into sentences and other syntax, and then get embedded within contexts of shared meanings show how these structures compound themselves and lead to clearer understandings (that is, accurate messages) in the context of human languages. While the Shannon understanding of “information” lacked context and meaning, we can see how still higher-order structures may be imposed through these reifications of symbols and signs that improve the accuracy and efficiency of our messages. Though Peirce did not couch his theory of semiosis on structure nor information, we can see it as a natural extension of the same structural premises in Shannon’s formulation.
In fact, today, we now see the “structure” in the semantic relationships of language through the graph structures of ontologies and linguistic databases such as WordNet. The understanding and explication of these structures are having a similarly beneficial effect on how still more refined and contextual messages can be composed, transmitted and received. Human-to-machine communications is (merely!) the challenge of codifying and making explicit the implicit structures in our languages.
The Peirceian ideas of interpretation (context) and compounding and reifying structure are a major intellectual breakthrough for extending the Shannon “information” theory to information in the modern sense. These insights also occur within a testable logic for how things and the names of things can be understood and related to one another, via logical statements or structures. These, in turn, can be symbolized and formalized into logical constructs that can capture the structure of natural language as well as more structured data (or even nature, as some of the earlier Peirce speculation asserts ).
According to this interpretation of Peirce, the nature of information is the process of communicating a form from the object to the interpretant through the sign . The clarity of Peirce’s logic of signs is an underlying factor, I believe, for why we are finally seeing our way clear to how to capture, represent and relate information from a diversity of sources and viewpoints that is defensible and interoperable.
Common to all of these perspectives — from patterns to nature and on to life and then animal and human communications — we see that structure is information. Human artifacts and technology, though not “messages” in a conventional sense, also embody the information of how they are built within their structures . We also see the interplay of patterns and information in many processes of the natural world  from complexity theory, to emergence, to autopoiesis, and on to autocatalysis, self-organization, stratification and cellular automata . Structure in its many guises is ubiquitous.
We, as beings who can symbolically record our perceptions, seem to innately recognize patterns. We see beauty in symmetry. Bilateral symmetry seems to be deeply ingrained in the inherent perception by humans of the likely health or fitness of other living creatures. We see beauty in the patterned, repeated variability of nature. We see beauty in the clever turn of phrase, or in music, or art, or the expressiveness of a mathematical formulation.
We also seem to recognize beauty in the simple. Seemingly complex bit streams that can be reduced to the short, algorithmic expression are always viewed as more elegant than lengthier, more complex alternatives. The simple laws of motion and Newtonian physics fit this pattern, as does Einstein’s E=mc2. This preference for the simple is a preference for the greater adaptiveness of the shorter, more universal pattern to messages, an insight indicated by Shannon’s information theory.
In the more prosaic terms of my vocation in the Web and information technology, these insights point to the importance of finding and deriving structured representations of information — including meaning (semantics) — that can be simply expressed and efficiently conveyed. Building upon the accretions of structure in human and computer languages, the semantic Web and semantic technologies offer just such a prospect. These insights provide a guidepost for how and where to look for the next structural innovations. We find them in the algorithms of nature and language, and in making connections that provide the basis for still more structure and patterned commonalities.
Ideas and algorithms around loseless compression and graph theory and network analysis are, I believe, the next fruitful hunting grounds for finding still higher-orders of structure, which can be simply expressed. The patterns of nature, which have emerged incrementally and randomly over the eons of cosmological time, look to be an excellent laboratory.
So, as we see across examples from nature and life to language and all manner of communications, information is structure and structure is information. And it is simply beautiful.
It has been clear for some time that Google has been assembling a war chest of entities and attributes. It first began to appear as structured results in its standard results listings, a trend I commented upon more than three years ago in Massive Muscle on the ABox at Google. Its purchase of Metaweb and its Freebase service in July 2010 only affirmed that trend.
This week, perhaps a bit rushed due to attention to the Facebook IPO, Google announced its addition of the Knowledge Graph (GKG) to its search results. It has been releasing this capability in a staged manner. Since I was fortunately one of the first to be able to see these structured results (due to luck of the draw and no special “ins”), I have spent a bit of time deconstructing what I have found.
What you get (see below) when you search on particular kinds of entities is in essence an “infobox“, similar to the same structure as what is found on Wikipedia. This infobox is a tabular presentation of key-value pairs, or attributes, for the kind of entity in the search. A ‘people’ search, for example, turns up birth and death dates and locations, some vital statistics, spouse or famous relations, pictures, and links to other relevant structured entities. The types of attributes shown vary by entity type. Here is an example for Stephen King, the writer (all links from here forward provide GKB results), which shows the infobox and its key-value pairs in the righthand column:
Reportedly these results are drawn from Freebase, Wikipedia, the CIA World Factbook and other unidentified sources. Some of the results may indeed be coming from Freebase, but I saw none as such. Most entities I found were from Wikipedia, though it is important to know that Freebase in its first major growth incarnation springboarded from a Wikipedia baseline. These early results may have been what was carried forward (since other contributed data to Freebase is known to be of highly variable quality).
The entity coverage appears to be spotty and somewhat disappointing in this first release. Entity types that I observed were in these categories:
Entity types that I expected to see, but did not find include:
This is clearly not rigorous testing, but it would appear that entity types along the lines of what is in schema.org is what should be expected over time.
I have no way to gauge the accuracy of Google’s claims that it has offered up structured data on some 500 million entities (and 3.5 billion facts). However, given the lack of coverage in key areas of Wikipedia (which itself has about 3 million entities in the English version), I suspect much of that number comes from the local businesses and restaurants and such that Google has been rapidly adding to its listings in recent years. Coverage of broadly interesting stuff still seems quite lacking.
The much-touted knowledge graph is also kind of disappointing. Related links are provided, but they are close and obvious. So, an actor will have links to films she was in, or a person may have links to famous spouses, but anything approaching a conceptual or knowledge relationship is missing. I think, though, we can see such links and types and entity expansions to steadily creep in over time. Google certainly has the data basis for making these expansions. And, constant, incremental improvement has been Google’s modus operandi.
For some time, and at various meetings I attend, I have always been at pains to question Google representatives whether there is some unique, internal ID for entities within its databases. Sometimes the reps I have questioned just don’t know, and sometimes they are cagey.
But, clearly, anything like the structured data that Google has been working toward has to have some internal identifier. To see if some of this might now have surfaced with the Knowledge Graph, I did a bit of poking of the URLs shown in the searches and the affiliated entities in the infoboxes. Under most standard searches, the infobox appears directly if there is one for that object. But, by inspecting cross-referenced entities from the infoboxes themselves, it is possible to discern the internal key.
The first thing one can do in such inspections is to remove that stuff that is local or cookie things related to one’s own use preferences or browser settings. Other tests can show other removals. So, using the Stephen King example above, we can eliminate these aspects of the URL:
This actually conformed to my intuition, because the ‘&stick’ aspect was a new parameter for me. (Typically, in many of these dynamic URLs, the various parameters are separated by one another by a set designator character. In the case of Google, that is the ampersand &.)
By simply doing repeated searches that result in the same entity references, I was able to confirm that the &stick parameter is what invokes the unique ID and infobox for each entity. Further, we can decompose that further, but the critical aspect seems to be what is not included within the following: &stick=H4sIAAAAAAAAAONg . . [VuLQz9U3] . . AAAA. The stuff in the brackets varies less, and I suspect might be related to the source, rather than the entity.
I started to do some investigation on types and possible sources, but ran out of time. Listed below are some &stick identifiers for various types of entities (each is a live link):
|Movie||The Green Mile||&stick=H4sIAAAAAAAAAONgVuLUz9U3MC62zC4AAGg8mEkNAAAA|
|Albums||The White Album||&stick=H4sIAAAAAAAAAONgVuLSz9U3MMxIN0nKAADnd5clDgAAAA|
You can verify that this ‘&stick‘ reference is what is pulling in the infobox by looking at this modified query that has substituted Marilyn Monroe’s &stick in the Stephen King URL string: Note the standard search results in the lefthand results panel are the same as for Stephen King, but we now have fooled the Google engine to display Marilyn Monroe’s infobox.
I’m sure over time that others will deconstruct this framework to a very precise degree. What would really be great, of course, as noted on many recent mailing lists, is for Google to expose all of this via an API. The Google listing could become the de facto source for Webby entity identifiers.
Sort of like when schema.org was first announced, there have been complaints from some in the semantic Web community that Google released this stuff without once saying the word “semantic”, that much had been ripped off from the original researchers and community without attribution, that a gorilla commercial entity like Google could only be expected to milk this stuff for profit, etc., etc.
That all sounds like sour grapes to me.
What we have here is what we are seeing across the board: the inexorable integration of semantic technology approaches into many existing products. Siri did it with voice commands; Bing, and now Google, are doing it too with search.
We should welcome these continued adoptions. The fact is, semWeb community, that what we are seeing in all of these endeavors is the right and proper role for these technologies: in the background, enhancing our search and information experience, and not something front and center or rammed down our throats. These are the natural roles of semantic technologies, and they are happening at a breakneck pace.
Welcome to the semantic technology space, Google! I look forward to learning much from you.
Since Richard Dawkins first put forward the idea of the “meme” in his book The Selfish Gene some 35 years ago , the premise has struck in my craw. I, like Dawkins, was trained as an evolutionary biologist. I understand the idea of the gene and its essential role as a vehicle for organic evolution. And, all of us clearly understand that “ideas” themselves have a certain competitive and adaptive nature. Some go viral; some run like wildfire and take prominence; and some go nowhere or fall on deaf ears. Culture and human communications and ideas play complementary — perhaps even dominant — roles in comparison to the biological information contained within DNA (genes).
I think there are two bases for why the “meme” idea sticks in my craw. The first harkens back to Dawkins. In formulating the concept of the “meme”, Dawkins falls into the trap of many professionals, what the French call déformation professionnelle. This is the idea of professionals framing problems from the confines of their own points of view. This is also known as the Law of the Instrument, or (Abraham) Maslow‘s hammer, or what all of us know colloquially as “if all you have is a hammer, everything looks like a nail“ . Human or cultural information is not genetics.
The second — and more fundamental — basis for why this idea sticks in my craw is its mis-characterization of what is adaptive information, the title and theme of this blog. Sure, adaptive information can be found in the types of information structures at the basis of organic life and organic evolution. But, adaptive information is much, much more. Adaptive information is any structure that provides arrangements of energy and matter that maximizes entropy production. In inanimate terms, such structures include chemical chirality and proteins. It includes the bases for organic life, inheritance and organic evolution. For some life forms, it might include communications such as pheromones or bird or whale songs or the primitive use of tools or communicated behaviors such as nest building. For humans with their unique abilities to manipulate and communicate symbols, adaptive information embraces such structures as languages, books and technology artifacts. These structures don’t look or act like genes and are not replicators in any fashion of the term. To hammer them as “memes” significantly distorts their fundamental nature as information structures and glosses over what factors might — or might not — make them adaptive.
I have been thinking of these concepts much over the past few decades. Recently, though, there has been a spate of the “meme” term, particularly on the semantic Web mailing lists to which I subscribe. This spewing has caused me to outline some basic ideas about what I find so problematic in the use of the “meme” concept.
As defined by Dawkins and expanded upon by others, a “meme” is an idea, behavior or style that spreads from person to person within a culture. It is proposed as being able to be transmitted through writing, speech, gestures or rituals. Dawkins specifically called melodies, catch-phrases, fashion and the technology of building arches as examples of memes. A meme is postulated as a cultural analogue to genes in that they are assumed to be able to self-replicate, mutate or respond to selective pressures. Thus, as proposed, memes may evolve by natural selection in a manner analogous to that of biological evolution.
However, unlike a gene, a structure corresponding to a “meme” has never been discovered or observed. There is no evidence for it as a unit of replication, or indeed as any kind of coherent unit at all. In its sloppy use, it is hard to see how “meme” differs in its scope from concepts, ideas or any form of cultural information or transmission, yet it is imbued with properties analogous to animate evolution for which there is not a shred of empirical evidence.
One might say, so what, the idea of a “meme” is merely a metaphor, what is the harm? Well, the harm comes about when it is taken seriously as a means of explaining human behavior and cultural changes, a field of study called memetics. It becomes a pseudo-scientific term that sets a boundary condition for understanding the nature of information and what makes it adaptive or not . Mechanisms and structures appropriate to animate life are not universal information structures, they are simply the structures that have evolved in the organic realm. In the human realm of signs and symbols and digital information and media, information is the universal, not the genetic structure of organic evolution.
The noted evolutionary geneticist, R.C. Lewontin, one of my key influences as a student, has also been harshly critical of the idea of memetics :
Consistent with my recent writings about Charles S. Peirce , many logicians and semiotic theorists are also critical of the idea of “memes”, but on different grounds. The criticism here is that “memes” distort Peirce’s ideas about signs and the reification of signs and symbols via a triadic nature. Notable in this camp is Terrence Deacon .
It is not surprising that the concept of “memes” arose in the first place. It is understandable to seek universal principles consistent with natural laws and observations. The mechanism of natural evolution works on the information embodied in DNA, so why not look to genes as some form of universal model?
The problem here, I think, was to confuse mechanisms with first principles. Genes are a mechanism — a “structure” if you will — that along with other forms of natural selection such as the entire organism and even kin selection , have evolved as means of adaptation in the animate world. But the fundamental thing to be looked for here is the idea of information, not the mechanism of genes and how they replicate. The idea of information holds the key for drilling down to universal principles that may find commonality between information for humans in a cultural sense and information conveyed through natural evolution for life forms. It is the search for this commonality that has driven my professional interests for decades, spanning from population genetics and evolution to computers, information theory and semantics .
But before we can tackle these connections head on, it is important to address a couple of important misconceptions (as I see them).
In looking to information as a first principle, Claude Shannon‘s seminal work in 1948 on information theory must be taken as the essential point of departure . The motivation of Shannon’s paper and work by others preceding him was to understand information losses in communication systems or networks. Much of the impetus for this came about because of issues in wartime communications and early ciphers and cryptography. (As a result, the Shannon paper is also intimately related to data patterns and data compression, not further discussed here.)
In a strict sense, Shannon’s paper was really talking about the amount of information that could be theoretically and predictably communicated between a sender and a receiver. No context or semantics were implied in this communication, only the amount of information (for which Shannon introduced the term “bits” ) and what might be subject to losses (or uncertainty in the accurate communication of the message). In this regard, what Shannon called “information” is what we would best term “data” in today’s parlance.
The form that the uncertainty (unpredictability) calculation that Shannon derived:
and thus Shannon also labelled his measure of unpredictability, H, as entropy .
After Shannon, and nearly a century after Boltzmann, work by individuals such as Jaynes in the field of statistical mechanics came to show that thermodynamic entropy can indeed be seen as an application of Shannon’s information theory, so there are close parallels . This parallel of mathematical form and terminology has led many to assert that information is entropy.
I believe this assertion is a misconception on two grounds.
First, as noted, what is actually being measured here is data (or bits), not information embodying any semantic meaning or context. Thus, the formula and terminology is not accurate for discussing “information” in a conventional sense.
Second, the Shannon methods are based on the communication (transmittal) between a sender and a receiver. Thus the Shannon entropy measure is actually a measure of the uncertainty for either one of these states. The actual information that gets transmitted and predictably received was formulated by Shannon as R (which he called rate), and he expressed basically as:
R = Hbefore – Hafter
R, then, becomes a proxy for the amount of information accurately communicated. R can never be zero (because all communication systems have losses). Hbefore and Hafter are both state functions for the message, so this also makes R a function of state. So while there is Shannon entropy (unpredictability) for any given sending or receiving state, the actual amount of information (that is, data) that is transmitted is a change in state as measured by a change in uncertainty between sender (Hbefore) and receiver (Hafter). In the words of Thomas Schneider, who provides a very clear discussion of this distinction :
Information is always a measure of the decrease of uncertainty at a receiver.
These points do not directly bear on the basis of information as discussed below, but help remove misunderstandings that might undercut those points. Further, these clarifications make consistent theoretical foundations of information (data) with natural evolution while being logically consistent with the 2nd law of thermodynamics (see next).
The 2nd law of thermodynamics expresses the tendency that, over time, differences in temperature, pressure, or chemical potential equilibrate in an isolated physical system. Entropy is a measure of this equilibration: for a given physical system, the highest entropy state is one at equilibrium. Fluxes or gradients arise when there are differences in state potentials in these systems. (In physical systems, these are known as sources and sinks; in information theory, they are sender and receiver.) Fluxes go from low to high entropy, and are non-reversible — the “arrow of time” — without the addition of external energy. Heat, for example, is a by product of fluxes in thermal energy. Because these fluxes are directional in isolation, a perpetual motion machine is shown as impossible.
In a closed system (namely, the entire cosmos), one can see this gradient as spanning from order to disorder, with the equilibrium state being the random distribution of all things. This perspective, and much schooling regarding these concepts, tends to present the idea of entropy as a “disordered” state. Life is seen as the “ordered” state in this mindset. Hewing to this perspective, some prominent philosophers, scientists and others have sometimes tried to present the “force” representing life and “order” as an opposite one to entropy. One common term for this opposite “force” is “negentropy” .
But, in the real conditions common to our lives, our environment is distinctly open, not closed. We experience massive influxes of energy via sunlight, and have learned as well how to harness stored energy from eons past in further sources of fossil and nuclear energy. Our open world is indeed a high energy one, and one that increases that high-energy state as our knowledge leads us to exploit still further resources of higher and higher quality. As Buckminster Fuller once famously noted, electricity consumption (one of the highest quality energy resources found to date) has become a telling metric about the well-being and wealth of human societies .
The high-energy environments fostering life on earth and more recently human evolution establish a local (in a cosmic sense) gradient that promotes fluxes to more ordered states, not lesser unordered ones. These fluxes remain faithful to basic physical laws and are non-deterministic . Indeed, such local gradients can themselves be seen as consistent with the conditions initially leading to life, favoring the random event in the early primordial soup that led to chemical structures such as chirality, auto-catalytic reactions, enzymes, and then proteins, which became the eventual building blocks for animate life .
These events did not have preordained outcomes (that is, they were non-deterministic), but were the result of time and variation in the face of external energy inputs to favor the marginal combinatorial improvement. The favoring of the new marginal improvement also arises consistent with entropy principles, by giving a competitive edge to those structures that produce faster movements across the existing energy gradient. According to Annila and Annila :
Via this analysis we see that life is not at odds with entropy, but is consistent with it. Further, we see that incremental improvements in structure that are consistent with the maximum entropy production principle will be favored . Of course, absent the external inputs of energy, these gradients would reverse. Under those conditions, the 2nd law would promote a breakdown to a less ordered system, what most of us have been taught in schools.
With these understandings we can now see the dichotomy as life representing order with entropy disorder as being false. Further, we can see a guiding set of principles that is consistent across the broad span of evolution from primordial chemicals and enzymes to basic life and on to human knowledge and artifacts. This insight provides the fundamental “unit” we need to be looking toward, and not the gene nor the “meme”.
Of course, the fundamental “unit” we are talking about here is information, and not limited as is Shannon’s concept to data. The quality that changes data to information is structure, and structure of a particular sort. Like all structure, there is order or patterns, often of a hierarchical or fractal or graph nature. But the real aspect of the structure that is important is the marginal ability of that structure to lead to improvements in entropy production. That is, processes are most adaptive (and therefore selected) that maximize entropy production. Any structure that emerges that is able to reduce the energy gradient faster will be favored.
However, remember, these are probabilistic, statistical processes. Uncertainties in state may favor one structure at one time versus another at a different time. The types of chemical compounds favored in the primordial soup were likely greatly influenced by thermal and light cycles and drying and wet conditions. In biological ecosystems, there are huge differences in seed or offspring production or in overall species diversity and ecological complexity based on the stability (say, tropics) or instability (say, disturbance) of local environments. As noted, these processes are inherently non-deterministic.
As we climb up the chain from the primordial ooze to life and then to humans and our many information mechanisms and technology artifacts (which are themselves embodiments of information), we see increasing complexity and structure. But we do not see uniformity of mechanisms or vehicles.
The general mechanisms of information transfer in living organisms occur (generally) via DNA in genes, mediated by sex in higher organisms, subject to random mutations, and then kept or lost entirely as their host organisms survive to procreate or not. Those are harsh conditions: the information survives or not (on a population basis) with high concentrations of information in DNA and with a priority placed on remixing for new combinations via sex. Information exchange (generally) only occurs at each generational event.
Human cultural information, however, is of an entirely different nature. Information can be made persistent, can be recorded and shared across individuals or generations, extended with new innovations like written language or digital computers, or combined in ways that defy the limits of sex. Occasionally, of course, loss of living languages due to certain cultures or populations dying out or horrendous catastrophes like the Spanish burning (nearly all of) the Mayan’s existing books can also occur . The environment will also be uncertain.
So, while we can define DNA in genes or the ideas of a “meme” all as information, in fact we now see how very unlike the dynamics and structures of these two forms really are. We can be awestruck with the elegance and sublimity of organic evolution. We can also be inspired by song or poem or moved to action through ideals such as truth and justice. But organic evolution does not transpire like reading a book or hearing a sermon, just like human ideas and innovations don’t act like genes. The “meme” is a totally false analogy. The only constant is information.
The closer we come to finding true universals, the better we will be able to create maximum entropy producing structures. This, in turn, has some pretty profound implications. The insight that keys these implications begins with an understanding of the fundamental nature — and importance — of information. According to Karnani et al :
All would agree that the evolution of life over the past few billion years is truly wondrous. But, what is equally wondrous is that the human species has come to learn and master symbols. That mastery, in turn, has broken the bounds of organic evolution and has put into our hands the very means and structure of information itself. Via this entirely new — and incredibly accelerated — path to information structures, we are only now beginning to see some of its implications:
The idea of a “meme” actually cheapens our understanding of these potentials.
Ideas matter and terminology matters. These are the symbols by which we define and communicate potentials. If we choose the wrong analogies or symbols — as “meme” is in this case — we are picking the option with the lower entropy potential. Whether I assert it to be so or not, the “meme” concept is an information structure doomed for extinction.
The crowning achievement of the semantc Web is the simple use of URIs to identify data. Further, if the URI identifier can resolve to a representation of that data, it now becomes an integral part of the HTTP access protocol of the Web while providing a unique identifier for the data. These innovations provide the basis for distributed data at global scale, all accessible via Web devices such as browsers and smartphones that are now a ubiquitous part of our daily lives.
Yet, despite these profound and simple innovations, the semantic Web’s designers and early practitioners and advocates have been mired in a muddled, metaphysical argument of at least a decade over what these URIs mean, what they reference, and what their actual true identity is. These muddles about naming and identity, it might be argued, are due to computer scientists and programmers trying to grapple with issues more properly the domain of philosophers and linguists. But that would be unfair. For philosophers and linguists themselves have for centuries also grappled with these same conundrums .
As I argue in this piece, part of the muddle results from attempting to do too much with URIs while another part results from not doing enough. I am also not trying to directly enter the fray of current standards deliberations. (Despite a decade of controversy, I optimistically believe that the messy process of argument and consensus building will work itself out .) What I am trying to do in this piece, however, is to look to one of America’s pre-eminent philosophers and logicians, Charles Sanders Peirce (pronounced “purse”), to inform how these controversies of naming, identity and meaning may be dissected and resolved.
The Web began as a way to hyperlink between documents, generally Web pages expressed in the HTML markup language. These initial links were called URLs (uniform resource locators), and each pointed to various kinds of electronic resources (documents) that could be accessed and retrieved on the Web. These resources could be documents written in HTML or other encodings (PDFs, other electronic formats), images, streaming media like audio or videos, and the like .
All was well and good until the idea of the semantic Web, which postulated that information about the real world — concepts, people and things — could also be referenced and made available for reasoning and discussion on the Web. With this idea, the scope of the Web was massively expanded from electronic resources that could be downloaded and accessed via the Web to now include virtually any topic of human discourse. The rub, of course, was that ideas such as abstract concepts or people or things could not be “dereferenced” nor downloaded from the Web.
One of the first things that needed to change was to define a broader concept of a URI “identifier” above the more limited concept of a URL “locator”, since many of these new things that could be referenced on the Web went beyond electronic resources that could be accessed and viewed . But, since what the referent of the URI now actually might be became uncertain — was it a concept or a Web page that could be viewed or something else? — a number of commentators began to note this uncertainty as the “identity crisis” of the Web . The topic took on much fervor and metaphysical argument, such that by 2003, Sandro Hawke, a staffer of the standards-setting W3C (World Wide Web Consortium), was able to say, “This is an old issue, and people are tired of it” .
Yet, for many of the reasons described more fully below, the issue refused to go away. The Technical Architecture Group (TAG) of the W3C took up the issue, under a rubric that came to be known as httpRange-14 . The issue was first raised in March 2002 by Tim Berners-Lee, accepted for TAG deliberations in February 2003, with then a resolution offered in June 2005 . (Refer to the original resolution and other information  to understand the nuances of this resolution, since particular commentary on that approach is not the focus of this article.) Suffice it to say here, however, that this resolution posited an entirely new distinction of Web content into “information resources” and “non-information resources”, and also recommended the use of the HTTP 303 redirect code for when agents requesting a URI should be directed to concepts versus viewable documents.
This “resolution” has been anything but. Not only can no one clearly distinguish these de novo classes of “information resources” , but the whole approach felt arbitrary and kludgy.
Meanwhile, the confusions caused by the “identity crisis” and httpRange-14 continued to perpetuate themselves. In 2006, a major workshop on “Identity, Reference and the Web” (IRW 2006) was held in conjunction with the Web’s major WWW2006 conference in Edinburgh, Scotland, on May 23, 2006 . The various presentations and its summary (by Harry Halpin) are very useful to understand these issues. What was starting to jell at this time was the understanding that the basis of identity and meaning on the Web posed new questions, and ones that philosophers, logicians and linguists needed to be consulted to help inform.
The fiat of the TAG’s 2005 resolution has failed to take hold. Over the ensuing years, various eruptions have occurred on mailing lists and within the TAG itself (now expressed as Issue 57) to revisit these questions and bring the steps moving forward into some coherent new understanding. Though linked data has been premised on best-practice implementation of these resolutions , and has been a qualified success, many (myself included) would claim that the extra steps and inefficiencies required from the TAG’s httpRange-14 guidance have been hindrances, not facilitators, of the uptake of linked data (or the semantic Web).
Today, despite the efforts of some to claim the issue closed, it is not. Issue 57 and the periodic bursts from notable semantic Web advocates such as Ian Davis , Pat Hayes and Harry Halpin , Ed Summers , Xiaoshu Wang , David Booth  and TAG members themselves, such as Larry Masinter  and Jonathan Rees , point to continued irresolution and discontent within the advocate community. Issue 57 currently remains open. Meanwhile, I think, all of us interested in such matters can express concern that linked data, the semantic Web and interoperable structured data have seen less uptake than any of us had hoped or wanted over the past decade. As I have stated elsewhere, unclear semantics and muddled guidelines help to undercut potential use.
As each of the eruptions over these identity issues has occurred, the competing camps have often been characterized as “talking past one another”; that is, not communicating in such a way as to help resolve to consensus. While it is hardly my position to do so, I try to encapsulate below the various positions and prejudices as I see them in this decades-long debate. I also try to share my own learning that may help inform some common ground. Forgive me if I overly simplify these vexing issues by returning to what I see as some first principles . . . .
One legacy of the initial document Web is the perception that Web addresses have meaning. We have all heard of the multi-million dollar purchasing of domains  and the adjudication that may occur when domains are hijacked from their known brands or trademark owners. This legacy has tended to imbue URIs with a perceived value. It is not by accident, I believe, that many within the semantic Web and linked data communities still refer to “minting” URIs. Some believe that ownership and control over URIs may be equivalent to grabbing up valuable real estate. It is also the case that many believe the “name” given to a URI acts to name the referent to which it refers.
This perception is partially true, partially false, but moreover incomplete in all cases. We can illustrate these points with the global icon, “Coca-Cola”.
As for the naming aspects, let’s dissect what we mean when we use the label “Coca-Cola” (in a URI or otherwise). Perhaps the first thing that comes to mind is “Coca-Cola,” the beverage (which has a description on Wikipedia, among other references). Because of its ubiquity, we may also recognize the image of the Coca-Cola bottle to the left as a symbol for this same beverage. (Though, in the hilarious movie, The Gods, They Must be Crazy, Kalahari Bushmen, who had no prior experience of Coca-Cola, took the bottle to be magical with evil powers .) Yet even as reference to the beverage, the naming aspects are a bit cloudy since we could also use the fully qualified synonyms of “Coke”, “Coca-cola” (small C), “Classic Coke” and the hundreds of language variants worldwide.
On the other hand, the label “Coca-Cola” could just as easily conjure The Coca-Cola Company itself. Indeed, the company web site is the location pointed to by the URI of http://www.thecoca-colacompany.com/. But, even that URI, which points to the home Web page of the company, does not do justice to conveying an understanding or description of the company. For that, additional URIs may need to be invoked, such as the description at Wikipedia, the company’s own company description page, plus perhaps the company’s similar heritage page.
Of course, even these links and references only begin to scratch the surface of what the company Coca-Cola actually is: headquarters, manufacturing facilities, 140,000 employees, shareholders, management, legal entities, patents and Coke recipe, and the like. Whether in human languages or URIs, in any attempt to signify something via symbols or words (themselves another form of symbol), we risk ambiguity and incompleteness.
URI shorteners also undercut the idea that a URI necessarily “names” something. Using the service bitly, we can shorten the link to the Wikipedia description of the Coke beverage to http://bit.ly/xnbA6 and we can shorten the link to The Coca-Cola Company Web site to http://bit.ly/9ojUpL. I think we can fairly say that neither of these shortened links “name” their referents. The most we can say about a URI is that it points to something. With the vagaries of meaning in human languages, we might also say that URIs refer to something, denote something or identify (but not in the sense of completely define) something.
From this discussion, we can assert with respect to the use of URIs as “names” that:
In summary, I think we can say that URIs may act as names, but not in all or most cases, and when used as such are often ambiguous. Absolutely associating URIs as names is way too heavy a burden, and incorrect in most cases.
The “name” discussion above masks that in some cases we are talking about a readable Web document or image (such as the Wikipedia description of the Coke beverage or its image) versus the “actual” thing in the real world (the Coke beverage itself or even the company). This distinction is what led to the so-called “identity crisis”, for which Ian Davis has used a toucan as his illustrative thing .
As I note in the conclusion, I like Davis’ approach to the identity conundrum insofar as Web architecture and linked data guidance are concerned. But here my purpose is more subtle: I want to tease apart still further the apparent distinction between an electronic description of something on the Web and the “actual” something. Like Davis, let’s use the toucan.
In our strawman case, we too use a description of the toucan (on Wikipedia) to represent our “information resource” (the accessible, downloadable electronic document). We contrast to that a URI that we mean to convey the actual physical bird (a “non-information resource” in the jumbled jargon of httpRange-14), which we will designate via the URI of http://example.com/toucan.
Despite the tortured (and newly conjured) distinction between “information resource” and “non-information resource”, the first blush reaction is that, sure, there is a difference between an electronic representation that can be accessed and viewed on the Web and its true, “actual” thing. Of course people can not actually be rendered and downloaded on the Web, but their bios and descriptions and portrait images may. While in the abstract such distinctions appear true and obvious, in the specifics that get presented to experts, there is surprising disagreement as to what is actually an “information resource” v. a “non-information resource” . Moreover, as we inspect the real toucan further, even that distinction is quite ambiguous.
When we inspect what might be a definitive description of “toucan” on Wikipedia, we see that the term more broadly represents the family of Ramphastidae, which contains five genera and forty different species. The picture we are showing to the right is but of one of those forty species, that of the keel-billed toucan (Ramphastos sulfuratus). Viewing the images of the full list of toucan species shows just how divergent these various “physical birds” are from one another. Across all species, average sizes vary by more than a factor of three with great variation in bill sizes, coloration and range. Further, if I assert that the picture to the right is actually that of my pet keel-billed toucan, Pretty Bird, then we can also understand that this representation is for a specific individual bird, and not the physical keel-billed toucan species as a whole.
The point of this diversion is not a lecture on toucans, but an affirmation that distinctions between “resources” occur at multiple levels and dimensions. Just as there is no self-evident criteria as to what constitutes an “information resource”, there is also not a self-evident and fully defining set of criteria as to what is the physical “toucan” bird. The meaning of what we call a “toucan” bird is not embodied in its label or even its name, but in the context and accompanying referential information that place the given referent into a context that can be communicated and understood. A URI points to (“refers to”) something that causes us to conjure up an understanding of that thing, be it a general description of a toucan, a picture of a toucan, an understanding of a species of toucan, or a specific toucan bird. Our understanding or interpretation results from the context and surrounding information accompanying the reference.
In other words, a “resource” may be anything, which is just the way the W3C has defined it. There is not a single dimension which, magically, like “information” and “non-information,” can cleanly and definitely place a referent into some state of absolute understanding. To assert that such magic distinctions exist is a flaw of Cartesian logic, which can only be reconciled by looking to more defensible bases in logic .
The logic behind these distinctions and nuances leads us to Charles Sanders Peirce (1839 – 1914). Peirce (pronounced “purse”) was an American logician, philosopher and polymath of the first rank. Along with Frege, he is acknowledged as the father of predicate calculus and the notation system that formed the basis of first-order logic. His symbology and approach arguably provide the logical basis for description logics and other aspects underlying the semantic Web building blocks of the RDF data model and, eventually, the OWL language. Peirce is the acknowledged founder of pragmatism, the philosophy of linking practice and theory in a process akin to the scientific method. He was also the first formulator of existential graphs, an essential basis to the whole field now known as model theory. Though often overlooked in the 20th century, Peirce has lately been enjoying a renaissance with his voluminous writings still being deciphered and published.
The core of Peirce’s world view is based in semiotics, the study and logic of signs. In his seminal writing on this, “What is in a Sign?” , he wrote that “every intellectual operation involves a triad of symbols” and “all reasoning is an interpretation of signs of some kind”. Peirce had a predilection for expressing his ideas in “threes” throughout his writings.
Semiotics is often split into three branches: 1) syntactics – relations among signs in formal structures; 2) semantics – relations between signs and the things to which they refer; and 3) pragmatics – relations between signs and the effects they have on the people or agents who use them.
Peirce’s logic of signs in fact is a taxonomy of sign relations, in which signs get reified and expanded via still further signs, ultimately leading to communication, understanding and an approximation of “canonical” truth. Peirce saw the scientific method as itself an example of this process.
A given sign is a representation amongst the triad of the sign itself (which Peirce called a representamen, the actual signifying item that stands in a well-defined kind of relation to the two other things), its object and its interpretant. The object is the actual thing itself. The interpretant is how the agent or the perceiver of the sign understands and interprets the sign. Depending on the context and use, a sign (or representamen) may be either an icon (a likeness), an indicator or index (a pointer or physical linkage to the object) or a symbol (understood convention that represents the object, such as a word or other meaningful signifier).
An interpretant in its barest form is a sign’s meaning, implication, or ramification. For a sign to be effective, it must represent an object in such a way that it is understood and used again. This makes the assignment and use of signs a community process of understanding and acceptance , as well as a truth-verifying exercise of testing and confirming accepted associations.
John Sowa has done much to help make some of Peirce’s obscure language and terminology more accessible to lay readers . He has expressed Peirce’s basic triad of sign relations as follows, based around the Yojo animist cat figure used by the character Queequeg in Herman Melville’s Moby-Dick:
In this figure, object and symbol are the same as the Peirce triad; concept is the interpretant in this case. The use of the word ‘Yojo’ conjures the concept of cat.
This basic triad representation has been used in many contexts, with various replacements or terms at the nodes. Its basic form is known as the Meaning Triangle, as was popularized by Ogden and Richards in 1923 .
The key aspect of signs for Peirce, though, is the ongoing process of interpretation and reference to further signs, a process he called semiosis. A sign of an object leads to interpretants, which, as signs, then lead to further interpretants. In the Sowa example below, we show how meaning triangles can be linked to one another, in this case by abstracting that the triangles themselves are concepts of representation; we can abstract the ideas of both concept and symbol:
We can apply this same cascade of interpretation to the idea of the sign (or representamen), which in this case shows that a name can be related to a word symbol, which in itself is a combination of characters in a string called ‘Yojo’:
According to Sowa :
Catherine Legg  notes that the semiotic process is really one of community involvement and consensus. Each understanding of a sign and each subsequent interpretation helps come to a consensus of what a sign means. It is a way of building a shared understanding that aids communication and effective interpretation. In Peirce’s own writings, the process of interpretation can lead to validation and an eventual “canonical” or normative interpretation. The scientific method itself is an extreme form of the semiotic process, leading ultimately to what might be called accepted “truths”.
So, how do Peircean semiotics help inform us about the role and use of URIs? Does this logic help provide guidance on the “identity crisis”?
The Peircean taxonomy of signs has three levels with three possible sign roles at each level, leading to a possible 27 combinations of sign representations. However, because not all sign roles are applicable at all levels, Peirce actually postulated only ten distinct sign representations.
Common to all roles, the URI “sign” is best seen as an index: the URI is a pointer to a representation of some form, be it electronic or otherwise. This representation bears a relation to the actual thing that this referent represents, as is true for all triadic sign relationships. However, in some contexts, again in keeping with additional signs interpreting signs in other roles, the URI “sign” may also play the role of a symbolic “name” or even as a signal that the resource can be downloaded or accessed in electronic form. In other words, by virtue of the conventions that we choose to assign to our signs, we can supply additional information that augments our understanding of what the URI is, what it means, and how it is accessed.
Of course, in these regards, a URI is no different than any other sign in the Peircean world view: it must reside in a triadic relationship to its actual object and an interpretation of that object, with further understanding only coming about by the addition of further signs and interpretations.
In shortened form, this means that a URI, acting alone, can at most play the role of a pointer between an object and its referent. A URI alone, without further signs (information), can not inform us well about names or even what type of resource may be at hand. For these interpretations to be reliable, more information must be layered on, either by accepted convention of the current signs or the addition of still further signs and their interpretations. Since the attempts to deal with the nature of a URI resource by fiat as stipulated by httpRange-14 neither meet the standards of consensus nor empirical validity, the attempt can not by definition become “canonical”. This does not mean that httpRange-14 and its recommended practices can not help in providing more information and aiding interpretation for what the nature of a resource may be. But it does mean that httpRange-14 acting alone is insufficient to resolve ambiguity.
Moreover, what we see in the general nature of Peirce’s logic of signs is the usefulness of adding more “triads” of representation as the process to increase understanding through further interpretation. Kind of sounds like adding on more RDF triples, does it not?
Names, references, identity and meaning are not absolutes. They are not philosophically, and they are not in human language. To expect machine communications to hold to different standards and laws than human communications is naive. To effect machine communications our challenge is not to devise new rules, but to observe and apply the best rules and practices that human communications instruct.
There has been an unstated hope at the heart of the semantic Web enterprise that simply expressing statements in the right way (syntax) and in the right form (RDF) is sufficient to facilitate machine communications. But this hope, too, is naive and silly. Just as we do not accept all human utterances as truth, neither will we accept all machine transmissions as reliable. Some of the information will be posted in error; some will be wrong or ill-fitting to our world view; some will be malicious or intended to deceive. Spam and occasionally lousy search results on the Web tell us that Web documents are subject to these sources of unsuitability, why is not the same true of data?
Thus, global data access via the semantic Web is not — and can never be — indiscriminate nor unambiguous. We need to understand and come to trust sources and provenance; we need interpretation and context to decide appropriateness and validity; and we need testing and validation to ensure messages as received are indeed correct. Humans need to do these things in their normal courses of interaction and communication; our machine systems will need to do the same.
These confirmations and decisions as to whether the information we receive is actionable or not will come about via still more information. Some of this information may come about via shared conventions. But most will come about because we choose to provide more context and interpretation for the core messages we hope to communicate.
Nearly five years ago Hayes and Halpin put forth a proposal to add
ex:describedBy to the standard RDF vocabulary as a way for authors to provide context and explanation for what constituted a specific RDF resource . In various ways, many of the other individuals cited in this article have come to similar conclusions. The simple redirect suggestions of both Ian Davis  and Ed Summers  appear particularly helpful.
Over time, we will likely need further representations about resources regarding such things as source, provenance, context and other interpretations that would help remove ambiguities as to how the information provided by that resource should be consumed or used. These additional interpretations can mechanically be provided via referenced ontologies or embedded RDFa (or similar). These additional interpretations can also be aided by judicious, limited additions of new predicates to basic language specifications for RDF (such as the Hayes and Halpin suggestions).
In the end, of course, any frameworks that achieve consensus and become widely adopted will be simple to use, easy to understand, and straightforward to deploy. The beauty of best practices in predicates and annotations is that failures to provide are easy to test. Parties that wish to have their data consumed have incentive to provide sufficient information so as to enable interpretation.
There is absolutely no reason that these additions can not co-exist with the current httpRange-14 approach. By adding a few other options and making clear the optional use of httpRange-14, we would be very Peirce-like in our go-forward approach: We are being both pragmatic while we add more means to improve our interpretations for what a Web resource is and is meant to be.
See the 2007 thread on this issue, mostly by Sean Palmer and Noah Mendelsohn, the latter aknowledging that various experts may only agree on 85% of the items.
In my opinion, perhaps the most important event for the structured Web since RDF was released a dozen years ago was today’s joint announcement by the search engine triumvirate of Google, Bing and Yahoo! releasing Schema.org. Schema.org is a vendor specification for nearly 300 mini-schema (or structured record definitions) that can be used to tag information in Web pages. These schema are organized into a clean little hierarchy and cover many of the leading things — from organizations to people to products and creative works — that can be written about and characterized on the Web.
These schema specifications are based on the microdata standard presently under review as part of the pending HTML5 specification. Microdata are set record descriptions of key-value pair attributes that can be embedded into the HTML Web page language. These microdata schema are similar to microformats, but broader in coverage and extensible. Microdata is also simpler than RDFa, another W3C specification that the Schema.org organizers call “. . . extensible and very expressive, but the substantial complexity of the language has contributed to slower adoption.”
Various forums have been alive with howls and questions from many RDF and RDFa advocates that this initiative negates years of effort behind those formats. Yet I and my company, Structured Dynamics, which base our entire technology approach on semantics and RDF, do not see this announcement as a threat or rejection. What gives; what is the difference in perspective?
In our view, RDF and its triple representations in its data model, is the simplest and most expressive means to represent any data or any data relationship. As such, RDF, and its language extensions such as OWL and ontologies, provide a robust and flexible canonical data model for capturing any extant data or schema. No matter what the native form of the source information, we can boil it down to RDF and inter-relate it to any other information. It is for these reasons (and others) we have frequently termed RDF as the universal data solvent.
But, simple records and simple data need not be encumbered with the complexity of RDF. We have long argued for the importance of naive data structs. Many of these are simple key-value pairs where the subject is implied. The so-called little structured data records in Wikipedia, called infoboxes, are of this form. JSON and many simple data formats also have cleaner data formats.
The basic fact that RDF provides a universal data model for any kind of native data does not necessarily translate into its use as the actual data exchange format. Rather, winning data exchange formats are those that can be easily understood, easily expressed and therefore widely used. I think there is a real prospect that microdata, ready for ingest and expression by the Web’s leading search engines, may represent a real sea change in the availability and expression of structured data on the Web.
More structure — not less — is the real fuel that will promote greater adoption of RDF when it comes time to interoperate that data. The RDF community should rejoice that more structure will be coming to the Web from Google et al.’s announcement. We should also soon see an explosion of tools and utilities and services that make it easy to automatically add such structure to Web pages via single clicks. Then, once this structure is available, watch out!
So, while the backers of Schema.org also announced their continued support for microformats and RDFa as they presently exist, I rather suspect today’s announcement represents a denouement for these alternative formats. Though these formats may be creatively destroyed, I think the effect on RDF itself will be a profound and significant boost. I foresee clarity coming to the marketplace regarding RDF’s role: as a canonical means for expressing data of any form, and not necessarily as a data exchange format.
This initiative, led by Google, should be no surprise. Google is the registered agent for the Schema.org Web site and has been the key proponent of microdata via its support of Ian Hickson in the WhatWG and HTML5 work groups. As I stated a couple of years back, Google has also not hidden its interests in structured data. Practically daily we see more structured data appear in Google search results and it has maintained a very active program in structured data extraction from text and tables for some years.
Google and its search engine partners recognize that search needs are evolving from keyword retrievals to structure, relationships, and filtered, targeted results. Those advances come from structure — as well as the semantic relationships between things that something like the Schema.org begins to represent.
Many within the W3C and elsewhere questioned why Google was pushing microdata when there were competing options such as microformats or RDFa (or even earlier variants). Of course, like Microsoft of a decade earlier, some ascribed Google’s microdata advocacy as arising from commercial interests or clout in advertising alone. Of course Google has an economic interest in the growth and usefulness of the Web. But I do not believe its advocacy to be premised on clout or “my way or the highway.”
Google and the search engine triumvirate understand well — much better than many of the researchers and academics that dominate mailing list discussions — that use and adoption trump elegance and sophistication. When one deconstructs the design of microdata and the nearly 300 schema now released behind it, I think the pragmatic observer can only come to one conclusion: Job well done!
I have been a fervent RDF advocate for nearly a decade and have also been a vocal proponent of the structured Web as a necessary stepping stone to the semantic Web. In fact, here is a repeat of a diagram I have used many times over the past 5 years:
|Document Web||Structured Web||
When one looks at the schema of schema that accompany today’s announcement, it is really clear just how encompassing and important these instant standards will become:
Today’s announcement is the best news I have heard in years regarding the structured Web, RDF, and the semantic Web. This announcement is — I believe — the signal event of the structured Web. With regard to my longstanding diagram above, I can go to bed tonight knowing we have now crossed the threshold into the semantic Web.