Posted:May 28, 2012

Mandelbrot animation based on a static number of iterations per pixel; from WikipediaInsights from Commonalities and Theory

One of the main reasons I am such a big fan of RDF as a canonical data model is its ability to capture information in structured, semi-structured and unstructured form [1]. These sources are conventionally defined as:

  • Structured data — information presented according to a defined data model, often found in relational databases or other forms of tabular data
  • Semi-structured data — does not conform with the formal structure of data models, but contains tags or other markers to denote fields within the content. Markup languages embedded in text are a common form of such sources
  • Unstructured data — information content, generally oriented to text, that lacks an explicit data model or schema; structured information can be obtained from it via data mining or information extraction.

A major trend I have written about for some time is the emergence of the structured Web: that is, the exposing of structure from these varied sources in order for more information to be interconnected and made interoperable. I have posited — really a view shared by many — that the structured Web is an intermediate point in the evolution of the Web from one of documents to one where meaningful semantics occurs [2].

It is clear in my writings — indeed in the very name of my company, Structured Dynamics — that structure plays a major role in our thinking. The use and reliance on this term, though, begs the question: just what is structure in an informational sense? We’ll find it helpful to get at the question of What is structure? from a basis using first principles. And this, in turn, may also provide insight into how structure and information are in fact inextricably entwined.

A General Definition of Structure

According to Wikipedia, structure is a fundamental notion, of tangible or intangible character, that refers to the recognition, observation, nature, or permanence of patterns and relationships of entities. The concept may refer to an object, such as a built structure, or an attribute, such as the structure of society.

Koch curve; example of snowflake fractal; from Wikipedia

Structure may be abstract, or it may be concrete. Its realm ranges from the physical to ideas and concepts. As a term, “structure” seems to be ubiquitous to every domain. Structure may be found across every conceivable scale, from the most minute and minuscule to the cosmic. Even realms without any physical aspect at all — such as ideas and languages and beliefs — are perceived by many to have structure. We apply the term to any circumstance in which things are arranged or connected to one another, as a means to describe the organization or relationships of things. We seem to know structure when we see it, and to be able to discern structure of very many kinds against unstructured or random backgrounds.

In this way structure quite resembles patterns, perhaps could even be used synonymously with that term. Other closely related concepts include order, organization, design or form. When expressed, structure, particularly that of a recognizably ordered or symmetrical nature, is often called beautiful.

One aspect of structure, I think, that provides the key to its roles and importance is that it can be expressed in shortened form as a mathematical statement. One could even be so bold as to say that mathematics is the language of structure. This observation is one of the threads that will help us tie structure to information.

The Patterned Basis of Nature

The natural world is replete with structure. Patterns in nature are regularities of visual form found in the natural world. Each such pattern can be modeled mathematically. Typical mathematical forms in nature include fractals, spirals, flows, waves, lattices, arrays, Golden ratios, tilings, Fibonacci sequences, and power laws. We see them in such structures as clouds, trees, leaves, river networks, fault lines, mountain ranges, craters, animal spots and stripes, shells, lightning bolts, coastlines, flowers, fruits, skeletons, cracks, growth rings, heartbeats and rates, earthquakes, veining, snow flakes, crystals, blood and pulmonary vessels, ocean waves, turbulence, bee hives, dunes and DNA.

Self-similarity in a Mandelbrot set; from Wikipedia

The mathematical expression of structures in nature is frequently repeated or recursive in nature, often in a self-organizing manner. The swirls of a snail’s shell reflect a Fibonacci sequence, while natural landscapes or lifeforms often have a fractal aspect (as expressed by some of the figures in this article). Fractals are typically self-similar patterns, generally involving some fractional or ratioed formula that is recursively applied. Another way to define it is as a detailed pattern repeating itself.

Even though these patterns can often be expressed simply and mathematically, and they often repeat themselves, their starting conditions can lead to tremendous variability and a lack of predictability. This makes them chaotic, as studied under chaos theory, though their patterns are often discernible.

While we certainly see randomness in statistics, quantum physics and Brownian motion, it is also striking how what gives nature its beauty is structure. As a force separate and apart from the random, there appears to be something in structure that guides the expression of what is nature and what is so pleasing to behold. Self-similar and repeated structures across the widest variety of spatial scales seems to be an abiding aspect of nature.

Structure in Language

Such forms of repeated patterns or structure are also inherent in that most unique of human capabilities, language. As a symbolic species [3], we first used symbols as a way to represent the ideas of things. Simple markings, drawings and ideograms grew into more complicated structures such as alphabets and languages. The languages themselves came to embrace still further structure via sentence structures, document structures, and structures for organizing and categorizing multiple documents. In fact, one of the most popular aspects of this blog site is its Timeline of Information History — worth your look — that shows the progression of structure in information throughout human history.

Grammar is often understood as the rules or structure that governs language. It is composed of syntax, including punctuation, traditionally understood as the sentence structure of languages, and morphology, which is the structural understanding of a language’s linguistic units, such as words, affixes, parts of speech, intonation or context. There is a whole field of linguistic typology that studies and classifies languages according to their structural features. But grammar is hardly the limit to language structure.

Semantics, the meaning of language, used to be held separate from grammar or structure. But via the advent of the thesaurus, and then linguistic databases such as WordNet and more recently concept graphs that relate words and terms into connected understandings, we also have now come to understand that semantics also has structure. Indeed, these very structural aspects are now opening up to us techniques and methods — generally classified under the heading of natural language processing (NLP) — for extracting meaningful structure from the very basis of written or spoken language.

It is the marriage of the computer with language that is illuminating these understandings of structure in language. And that opening, in turn, is enabling us to capture and model the basis of human language discourse in ways that can be codified, characterized, shared and analyzed. Machine learning and processing is now enabling us to complete the virtual circle of language. From its roots in symbols, we are now able to extract and understand those very same symbols in order to derive information and knowledge from our daily discourse. We are doing this by gleaning the structure of language, which in turn enables us to relate it to all other forms of structured information.

Common Threads Via Patterns

The continuation of structure from nature to language extends across all aspects of human endeavor. I remember excitedly describing to a colleague more than a decade ago what likely is a pedestrian observation: pattern matching is a common task in many fields. (I had observed that pattern matching in very different forms was a standard practice in most areas of industry and commerce.) My “insight” was that this commonality was not widely understood, which meant that pattern matching techniques in one field were not often exploited or seen as transferable to other domains.

In computer science, pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern. It is closely related to the idea of pattern recognition, which is the characterization of some discernible and repeated sequence. These techniques, as noted, are widely applied, with each field tending to have its own favorite algorithms. Common applications that one sees for such pattern-based calculations include communications [4], encoding and coding theory, file compression, data compression, machine learning, video compression, mathematics (including engineering and signal processing via such techniques as Fourier transforms), cryptography, NLP [5], speech recognition, image recognition, OCR, image analysis, search, sound cleaning (that is, error detection, such as Dolby) and gene sequence searching and alignment, among many others.

To better understand what is happening here and the commonalities, let’s look at the idea of compression. Data compression is valuable for transmitting any form of content in wired or wireless manners because we can transmit the same (or closely similar) message faster and with less bandwidth [6]. There are both lossless (no loss of information) and lossy compression methods. Lossless data compression algorithms usually exploit statistical redundancy — that is, a pattern match — to represent data more concisely without losing information. Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Lossless compression is possible because most real-world data has statistical redundancy. In lossy data compression, some loss of information is acceptable by dropping detail from the data to save storage space. These methods are guided by research that indicates, say, how certain frequencies may not be heard or seen by people and can be removed from the source data.

On a different level, there is a close connection between machine learning and compression: a system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution), while an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history). This equivalence has been used as justification for data compression as a benchmark for “general intelligence.” On a still different level, one major part of cryptography is the exact opposite of these objectives: constructing messages that pattern matching fails against or is extremely costly or time-consuming to analyze.

When one stands back from any observable phenomena — be it natural or human communications — we can see that the “information” that is being conveyed often has patterns, recursion or other structure that enables it to be represented more simply and compactly in mathematical form. This brings me back to my two favorite protagonists in my recent writings — Claude Shannon and Charles S. Peirce.

Information is Structure

Claude Shannon‘s seminal work in 1948 on information theory dealt with the amount of information that could be theoretically and predictably communicated between a sender and a receiver [7] [8]. No context or semantics were implied in this communication, only the amount of information (for which Shannon introduced the term “bits”) and what might be subject to losses (or uncertainty in the accurate communication of the message). In this regard, what Shannon called “information” is what we would best term “data” in today’s parlance.

A cellular automaton based on hexagonal cells instead of squares (rule 34/2); from Wikipedia

The context of Shannon’s paper and work by others preceding him was to understand information losses in communication systems or networks. Much of the impetus for this came about because of issues in wartime communications and early ciphers and cryptography and the emerging advent of digital computers. But the insights from Shannon’s paper also relate closely to the issues of data patterns and data compression.

A key measure of Shannon’s theory is what he referred to as information entropy, which is usually expressed by the average number of bits needed to store or communicate one symbol in a message. Entropy quantifies the uncertainty involved in predicting the value of a random variable. The Shannon entropy measure is actually a measure of the uncertainty based on the communication (transmittal) between a sender and a receiver; the actual information that gets transmitted and predictably received was formulated by Shannon as R, which can never be zero because all communication systems have losses.

A simple intuition can show how this formulation relates to patterns or data compression. Let’s take a message of completely random digits. In order to accurately communicate that message, all digits (bits) would have to be transmitted in their original state and form. Absolutely no compression of this message is possible. If, however, there are patterns within the message (which, of course, now ceases to make the message random), these can be represented algorithmically in shortened form, so that we only need communicate the algorithm and not the full bits of the original message. If this “compression” algorithm can then be used to reconstruct the bit stream of the original message, the data compression method is deemed to be lossless. The algorithm so derived is also the expression of the pattern that enabled us to compress the message in the first place (such as a*2+1).

We can apply this same type of intuition to human language. In order to improve communication efficiency, the most common words (e.g., “a”, “the”, “I”) should be shorter than less common words (e.g., “disambiguation”, “semantics”, “ontology”), so that sentences will not be too long. As they are. This is an equivalent principal to data compression. In fact, such repeats and patterns apply to the natural world as well.

Shannon’s idea of information entropy has come to inform the even broader subject of entropy in physics and the 2nd Law of Thermodynamics [10]. According to Koelman, “the entropy of a physical system is the minimum number of bits you need to fully describe the detailed state of the system.” Very random (uncertain) states have high entropy, patterned ones low entropy. As I noted recently, in open systems, structures (patterns) are a means to speed the tendency to equilibrate across energy gradients [8]. This observation helps provide insight into structure in natural systems, and why life and human communications tend toward less randomness. Structure will always continue to emerge because it is adaptive to speed the deltas across these gradients; structure provides the fundamental commonality between biological information (life) and human information.

In the words of Thomas Schneider [11], “Information is always a measure of the decrease of uncertainty at a receiver.” Of course, in Shannon’s context, what is actually being measured here is data (or bits), not information embodying any semantic meaning or context. Thus, the terminology may not be accurate for discussing “information” in a contemporary sense. But it does show that “structure” — that is, the basis for shortening the length of a message while still retaining its accuracy — is information (in the Shannon context). In this information there is order or patterns, often of a hierarchical or fractal or graph nature. Any structure that emerges that is better able to reduce the energy gradient faster will be favored according to the 2nd Law.

Still More Structure Makes “Information” Information

The data that constitutes “information” in the Shannon sense still lacks context and meaning. In communications terms, it is data; it has not yet met the threshold of actionable information. It is in this next step that we can look to Charles Sanders Peirce (1839 – 1914) for guidance [9].

The core of Peirce’s world view is based in semiotics, the study and logic of signs. In his seminal writing on this, “What is in a Sign?” [10], he wrote that “every intellectual operation involves a triad of symbols” and “all reasoning is an interpretation of signs of some kind”. A sign of an object leads to interpretants, which, as signs, then lead to further interpretants. Peirce’s triadic logic of signs in fact is a taxonomy of sign relations, in which signs get reified and expanded via still further signs, ultimately leading to communication, understanding and an approximation of “canonical” truth. Peirce saw the scientific method as itself an ultimate example of this process.  The key aspect of signs for Peirce is the ongoing process of interpretation and reference to further signs.

Information is structure, and structure is information.

Ideograms leading to characters, that get combined into sentences and other syntax, and then get embedded within contexts of shared meanings show how these structures compound themselves and lead to clearer understandings (that is, accurate messages) in the context of human languages. While the Shannon understanding of “information” lacked context and meaning, we can see how still higher-order structures may be imposed through these reifications of symbols and signs that improve the accuracy and efficiency of our messages. Though Peirce did not couch his theory of semiosis on structure nor information, we can see it as a natural extension of the same structural premises in Shannon’s formulation.

In fact, today, we now see the “structure” in the semantic relationships of language through the graph structures of ontologies and linguistic databases such as WordNet. The understanding and explication of these structures are having a similarly beneficial effect on how still more refined and contextual messages can be composed, transmitted and received. Human-to-machine communications is (merely!) the challenge of codifying and making explicit the implicit structures in our languages.

The Peirceian ideas of interpretation (context) and compounding and reifying structure are a major intellectual breakthrough for extending the Shannon “information” theory to information in the modern sense. These insights also occur within a testable logic for how things and the names of things can be understood and related to one another, via logical statements or structures. These, in turn, can be symbolized and formalized into logical constructs that can capture the structure of natural language as well as more structured data (or even nature, as some of the earlier Peirce speculation asserts [13]).

According to this interpretation of Peirce, the nature of information is the process of communicating a form from the object to the interpretant through the sign [14]. The clarity of Peirce’s logic of signs is an underlying factor, I believe, for why we are finally seeing our way clear to how to capture, represent and relate information from a diversity of sources and viewpoints that is defensible and interoperable.

Structure is Information

Common to all of these perspectives — from patterns to nature and on to life and then animal and human communications — we see that structure is information. Human artifacts and technology, though not “messages” in a conventional sense, also embody the information of how they are built within their structures [15]. We also see the interplay of patterns and information in many processes of the natural world [16] from complexity theory, to emergence, to autopoiesis, and on to autocatalysis, self-organization, stratification and cellular automata [17]. Structure in its many guises is ubiquitous.

From article on Beauty in Wikipedia; Joanna Krupa, a Polish-American model and actress

We, as beings who can symbolically record our perceptions, seem to innately recognize patterns. We see beauty in symmetry. Bilateral symmetry seems to be deeply ingrained in the inherent perception by humans of the likely health or fitness of other living creatures. We see beauty in the patterned, repeated variability of nature. We see beauty in the clever turn of phrase, or in music, or art, or the expressiveness of a mathematical formulation.

We also seem to recognize beauty in the simple. Seemingly complex bit streams that can be reduced to the short, algorithmic expression are always viewed as more elegant than lengthier, more complex alternatives. The simple laws of motion and Newtonian physics fit this pattern, as does Einstein’s E=mc2. This preference for the simple is a preference for the greater adaptiveness of the shorter, more universal pattern to messages, an insight indicated by Shannon’s information theory.

In the more prosaic terms of my vocation in the Web and information technology, these insights point to the importance of finding and deriving structured representations of information — including meaning (semantics) — that can be simply expressed and efficiently conveyed. Building upon the accretions of structure in human and computer languages, the semantic Web and semantic technologies offer just such a prospect. These insights provide a guidepost for how and where to look for the next structural innovations. We find them in the algorithms of nature and language, and in making connections that provide the basis for still more structure and patterned commonalities.

Ideas and algorithms around loseless compression and graph theory and network analysis are, I believe, the next fruitful hunting grounds for finding still higher-orders of structure, which can be simply expressed. The patterns of nature, which have emerged incrementally and randomly over the eons of cosmological time, look to be an excellent laboratory.

So, as we see across examples from nature and life to language and all manner of communications, information is structure and structure is information. And it is simply beautiful.


[1]  I discuss this advantage, among others, in M. K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Innovation blog, April 8, 2009. See http://www.mkbergman.com/483/advantages-and-myths-of-rdf/.
[2] The structured Web is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance. See further M. K. Bergman, 2007. “What is the Structured Web?,” AI3:::Adaptive Innovation blog, July 18, 2007. See http://www.mkbergman.com/390/what-is-the-structured-web/. Also, for a diagram of the evolution of the Web, see M. K. Bergman, 2007. “More Structure, More Terminology and (hopefully) More Clarity,” AI3:::Adaptive Innovation blog, July 22, 2007. See http://www.mkbergman.com/391/more-structure-more-terminology-and-hopefully-more-clarity/.
[3] Terrence W. Deacon, 1997. The Symbolic Species: The Co-Evolution of Language and the Brain, W. W. Norton & Company, July 1997 527 pp. (ISBN-10: 0393038386)
[4] Communications is a particularly rich domain with techniques such as the Viterbi algorithm , which has found universal application in decoding the convolutional codes used in both CDMA and GSM digital cellular, dial-up modems, satellite, deep-space communications, and 802.11 wireless LANs.
[5] Notable areas in natural language processing (NLP) that rely on pattern-based algorithms include classification, clustering, summarization, disambiguation, information extraction and machine translation.
[6] To see some of the associated compression algorithms, there is a massive list of “codecs” (compression/decompression) techniques available; fractal compression is one.
[7] Claude E. Shannon, 1948. “A Mathematical Theory of Communication”, Bell System Technical Journal, 27: 379–423, 623-656, July, October, 1948. See http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.
[8] I first raised the relation of Shannon’s paper to data patterns — but did not discuss it further awaiting this current article — in M. K. Bergman, 2012. “The Trouble with Memes,” AI3:::Adaptive Innovation blog, April 4, 2012. See http://www.mkbergman.com/1004/the-trouble-with-memes/.
[9] I first overviewed Peirce’s relation to information messaging in M. K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” AI3:::Adaptive Innovation blog, January 24, 2012. See http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/. Peirce had a predilection for expressing his ideas in “threes” throughout his writings.
[10] For a very attainable lay description, see Johannes Koelman, 2012. “What Is Entropy?,” in Science 2.0 blog, May 5, 2012. See http://www.science20.com/hammock_physicist/what_entropy-89730.
[11] See Thomas D. Schneider, 2012. “Information Is Not Entropy, Information Is Not Uncertainty!,” Web page retrieved April 4, 2012; see http://www.lecb.ncifcrf.gov/~toms/information.is.not.uncertainty.html.
[12] Charles Sanders Peirce, 1894. “What is in a Sign?”, see http://www.iupui.edu/~peirce/ep/ep2/ep2book/ch02/ep2ch2.htm.
[13] It is somewhat amazing, more than a half century before Shannon, that Peirce himself considered “quasi-minds” such as crystals and bees to be sufficient as the interpreters of signs. See Commens Dictionary of Peirce’s Terms (CDPT), Peirce’s own definitions, and the entry on “quasi-minds”; see http://www.helsinki.fi/science/commens/terms/quasimind.html.
[14] João Queiroz, Claus Emmeche and Charbel Niño El-Hania, 2005. “Information and Semiosis in Living Systems: A Semiotic Approach,” in SEED 2005, Vol. 1. pp 60-90; see http://www.library.utoronto.ca/see/SEED/Vol5-1/Queiroz_Emmeche_El-Hani.pdf.
[15] Kevin Kelley has written most broadly about this in his book, What Technology Wants, Viking/Penguin, October 2010. For a brief introduction, see Kevin Kelly, 2006. “The Seventh Kingdom,” in The Technium blog, February 1, 2006. See http://www.kk.org/thetechnium/archives/2006/02/the_seventh_kin.php.
[16] For a broad overview, see John Cleveland, 1994. “Complexity Theory: Basic Concepts and Application to Systems Thinking,” Innovation Network for Communities, March 27, 1994. May be obtained at http://www.slideshare.net/johncleveland/complexity-theory-basic-concepts.
[17] For more on cellular automata, see Stephen Wolfram, 2002. A New Kind of Science, Wolfram Media, Inc., May 14, 2002, 1197 pp. ISBN 1-57955-008-8.

Posted by AI3's author, Mike Bergman Posted on May 28, 2012 at 10:24 pm in Adaptive Information, Structured Web | Comments (5)
The URI link reference to this post is: http://www.mkbergman.com/1011/what-is-structure/
The URI to trackback this post is: http://www.mkbergman.com/1011/what-is-structure/trackback/
Posted:May 21, 2012

UMBEL Big GraphModularization Also Leads to Big Graph Visualization

We are pleased to announce the release of version 1.05 of UMBEL, which now has linkages to schema.org [6] and  GeoNames [1]. UMBEL has also been split into ‘core’ and ‘geo’ modules. The resulting smaller size of UMBEL ‘core’ — now some 26,000 reference concepts — has also enabled us to create a full visualization of UMBEL’s content graph.

The first notable change in UMBEL v. 1.05 is its mapping to schema.org. schema.org is a collection of schema (usable as HTML tags) that webmasters can use to markup their pages in ways recognized by major search providers. schema.org was first developed and organized by the major search engines of Bing, Google and Yahoo!; later Yandex joined as a sponsor. Now many groups are supporting schema.org and contributing vocabularies and schema.

I was one of the first to hail schema.org hours after its announcement [7]. It seemed only fair that we put our money where our mouth is and map UMBEL to it as well.

The UMBEL-schema.org mapping was manually done by, firstly, searching and inspecting the current UMBEL concept base for appropriate matches. If that mapping failed to find a rather direct correspondence between existing UMBEL concepts and the types in schema.org, the source concept reference of OpenCyc was then inspected in a similar manner. Failing a match from either of these two sources, the decision was to add a new concept to the ‘core’ UMBEL. This new concept was then appropriately placed into the UMBEL reference concept subject structure.

The net result of this process was to add 298 mapped schema.org types to UMBEL. This mapping required a further three concepts from OpenCyc, and a further 78 new reference concepts, to be added to UMBEL. Along with the new updates to UMBEL and its mappings, the section of Key Files below provides further explanatory links. We are reserving the addition of schema.org properties for a later time, when we plan to re-organize the Attributes SuperType within UMBEL.

Modularization of the UMBEL Vocabulary

Even in the early development of UMBEL there was a tension about the scope and level of what geographic information to include in its concept base. The initial decision was to support country and leading-country province and state concepts, and some leading cities. This decision was in the spirit of a general reference structure, but still felt arbitrary.

GeoNames is devoted to geographical information and concepts — both natural and human artifacts — and has become the go-to resource for geo-locational information. The decision was thus made to split out the initial geo-locational information in UMBEL and replace it with mappings to GeoNames. This decision also had the advantage of beginning a process of modularization of UMBEL. UMBEL Vocabulary and Reference Concept Ontology

Two sets of reference concepts were identified as useful for splitting out from the ‘core’ UMBEL in a geo-locational aspect:

  1. Geopolitical places and places of human activities and facilities
  2. Natural geographical places and features.

These removed concepts were then placed into a separate ‘geo’ module of UMBEL, including all existing annotations and relations, resulting in a module of 1,854 concepts. That left 26,046 concepts in UMBEL ‘core’. Because of some shared parent concepts, there is some minor overlap between the two modules. These are now the modular splits in UMBEL version 1.05.

Mapping to GeoNames

GeoNames has a different structure to UMBEL. It has few classes and distinguishes its geographic information on the basis of some 671 feature codes. These codes span from geopolitical divisions — such as countries, states or provinces, cities, or other administrative districts — to splits and aggregations by natural and human features. Types of physical terrain — above ground and underwater — are denoted, as well as regions and landscape features governed by human activities (such as vineyards or lighthouses) [1]. We wanted to retain this richness in our mappings.

We needed a bridge between feature codes and classes, a sort of umbrella property generally equivalent to owl:sameAs in nature, but with some possible inexactitude or degree of approximation. The appropriate choice here is umbel:correspondsTo, which was designed specifically for this purpose [2]. This predicate is thus the basis for the mappings.

The 671 GeoNames feature codes were manually mapped to corresponding classes in the UMBEL concepts, in a manner identical to what was described for schema.org above. The result was to add another further three OpenCyc concepts and to add 88 new UMBEL reference concepts to accommodate the full GeoNames feature codes. We thus now have a complete representation of the full structure and scope of GeoNames in UMBEL.

There are three modes in which one can now work with UMBEL:

  1. With UMBEL ‘core’ alone, recommended when your concept space is not concerned with geographical information
  2. UMBEL ‘core’ plus the UMBEL ‘geo’ module — equivalent to prior versions of UMBEL, or
  3. UMBEL ‘core’ plus GeoNames, recommended where geographical information is important to your concept space.

In the latter case, you may use SPARQL queries with the umbel:correspondsTo predicate to achieve the desired retrievals. If more logic is required, you will likely need to look to a rules-based addition such as SWRL [3] or RIF [4] to capture the umbel:correspondsTo semantics.

New Big Graph Visualization

Because of the UMBEL modularization, it has now become tractable to graph the main ontology in its entirety. The core UMBEL ontology contains about 26,000 reference concepts organized according to 33 super types. There are more than 60,000 relationships amongst these concepts, resulting in a graph structure of very large size.

It is difficult to grasp this graph in the abstract. Thus, using methods earlier described in our use of the Gephi visualization software [5], we present below a dynamic, navigable rendering of this graph of UMBEL core:

Note: at standard resolution, if this graph were to be rendered in actual size, it would be larger than 34 feet by 34 feet square at full zoom !!! Hint: that is about 1200 square feet, or 1/2 the size of a typical American house !

Note: If you are viewing this in a feed reader, click here to see the interactive graph.

This UMBEL graph displays:

  • All 26,000 concepts (“nodes”) with labels, and with connections shown (though you must must zoom to see)
  • The color-coded relation of these nodes to the 33 or so major SuperTypes in UMBEL, as well as the relative position of these clusters with respect to one another, and
  • When zooming (use scroll wheel or + icon) or panning (via mouse down moves), wait a couple of seconds to get the clearest image refresh:

You may also want to inspect a static version of this big graph by downloading a PDF.

Key Files and Links

Lastly, we fully updated the UMBEL Web site and re-released the UMBEL wiki.


[1] For more information on GeoNames, see http://www.geonames.org/. The complete mapping to GeoNames is based on its 671 feature codes, which describe natural, geopolitical, and human activity geo-locational information; see further http://www.geonames.org/statistics/total.html

[2] Approximate relationships are discussed in M.K. Bergman, 2010. “The Nature of Connectedness on the Web,” AI3:::Adaptive Information blog, November 22, 2010; see http://www.mkbergman.com/935/the-nature-of-connectedness-on-the-web/. One option, for example, is the x:coref predicate from the UMBC Ebiquity group; see further Jennifer Sleeman and Tim Finin, 2010. “Learning Co-reference Relations for FOAF Instances,” Proceedings of the Poster and Demonstration Session at the 9th International Semantic Web Conference, November 2010; see http://ebiquity.umbc.edu/_file_directory_/papers/522.pdf. In the words of Tim Finin of the Ebiquity group:

The solution we are currently exploring is to define a new property to assert that two RDF instances are co-referential when they are believed to describe the same object in the world. The two RDF descriptions might be incompatible because they are true at different times, or the sources disagree about some of the facts, or any number of reasons, so merging them with owl:sameAs may lead to contradictions. However, virtually merging the descriptions in a co-reference engine is fine — both provide information that is useful in disambiguating future references as well as for many other purposes. Our property (:coref) is a transitive, symmetric property that is a super-property of owl:sameAs and is paired with another, :notCoref that is symmetric and generalizes owl:differentFrom.

When we look at the analog properties noted above, we see that the property objects tend to share reflexivity, symmetry and transitivity. We specifically designed the umbel:correspondsTo predicate to capture these close, nearly equivalent, but uncertain degree of relationships.

[3] SWRL (Semantic Web Rule Language) combines sublanguages of the OWL Web Ontology Language (OWL DL and Lite) with those of the Rule Markup Language (Unary/Binary Datalog). SWRL has the full power of OWL DL, but at the price of decidability and practical implementations. See further http://www.w3.org/Submission/SWRL/.
[4] The Rule Interchange Format (RIF) is a W3C Recommendation. RIF is based on the observation that there are many “rules languages” in existence, and what is needed is to exchange rules between them. RIF includes three dialects, a Core dialect which is extended into a Basic Logic Dialect (BLD) and Production Rule Dialect (PRD). See further http://www.w3.org/2005/rules/wiki/RIF_FAQ.
[5] See further, M.K. Bergman, 2011. “A New Best Friend: Gephi for Large-scale Networks,” AI3:::Adaptive Information blog, August 8, 2011.
[6] schema.org lists its various contributing schema and also provides an OWL ontology of the system.
[7] See further, M.K. Bergman, 2011. “Structured Web Gets Massive Boost,” AI3:::Adaptive Information blog, June 2, 2011.

Posted by AI3's author, Mike Bergman Posted on May 21, 2012 at 12:26 am in Ontologies, Structured Web, UMBEL | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/999/new-umbel-release-gains-schema-org-geonames-capabilities/
The URI to trackback this post is: http://www.mkbergman.com/999/new-umbel-release-gains-schema-org-geonames-capabilities/trackback/
Posted:May 18, 2012

Google logoSome Quick Investigations Point to Promise, Disappointments

It has been clear for some time that Google has been assembling a war chest of entities and attributes. It first began to appear as structured results in its standard results listings, a trend I commented upon more than three years ago in Massive Muscle on the ABox at Google. Its purchase of Metaweb and its Freebase service in July 2010 only affirmed that trend.

This week, perhaps a bit rushed due to attention to the Facebook IPO, Google announced its addition of the Knowledge Graph (GKG) to its search results. It has been releasing this capability in a staged manner. Since I was fortunately one of the first to be able to see these structured results (due to luck of the draw and no special “ins”), I have spent a bit of time deconstructing what I have found.

Apparent Coverage

What you get (see below) when you search on particular kinds of entities is in essence an “infobox“, similar to the same structure as what is found on Wikipedia. This infobox is a tabular presentation of key-value pairs, or attributes, for the kind of entity in the search. A ‘people’ search, for example, turns up birth and death dates and locations, some vital statistics, spouse or famous relations, pictures, and links to other relevant structured entities. The types of attributes shown vary by entity type. Here is an example for Stephen King, the writer (all links from here forward provide GKB results), which shows the infobox and its key-value pairs in the righthand column: Google 'Stephen King' structured result

Reportedly these results are drawn from Freebase, Wikipedia, the CIA World Factbook and other unidentified sources. Some of the results may indeed be coming from Freebase, but I saw none as such. Most entities I found were from Wikipedia, though it is important to know that Freebase in its first major growth incarnation springboarded from a Wikipedia baseline. These early results may have been what was carried forward (since other contributed data to Freebase is known to be of highly variable quality).

The entity coverage appears to be spotty and somewhat disappointing in this first release. Entity types that I observed were in these categories:

  • People
  • Chemical Compounds
  • Directors
  • Some Companies
  • Some National Parks
  • Places
  • Musicians/Musical Groups
  • Actors
  • Some Government Agencies
  • Many Local Businesses
  • Animals
  • Movies
  • Albums
  • Notable Landmarks
  • Who Knows What Else

 

Entity types that I expected to see, but did not find include:

  • Products
  • Most Companies
  • Who Knows What Else
  • Songs
  • Most Government Agencies
  • Concepts
  • Non-government Organizations

This is clearly not rigorous testing, but it would appear that entity types along the lines of what is in schema.org is what should be expected over time.

I have no way to gauge the accuracy of Google’s claims that it has offered up structured data on some 500 million entities (and 3.5 billion facts). However, given the lack of coverage in key areas of Wikipedia (which itself has about 3 million entities in the English version), I suspect much of that number comes from the local businesses and restaurants and such that Google has been rapidly adding to its listings in recent years. Coverage of broadly interesting stuff still seems quite lacking.

The much-touted knowledge graph is also kind of disappointing. Related links are provided, but they are close and obvious. So, an actor will have links to films she was in, or a person may have links to famous spouses, but anything approaching a conceptual or knowledge relationship is missing. I think, though, we can see such links and types and entity expansions to steadily creep in over time. Google certainly has the data basis for making these expansions. And, constant, incremental improvement has been Google’s modus operandi.

Deconstructing the URL

For some time, and at various meetings I attend, I have always been at pains to question Google representatives whether there is some unique, internal ID for entities within its databases. Sometimes the reps I have questioned just don’t know, and sometimes they are cagey.

But, clearly, anything like the structured data that Google has been working toward has to have some internal identifier. To see if some of this might now have surfaced with the Knowledge Graph, I did a bit of poking of the URLs shown in the searches and the affiliated entities in the infoboxes. Under most standard searches, the infobox appears directly if there is one for that object. But, by inspecting cross-referenced entities from the infoboxes themselves, it is possible to discern the internal key.

The first thing one can do in such inspections is to remove that stuff that is local or cookie things related to one’s own use preferences or browser settings. Other tests can show other removals. So, using the Stephen King example above, we can eliminate these aspects of the URL:

https://www.google.com/search?num=100&hl=en&safe=off&q=stephen+king&stick=H4sIAAAAAAAAAONgVuLQz9U3MCvIKwQAcT1q1AwAAAA&sa=X&ei=Q-S2T6e1HYOw2wXR2OXOCQ&ved=0CPMGEJsTKAA&biw=1280&bih=827

This actually conformed to my intuition, because the ‘&stick’ aspect was a new parameter for me. (Typically, in many of these dynamic URLs, the various parameters are separated by one another by a set designator character. In the case of Google, that is the ampersand &.)

By simply doing repeated searches that result in the same entity references, I was able to confirm that the &stick parameter is what invokes the unique ID and infobox for each entity. Further, we can decompose that further, but the critical aspect seems to be what is not included within the following: &stick=H4sIAAAAAAAAAONg . . [VuLQz9U3] . . AAAA. The stuff in the brackets varies less, and I suspect might be related to the source, rather than the entity.

I started to do some investigation on types and possible sources, but ran out of time. Listed below are some &stick identifiers for various types of entities (each is a live link):

Type Entity Infobox Identifier
Place Los Angeles &stick=H4sIAAAAAAAAAONgVuLSz9U3MDYoTDIuAQD9ON7KDgAAAA
Place Brentwook &stick=H4sIAAAAAAAAAONgVuLUz9U3MKwsNEkDAN6nm-sNAAAA
Person Arthur Miller &stick=H4sIAAAAAAAAAONgVmLXz9U3qEwrAADRsxaaCwAAAA
Person Joe DiMaggio &stick=H4sIAAAAAAAAAONgVuLQz9U3SDFMqwAAAy8zdQwAAAA
Person Marilyn Monroe &stick=H4sIAAAAAAAAAONgVuLQz9U3MCkvLAIAW7x_LwwAAAA
Movie Shawshank Redemption &stick=H4sIAAAAAAAAAONgVuLQz9U3MM_KKwEAPYgNDQwAAAA
Movie The Green Mile &stick=H4sIAAAAAAAAAONgVuLUz9U3MC62zC4AAGg8mEkNAAAA
Movie Dr. Strangelove &stick=H4sIAAAAAAAAAONgVuLQz9U3MEopzwIAfsFCUQwAAAA
Animal Toucan &stick=H4sIAAAAAAAAAONgVuLQz9U3MM8wNAcA4g1_3AwAAAA
Animal Wolverine &stick=H4sIAAAAAAAAAONgUeLQz9U3sDAryQIAhJ3RUwwAAAA
Musicians/Groups The Beatles &stick=H4sIAAAAAAAAAONgVuLQz9U3ME82yAIAC_7r3AwAAAA
Albums The White Album &stick=H4sIAAAAAAAAAONgVuLSz9U3MMxIN0nKAADnd5clDgAAAA

 

You can verify that this ‘&stick‘ reference is what is pulling in the infobox by looking at this modified query that has substituted Marilyn Monroe’s &stick in the Stephen King URL string: Google 'Stephen King' + 'Marilyn Monroe' structured results Note the standard search results in the lefthand results panel are the same as for Stephen King, but we now have fooled the Google engine to display Marilyn Monroe’s infobox.

I’m sure over time that others will deconstruct this framework to a very precise degree. What would really be great, of course, as noted on many recent mailing lists, is for Google to expose all of this via an API. The Google listing could become the de facto source for Webby entity identifiers.

Some Concluding Thoughts

Sort of like when schema.org was first announced, there have been complaints from some in the semantic Web community that Google released this stuff without once saying the word “semantic”, that much had been ripped off from the original researchers and community without attribution, that a gorilla commercial entity like Google could only be expected to milk this stuff for profit, etc., etc.

That all sounds like sour grapes to me.

What we have here is what we are seeing across the board: the inexorable integration of semantic technology approaches into many existing products. Siri did it with voice commands; Bing, and now Google, are doing it too with search.

We should welcome these continued adoptions. The fact is, semWeb community, that what we are seeing in all of these endeavors is the right and proper role for these technologies: in the background, enhancing our search and information experience, and not something front and center or rammed down our throats. These are the natural roles of semantic technologies, and they are happening at a breakneck pace.

Welcome to the semantic technology space, Google! I look forward to learning much from you.

Posted by AI3's author, Mike Bergman Posted on May 18, 2012 at 7:43 pm in Adaptive Information, Structured Web | Comments (3)
The URI link reference to this post is: http://www.mkbergman.com/1009/deconstructing-the-google-knowledge-graph/
The URI to trackback this post is: http://www.mkbergman.com/1009/deconstructing-the-google-knowledge-graph/trackback/
Posted:April 4, 2012

Tractricious Sculpture at Fermilab; picture by Mike KappelAdaptive Information is a Hammer, but Genes are Not a Nail

Since Richard Dawkins first put forward the idea of the “meme” in his book The Selfish Gene some 35 years ago [1], the premise has struck in my craw. I, like Dawkins, was trained as an evolutionary biologist. I understand the idea of the gene and its essential role as a vehicle for organic evolution. And, all of us clearly understand that “ideas” themselves have a certain competitive and adaptive nature. Some go viral; some run like wildfire and take prominence; and some go nowhere or fall on deaf ears. Culture and human communications and ideas play complementary — perhaps even dominant — roles in comparison to the biological information contained within DNA (genes).

I think there are two bases for why the “meme” idea sticks in my craw. The first harkens back to Dawkins. In formulating the concept of the “meme”, Dawkins falls into the trap of many professionals, what the French call déformation professionnelle. This is the idea of professionals framing problems from the confines of their own points of view. This is also known as the Law of the Instrument, or (Abraham) Maslow‘s hammer, or what all of us know colloquially as “if all you have is a hammer, everything looks like a nail [2]. Human or cultural information is not genetics.

The second — and more fundamental — basis for why this idea sticks in my craw is its mis-characterization of what is adaptive information, the title and theme of this blog. Sure, adaptive information can be found in the types of information structures at the basis of organic life and organic evolution. But, adaptive information is much, much more. Adaptive information is any structure that provides arrangements of energy and matter that maximizes entropy production. In inanimate terms, such structures include chemical chirality and proteins. It includes the bases for organic life, inheritance and organic evolution. For some life forms, it might include communications such as pheromones or bird or whale songs or the primitive use of tools or communicated behaviors such as nest building. For humans with their unique abilities to manipulate and communicate symbols, adaptive information embraces such structures as languages, books and technology artifacts. These structures don’t look or act like genes and are not replicators in any fashion of the term. To hammer them as “memes” significantly distorts their fundamental nature as information structures and glosses over what factors might — or might not — make them adaptive.

I have been thinking of these concepts much over the past few decades. Recently, though, there has been a spate of the “meme” term, particularly on the semantic Web mailing lists to which I subscribe. This spewing has caused me to outline some basic ideas about what I find so problematic in the use of the “meme” concept.

A Brief Disquisition on Memes

As defined by Dawkins and expanded upon by others, a “meme” is an idea, behavior or style that spreads from person to person within a culture. It is proposed as being able to be transmitted through writing, speech, gestures or rituals. Dawkins specifically called melodies, catch-phrases, fashion and the technology of building arches as examples of memes. A meme is postulated as a cultural analogue to genes in that they are assumed to be able to self-replicate, mutate or respond to selective pressures. Thus, as proposed, memes may evolve by natural selection in a manner analogous to that of biological evolution.

However, unlike a gene, a structure corresponding to a “meme” has never been discovered or observed. There is no evidence for it as a unit of replication, or indeed as any kind of coherent unit at all. In its sloppy use, it is hard to see how “meme” differs in its scope from concepts, ideas or any form of cultural information or transmission, yet it is imbued with properties analogous to animate evolution for which there is not a shred of empirical evidence.

One might say, so what, the idea of a “meme” is merely a metaphor, what is the harm? Well, the harm comes about when it is taken seriously as a means of explaining human behavior and cultural changes, a field of study called memetics. It becomes a pseudo-scientific term that sets a boundary condition for understanding the nature of information and what makes it adaptive or not [3]. Mechanisms and structures appropriate to animate life are not universal information structures, they are simply the structures that have evolved in the organic realm. In the human realm of signs and symbols and digital information and media, information is the universal, not the genetic structure of organic evolution.

The noted evolutionary geneticist, R.C. Lewontin, one of my key influences as a student, has also been harshly critical of the idea of memetics [4]:

 “The selectionist paradigm requires the reduction of society and culture to inheritance systems that consist of randomly varying, individual units, some of which are selected, and some not; and with society and culture thus reduced to inheritance systems, history can be reduced to ‘evolution.’ . . . we conclude that while historical phenomena can always be modeled selectionistically, selectionist explanations do not work, nor do they contribute anything new except a misleading vocabulary that anesthetizes history.”

Consistent with my recent writings about Charles S. Peirce [5], many logicians and semiotic theorists are also critical of the idea of “memes”, but on different grounds. The criticism here is that “memes” distort Peirce’s ideas about signs and the reification of signs and symbols via a triadic nature. Notable in this camp is Terrence Deacon [6].

Information is a First Principle

It is not surprising that the concept of “memes” arose in the first place. It is understandable to seek universal principles consistent with natural laws and observations. The mechanism of natural evolution works on the information embodied in DNA, so why not look to genes as some form of universal model?

The problem here, I think, was to confuse mechanisms with first principles. Genes are a mechanism — a “structure” if you will — that along with other forms of natural selection such as the entire organism and even kin selection [7], have evolved as means of adaptation in the animate world. But the fundamental thing to be looked for here is the idea of information, not the mechanism of genes and how they replicate. The idea of information holds the key for drilling down to universal principles that may find commonality between information for humans in a cultural sense and information conveyed through natural evolution for life forms. It is the search for this commonality that has driven my professional interests for decades, spanning from population genetics and evolution to computers, information theory and semantics [8].

But before we can tackle these connections head on, it is important to address a couple of important misconceptions (as I see them).

Seque #1: Information is (Not!) Entropy

In looking to information as a first principle, Claude Shannon‘s seminal work in 1948 on information theory must be taken as the essential point of departure [9]. The motivation of Shannon’s paper and work by others preceding him was to understand information losses in communication systems or networks. Much of the impetus for this came about because of issues in wartime communications and early ciphers and cryptography. (As a result, the Shannon paper is also intimately related to data patterns and data compression, not further discussed here.)

In a strict sense, Shannon’s paper was really talking about the amount of information that could be theoretically and predictably communicated between a sender and a receiver. No context or semantics were implied in this communication, only the amount of information (for which Shannon introduced the term “bits” [10]) and what might be subject to losses (or uncertainty in the accurate communication of the message). In this regard, what Shannon called “information” is what we would best term “data” in today’s parlance.

The form that the uncertainty (unpredictability) calculation that Shannon derived:

 \displaystyle H(X) = - \sum_{i=1}^np(x_i)\log_b p(x_i)

very much resembled the mathematical form for Boltzmann‘s original definition of entropy (as elaborated upon by Gibbs, denoted as S, for Gibb’s entropy):

S = - k_B \sum p_i \ln p_i \,

and thus Shannon also labelled his measure of unpredictability, H, as entropy [10].

After Shannon, and nearly a century after Boltzmann, work by individuals such as Jaynes in the field of statistical mechanics came to show that thermodynamic entropy can indeed be seen as an application of Shannon’s information theory, so there are close parallels [11]. This parallel of mathematical form and terminology has led many to assert that information is entropy.

I believe this assertion is a misconception on two grounds.

First, as noted, what is actually being measured here is data (or bits), not information embodying any semantic meaning or context. Thus, the formula and terminology is not accurate for discussing “information” in a conventional sense.

Second, the Shannon methods are based on the communication (transmittal) between a sender and a receiver. Thus the Shannon entropy measure is actually a measure of the uncertainty for either one of these states. The actual information that gets transmitted and predictably received was formulated by Shannon as R (which he called rate), and he expressed basically as:

R = Hbefore – Hafter

R, then, becomes a proxy for the amount of information accurately communicated. R can never be zero (because all communication systems have losses). Hbefore and Hafter are both state functions for the message, so this also makes R a function of state. So while there is Shannon entropy (unpredictability) for any given sending or receiving state, the actual amount of information (that is, data) that is transmitted is a change in state as measured by a change in uncertainty between sender (Hbefore) and receiver (Hafter). In the words of Thomas Schneider, who provides a very clear discussion of this distinction [12]:

Information is always a measure of the decrease of uncertainty at a receiver.

These points do not directly bear on the basis of information as discussed below, but help remove misunderstandings that might undercut those points. Further, these clarifications make consistent theoretical foundations of information (data) with natural evolution while being logically consistent with the 2nd law of thermodynamics (see next).

Seque #2: Entropy is (Not!) Disorder

The 2nd law of thermodynamics expresses the tendency that, over time, differences in temperature, pressure, or chemical potential equilibrate in an isolated physical system. Entropy is a measure of this equilibration: for a given physical system, the highest entropy state is one at equilibrium. Fluxes or gradients arise when there are differences in state potentials in these systems. (In physical systems, these are known as sources and sinks; in information theory, they are sender and receiver.) Fluxes go from low to high entropy, and are non-reversible — the “arrow of time” — without the addition of external energy. Heat, for example, is a by product of fluxes in thermal energy. Because these fluxes are directional in isolation, a perpetual motion machine is shown as impossible.

In a closed system (namely, the entire cosmos), one can see this gradient as spanning from order to disorder, with the equilibrium state being the random distribution of all things. This perspective, and much schooling regarding these concepts, tends to present the idea of entropy as a “disordered” state. Life is seen as the “ordered” state in this mindset. Hewing to this perspective, some prominent philosophers, scientists and others have sometimes tried to present the “force” representing life and “order” as an opposite one to entropy. One common term for this opposite “force” is “negentropy[13].

But, in the real conditions common to our lives, our environment is distinctly open, not closed. We experience massive influxes of energy via sunlight, and have learned as well how to harness stored energy from eons past in further sources of fossil and nuclear energy. Our open world is indeed a high energy one, and one that increases that high-energy state as our knowledge leads us to exploit still further resources of higher and higher quality. As Buckminster Fuller once famously noted, electricity consumption (one of the highest quality energy resources found to date) has become a telling metric about the well-being and wealth of human societies [14].

The high-energy environments fostering life on earth and more recently human evolution establish a local (in a cosmic sense) gradient that promotes fluxes to more ordered states, not lesser unordered ones. These fluxes remain faithful to basic physical laws and are non-deterministic [15]. Indeed, such local gradients can themselves be seen as consistent with the conditions initially leading to life, favoring the random event in the early primordial soup that led to chemical structures such as chirality, auto-catalytic reactions, enzymes, and then proteins, which became the eventual building blocks for animate life [16].

These events did not have preordained outcomes (that is, they were non-deterministic), but were the result of time and variation in the face of external energy inputs to favor the marginal combinatorial improvement. The favoring of the new marginal improvement also arises consistent with entropy principles, by giving a competitive edge to those structures that produce faster movements across the existing energy gradient. According to Annila and Annila [16]:

“According to the thermodynamics of open systems, every entity, simple or sophisticated, is considered as a catalyst to increase entropy, i.e., to diminish free energy. Catalysis calls for structures. Therefore, the spontaneous rise of structural diversity is inevitably biased toward functional complexity to attain and maintain high-entropy states.”

Via this analysis we see that life is not at odds with entropy, but is consistent with it. Further, we see that incremental improvements in structure that are consistent with the maximum entropy production principle will be favored [17]. Of course, absent the external inputs of energy, these gradients would reverse. Under those conditions, the 2nd law would promote a breakdown to a less ordered system, what most of us have been taught in schools.

With these understandings we can now see the dichotomy as life representing order with entropy disorder as being false. Further, we can see a guiding set of principles that is consistent across the broad span of evolution from primordial chemicals and enzymes to basic life and on to human knowledge and artifacts. This insight provides the fundamental “unit” we need to be looking toward, and not the gene nor the “meme”.

Information is Structure

Of course, the fundamental “unit” we are talking about here is information, and not limited as is Shannon’s concept to data. The quality that changes data to information is structure, and structure of a particular sort. Like all structure, there is order or patterns, often of a hierarchical or fractal or graph nature. But the real aspect of the structure that is important is the marginal ability of that structure to lead to improvements in entropy production. That is, processes are most adaptive (and therefore selected) that maximize entropy production. Any structure that emerges that is able to reduce the energy gradient faster will be favored.

However, remember, these are probabilistic, statistical processes. Uncertainties in state may favor one structure at one time versus another at a different time. The types of chemical compounds favored in the primordial soup were likely greatly influenced by thermal and light cycles and drying and wet conditions. In biological ecosystems, there are huge differences in seed or offspring production or in overall species diversity and ecological complexity based on the stability (say, tropics) or instability (say, disturbance) of local environments. As noted, these processes are inherently non-deterministic.

As we climb up the chain from the primordial ooze to life and then to humans and our many information mechanisms and technology artifacts (which are themselves embodiments of information), we see increasing complexity and structure. But we do not see uniformity of mechanisms or vehicles.

The general mechanisms of information transfer in living organisms occur (generally) via DNA in genes, mediated by sex in higher organisms, subject to random mutations, and then kept or lost entirely as their host organisms survive to procreate or not. Those are harsh conditions: the information survives or not (on a population basis) with high concentrations of information in DNA and with a priority placed on remixing for new combinations via sex. Information exchange (generally) only occurs at each generational event.

Human cultural information, however, is of an entirely different nature. Information can be made persistent, can be recorded and shared across individuals or generations, extended with new innovations like written language or digital computers, or combined in ways that defy the limits of sex. Occasionally, of course, loss of living languages due to certain cultures or populations dying out or horrendous catastrophes like the Spanish burning (nearly all of) the Mayan’s existing books can also occur [18]. The environment will also be uncertain.

So, while we can define DNA in genes or the ideas of a “meme” all as information, in fact we now see how very unlike the dynamics and structures of these two forms really are. We can be awestruck with the elegance and sublimity of organic evolution. We can also be inspired by song or poem or moved to action through ideals such as truth and justice. But organic evolution does not transpire like reading a book or hearing a sermon, just like human ideas and innovations don’t act like genes. The “meme” is a totally false analogy. The only constant is information.

Some Tentative Implications

The closer we come to finding true universals, the better we will be able to create maximum entropy producing structures. This, in turn, has some pretty profound implications. The insight that keys these implications begins with an understanding of the fundamental nature — and importance — of information. According to Karnani et al [19]:

“. . . the common contemporary consent, the second law of thermodynamics, is perceived to drive disorder. Therefore, it may appear, at first sight, inconceivable that this universal law could possibly account for the existence and orderly characteristics of information, as well as for its meaningful content. However, the second law, or equivalently the principle of increasing entropy, merely states that difference among energy densities tends to vanish. When the surrounding energy density is high, the system will evolve toward a stationary state by increasing its energy content, e.g, by devising orderly machinery for energy transduction to acquire energy. . . . Syntax of information, when described by thermodynamics, is associated with the entropy of the physical representation, and significance of information is associated with the entropy increase in the receiver system when it executes the encoded information.”

All would agree that the evolution of life over the past few billion years is truly wondrous. But, what is equally wondrous is that the human species has come to learn and master symbols. That mastery, in turn, has broken the bounds of organic evolution and has put into our hands the very means and structure of information itself. Via this entirely new — and incredibly accelerated — path to information structures, we are only now beginning to see some of its implications:

  • Unlike all other organisms, we dominate our environment and have experienced increasing wealth and freedom. Wealth increases and their universal applicability continue to increase at an exponential rate [20]
  • We no longer depend on the random variant to maximize our entropy producing structures. We can now do so purposefully and with symbologies and bit streams of our own devising
  • Potentially all information variants can be recorded and shared across all human individuals and generations, a complete decoupling from organic boundaries
  • Key ideas and abstractions, such as truth, justice and equality, can operate on a species-wide basis and become adopted without massive die-offs of individuals
  • We are actively moving ourselves into higher-level energy states, further increasing the potential for wealth and new structures
  • We are actively impacting our local environment, potentially creating the conditions for our species’ demise
  • We are increasingly engaging all individuals of the human species in these endeavors through literacy, education and access to global information sources. This provides a still further multiplier effect on humanity’s ability to devise and manipulate information structures into more adaptive and highly-ordered states.

The idea of a “meme” actually cheapens our understanding of these potentials.

Ideas matter and terminology matters. These are the symbols by which we define and communicate potentials. If we choose the wrong analogies or symbols — as “meme” is in this case — we are picking the option with the lower entropy potential. Whether I assert it to be so or not, the “meme” concept is an information structure doomed for extinction.


[1] Richard Dawkins, 1976. The Selfish Gene, Oxford University Press, New York City, ISBN 0-19-286092-5.
[2] This phrase was perhaps first made famous by Mark Twain or Bernard Baruch, but in any case is clearly understood now by all.
[3] According to Wikipedia, Benitez-Bribiesca calls memetics “a dangerous idea that poses a threat to the serious study of consciousness and cultural evolution”. He points to the lack of a coding structure analogous to the DNA of genes, and to instability of any mutation mechanisms for “memes” sufficient for standard evolution processes. See Luis Benitez Bribiesca, 2001. “Memetics: A Dangerous Idea”, Interciencia: Revista de Ciencia y Technologia de América (Venezuela: Asociación Interciencia) 26 (1): 29–31, January 2001. See http://redalyc.uaemex.mx/redalyc/pdf/339/33905206.pdf.
[4] Joseph Fracchia and R.C. Lewontin, 2005. “The Price of Metaphor”, History and Theory (Wesleyan University) 44 (44): 14–29, February 2005.
[5] See further M. K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” posting on AI3:::Adaptive Information blog, January 24, 2012. See http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/.
[6] Terrence Deacon, 1999. “The Trouble with Memes (and what to do about it)”. The Semiotic Review of Books 10(3). See http://projects.chass.utoronto.ca/semiotics/srb/10-3edit.html.
[7] Kin selection refers to changes in gene frequency across generations that are driven at least in part by interactions between related individuals. Some mathematical models show how evolution may favor the reproductive success of an organism’s relatives, even at a cost to an individual organism. Under this mode, selection can occur at the level of populations and not the individual or the gene. Kin selection is often posed as the mechanism for the evolution of altruism or social insects. Among others, kin selection and inclusive fitness was popularized by W. D. Hamilton and Robert Trivers.
[8] You may want to see my statement of purpose under the Blogasbörd topic, first written seven years ago when I started this blog.
[9] Claude E. Shannon, 1948. “A Mathematical Theory of Communication”, Bell System Technical Journal, 27: 379–423, 623-656, July, October, 1948. See http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.
[10] As Shannon acknowledges in his paper, the “bit” term was actually suggested by J. W. Tukey. Shannon can be more accurately said to have popularized the term via his paper.
[12] See Thomas D. Schneider, 2012. “Information Is Not Entropy, Information Is Not Uncertainty!,” Web page retrieved April 4, 2012; see http://www.lecb.ncifcrf.gov/~toms/information.is.not.uncertainty.html.
[13] The “negative entropy” (also called negentropy or syntropy) of a living system is the entropy that it exports to keep its own entropy low, and according to proponents lies at the intersection of entropy and life. The concept and phrase “negative entropy” were introduced by Erwin Schrödinger in his 1944 popular-science book What is Life?. See Erwin Schrödinger, 1944. What is Life – the Physical Aspect of the Living Cell, Cambridge University Press, 1944. A copy may be downloaded at http://old.biovip.com/UpLoadFiles/Aaron/Files/2005051204.pdf.
[14] R. Buckminster Fuller, 1981. Critical Path, St. Martin’s Press, New York City, 471 pp. See especially p. 103 ff.
[15] The seminal paper first presenting this argument is Vivek Sharma and Arto Annila, 2007. “Natural Process – Natural Selection”, Biophysical Chemistry 127: 123-128. See http://www.helsinki.fi/~aannila/arto/natprocess.pdf. This basic theme has been much expanded upon by Annila and his various co-authors. See, for example, [16] and [19], among many others.
[16] Arto Annila and Erkki Annila, 2008. “Why Did Life Emerge?,” International Journal of Astrobiology 7(3 and 4): 293-300. See http://www.helsinki.fi/~aannila/arto/whylife.pdf.
[17] According to Wikipedia, the principle (or “law”) of maximum entropy production is an aspect of non-equilibrium thermodynamics, a branch of thermodynamics that deals with systems that are not in thermodynamic equilibrium. Most systems found in nature are not in thermodynamic equilibrium and are subject to fluxes of matter and energy to and from other systems and to chemical reactions. One fundamental difference between equilibrium thermodynamics and non-equilibrium thermodynamics lies in the behavior of inhomogeneous systems, which require for their study knowledge of rates of reaction which are not considered in equilibrium thermodynamics of homogeneous systems. Another fundamental difference is the difficulty in defining entropy in macroscopic terms for systems not in thermodynamic equilibrium.
The principle of maximum entropy production states that the in comparing two or more alternate paths for crossing an energy gradient that the one that creates the maximum entropy change will be favored. The maximum entropy (sometimes abbreviated MaxEnt or MaxEp) concept is related to this notion. It is also known as the maximum entropy production principle, or MEPP.
[18] The actual number of Mayan books burned by the Spanish conquistadors is unknown, but is somewhere between tens and thousands; see here. Only three or four codexes are known to survive today. Also, Wikipedia contains a listing of notable book burnings throughout history.
[19] Mahesh Karnani, Kimmo Pääkkönen and Arto Annila, 2009. “The Physical Character of Information,” Proceedings of the Royal Society A, April 27, 2009. See http://www.helsinki.fi/~aannila/arto/natinfo.pdf.
[20] I discuss and chart the exponential growth of human wealth based on Angus Maddison data in M. K. Bergman, 2006. “The Biggest Disruption in History: Massively Accelerated Growth Since the Industrial Revolution,” post in AI3:::Adaptive Information blog, July 27, 2006. See http://www.mkbergman.com/250/the-biggest-disruption-in-history-massively-accelerated-growth-since-the-industrial-revolution/.
Posted:December 12, 2011

State of SemWeb Tools - 2011Number of Semantic Web Tools Passes 1000 for First Time; Many Other Changes

We have been maintaining Sweet Tools, AI3‘s listing of semantic Web and -related tools, for a bit over five years now. Though we had switched to a structWSF-based framework that allows us to update it on a more regular, incremental schedule [1], like all databases, the listing needs to be reviewed and cleaned up on a periodic basis. We have just completed the most recent cleaning and update. We are also now committing to do so on an annual basis.

Thus, this is the inaugural ‘State of Tooling for Semantic Technologies‘ report, and, boy, is it a humdinger. There have been more changes — and more important changes — in this past year than in all four previous years combined. I think it fair to say that semantic technology tooling is now reaching a mature state, the trends of which likely point to future changes as well.

In this past year more tools have been added, more tools have been dropped (or abandoned), and more tools have taken on a professional, sophisticated nature. Further, for the first time, the number of semantic technology and -related tools has passed 1000. This is remarkable, given that more tools have been abandoned or retired than ever before.

Click here to browse the Sweet Tools listing. There is also a simple listing of URL links and categories only.

We first present our key findings and then overall statistics. We conclude with a discussion of observed trends and implications for the near term.

Key Findings

Some of the key findings from the 2011 State of Tooling for Semantic Technologies are:

  • As of the date of this article, there are 1010 tools in the Sweet Tools listing, the first it has passed 1000 total tools
  • A total of 158 new tools have been added to the listing in the last six months, an increase of 17%
  • 75 tools have been abandoned or retired, the most removed at any period over the past five years
  • A further 6%, or 55 tools, have been updated since the last listing
  • Though open source has always been an important component of the listing, it now constitutes more than 80% of all listings; with dual licenses, open source availability is about 83%. Online systems contribute another 9%
  • Key application areas for growth have been in SPARQL, ontology-related areas and linked data
  • Java continues to dominate as the most important language.

Many of these points are elaborated below.

The Statistical Picture

The updated Sweet Tools listing now includes nearly 50 different tools categories. The most prevalent categories, each with over 6% of the total, are information extraction, general RDF tools, ontology tools, browser tools (RDF, OWL), and parsers or converters. The relative share by category is shown in this diagram (click to expand):

Since the last listing, the fastest growing categories have been SPARQL, linked data, knowledge bases and all things related to ontologies. The relative changes by tools category are shown in this figure:

Though it is true that some of this growth is the result of discovery, based on our own tool needs and investigations, we have also been monitoring this space for some time and serendipity is not a compelling explanation alone. Rather, I think that we are seeing both an increase in practical tools (such as for querying), plus the trends of linked data growth matched with greater sophistication in areas such as ontologies and the OWL language.

The languages these tools are written in have also been pretty constant over the past couple of years, with Java remaining dominant. Java has represented half of all tools in this space, which continues with the most recent tools as well (see below). More than a dozen programming or scripting languages have at least some share of the semantic tooling space (click to expand):

Sweet Tools Languages

With only 160 new tools it is hard to draw firm trends, but it does appear that some languages (Haskell, XSLT) have fallen out of favor, while popularity has grown for Flash/Flex (from a small base), Python and Prolog (with the growth of logic tools):

PHP will likely continue to see some emphasis because of relations to many content management systems (WordPress, Drupal, etc.), though both Python and Ruby seem to be taking some market share in that area.

New Tools

The newest tools added to the listing show somewhat similar trends. Again, Java is the dominant language, but with much increased use of JavaScript and Python and Prolog:

Sweet Tools Languages

The higher incidence of Prolog is likely due to the parallel increase in reasoners and inference engines associated with ontology (OWL) tools.

The increase in comprehensive tool suites and use of Eclipse as a development environment would appear to secure Java’s dominance for some time to come.

Trends and Observations

These dry statistics tend to mask the feel one gets when looking at most of the individual tools across the board. Older academic and government-funded project tools are finally getting cleaned out and abandoned. Those tools that remain have tended to get some version upgrades and improved Web sites to accompany them.

The general feel one gets with regard to semantic technology tooling at the close of 2011 has these noticeable trends:

  • A three-tiered environment – the tools seem to segregate into: 1) a bottom tier of tools (largely) developed by individuals or small groups, now most often found on Google Code or Github; 2) a middle-tier of (largely) government-funded projects, sometimes with multiple developers, often older, but with no apparent driving force for ongoing improvements or commercialization; and 3) a top-tier of more professional and (often) commercially-oriented tools. The latter category is the most noticeable with respect to growth and impact
  • Professionalism – the tools in the apparent top tiers feel to have more professionalism and better (and more attractive) packaging. This professionalism is especially true for the frameworks and composite applications. But, it also applies to many of the EU-funded projects from Europe, which has always been a huge source of new tool developments
  • More complete toolsets – similarly, the upper levels of tools are oriented to pragmatic problems and problem-solving, which often means they embody multiple functions and more complete tooling environments. This category actually appears to be the most visible one exhibiting growth
  • Changing nature of academic releases – yet, even the academic releases seem to be increasing in professionalism and completeness. Though in the lowest tier it is still possible to see cursory or experimental tool releases, newer academic releases (often) seem to be more strategically oriented and parts of broader programmatic emphases. Programs like AKSW from the University of Leipzig or the Freie Universität Berlin or Finland’s Semantic Computing Research Group (SeCo), among many others, tend to be exemplars of this trend
  • Rise of commercial interests and enterprise adoption – the growing maturity of semantic technologies is also drawing commercial interest, and the incubation of new start-ups by academic and research institutions acts to reinforce the above trends. Promising projects and tools are now much more likely to be spun off as potential ventures, with accompanying better packaging, documentation and business models
  • Multiple languages and applications – with this growing complexity and sophistication has also come more complicated apps, combining multiple languages and functions. In fact, for some time the Sweet Tools listing has been justifiably criticized by some as overly “simplifying” the space by classifying tools under (largely) single applications or single languages. By the 2012 survey, it will likely be necessary to better classify the tools using multiple assignments
  • Google code over SourceForge for open source (and an increase in Github, as well) – virtually all projects on SourceForge now feel abandoned or less active. The largest source of open source projects in the semantic technology space is now clearly Google Code. Though of a smaller footprint today, we are also seeing many of the newer open source projects also gravitate to Github. Open source hosting environments are clearly in flux.

I have said this before, and been wrong about it before, but it is hard to see the tooling growth curve continue at its current slope into the future. I think we will see many individual tools spring up on the open source hosting sites like Google and Github, perhaps at relatively the same steady release rate. But, old projects I think will increasingly be abandoned and older projects will not tend to remain available for as long a time. While a relatively few established open source standards, like Solr and Jena, will be the exception, I think we will see shorter shelf lives for most open source tools moving forward. This will lead to a younger tools base than was the case five or more years ago.

I also think we will continue to see the dominance of open source. Proprietary software has increasingly been challenged in the enterprise space. And, especially in semantic technologies, we tend to see many open source tools that are as capable as proprietary ones, and generally more dynamic as well. The emphasis on open data in this environment also tends to favor open source.

Yet, despite the professionalism, sophistication and complexity trends, I do not yet see massive consolidation in the semantic technology space. While we are seeing a rapid maturation of tooling, I don’t think we have yet seen a similar maturation in revenue and business models. While notable semantic technology start-ups like Powerset and Siri have been acquired and are clear successes, these wins still remain much in the minority.


[1] Please use the comments section of this post for suggesting new or overlooked tools. We will incrementally add them to the Sweet Tools listing. Also, please see the About tab of the Sweet Tools results listing for prior releases and statistics.

Posted by AI3's author, Mike Bergman Posted on December 12, 2011 at 8:29 am in Open Source, Semantic Web Tools, Structured Web | Comments (6)
The URI link reference to this post is: http://www.mkbergman.com/991/the-state-of-tooling-for-semantic-technologies/
The URI to trackback this post is: http://www.mkbergman.com/991/the-state-of-tooling-for-semantic-technologies/trackback/