Posted:August 12, 2012

Example Ontology GrowthThe Transition from Transactions to Connections

Virtually everywhere one looks we are in the midst of a transition for how we organize and manage information, indeed even relationships. Social networks and online communities are changing how we live and interact. NoSQL and graph databases — married to their near cousin Big Data — are changing how we organize and store information and data. Semantic technologies, backed by their ontologies and RDF data model, are showing the way for how we can connect and interoperate disparate information in ways only dreamed about a decade ago. And all of this, of course, is being built upon the infrastructure of the Internet and the Web, a global, distributed network of devices and information that is undoubtedly one of the most important technological developments in human history.

There is a shared structure across all of these developments — the graph. Graphs are proving to be the new universal paradigm for how we organize and manage information. Graphs have an inherently expandable nature, and one which can also capture any existing structure. So, as we see all of the networks, connections, relationships and links — both physical and informational — grow around us, it is useful to step back a bit and contemplate the universal graph structure at the core of these developments.

Understanding that we now live in the Age of the Graph means we can begin studying and using the concept of the graph itself to better analyze and manage our interconnected world. Whether we are trying to understand the physical networks of supply chains and infrastructure or the information relationships within ontologies or knowledge graphs, the various concepts underlying graphs and graph theory, themselves expressed through a rich vocabulary of terms, provide the keys for unlocking still further treasures hidden in the structure of graphs.

Graphs as a Concept

The use of “graph” as a mathematical concept is not much more than 100 years old. The beginning explication of the various classes of problems that can be addressed by graph theory probably is no older than 300 years. The use of graphs for expressing logic structures probably is not much older than 100 years, with the intellectual roots beginning with Charles Sanders Peirce [1]. Though likely trade routes and their affiliated roads and primitive transportation or nomadic infrastructures were perhaps the first expressions of physical networks, the emergence and prevalence of networks is a fairly recent phenomenon. The Internet and the Web are surely the catalyzing development that has brought graphs and networks to the forefront.

In mathematics, a graph is an abstract representation of a set of objects where pairs of the objects are connected. The objects are most often known as nodes or vertices; the connections between the objects are called edges. Typically, a graph is depicted in diagrammatic form as a set of dots or bubbles for the nodes, joined by lines or curves for the edges. If there is a logical relationship between connected nodes the edge is directed, and the graph is known as a directed graph. Various structures or topologies can be expressed through this conceptual graph framework. Graphs are one of the principle focuses of study in discrete mathematics [2]. The word “graph” was first used in the sense as a mathematical structure by J.J. Sylvester in 1878 [3].

As representative of various data models, particularly in our company’s own interests in the Resource Description Framework (RDF) model, the nodes can represent “nouns” or subjects or objects (depending on the direction of the links) or attributes. The edges or connections represent “verbs” or relationships, properties or predicates. Thus, the simple “triple” of the basic statement in RDF (consisting of subjectpredicateobject) is one of the constituent barbells that make up what becomes the eventual graph structure.

The manipulation and analysis of graph structures comes under the rubric of graph theory. The first recognized paper in that field is the Seven Bridges of Königsberg, written by Leonhard Euler in 1736. The objective of the paper was to find a walking path through the city that would cross each bridge once and only once. Euler proved that the problem has no solution:

Seven Bridges of Königsberg; from Wikipedia –> Seven Bridges of Königsberg graph; from Wikipedia

Euler’s approach represented the path problem as a graph, by treating the land masses as nodes and the bridges as edges. Euler’s proof postulated that if every bridge has been traversed exactly once, it follows that, for each land mass (except for the ones chosen for the start and finish), the number of bridges touching that land mass must be even (the number of connections to a node we now call “degree”). Since that is not true for this instance, there is no solution. Other researchers, including Leibniz, Cauchy and L’Huillier applied this approach to similar problems, leading to the origin of the field of topology.

Later, Cayley broadened the approach to study tree structures, which have many implications in theoretical chemistry. By the 20th century, the fusion of ideas coming from mathematics with those coming from chemistry formed the origin of much of the standard terminology of graph theory.

The Theory of Graphs

Graph theory forms the core of network science, the applied study of graph structures and networks. Besides graph theory, the field draws on methods including statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. Classical problems embraced by this realm include the four color problem of maps, the traveling salesman problem, and the six degrees of Kevin Bacon.

Graph theory and network science are the suitable disciplines for a variety of information structures and many additional classes of problems. This table lists many of these applicable areas, most with links to still further information from Wikipedia:

Graph Structures Graph Problems
Data structures
Tree structures
List structures
Matrix structures
Path structures
Networks
Logic structures
Random graphs
Weighted graphs
Sparse/dense graphs
Enumeration
Subgraphs, induced subgraphs, and minors
Search and navigation
Graph coloring
Subsumption and unification
Route (path) problems
Matrix manipulations (many)
Network flow
Visibility graph problems
Covering problems
Graph structure
Graph classes

Graphs are among the most ubiquitous models of both natural and human-made structures. They can be used to model many types of relations and process dynamics in physical, biological and social systems. Many problems of practical interest can be represented by graphs. This breadth of applicability makes network science and graph theory two of the most critical analytical areas for study and breakthroughs for the foreseeable future. I touch on this more in the concluding section.

Graphs as Physical Networks

Surely the first examples of graph structures were early trade and nomadic routes. Here, for example, are the trade routes of the Radhanites dating from about 870 AD [4]:

Trade network of the Radhanites, c. 870 CE; from Wikipedia

It is not surprising that routes such as these, or other physical networks as exemplified by the bridges of Königsberg, were the stimulus for early mathematics and analysis related to efficient use of networks. Minimizing the time to complete a trade circuit or visiting multiple markets efficiently has clear benefits. These economic rationales apply to a wide variety of modern, physical networks, including:

Of course, included in the latter category is the Internet itself. It is the largest graph in existence, with an estimated 2.2 billion users and their devices all connected in one way or another in all parts of the globe [5].

Graphs as Natural Systems

Graphs and graph theory also have broad applicability to natural systems. For example, graph theory is used extensively to study molecular structures in chemistry and physics. A graph makes a natural model for a molecule, where vertices represent atoms and edges bonds. Similarly, in biology or ecology, graphs can readily express such systems as species networks, ecological relationships, migration paths, or the spread of diseases. Graphs are also proper structures for modeling biological and chemical pathways.

Some of the exemplar natural systems that lend themselves to graph structures include:

As with physical networks, a graph representation for natural systems provides real benefits in computer processing and analysis. Once expressed as a graph, all graph algorithms and perspectives from graph theory and network science can be brought to bear. Statistical methods are particularly applicable to representing connections between interacting parts of a system, as well to representing the physical dynamics of natural systems.

Graphs as Social Networks

Parallel with the growth of the Internet and Web has been the growth of social networks. Social network analysis (SNA) has arguably been the single most important driver for advances in graph theory and analysis algorithms in recent years. New and interesting problems and challenges — from influence to communities to conflicts — are now being elucidated through techniques pioneered for SNA.

Second only in size to the Internet has been the graph of interactions arising from Facebook. Facebook had about 900 million users as of May 2012, half of which accessed the service via mobile devices [6]. Facebook famously embraced the graph with its own Open Graph protocol, which makes it easy for users to access and tie into Facebook’s social network. A representation of the Facebook social graph as of December 2010 is shown in this well-known figure:

The suitability of the graph structure to capture relationships has been a real boon to better understanding of social and community dynamics. Many new concepts have been introduced as the result of SNA, including such things as influence, diversity, centrality, cliques and so forth. (The opening diagram to this article, for example, models centrality, with blue the maximum and red the minimum.)

Particular areas of social interaction that lend themselves to SNA include:

Entirely new insights have arisen from SNA including finding terrorist leaders, analyzing prestige, or identifying keystone vendors or suppliers in business ecosystems.

Graphs as Information Representations

Given the ubiquity of graphs as representations of real systems and networks, it is certainly not surprising to see their use in computer science as as means for information representation. We already saw in the table above the many data structures that can be represented as graphs, but the paradigm has even broader applicability.

The critical breakthroughs have come through using the graph as a basis for data models and logic models. These, in turn, provide the basis for crafting entire graph-based vocabularies and languages. Once such structures are embraced, it is a natural extension to also extend the mindset to graph databases as well.

Some of the notable information representations that have a graph as their basis include:

Graphs as Knowledge Representations

A key point of graphs noted earlier was their inherent extensibility. Once graphs are understood as a great basis for representing both logic and data structures, it is a logical next step to see their applicability extend to knowledge representations and knowledge bases as well.

Graph-theoretic methods have proven particularly useful in linguistics, since natural language often lends itself well to discrete structure. So, not only can graphs represent syntactic and compositional structure, but they can also capture the interrelationships of terms and concepts within those languages. The usefulness of graph theory to linguistics is shown by the various knowledge bases such as WordNet (in various languages) and VerbNet.

Domain ontologies are similar structures, capturing the relationships amongst concepts within a given knowledge domain. These are also known as knowledge graphs, and Google has famously just released its graph of entities to the world [7]. Semantic networks and neural networks are similar knowledge representations.

The following interactive diagram, of the UMBEL knowledge graph of about 25,000 reference concepts for helping to orient disparate datasets [8], shows that some of these graph structures can get quite large:

Notes: at standard resolution, if this graph were to be rendered in actual size, it would be larger than 34 feet by 34 feet square at full zoom !!! Hint: that is about 1200 square feet, or 1/2 the size of a typical American house ! Also, if you are viewing this in a feed reader, click here to see the interactive graph.

What all of these examples show is the nearly universal applicability of graphs, from the abstract to the physical, from the small to the large, and every gradation between. We also see how basic graph structures and concepts can be built upon with more structure. This breadth points to the many synergies and innovations that may be transferred from diverse fields to advance the usefulness of graph theories.

Graphs as a Guiding Paradigm

Despite the many advances that have occurred in graph theory and the increased attention from social network analysis, many, many graph problems remain some of the hardest in computation. Optimizations, partitioning, mapping, inferencing, traversing and graph structure comparisons remain challenging. And, some of these challenges are only growing due to the growth in the size of networks and graphs.

Applying the lessons of the Internet in such areas as non-relational databases, distributed processing, and big data and map reduce-oriented approaches will help some in this regard. We’re learning how to divide and conquer big problems, and we are discovering data and processing architectures more amenable to graph-based problems.

The fact we have now entered the Age of the Graph also bodes that further scrutiny and attention will lead to more analytic breakthroughs and innovation. We may be in an era of Big Data, but the structure underlying all of that is the graph. And that reality, I predict, will result in accelerated advances in graph theory.


[1] For a fairly broad discussion of Peirce in relation to these topics, see M.K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” in AI3:::Adaptive Innovation blog, January 24, 2012. See http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/.
[2] Topics in discrete mathematics, which are all applicable to graphing techniques and theory, include theoretical computer science, information theory, logic, set theory, combinatorics, probability, number theory, algebra, geometry, topology, discrete calculus or discrete analysis, operations research, game theory, decision theory, utility theory, social choice theory, and all discrete analogues of continuous mathematics.
[3] See reference 1 in the Wikipedia entry on graph theory.
[4] According to Wikipedia, the Radhanites were medieval Jewish merchants involved in trade between the Christian and Islamic worlds during the early Middle Ages (approx. 500–1000 AD). Many trade routes previously established under the Roman Empire continued to function during that period largely through their efforts. Their trade network covered much of Europe, North Africa, the Middle East, Central Asia and parts of India and China.
[5] See the article on the Internet in Wikipedia for various size estimates.
[6] See the article on the Facebook in Wikipedia for various size estimates.
[7] For my discussion of the Google Knowledge Graph, see M.K. Bergman, 2012. “Deconstructing the Google Knowledge Graph,” in AI3:::Adaptive Innovation blog, May 18, 2012. See http://www.mkbergman.com/1009/deconstructing-the-google-knowledge-graph/.
[8] UMBEL (the Upper Mapping and Binding Exchange Layer) is designed to help content interoperate on the Web. It provides two functions: a) it is a broad, general reference structure of 25,000 concepts, which provides a scaffolding to link and interoperate other datasets and domain vocabularies, and b) it is a base vocabulary for the construction of other concept-based domain ontologies, also designed for interoperation.
Posted:July 9, 2012
Abrogans; earliest glossary (from Wikipedia)

There are many semantic technology terms relevant to the context of a semantic technology installation [1]. Some of these are general terms related to language standards, as well as to  ontologies or the dataset concept.

ABox
An ABox (for assertions, the basis for A in ABox) is an “assertion component”; that is, a fact associated with a terminological vocabulary within a knowledge base. ABox are TBox-compliant statements about instances belonging to the concept of an ontology.
Adaptive ontology
An adaptive ontology is a conventional knowledge representational ontology that has added to it a number of specific best practices, including modeling the ABox and TBox constructs separately; information that relates specific types to different and appropriate display templates or visualization components; use of preferred labels for user interfaces, as well as alternative labels and hidden labels; defined concepts; and a design that adheres to the open world assumption.
Administrative ontology
Administrative ontologies govern internal application use and user interface interactions.
Annotation
An annotation, specifically as an annotation property, is a way to provide metadata or to describe vocabularies and properties used within an ontology. Annotations do not participate in reasoning or coherency testing for ontologies.
Atom
The name Atom applies to a pair of related standards. The Atom Syndication Format is an XML language used for web feeds, while the Atom Publishing Protocol (APP for short) is a simple HTTP-based protocol for creating and updating Web resources.
Attributes
These are the aspects, properties, features, characteristics, or parameters that objects (and classes) may have. They are the descriptive characteristics of a thing. Key-value pairs match an attribute with a value; the value may be a reference to another object, an actual value or a descriptive label or string. In an RDF statement, an attribute is expressed as a property (or predicate or relation). In intensional logic, all attributes or characteristics of similarly classifiable items define the membership in that set.
Axiom
An axiom is a premise or starting point of reasoning. In an ontology, each statement (assertion) is an axiom.
Binding
Binding is the creation of a simple reference to something that is larger and more complicated and used frequently. The simple reference can be used instead of having to repeat the larger thing.
Class
A class is a collection of sets or instances (or sometimes other mathematical objects) which can be unambiguously defined by a property that all of its members share. In ontologies, classes may also be known as sets, collections, concepts, types of objects, or kinds of things.
Closed World Assumption
CWA is the presumption that what is not currently known to be true, is false. CWA also has a logical formalization. CWA is the most common logic applied to relational database systems, and is particularly useful for transaction-type systems. In knowledge management, the closed world assumption is used in at least two situations: 1) when the knowledge base is known to be complete (e.g., a corporate database containing records for every employee), and 2) when the knowledge base is known to be incomplete but a “best” definite answer must be derived from incomplete information. See contrast to the open world assumption.
Data Space
A data space may be personal, collective or topical, and is a virtual “container” for related information irrespective of storage location, schema or structure.
Dataset
An aggregation of similar kinds of things or items, mostly comprised of instance records.
DBpedia
A project that extracts structured content from Wikipedia, and then makes that data available as linked data. There are millions of entities characterized by DBpedia in this way. As such, DBpedia is one of the largest — and most central — hubs for linked data on the Web.
DOAP
DOAP (Description Of A Project) is an RDF schema and XML vocabulary to describe open-source projects.
Description logics
Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.
Domain ontology
Domain (or content) ontologies embody more of the traditional ontology functions such as information interoperability, inferencing, reasoning and conceptual and knowledge capture of the applicable domain.
Entity
An individual object or member of a class; when affixed with a proper name or label is also known as a named entity (thus, named entities are a subset of all entities).
Entity–attribute–value model
EAV is a data model to describe entities where the number of attributes (properties, parameters) that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. In the EAV data model, each attribute-value pair is a fact describing an entity. EAV systems trade off simplicity in the physical and logical structure of the data for complexity in their metadata, which, among other things, plays the role that database constraints and referential integrity do in standard database designs.
Extensional
The extension of a class, concept, idea, or sign consists of the things to which it applies, in contrast with its intension. For example, the extension of the word “dog” is the set of all (past, present and future) dogs in the world. The extension is most akin to the attributes or characteristics of the instances in a set defining its class membership.
FOAF
FOAF (Friend of a Friend) is an RDF schema for machine-readable modeling of homepage-like profiles and social networks.
Folksonomy
A folksonomy is a user-generated set of open-ended labels called tags organized in some manner and used to categorize and retrieve Web content such as Web pages, photographs, and Web links.
GeoNames
GeoNames integrates geographical data such as names of places in various languages, elevation, population and others from various sources.
GRDDL
GRDDL is a markup format for Gleaning Resource Descriptions from Dialects of Languages; that is, for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT.
High-level Subject
A high-level subject is both a subject proxy and category label used in a hierarchical subject classification scheme (taxonomy). Higher-level subjects are classes for more atomic subjects, with the height of the level representing broader or more aggregate classes.
Individual
See Instance.
Inferencing
Inference is the act or process of deriving logical conclusions from premises known or assumed to be true. The logic within and between statements in an ontology is the basis for inferring new conclusions from it, using software applications known as inference engines or reasoners.
Instance
Instances are the basic, “ground level” components of an ontology. An instance is individual member of a class, also used synonomously with entity. The instances in an ontology may include concrete objects such as people, animals, tables, automobiles, molecules, and planets, as well as abstract instances such as numbers and words. An instance is also known as an individual, with member and entity also used somewhat interchangeably.
Instance record
An instance with one or more attributes also provided.
irON
irON (instance record and Object Notation) is a abstract notation and associated vocabulary for specifying RDF (Resource Description Framework) triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF.
Intensional
The intension of a class is what is intended as a definition of what characteristics its members should have; it is akin to a definition of a concept and what is intended for a class to contain. It is therefore like the schema aspects (or TBox) in an ontology.
Key-value pair
Also known as a name–value pair or attribute–value pair, a key-value pair is a fundamental, open-ended data representation. All or part of the data model may be expressed as a collection of tuples <attribute name, value> where each element is a key-value pair. The key is the defined attribute and the value may be a reference to another object or a literal string or value. In RDF triple terms, the subject is implied in a key-value pair by nature of the instance record at hand.
Kind
Used synonomously herein with class.
Knowledge base
A knowledge base (abbreviated KB or kb) is a special kind of database for knowledge management. A knowledge base provides a means for information to be collected, organized, shared, searched and utilized. Formally, the combination of a TBox and ABox is a knowledge base.
Linkage
A specification that relates an object or attribute name to its full URI (as required in the RDF language).
Linked data
Linked data is a set of best practices for publishing and deploying instance and class data using the RDF data model, and uses uniform resource identifiers (URIs) to name the data objects. The approach exposes the data for access via the HTTP protocol, while emphasizing data interconnections, interrelationships and context useful to both humans and machine agents.
Mapping
A considered correlation of objects in two different sources to one another, with the relation between the objects defined via a specific property. Linkage is a subset of possible mappings.
Member
Used synonomously herein with instance.
Metadata
Metadata (metacontent) is supplementary data that provides information about one or more aspects of the content at hand such as means of creation, purpose, when created or modified, author or provenance, where located, topic or subject matter, standards used, or other annotation characteristics. It is “data about data”, or the means by which data objects or aggregations can be described. Contrasted to an attribute, which is an individual characteristic intrinsic to a data object or instance, metadata is a description about that data, such as how or when created or by whom.
Metamodeling
Metamodeling is the analysis, construction and development of the frames, rules, constraints, models and theories applicable and useful for modeling a predefined class of problems.
Microdata
Microdata is a proposed specification used to nest semantics within existing content on web pages. Microdata is an attempt to provide a simpler way of annotating HTML elements with machine-readable tags than the similar approaches of using RDFa or microformats.
Microformats
A microformat (sometimes abbreviated μF or uF) is a piece of mark up that allows expression of semantics in an HTML (or XHTML) web page. Programs can extract meaning from a web page that is marked up with one or more microformats.
Natural language processing
NLP is the process of a computer extracting meaningful information from natural language input and/or producing natural language output. NLP is one method for assigning structured data characterizations to text content for use in semantic technologies. (Hand assignment is another method.) Some of the specific NLP techniques and applications relevant to semantic technologies include automatic summarization, coreference resolution, machine translation, named entity recognition (NER), question answering, relationship extraction, topic segmentation and recognition, word segmentation, and word sense disambiguation, among others.
OBIE
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. Ontology-based information extraction (OBIE) is the use of an ontology to inform a “tagger” or information extraction program when doing natural language processing. Input ontologies thus become the basis for generating metadata tags when tagging text or documents.
Ontology
An ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. Loosely defined, ontologies on the Web can have a broad range of formalism, or expressiveness or reasoning power.
Ontology-driven application
Ontology-driven applications (or ODapps) are modular, generic software applications designed to operate in accordance with the specifications contained in one or more ontologies. The relationships and structure of the information driving these applications are based on the standard functions and roles of ontologies (namely as domain ontologies), as supplemented by UI and instruction sets and validations and rules.
Open Semantic Framework
The open semantic framework, or OSF, is a combination of a layered architecture and an open-source, modular software stack. The stack combines many leading third-party software packages with open source semantic technology developments from Structured Dynamics.
Open World Assumption
OWA is a formal logic assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. OWA is used in knowledge representation to codify the informal notion that in general no single agent or observer has complete knowledge, and therefore cannot make the closed world assumption. The OWA limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true. OWA is useful when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information. In the OWA, statements about knowledge that are not included in or inferred from the knowledge explicitly recorded in the system may be considered unknown, rather than wrong or false. Semantic Web languages such as OWL make the open world assumption. See contrast to the closed world assumption.
OPML
OPML (Outline Processor Markup Language) is an XML format for outlines, and is commonly used to exchange lists of web feeds between web feed aggregators.
OWL
The Web Ontology Language (OWL) is designed for defining and instantiating formal Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. There are also a variety of OWL dialects.
Predicate
See Property.
Property
Properties are the ways in which classes and instances can be related to one another. Properties are thus a relationship, and are also known as predicates. Properties are used to define an attribute relation for an instance.
Punning
In computer science, punning refers to a programming technique that subverts or circumvents the type system of a programming language, by allowing a value of a certain type to be manipulated as a value of a different type. When used for ontologies, it means to treat a thing as both a class and an instance, with the use depending on context.
RDF
Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata model but which has come to be used as a general method of modeling information, through a variety of syntax formats. The RDF metadata model is based upon the idea of making statements about resources in the form of subject-predicate-object expressions, called triples in RDF terminology. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object.
RDFa
RDFa 1.0 is a set of extensions to XHTML that is a W3C Recommendation. RDFa uses attributes from meta and link elements, and generalizes them so that they are usable on all elements allowing annotation markup with semantics. A W3C Working draft is presently underway that expands RDFa into version 1.1 with HTML5 and SVG support, among other changes.
RDF Schema
RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources.
Reasoner
A semantic reasoner, reasoning engine, rules engine, or simply a reasoner, is a piece of software able to infer logical consequences from a set of asserted facts or axioms. The notion of a semantic reasoner generalizes that of an inference engine, by providing a richer set of mechanisms.
Reasoning
Reasoning is one of many logical tests using inference rules as commonly specified by means of an ontology language, and often a description language. Many reasoners use first-order predicate logic to perform reasoning; inference commonly proceeds by forward chaining or backward chaining.
Record
As used herein, a shorthand reference to an instance record.
Relation
Used synonomously herein with attribute.
RSS
RSS (an acronym for Really Simple Syndication) is a family of web feed formats used to publish frequently updated digital content, such as blogs, news feeds or podcasts.
schema.org
Schema.org is an initiative launched by the major search engines of Bing, Google and Yahoo!, and later jointed by Yandex, in order to create and support a common set of schemas for structured data markup on web pages. schema.org provided a starter set of schema and extension mechanisms for adding to them. schema.org supports markup in microdata, microformat and RDFa formats.
Semantic enterprise
An organization that uses semantic technologies and the languages and standards of the semantic Web, including RDF, RDFS, OWL, SPARQL and others to integrate existing information assets, using the best practices of linked data and the open world assumption, and targeting knowledge management applications.
Semantic technology
Semantic technologies are a combination of software and semantic specifications that encodes meanings separately from data and content files and separately from application code. This approach enables machines as well as people to understand, share and reason with data and specifications separately. With semantic technologies, adding, changing and implementing new relationships or interconnecting programs in a different way can be as simple as changing the external model that these programs share. New data can also be brought into the system and visualized or worked upon based on the existing schema. Semantic technologies provide an abstraction layer above existing IT technologies that enables bridging and interconnection of data, content, and processes.
Semantic Web
The Semantic Web is a collaborative movement led by the World Wide Web Consortium (W3C) that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a “web of data”. It builds on the W3C’s Resource Description Framework (RDF).
Semset
A semset is the use of a series of alternate labels and terms to describe a concept or entity. These alternatives include true synonyms, but may also be more expansive and include jargon, slang, acronyms or alternative terms that usage suggests refers to the same concept.
SIOC
Semantically-Interlinked Online Communities Project (SIOC) is based on RDF and is an ontology defined using RDFS for interconnecting discussion methods such as blogs, forums and mailing lists to each other.
SKOS
SKOS or Simple Knowledge Organisation System is a family of formal languages designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary; it is built upon RDF and RDFS.
SKSI
Semantic Knowledge Source Integration provides a declarative mapping language and API between external sources of structured knowledge and the Cyc knowledge base.
SPARQL
SPARQL (pronounced “sparkle”) is an RDF query language; its name is a recursive acronym that stands for SPARQL Protocol and RDF Query Language.
Statement
A statement is a “triple” in an ontology, which consists of a subject – predicate – object (S-P-O) assertion. By definition, each statement is a “fact” or axiom within an ontology.
Subject
A subject is always a noun or compound noun and is a reference or definition to a particular object, thing or topic, or groups of such items. Subjects are also often referred to as concepts or topics.
Subject extraction
Subject extraction is an automatic process for retrieving and selecting subject names from existing knowledge bases or data sets. Extraction methods involve parsing and tokenization, and then generally the application of one or more information extraction techniques or algorithms.
Subject proxy
A subject proxy as a canonical name or label for a particular object; other terms or controlled vocabularies may be mapped to this label to assist disambiguation. A subject proxy is always representative of its object but is not the object itself.
Tag
A tag is a keyword or term associated with or assigned to a piece of information (e.g., a picture, article, or video clip), thus describing the item and enabling keyword-based classification of information. Tags are usually chosen informally by either the creator or consumer of the item.
TBox
A TBox (for terminological knowledge, the basis for T in TBox) is a “terminological component”; that is, a conceptualization associated with a set of facts. TBox statements describe a conceptualization, a set of concepts and properties for these concepts. The TBox is sufficient to describe an ontology (best practice often suggests keeping a split between instance records — and ABox — and the TBox schema).
Taxonomy
In the context of knowledge systems, taxonomy is the hierarchical classification of entities of interest of an enterprise, organization or administration, used to classify documents, digital assets and other information. Taxonomies can cover virtually any type of physical or conceptual entities (products, processes, knowledge fields, human groups, etc.) at any level of granularity.
Topic
The topic (or theme) is the part of the proposition that is being talked about (predicated). In topic maps, the topic may represent any concept, from people, countries, and organizations to software modules, individual files, and events. Topics and subjects are closely related.
Topic Map
Topic maps are an ISO standard for the representation and interchange of knowledge. A topic map represents information using topics, associations (similar to a predicate relationship), and occurrences (which represent relationships between topics and information resources relevant to them), quite similar in concept to the RDF triple.
Triple
A basic statement in the RDF language, which is comprised of a subjectproperty – object construct, with the subject and property (and object optionally) referenced by URIs.
Type
Used synonomously herein with class.
UMBEL
UMBEL, short for Upper Mapping and Binding Exchange Layer, is an upper ontology of about 28,000 reference concepts, designed to provide common mapping points for relating different ontologies or schema to one another, and a vocabulary for aiding that ontology mapping, including expressions of likelihood relationships distinct from exact identity or equivalence. This vocabulary is also designed for interoperable domain ontologies.
Upper ontology
An upper ontology (also known as a top-level ontology or foundation ontology) is an ontology that describes very general concepts that are the same across all knowledge domains. An important function of an upper ontology is to support very broad semantic interoperability between a large number of ontologies that are accessible ranking “under” this upper ontology.
Vocabulary
A vocabulary in the sense of knowledge systems or ontologies are controlled vocabularies. They provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems.
WordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools can be downloaded and used freely. Multiple language versions exist, and WordNet is a frequent reference structure for semantic applications.
YAGO
“Yet another great ontology” is a WordNet structure placed on top of Wikipedia.

[1] This glossary is based on the one provided on the OSF TechWiki. For the latest version, please refer to this link.
Posted:July 2, 2012

Example Ontology (from Wikipedia)Conventional IT Systems are Poorly Suited to Knowledge Applications

Frequently customers ask me why semantic technologies should be used instead of conventional information technologies. In the areas of knowledge representation (KR) and knowledge management (KM), there are compelling reasons and benefits for selecting semantic technologies over conventional approaches. This article attempts to summarize these rationales from a layperson perspective.

It is important to recognize that semantic technologies are orthogonal to the buzz around some other current technologies, including cloud computing and big data. Semantic technologies are also not limited to open data: they are equivalently useful to private or proprietary data. It is also important to note that semantic technologies do not imply some grand, shared schema for organizing all information. Semantic technologies are not “one ring to rule them all,” but rather a way to capture the world views of particular domains and groups of stakeholders. Lastly, semantic technologies done properly are not a replacement for existing information technologies, but rather an added layer that can leverage those assets for interoperability and to overcome the semantic barriers between existing information silos.

Nature of the World

The world is a messy place. Not only is it complicated and richly diverse, but our ways of describing and understanding it are made more complex by differences in language and culture.

We also know the world to be interconnected and interdependent. Effects of one change can propagate into subtle and unforeseen effects. And, not only is the world constantly changing, but so is our understanding of what exists in the world and how it affects and is affected by everything else.

This means we are always uncertain to a degree about how the world works and the dynamics of its working. Through education and research we continually strive to learn more about the world, but often in that process find what we thought was true is no longer so and even our own human existence is modifying our world in manifest ways.

Knowledge is very similar to this nature of the world. We find that knowledge is never complete and it can be found anywhere and everywhere. We capture and codify knowledge in structured, semi-structured and unstructured forms, ranging from “soft” to “hard” information. We find that the structure of knowledge evolves with the incorporation of more information.

We often see that knowledge is not absolute, but contextual. That does not mean that there is no such thing as truth, but that knowledge should be coherent, to reflect a logical consistency and structure that comports with our observations about the physical world. Knowledge, like the world, is constantly changing; we thus must constantly adapt to what we observe and learn.

Knowledge Representation, Not Transactions

These observations about the world and knowledge are not platitudes but important guideposts for how we should organize and manage information, the field known as “information technology.” For IT to truly serve the knowledge function, its logical bases should be consistent with the inherent nature of the world and knowledge.

By knowledge functions we mean those areas of various computer applications that come under the rubrics of search, business intelligence, competitive intelligence, planning, forecasting, data federation, data warehousing, knowledge management, enterprise information integration, master data management, knowledge representation, and so forth. These applications are distinctly different than the earliest and traditional concerns of IT systems:  accounting and transactions.

A transaction system — such as calculating revenue based on seats on a plane, the plane’s occupancy, and various rate classes — is a closed system. We can count the seats, we know the number of customers on board, and we know their rate classes and payments. Much can be done with this information, including yield and profitability analysis and other conventional ways of accounting for costs or revenues or optimizations.

But, as noted, neither the world nor knowledge is a closed system. Trying to apply legacy IT approaches to knowledge problems is fraught with difficulties. That is the reason that for more than four decades enterprises have seen massive cost overruns and failed projects in applying conventional IT approaches to knowledge problems: traditional IT is fundamentally mismatched to the nature of the problems at hand.

What works efficiently for transactions and accounting is a miserable failure applied to knowledge problems. Traditional relational databases work best with structured data; are inflexible and fragile when the nature (schema) of the world changes; and thus require constant (and expensive) re-architecting in the face of new knowledge or new relationships.

Of course, often knowledge problems do consider fixed entities with fixed attributes to describe them. In these cases, relational data systems can continue to act as valuable contributors and data managers of entities and their attributes. But, in the role of organizing across schema or dealing with semantics and differences of definition and scope – that is, the common types of knowledge questions – a much different integration layer with a much different logic basis is demanded.

The New Open World Paradigm

The first change that is demanded is to shift the logic paradigm of how knowledge and the world are modeled. In contrast to the closed-world approach of transaction systems, IT systems based on the logical premise of the open world assumption (OWA) mean:

  • Lack of a given assertion does not imply whether it is true or false; it simply is not known
  • A lack of knowledge does not imply falsity
  • Everything is permitted until it is prohibited
  • Schema can be incremental without re-architecting prior schema (“extensible”), and
  • Information at various levels of incompleteness can be combined.

Much more can be said about OWA, including formal definitions of the logics underlying it [1], but even from the statements above, we can see that the right logic for most knowledge representation (KR) problems is the open world approach.

This logic mismatch is perhaps the most fundamental cause of failures, cost overruns, and disappointing deliverables for KM and KR projects over the years. But, like the fingertip between the eyes that cannot be seen because it is too close at hand, the importance of this logic mismatch strangely continues to be overlooked.

Integrating All Forms of Information

Data exists in many forms and of many natures. As one classification scheme, there are:

  • Structured data — information presented according to a defined data model, often found in relational databases or other forms of tabular data
  • Semi-structured data — does not conform to the formal structure of data models, but contains tags or other markers to denote fields within the content. Markup languages embedded in text are a common form of such sources
  • Unstructured data — information content, generally oriented to text, that lacks an explicit data model or schema; structured information can be obtained from it via data mining or information extraction.

Further, these types of data may be “soft”, such as social information or opinion, or “hard”, more akin to measurable facts or quantities.

These various forms may also be serialized in a variety of data formats or data transfer protocols, some using straight text with a myriad of syntax or markup vocabularies, ranging to scripts or forms encoded or binary.

Still further, any of these data forms may be organized according to a separate schema that describes the semantics and relationships within the data.

These variations further complicate the inherently diverse nature of the world and knowledge of it. A suitable data model for knowledge representation must therefore have the power to be able to capture the form, format, serialization or schema of any existing data within the diversity of these options.

The Resource Description Framework (RDF) data model has such capabilities [2]. Any extant data form or schema (from the simple to the complex) can be converted to the RDF data model. This capability enables RDF to act as a “universal solvent” for all information.

Once converted to this “canonical” form, RDF can then act as a single representation around which to design applications and other converters (for “round-tripping” to legacy systems, for example), as illustrated by this diagram:

Generic tools can then be driven by the RDF data model, which leads to fewer applications required and lower overall development costs.

Lastly, RDF can represent simple assertions (“Jane runs fast”) to complex vocabularies and languages. It is in this latter role that RDF can begin to represent the complexity of an entire domain via what is called an “ontology” or “knowledge graph.”

Example Ontology Growth

Connections Create Graphs

When representing knowledge, more things and concepts get drawn into consideration. In turn, the relationships of these things lead to connections between them to capture the inherent interdependence and linkages of the world. As still more things get considered, more connections are made and proliferate.

This process naturally leads to a graph structure, with the things in the graphs represented as nodes and the relationships between them represented as connecting edges. More things and more connections lead to more structure. Insofar as this structure and its connections are coherent, the natural structure of the knowledge graph itself can help lead to more knowledge and understanding.

How one such graph may emerge is shown by this portion of the recently announced Google Knowledge Graph [3], showing female Nobel prize winners:

Unlike traditional data tables, graphs have a number of inherent benefits, particularly for knowledge representations. They provide:

  • A coherent way to navigate the knowledge space
  • Flexible entry points for each user to access that knowledge (since every node is a potential starting point)
  • Inferencing and reasoning structures about the space
  • Connections to related information
  • Ability to connect to any form of information
  • Concept mapping, and thus the ability to integrate external content
  • A framework to disambiguate concepts based on relations and context, and
  • A common vocabulary to drive content “tagging”.

Graphs are the natural structures for knowledge domains.

Network Analysis is the New Algebra

Once built, graphs offer some analytical capabilities not available through traditional means of information structure. Graph analysis is a rapidly emerging field, but already some unique measures of knowledge domains are now possible to gauge:

  • Influence
  • Relatedness
  • Proximity
  • Centrality
  • Inference
  • Clustering
  • Shortest paths
  • Diffusion.

As science is coming to appreciate, graphs can represent any extant structure or schema. This gives graphs a universal character in terms of analytic tools. Further, many structures can only be represented by graphs.

Information and Interaction is Distributed

The nature of knowledge is such that relevant information is everywhere. Further, because of the interconnectedness of things, we can also appreciate that external information needs to be integrated with internal information. Meanwhile, the nature of the world is such that users and stakeholders may be anywhere.

These observations suggest a knowledge representation architecture that needs to be truly distributed. Both sources and users may be found in multiple locations.

In order to preserve existing information assets as much as possible (see further below) and to codify the earlier observation regarding the broad diversity of data formats, the resulting knowledge architecture should also attempt to put in place a thin layer or protocol that provides uniform access to any source or target node on the physical network. A thin, uniform abstraction layer – with appropriate access rights and security considerations – means knowledge networks may grow and expand at will at acceptable costs with minimal central coordination or overhead.

Properly designed, then, such architectures are not only necessary to represent the distributed nature of users and knowledge, but can also facilitate and contribute to knowledge development and exchange.

The Web is the Perfect Medium

The items above suggest the Web as an appropriate protocol for distributed access and information exchange. When combined with the following considerations, it becomes clear that the Web is the perfect medium for knowledge networks:

  • Potentially, all information may be accessed via the Web
  • All information may be given unique Web identifiers (URIs)
  • All Web tools are available for use and integration
  • All Web information may be integrated
  • Web-oriented architectures (WOA) have proven:
  • Scalability
  • Robustness
  • Substitutability
  • Most Web technologies are open source.

It is not surprising that the largest extant knowledge networks on the globe – such as Google, Wikipedia, Amazon and Facebook – are Web-based. These pioneers have demonstrated the wisdom of WOA for cost-effective scalability and universal access.

Also, the combination of RDF with Web identifiers also means that any and all information from a given knowledge repository may be exposed and made available to others as linked data. This approach makes the Web a global, universal database. And it is in keeping with the general benefits of integrating external information sources.

Leveraging – Not Replacing – Existing IT Assets

Existing IT assets represent massive sunk costs, legacy knowledge and expertise, and (often) stakeholder consensus. Yet, these systems are still largely stovepiped.

Strategies that counsel replacement of existing IT systems risk wasting existing assets and are therefore unlikely to be adopted. Ways must be found to leverage the value already embodied in these systems, while promoting interoperability and integration.

The beauty of semantic technologies – properly designed and deployed in a Web-oriented architecture – is that a thin interoperability layer may be placed over existing IT assets to achieve these aims. The knowledge graph structure may be used to provide the semantic mappings between schema, while the Web service framework that is part of the WOA provides the source conversion to the canonical RDF data model.

Via these approaches, prior investments in knowledge, information and IT assets may be preserved while enabling interoperability. The existing systems can continue to provide the functionality for which they were originally designed and deployed. Meanwhile, the KR-related aspects may be exposed and integrated with other knowledge assets on the physical network.

Democratizing the Knowledge Function

These kinds of approaches represent a fundamental shift in power and roles with respect to IT in the enterprise. IT departments and their bottlenecks in writing queries and bespoke application development can now be bypassed; the departments may be relegated to more appropriate support roles. Developers and consultants can now devote more of their time to developing generic applications driven by graph structures [4].

In turn, the consumers of knowledge applications – namely subject matter experts, employees, partners and stakeholders – now become the active contributors to the graphs themselves, focusing on reconciling terminology and ensuring adequate entity and concept coverage. Knowledge graphs are relatively straightforward structures to build and maintain. Those that rely on them can also be those that have the lead role in building and maintaining them.

Thus, graph-driven applications can be made generic by function with broader and more diverse information visualization capabilities. Simple instructions in the graphs can indicate what types of information can be displayed with what kind of widget. Graph-driven applications also mean that those closest to the knowledge problems will also be those directly augmenting the graphs. These changes act to democratize the knowledge function, and lower overall IT costs and risks.

Seven Pillars of the Semantic Enterprise

Elsewhere we have discussed the specific components that go into enabling the development of a semantic enterprise, what we have termed the seven pillars [5]. Most of these points have been covered to one degree or another in the discussion above.

There are off-the-shelf starter kits for enterprises to embrace to begin this process. The major starting requirements are to develop appropriate knowledge graphs (ontologies) for the given domain and to convert existing information assets into appropriate interoperable RDF form.

Beyond that, enterprise staff may be readily trained in the use and growth of the graphs, and in the staging and conversion of data. With an appropriate technology transfer component, these semantic technology systems can be maintained solely by the enterprise itself without further outside assistance.

Summary of Semantic Technology Benefits

Unlike conventional IT systems with their closed-world approach, semantic technologies that adhere to these guidelines can be deployed incrementally at lower cost and with lower risk. Further, we have seen that semantic technologies offer an excellent integration approach, with no need to re-do schema because of changed circumstances. The approach further leverages existing information assets and brings the responsibility for the knowledge function more directly to its users and consumers.

Semantic technologies are thus well-suited for knowledge applications. With their graph structures and the ability to capture semantic differences and meanings, these technologies can also accommodate multiple viewpoints and stakeholders. There are also excellent capabilities to relate all available information – from documents and images and metadata to tables and databases – into a common footing.

These advantages will immediately accrue through better integration and interoperability of diverse information assets. But, for early adopters, perhaps the most immediate benefit will come from visible leadership in embracing these enabling technologies in advance of what will surely become the preferred approach to knowledge problems.

Note: there is a version of this article on Slideshare:

View more presentations from Mike Bergman.

[1] For more on the open world assumption (OWA), see the various entries on this topic on Michael Bergman’s AI3:::Adaptive Information blog. This link is a good search string to discover more.
[2] M.K. Bergman, 2009. Advantages and Myths of RDF, white paper from Structured Dynamics LLC, April 22, 2009, 13 pp. See http://www.mkbergman.com/wp-content/themes/ai3v2/files/2009Posts/Advantages_Myths_RDF_090422.pdf.
[4] For the most comprehensive discussion of graph-driven apps, see M. K. Bergman, 2011. ” Ontology-Driven Apps Using Generic Applications,” posted on the AI3:::Adaptive Information blog, March 7, 2011. You may also search on that blog for ‘ODapps‘ to see related content.
[5] M.K. Bergman, 2010. “Seven Pillars of the Open Semantic Enterprise,” in AI3:::Adaptive Information blog, January 12, 2010; see http://www.mkbergman.com/859/seven-pillars-of-the-open-semantic-enterprise/.
Posted:June 4, 2012

Popular ‘Timeline of Information History’ Expands by 30%

Since its first release four years ago, the History of Information Timeline has been one of the most consistently popular aspects of this site. It is an interactive timeline of the most significant events and developments in the innovation and management of information and documents throughout human history.

Recent requests for use by others and my own references to it caused me to review its entries and add to it. Over the past few weeks I have expanded its coverage by some 30%. There are now about 115 entries in the period ranging from ca 30,000 BC (cave paintings) to ca 2003 AD (3D printing). Most additions have been to update the past twenty or thirty years:

A Timeline of Information History

The timeline works via a fast scroll at the bottom; every entry when clicked produces a short info-panel, as shown.

All entries are also coded by a icon scheme of about 20 different categories:

Book Forms and Bookmaking Calendars Copyrights and Legal Infographics and Statististics
Libraries Maps Math and Symbology Mechanization
Networks New Formats or Document Forms Organizing Information Pre-writing
Paper and Papermaking Printing Science and Technology Scripts and Alphabets
Standardization and Typography Theory Timelines Writing

You may learn more about this timeline and the technology behind it by referring to the original announcement.

Posted by AI3's author, Mike Bergman Posted on June 4, 2012 at 9:28 am in Adaptive Information, Site-related | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/1012/information-timeline-gets-major-update/
The URI to trackback this post is: http://www.mkbergman.com/1012/information-timeline-gets-major-update/trackback/
Posted:May 28, 2012

Mandelbrot animation based on a static number of iterations per pixel; from WikipediaInsights from Commonalities and Theory

One of the main reasons I am such a big fan of RDF as a canonical data model is its ability to capture information in structured, semi-structured and unstructured form [1]. These sources are conventionally defined as:

  • Structured data — information presented according to a defined data model, often found in relational databases or other forms of tabular data
  • Semi-structured data — does not conform with the formal structure of data models, but contains tags or other markers to denote fields within the content. Markup languages embedded in text are a common form of such sources
  • Unstructured data — information content, generally oriented to text, that lacks an explicit data model or schema; structured information can be obtained from it via data mining or information extraction.

A major trend I have written about for some time is the emergence of the structured Web: that is, the exposing of structure from these varied sources in order for more information to be interconnected and made interoperable. I have posited — really a view shared by many — that the structured Web is an intermediate point in the evolution of the Web from one of documents to one where meaningful semantics occurs [2].

It is clear in my writings — indeed in the very name of my company, Structured Dynamics — that structure plays a major role in our thinking. The use and reliance on this term, though, begs the question: just what is structure in an informational sense? We’ll find it helpful to get at the question of What is structure? from a basis using first principles. And this, in turn, may also provide insight into how structure and information are in fact inextricably entwined.

A General Definition of Structure

According to Wikipedia, structure is a fundamental notion, of tangible or intangible character, that refers to the recognition, observation, nature, or permanence of patterns and relationships of entities. The concept may refer to an object, such as a built structure, or an attribute, such as the structure of society.

Koch curve; example of snowflake fractal; from Wikipedia

Structure may be abstract, or it may be concrete. Its realm ranges from the physical to ideas and concepts. As a term, “structure” seems to be ubiquitous to every domain. Structure may be found across every conceivable scale, from the most minute and minuscule to the cosmic. Even realms without any physical aspect at all — such as ideas and languages and beliefs — are perceived by many to have structure. We apply the term to any circumstance in which things are arranged or connected to one another, as a means to describe the organization or relationships of things. We seem to know structure when we see it, and to be able to discern structure of very many kinds against unstructured or random backgrounds.

In this way structure quite resembles patterns, perhaps could even be used synonymously with that term. Other closely related concepts include order, organization, design or form. When expressed, structure, particularly that of a recognizably ordered or symmetrical nature, is often called beautiful.

One aspect of structure, I think, that provides the key to its roles and importance is that it can be expressed in shortened form as a mathematical statement. One could even be so bold as to say that mathematics is the language of structure. This observation is one of the threads that will help us tie structure to information.

The Patterned Basis of Nature

The natural world is replete with structure. Patterns in nature are regularities of visual form found in the natural world. Each such pattern can be modeled mathematically. Typical mathematical forms in nature include fractals, spirals, flows, waves, lattices, arrays, Golden ratios, tilings, Fibonacci sequences, and power laws. We see them in such structures as clouds, trees, leaves, river networks, fault lines, mountain ranges, craters, animal spots and stripes, shells, lightning bolts, coastlines, flowers, fruits, skeletons, cracks, growth rings, heartbeats and rates, earthquakes, veining, snow flakes, crystals, blood and pulmonary vessels, ocean waves, turbulence, bee hives, dunes and DNA.

Self-similarity in a Mandelbrot set; from Wikipedia

The mathematical expression of structures in nature is frequently repeated or recursive in nature, often in a self-organizing manner. The swirls of a snail’s shell reflect a Fibonacci sequence, while natural landscapes or lifeforms often have a fractal aspect (as expressed by some of the figures in this article). Fractals are typically self-similar patterns, generally involving some fractional or ratioed formula that is recursively applied. Another way to define it is as a detailed pattern repeating itself.

Even though these patterns can often be expressed simply and mathematically, and they often repeat themselves, their starting conditions can lead to tremendous variability and a lack of predictability. This makes them chaotic, as studied under chaos theory, though their patterns are often discernible.

While we certainly see randomness in statistics, quantum physics and Brownian motion, it is also striking how what gives nature its beauty is structure. As a force separate and apart from the random, there appears to be something in structure that guides the expression of what is nature and what is so pleasing to behold. Self-similar and repeated structures across the widest variety of spatial scales seems to be an abiding aspect of nature.

Structure in Language

Such forms of repeated patterns or structure are also inherent in that most unique of human capabilities, language. As a symbolic species [3], we first used symbols as a way to represent the ideas of things. Simple markings, drawings and ideograms grew into more complicated structures such as alphabets and languages. The languages themselves came to embrace still further structure via sentence structures, document structures, and structures for organizing and categorizing multiple documents. In fact, one of the most popular aspects of this blog site is its Timeline of Information History — worth your look — that shows the progression of structure in information throughout human history.

Grammar is often understood as the rules or structure that governs language. It is composed of syntax, including punctuation, traditionally understood as the sentence structure of languages, and morphology, which is the structural understanding of a language’s linguistic units, such as words, affixes, parts of speech, intonation or context. There is a whole field of linguistic typology that studies and classifies languages according to their structural features. But grammar is hardly the limit to language structure.

Semantics, the meaning of language, used to be held separate from grammar or structure. But via the advent of the thesaurus, and then linguistic databases such as WordNet and more recently concept graphs that relate words and terms into connected understandings, we also have now come to understand that semantics also has structure. Indeed, these very structural aspects are now opening up to us techniques and methods — generally classified under the heading of natural language processing (NLP) — for extracting meaningful structure from the very basis of written or spoken language.

It is the marriage of the computer with language that is illuminating these understandings of structure in language. And that opening, in turn, is enabling us to capture and model the basis of human language discourse in ways that can be codified, characterized, shared and analyzed. Machine learning and processing is now enabling us to complete the virtual circle of language. From its roots in symbols, we are now able to extract and understand those very same symbols in order to derive information and knowledge from our daily discourse. We are doing this by gleaning the structure of language, which in turn enables us to relate it to all other forms of structured information.

Common Threads Via Patterns

The continuation of structure from nature to language extends across all aspects of human endeavor. I remember excitedly describing to a colleague more than a decade ago what likely is a pedestrian observation: pattern matching is a common task in many fields. (I had observed that pattern matching in very different forms was a standard practice in most areas of industry and commerce.) My “insight” was that this commonality was not widely understood, which meant that pattern matching techniques in one field were not often exploited or seen as transferable to other domains.

In computer science, pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern. It is closely related to the idea of pattern recognition, which is the characterization of some discernible and repeated sequence. These techniques, as noted, are widely applied, with each field tending to have its own favorite algorithms. Common applications that one sees for such pattern-based calculations include communications [4], encoding and coding theory, file compression, data compression, machine learning, video compression, mathematics (including engineering and signal processing via such techniques as Fourier transforms), cryptography, NLP [5], speech recognition, image recognition, OCR, image analysis, search, sound cleaning (that is, error detection, such as Dolby) and gene sequence searching and alignment, among many others.

To better understand what is happening here and the commonalities, let’s look at the idea of compression. Data compression is valuable for transmitting any form of content in wired or wireless manners because we can transmit the same (or closely similar) message faster and with less bandwidth [6]. There are both lossless (no loss of information) and lossy compression methods. Lossless data compression algorithms usually exploit statistical redundancy — that is, a pattern match — to represent data more concisely without losing information. Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Lossless compression is possible because most real-world data has statistical redundancy. In lossy data compression, some loss of information is acceptable by dropping detail from the data to save storage space. These methods are guided by research that indicates, say, how certain frequencies may not be heard or seen by people and can be removed from the source data.

On a different level, there is a close connection between machine learning and compression: a system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution), while an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history). This equivalence has been used as justification for data compression as a benchmark for “general intelligence.” On a still different level, one major part of cryptography is the exact opposite of these objectives: constructing messages that pattern matching fails against or is extremely costly or time-consuming to analyze.

When one stands back from any observable phenomena — be it natural or human communications — we can see that the “information” that is being conveyed often has patterns, recursion or other structure that enables it to be represented more simply and compactly in mathematical form. This brings me back to my two favorite protagonists in my recent writings — Claude Shannon and Charles S. Peirce.

Information is Structure

Claude Shannon‘s seminal work in 1948 on information theory dealt with the amount of information that could be theoretically and predictably communicated between a sender and a receiver [7] [8]. No context or semantics were implied in this communication, only the amount of information (for which Shannon introduced the term “bits”) and what might be subject to losses (or uncertainty in the accurate communication of the message). In this regard, what Shannon called “information” is what we would best term “data” in today’s parlance.

A cellular automaton based on hexagonal cells instead of squares (rule 34/2); from Wikipedia

The context of Shannon’s paper and work by others preceding him was to understand information losses in communication systems or networks. Much of the impetus for this came about because of issues in wartime communications and early ciphers and cryptography and the emerging advent of digital computers. But the insights from Shannon’s paper also relate closely to the issues of data patterns and data compression.

A key measure of Shannon’s theory is what he referred to as information entropy, which is usually expressed by the average number of bits needed to store or communicate one symbol in a message. Entropy quantifies the uncertainty involved in predicting the value of a random variable. The Shannon entropy measure is actually a measure of the uncertainty based on the communication (transmittal) between a sender and a receiver; the actual information that gets transmitted and predictably received was formulated by Shannon as R, which can never be zero because all communication systems have losses.

A simple intuition can show how this formulation relates to patterns or data compression. Let’s take a message of completely random digits. In order to accurately communicate that message, all digits (bits) would have to be transmitted in their original state and form. Absolutely no compression of this message is possible. If, however, there are patterns within the message (which, of course, now ceases to make the message random), these can be represented algorithmically in shortened form, so that we only need communicate the algorithm and not the full bits of the original message. If this “compression” algorithm can then be used to reconstruct the bit stream of the original message, the data compression method is deemed to be lossless. The algorithm so derived is also the expression of the pattern that enabled us to compress the message in the first place (such as a*2+1).

We can apply this same type of intuition to human language. In order to improve communication efficiency, the most common words (e.g., “a”, “the”, “I”) should be shorter than less common words (e.g., “disambiguation”, “semantics”, “ontology”), so that sentences will not be too long. As they are. This is an equivalent principal to data compression. In fact, such repeats and patterns apply to the natural world as well.

Shannon’s idea of information entropy has come to inform the even broader subject of entropy in physics and the 2nd Law of Thermodynamics [10]. According to Koelman, “the entropy of a physical system is the minimum number of bits you need to fully describe the detailed state of the system.” Very random (uncertain) states have high entropy, patterned ones low entropy. As I noted recently, in open systems, structures (patterns) are a means to speed the tendency to equilibrate across energy gradients [8]. This observation helps provide insight into structure in natural systems, and why life and human communications tend toward less randomness. Structure will always continue to emerge because it is adaptive to speed the deltas across these gradients; structure provides the fundamental commonality between biological information (life) and human information.

In the words of Thomas Schneider [11], “Information is always a measure of the decrease of uncertainty at a receiver.” Of course, in Shannon’s context, what is actually being measured here is data (or bits), not information embodying any semantic meaning or context. Thus, the terminology may not be accurate for discussing “information” in a contemporary sense. But it does show that “structure” — that is, the basis for shortening the length of a message while still retaining its accuracy — is information (in the Shannon context). In this information there is order or patterns, often of a hierarchical or fractal or graph nature. Any structure that emerges that is better able to reduce the energy gradient faster will be favored according to the 2nd Law.

Still More Structure Makes “Information” Information

The data that constitutes “information” in the Shannon sense still lacks context and meaning. In communications terms, it is data; it has not yet met the threshold of actionable information. It is in this next step that we can look to Charles Sanders Peirce (1839 – 1914) for guidance [9].

The core of Peirce’s world view is based in semiotics, the study and logic of signs. In his seminal writing on this, “What is in a Sign?” [10], he wrote that “every intellectual operation involves a triad of symbols” and “all reasoning is an interpretation of signs of some kind”. A sign of an object leads to interpretants, which, as signs, then lead to further interpretants. Peirce’s triadic logic of signs in fact is a taxonomy of sign relations, in which signs get reified and expanded via still further signs, ultimately leading to communication, understanding and an approximation of “canonical” truth. Peirce saw the scientific method as itself an ultimate example of this process.  The key aspect of signs for Peirce is the ongoing process of interpretation and reference to further signs.

Information is structure, and structure is information.

Ideograms leading to characters, that get combined into sentences and other syntax, and then get embedded within contexts of shared meanings show how these structures compound themselves and lead to clearer understandings (that is, accurate messages) in the context of human languages. While the Shannon understanding of “information” lacked context and meaning, we can see how still higher-order structures may be imposed through these reifications of symbols and signs that improve the accuracy and efficiency of our messages. Though Peirce did not couch his theory of semiosis on structure nor information, we can see it as a natural extension of the same structural premises in Shannon’s formulation.

In fact, today, we now see the “structure” in the semantic relationships of language through the graph structures of ontologies and linguistic databases such as WordNet. The understanding and explication of these structures are having a similarly beneficial effect on how still more refined and contextual messages can be composed, transmitted and received. Human-to-machine communications is (merely!) the challenge of codifying and making explicit the implicit structures in our languages.

The Peirceian ideas of interpretation (context) and compounding and reifying structure are a major intellectual breakthrough for extending the Shannon “information” theory to information in the modern sense. These insights also occur within a testable logic for how things and the names of things can be understood and related to one another, via logical statements or structures. These, in turn, can be symbolized and formalized into logical constructs that can capture the structure of natural language as well as more structured data (or even nature, as some of the earlier Peirce speculation asserts [13]).

According to this interpretation of Peirce, the nature of information is the process of communicating a form from the object to the interpretant through the sign [14]. The clarity of Peirce’s logic of signs is an underlying factor, I believe, for why we are finally seeing our way clear to how to capture, represent and relate information from a diversity of sources and viewpoints that is defensible and interoperable.

Structure is Information

Common to all of these perspectives — from patterns to nature and on to life and then animal and human communications — we see that structure is information. Human artifacts and technology, though not “messages” in a conventional sense, also embody the information of how they are built within their structures [15]. We also see the interplay of patterns and information in many processes of the natural world [16] from complexity theory, to emergence, to autopoiesis, and on to autocatalysis, self-organization, stratification and cellular automata [17]. Structure in its many guises is ubiquitous.

From article on Beauty in Wikipedia; Joanna Krupa, a Polish-American model and actress

We, as beings who can symbolically record our perceptions, seem to innately recognize patterns. We see beauty in symmetry. Bilateral symmetry seems to be deeply ingrained in the inherent perception by humans of the likely health or fitness of other living creatures. We see beauty in the patterned, repeated variability of nature. We see beauty in the clever turn of phrase, or in music, or art, or the expressiveness of a mathematical formulation.

We also seem to recognize beauty in the simple. Seemingly complex bit streams that can be reduced to the short, algorithmic expression are always viewed as more elegant than lengthier, more complex alternatives. The simple laws of motion and Newtonian physics fit this pattern, as does Einstein’s E=mc2. This preference for the simple is a preference for the greater adaptiveness of the shorter, more universal pattern to messages, an insight indicated by Shannon’s information theory.

In the more prosaic terms of my vocation in the Web and information technology, these insights point to the importance of finding and deriving structured representations of information — including meaning (semantics) — that can be simply expressed and efficiently conveyed. Building upon the accretions of structure in human and computer languages, the semantic Web and semantic technologies offer just such a prospect. These insights provide a guidepost for how and where to look for the next structural innovations. We find them in the algorithms of nature and language, and in making connections that provide the basis for still more structure and patterned commonalities.

Ideas and algorithms around loseless compression and graph theory and network analysis are, I believe, the next fruitful hunting grounds for finding still higher-orders of structure, which can be simply expressed. The patterns of nature, which have emerged incrementally and randomly over the eons of cosmological time, look to be an excellent laboratory.

So, as we see across examples from nature and life to language and all manner of communications, information is structure and structure is information. And it is simply beautiful.


[1]  I discuss this advantage, among others, in M. K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Innovation blog, April 8, 2009. See http://www.mkbergman.com/483/advantages-and-myths-of-rdf/.
[2] The structured Web is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance. See further M. K. Bergman, 2007. “What is the Structured Web?,” AI3:::Adaptive Innovation blog, July 18, 2007. See http://www.mkbergman.com/390/what-is-the-structured-web/. Also, for a diagram of the evolution of the Web, see M. K. Bergman, 2007. “More Structure, More Terminology and (hopefully) More Clarity,” AI3:::Adaptive Innovation blog, July 22, 2007. See http://www.mkbergman.com/391/more-structure-more-terminology-and-hopefully-more-clarity/.
[3] Terrence W. Deacon, 1997. The Symbolic Species: The Co-Evolution of Language and the Brain, W. W. Norton & Company, July 1997 527 pp. (ISBN-10: 0393038386)
[4] Communications is a particularly rich domain with techniques such as the Viterbi algorithm , which has found universal application in decoding the convolutional codes used in both CDMA and GSM digital cellular, dial-up modems, satellite, deep-space communications, and 802.11 wireless LANs.
[5] Notable areas in natural language processing (NLP) that rely on pattern-based algorithms include classification, clustering, summarization, disambiguation, information extraction and machine translation.
[6] To see some of the associated compression algorithms, there is a massive list of “codecs” (compression/decompression) techniques available; fractal compression is one.
[7] Claude E. Shannon, 1948. “A Mathematical Theory of Communication”, Bell System Technical Journal, 27: 379–423, 623-656, July, October, 1948. See http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.
[8] I first raised the relation of Shannon’s paper to data patterns — but did not discuss it further awaiting this current article — in M. K. Bergman, 2012. “The Trouble with Memes,” AI3:::Adaptive Innovation blog, April 4, 2012. See http://www.mkbergman.com/1004/the-trouble-with-memes/.
[9] I first overviewed Peirce’s relation to information messaging in M. K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” AI3:::Adaptive Innovation blog, January 24, 2012. See http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/. Peirce had a predilection for expressing his ideas in “threes” throughout his writings.
[10] For a very attainable lay description, see Johannes Koelman, 2012. “What Is Entropy?,” in Science 2.0 blog, May 5, 2012. See http://www.science20.com/hammock_physicist/what_entropy-89730.
[11] See Thomas D. Schneider, 2012. “Information Is Not Entropy, Information Is Not Uncertainty!,” Web page retrieved April 4, 2012; see http://www.lecb.ncifcrf.gov/~toms/information.is.not.uncertainty.html.
[12] Charles Sanders Peirce, 1894. “What is in a Sign?”, see http://www.iupui.edu/~peirce/ep/ep2/ep2book/ch02/ep2ch2.htm.
[13] It is somewhat amazing, more than a half century before Shannon, that Peirce himself considered “quasi-minds” such as crystals and bees to be sufficient as the interpreters of signs. See Commens Dictionary of Peirce’s Terms (CDPT), Peirce’s own definitions, and the entry on “quasi-minds”; see http://www.helsinki.fi/science/commens/terms/quasimind.html.
[14] João Queiroz, Claus Emmeche and Charbel Niño El-Hania, 2005. “Information and Semiosis in Living Systems: A Semiotic Approach,” in SEED 2005, Vol. 1. pp 60-90; see http://www.library.utoronto.ca/see/SEED/Vol5-1/Queiroz_Emmeche_El-Hani.pdf.
[15] Kevin Kelley has written most broadly about this in his book, What Technology Wants, Viking/Penguin, October 2010. For a brief introduction, see Kevin Kelly, 2006. “The Seventh Kingdom,” in The Technium blog, February 1, 2006. See http://www.kk.org/thetechnium/archives/2006/02/the_seventh_kin.php.
[16] For a broad overview, see John Cleveland, 1994. “Complexity Theory: Basic Concepts and Application to Systems Thinking,” Innovation Network for Communities, March 27, 1994. May be obtained at http://www.slideshare.net/johncleveland/complexity-theory-basic-concepts.
[17] For more on cellular automata, see Stephen Wolfram, 2002. A New Kind of Science, Wolfram Media, Inc., May 14, 2002, 1197 pp. ISBN 1-57955-008-8.

Posted by AI3's author, Mike Bergman Posted on May 28, 2012 at 10:24 pm in Adaptive Information, Structured Web | Comments (5)
The URI link reference to this post is: http://www.mkbergman.com/1011/what-is-structure/
The URI to trackback this post is: http://www.mkbergman.com/1011/what-is-structure/trackback/