Posted:August 18, 2014

Steam engine in action, from WikipediaA Critical Fit with the Semantic Web and AI

In the first parts of this series we introduced the idea of Big Structure, and the fact that it resides at the nexus of the semantic Web, artificial intelligence, natural language processing, knowledge bases, and Big Data. In this article, we look specifically at the work that Big Structure promotes in data interoperability as a way to clarify what the roles these various aspects play.

By its nature, data integration (the first step in data interoperability) means that data is being combined across two or more datasets. Such integration surfaces all of the myriad aspects of semantic heterogeneities, exactly the kinds of issues that the semantic Web and semantic technologies were designed to address. But resolving semantic differences can not be fulfilled by semantic technologies alone. While semantics can address the basis of differences in meaning and context, resolution of those differences or deciding between differing interpretations (that is, ambiguity) also requires many of the tools of artificial intelligence or natural language processing (NLP).

By decomposing this space into its various sources of semantic heterogeneities — as well as the work required in order to provide for such functions as search, disambiguation, mapping and transformations — we can begin to understand how all of these components can work together in order to help achieve data interoperability. This understanding, in turn, is essential to understand the stack and software architecture — and its accompanying information architecture — in order to best achieve these interoperability objectives.

So, this current article lays out this conceptual framework of components and roles. Later articles in this series will address the specific questions of software and information architectural design.

Data Interoperability in Relation to Semantics

Semantic technologies give us the basis for understanding differences in meaning across sources, specifically geared to address differences in real world usage and context. These semantic tools are essential for providing common bases for relating structured data across various sources and contexts. These same semantic tools are also the basis by which we can determine what unstructured content “means”, thus providing the structured data tags that also enable us to relate documents to conventional data sources (from databases, spreadsheets, tables and the like). These semantic technologies are thus the key enablers for making information — unstructured, semi-structured and structured — understandable to both humans and machines across sources. Such understandings are then a key basis for powering the artificial intelligence applications that are now emerging to make our lives more productive and less routine.

For nearly a decade I have used an initial schema by Pluempitiwiriyawej and Hammer to elucidate the sources of possible semantic differences between content. Over the years I have added language and encoding differences to this schema. Most recently, I have updated this schema to specifically call out semantic heterogeneities due to either conceptual differences between sources (largely arising from schema differences) and value and attribute differences amongst actual data. I have further added examples for what each of these categories of semantic heterogenities means [1].

This table of more than 40 sources of semantic heterogeneities clearly shows the possible impediments to get data to interoperate across sources:

Class Category Subcategory Examples Type [2] [4]
LANGUAGE Encoding Ingest Encoding Mismatch For example, ANSI v UTF-8 [3] Concept
Ingest Encoding Lacking Mis-recognition of tokens because not being parsed with the proper encoding [3] Concept
Query Encoding Mismatch For example, ANSI v UTF-8 in search [3] Concept
Query Encoding Lacking Mis-recognition of search tokens because not being parsed with the proper encoding [3] Concept
Languages Script Mismatch Variations in how parsers handle, say, stemming, white spaces or hyphens Concept
Parsing / Morphological Analysis Errors (many) Arabic languages (right-to-left) v Romance languages (left-to-right) Concept
Syntactical Errors (many) Ambiguous sentence references, such as I’m glad I’m a man, and so is Lola (Lola by Ray Davies and the Kinks) Concept
Semantics Errors (many) River bank v money bank v billiards bank shot Concept
CONCEPTUAL Naming Case Sensitivity Uppercase v lower case v Camel case Concept
Synonyms United States v USA v America v Uncle Sam v Great Satan Concept
Acronyms United States v USA v US Concept
Homonyms Such as when the same name refers to more than one concept, such as Name referring to a person v Name referring to a book Concept
Misspellings As stated Concept
Generalization / Specialization When single items in one schema are related to multiple items in another schema, or vice versa. For example, one schema may refer to “phone” but the other schema has multiple elements such as “home phone,” “work phone” and “cell phone” Concept
Aggregation Intra-aggregation When the same population is divided differently (such as, Census v Federal regions for states, England v Great Britain v United Kingdom, or full person names v first-middle-last) Concept
Inter-aggregation May occur when sums or counts are included as set members Concept
Internal Path Discrepancy Can arise from different source-target retrieval paths in two different schemas (for example, hierarchical structures where the elements are different levels of remove) Concept
Missing Item Content Discrepancy Differences in set enumerations or including items or not (say, US territories) in a listing of US states Concept
Missing Content Differences in scope coverage between two or more datasets for the same concept Concept
Attribute List Discrepancy Differences in attribute completeness between two or more datasets Attribute
Missing Attribute Differences in scope coverage between two or more datasets for the same attribute Attribute
Item Equivalence When two types (classes or sets) are asserted as being the same when the scope and reference are not (for example, Berlin the city v Berlin the official city-state) Concept
When two individuals are asserted as being the same when they are actually distinct (for example, John Kennedy the president v John Kennedy the aircraft carrier) Attribute
Type Mismatch When the same item is characterized by different types, such as a person being typed as an animal v human being v person Attribute
Constraint Mismatch When attributes referring to the same thing have different cardinalities or disjointedness assertions Attribute
DOMAIN Schematic Discrepancy Element-value to Element-label Mapping One of four errors that may occur when attribute names (say, Hair v Fur) may refer to the same attribute, or when same attribute names (say, Hair v Hair) may refer to different attribute scopes (say, Hair v Fur) or where values for these attributes may be the same but refer to different actual attributes or where values may differ but be for the same attribute and putative value.Many of the other semantic heterogeneities herein also contribute to schema discrepancies Attribute
Attribute-value to Element-label Mapping Attribute
Element-value to Attribute-label Mapping Attribute
Attribute-value to Attribute-label Mapping Attribute
Scale or Units Measurement Type Differences, say, in the metric v English measurement systems, or currencies Attribute
Units Differences, say, in meters v centimeters v millimeters Attribute
Precision For example, a value of 4.1 inches in one dataset v 4.106 in another dataset Attribute
Data Representation Primitive Data Type Confusion often arises in the use of literals v URIs v object types Attribute
Data Format Delimiting decimals by period v commas; various date formats; using exponents or aggregate units (such as thousands or millions) Attribute
DATA Naming Case Sensitivity Uppercase v lower case v Camel case Attribute
Synonyms For example, centimeters v cm Attribute
Acronyms For example, currency symbols v currency names Attribute
Homonyms Such as when the same name refers to more than one attribute, such as Name referring to a person v Name referring to a book Attribute
Misspellings As stated Attribute
ID Mismatch or Missing ID URIs can be a particular problem here, due to actual mismatches but also use of name spaces or not and truncated URIs Attribute
Missing Data A common problem, more acute with closed world approaches than with open world ones Attribute
Element Ordering Set members can be ordered or unordered, and if ordered, the sequences of individual members or values can differ Attribute
Sources of Semantic Heterogeneities

Ultimately, since we express all of our content and information with human language, we need to start there to understand the first sources in semantic differences. Like the differences in human language, we also have differences in world views and experience. These differences are often conceptual in nature and get at what we might call differences in real world perspectives and experiences. From there, we encounter differences in our specific realms of expertise or concern, or the applicable domain(s) for our information and knowledge. Then, lastly, we give our observations and characterizations data and values in order to specify and quantify our observations. But the attributes of data are subject to the same semantic vagaries as concepts, in addition to their own specific challenges in units and measures and how they are expressed.

From the conceptual to actual data, then, we see differences in perspective, vocabularies, measures and conventions. Only by systematically understanding these sources of heterogeneity — and then explicitly addressing them — can we begin to try to put disparate information on a common footing. Only by reconciling these differences can we begin to get data to interoperate.

Some of these differences and heterogeneities are intrinsic to the nature of the data at hand. Even for the same putative topics, data from French researchers will be expressed in a different language and with different measurements (metric) than will data from English researchers. Some of these heterogeneities also arise from the basis and connections asserted between datasets, as misuse of the sameAs predicate shows in many linked data applications [5].

Fortunately, in many areas we are transitioning by social convention to overcome many of these sources of semantic heterogeneity. A mere twenty years ago, our information technology systems expressed and stored data in a multitude of formats and systems. The Internet and Web protocols have done much to overcome these sources of differences, what I’ve termed elsewhere as climbing the data federation pyramid [6]. Semantic Web approaches where data items are assigned unique URIs are another source of making integration easier. And, whether all agree from a cultural aspect if it is good, we are also seeing English become the lingua franca of research and data.

The point of the table above is not to throw up our hands and say there is just too much complexity in data integration. Rather, by systematically decomposing the sources of semantic heterogeneity, we can anticipate and accommodate those sources not yet being addressed by cultural or technological conventions. While there is a large number of categories of semantic heterogeneity, these categories are also patterned and can be anticipated and corrected. These patterned sources inform us about what kind of work must be done to overcome semantic differences where they still reside.

Work Components in Data Interoperability

The description logics that underly the semantic Web already do a fair job of architecting this concept-attribute split in semantics. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts [7].

The semantic Web is a standards-based effort by the W3C (World Wide Web Consortium); many of its accomplishments have arisen around ontology and TBox-related efforts. Data integration has putatively been tackled from the perspective of linked data, but that methodology so far is short on attributes and property-mapping linkages between datasets and schema. There are as yet no reference vocabularies or schema for attributes [8]. Many of the existing linked data linkages are based on erroneous owl:sameAs assertions. It is fair to say that attribute and ABox-level semantics and interoperability have received scarce attention, even though the logic underpinnings exist for progress to be made.

This lack on the attributes or ABox-side of things is a major gap in the work requirements for data interoperability, as we see from the table below. The TBox development and understanding is quite good; and, a number of reference ontologies are available upon which to ground conceptual mappings [9]. But the ABox third is largely missing grounding references. And, the specialty work tasks, representing about the last third, are needful of better effectiveness and tooling.

For both the TBox and the ABox we are able to describe and model concepts (classes), instances (individuals), and are pretty good at being able to model relationships (predicates) between concepts and individuals. We also are able to ground concepts and their relationships through a number of reference concept ontologies [9]. But our understanding of attributes (the descriptive properties of instances) remains poor and ungrounded. Best practices — let alone general practices — still remain to be discovered.

TBox (concepts) Specialty Work Tasks ABox (data)
  • Definitions of the concepts and properties (relationships) of the controlled vocabulary
  • Declarations of concept axioms or roles
  • Inferencing of relationships, be they transitive, symmetric, functional or inverse to another property
  • Equivalence testing as to whether two classes or properties are equivalent to one another
  • Subsumption, which is checking whether one concept is more general than another
  • Satisfiability, which is the problem of checking whether a concept has been defined (is not an empty concept)
  • Classification, which places a new concept in the proper place in a taxonomic hierarchy of concepts
  • Logical implication, which is whether a generic relationship is a logical consequence of the declarations in the TBox
  • Infer property assertions implicit through the transitive property
  • Mappings are the core of interoperability in that concepts and attributes get matched across schema and datasets
  • Transformations are the means to bring disparate data into common grounds, the second leg of interoperability
  • Entailments, which are whether other propositions are implied by the stated condition
  • Instance checking, which verifies whether a given individual is an instance of (belongs to) a specified concept
  • Knowledge base consistency, which is to verify whether all concepts admit at least one individual
  • Realization, which is to find the most specific concept for an individual object
  • Retrieval, which is to find the individuals that are instances of a given concept
  • Identity relations, which is to determine the equivalence or relatedness of instances in different datasets]
  • Disambiguation, which is resolving references to the proper instance
  • Membership assertions, either as concepts or as roles
  • Attributes assertions
  • Linkages assertions that capture the above but also assert the external sources for these assignments
  • Consistency checking of instances
  • Satisfiability checks, which are that the conditions of instance membership are met
Work Tasks for a Data Interoperability Framework

Across the knowledge base (that is, the combination of the TBox and the ABox), the semantic Web has improved its search capabilities by formally integrating with conventional text search engines, such as Solr. Instance and consistency checking are pretty straightforward to do, but are often neglected steps in most non-commercial semantic installations. Critical areas such as mappings, transformations and identity evaluation remain weak work areas. This figure helps show these major areas and their work splits:

Work Splits in Data Interoperability

Work Splits Between the Semantic Web and AI

As we discussed earlier on the recent and rapid advances of artificial intelligence [10], the combination of knowledge bases and the semantic Web with AI machine learning (ML) and NLP techniques will show rapid improvements in data interoperability. The two stumbling blocks of not having a framework and architecture for interoperability, plus the lack of attributes groundings, have been controlling. Now that these factors are known and they are being purposefully addressed, we should see rapid improvements, similar to other areas in AI.

This re-embedding of the semantic Web in artificial intelligence, coupled with the conscious attention to provide reference groundings for data interoperability, should do much to address what are current, labor-intensive stumbling blocks in the knowledge management workflow.

Putting Some Grown-up Pants on the Semantic Web

The semantic Web clearly needs to play a central role in data integration and interoperability. Fortunately, like we have seen in other areas [11], semantic technologies lend themselves to generic functional software that can be designed for re-use in most any knowledge domain, chiefly by changing the data and ontologies guiding them. This means that reference libraries of groundings, mappings and transformations can be built over time and reused across enterprises and projects. Use of functional programming languages will also align well with the data and schema in knowledge management functions and ontologies and DSLs. These prospects parallel the emergence of knowledge-based AI (KBAI), which marries electronic Web knowledge bases with improvements in machine-learning algorithms.

The time for these initiatives is now. The complete lack of distributed data interoperability is no longer tolerable. High costs due to unacceptable manual efforts and too many failed projects plague the data interoperability efforts of the past. Data interoperability is no longer a luxury, but a necessity for enterprises needing to compete in a data-intensive environment. At scale, point-to-point integration efforts become ineffective; a form of reusable and transferable master data management (MDM) needs to emerge for the realiites of Big Data, and one that is based on the open and standard protocols of the Web.

Much tooling and better workflows and user interfaces will need to emerge. But the critical aspects are the ones we are addressing now: information and software architectures; reference groundings and attributes; and education about these very real prospects near at hand. The challenge of data interoperability in cooperation with its artificial intelligence cousin is where the semantic Web will finally put on its Big Boy pants.


[1] See Charnyote Pluempitiwiriyawej and Joachim Hammer, 2000. A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources, Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See https://cise.ufl.edu/tr/DOC/REP-2000-396.pdf. I first cited this report and extended it to cover languages (see [3]) in M.K. Bergman 2006. Sources and Classification of Semantic Heterogeneities, AI3:::Adaptive Information blog, June 6, 2006. See http://www.mkbergman.com/232/sources-and-classification-of-semantic-heterogeneities/). This most recent version added the examples and expanding the listing a bit further, to where it is no longer faithful to the original 2000 paper.
[2] Concept is the shorthand used for the schema or classes or TBox. Attribute is the shorthand used for instance data or entities and their ABox. I segregate class-relation properties (predicates) from instance-describing properties (attributes). This distinction is not use in standard TBox-ABox splits; its rationale will be described in a further article.
[3] See M.K. Bergman, 2006. Tutorial: Internet Languages, Character Sets and Encodings, BrightPlanet Corporation Technical Documentation, March 2006, 13 pp. See http://www.mkbergman.com/wp-content/themes/ai3v2/files/2006Posts/InternationalizationTutorial060323.pdf.
[4] See [7]. Also the TBox portion, or classes (concepts), is the basis of the ontologies. The ontologies establish the structure used for governing the conceptual relationships for that domain and in reference to external (Web) ontologies. The ABox portion, or instances (named entities), represents the specific, individual things that are the members of those classes. Named entities are the notable objects, persons, places, events, organizations and things of the world. Each named entity is related to one or more classes (concepts) to which it is a member. Named entities do not set the structure of the domain, but populate that structure. The ABox and TBox play different roles in the use and organization of the information and structure.
[5] M.K. Bergman 2009. When Linked Data Rules Fail, AI3:::Adaptive Information blog, November 16, 2009. See http://www.mkbergman.com/846/when-linked-data-rules-fail/.
[6] M.K. Bergman 2006. Climbing the Data Federation Pyramid, AI3:::Adaptive Information blog, May 25, 2006. See http://www.mkbergman.com/229/climbing-the-data-federation-pyramid/.
[7] M.K. Bergman 2008. Thinking ‘Inside the Box’ with Description Logics, AI3:::Adaptive Information blog, November 10, 2008. See http://www.mkbergman.com/466/thinking-inside-the-box-with-description-logics/.
[8] See the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.
[9] Examples of upper-level ontologies include UMBEL, the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak. See further the Wikipedia entry on upper ontologies.
[10] M.K. Bergman 2014. Spring Dawns on Artificial Intelligence, AI3:::Adaptive Information blog, June 2, 2014. See http://www.mkbergman.com/1731/spring-dawns-on-artificial-intelligence/.
[11] M.K. Bergman 2011. Ontology-driven Apps Using Generic Applications, AI3:::Adaptive Information blog, March 7, 2011. See http://www.mkbergman.com/948/ontology-driven-apps-using-generic-applications/.
Posted:August 12, 2014

From http://www.slowfamilyonline.com/tag/tinker-toy/Defining the Guideposts for Big Data

In our recent two-part series we described a decade of experience working in the semantic Web (Part I) and our view that Big Structure, which resides at the nexus of the semantic Web, knowledge bases and artificial intelligence, was a key component of making sense of Big Data going forward (Part II). We are at a time when multiple advances are conjoining to create new opportunities and excitement.

Data without context and relationships is meaningless. The idea of Big Data is powerful, but it is often presented as either a “good thing” in and of itself, or a mantra for something that is rather undefined. There is no doubt that with the Internet and the Web we are now able to generate and access data at unprecedented scale. There is also no question that tracking mechanisms and cheap storage — and simpler, large-scale databases and Web services — mean that we can also capture data and structure of natures previously unseen. Everyone knows the remarkable growth in exabytes and more.

The prospect of data everywhere — some useful with important context and some not — has clearly captured the current discussion. Heck, if we claim Big Data, we even make more in wage or consulting charge-out fees. Who can argue with that?

Well, actually, anyone interested in meaningful data or cross-dataset interoperability can argue with that. Big Data is great, except it means little if we can not combine that data across multiple sources for potentially multiple purposes. (Remember, one of the “V’s” of Big Data is variability.) Once the question of what data means gets brought to the fore, it is now time for context and relationships. Structure in an information context means that which situates or describes data in an interpretable way. Big Data needs a Big Structure complement to make sense of it all.

What is a Big Structure?

Big Structure is data relationships and context that can be combined into a coherent framework to enable dataset interoperability and understanding. By necessity, Big Structure implies that the meaning of data can be understood and its values can be brought to common bases such that analysis, testing and validation can be applied across values. Big Structure is not a monolithic thing, but the combination of multiple things that give data meaning and context. As such, Big Structure is often a re-purposing of existing structural assets, plus other special sauce, organized for the aim of data interoperability.

Big Structure is data relationships and context that can be combined into a coherent framework to enable dataset interoperability and understanding.

The components of Big Structure can be identified and characterized. These components can be assessed for usefulness and authoritativeness, and then incorporated into broader structures that ultimately bring the topics of what the data is about and the values of that data into alignment. Thus, Big Structure is also a mindset and approach to selecting and combining structures such that broad dataset interoperability can be achieved.

Big Structure is actually a continuum or family of concept and data relationships, any one of which is also a contributor to helping to map and interoperate data. Ultimately, the components of Big Structure get combined into reference graph structures that place the concepts and actual data values of the Big Data into context. There are certain ways to use and organize existing structures to achieve these Big Structure objectives; some of these ways are described in this article.

Once the components of Big Structure are combined into these reference graphs we then can also use network or graph analysis to understand the relationships amongst the constituent data items. This recursive nature of graph reference structures to organize the constituent data and then to use those graphs to analyze the data is one of the hallmark characteristics of Big Structure.

Big Structure thus involves the need to identify and then organize constituent forms of structure into coherent reference frameworks. Concepts in contributing datasets are then mapped to these structures, and the attributes and values of the underlying data are also transformed into canonical representations. It is these mappings and transformations that provide the interoperability of Big Structure. Big Structure therefore continues to evolve by adding more and more reference structures, all coherently organized.

Contributors to Big Structure

Big Structure is a family of canonical reference structures that help guide mapping and interoperability. The table below lists some of the possible contributors to Big Structure [1], roughly in descending order as to the degree of structure and its contribution to interoperability. The table provides both definitions and use descriptions for each component, plus optionally some notes regarding coverage and use:

Structure Type Definition Use Note
Reference ontologies Major grounding structures for orienting and interoperating concepts or data The reference concepts for orienting all data and domain information [2]
Reference attributes Major grounding structures for interoperating data and data characterizations The reference relationships amongst data descriptions and characteristics, which also provides the means for transformations between heterogeneous representations [3]
Data model (RDF) A self-consistent means for describing the structure of data and their relationships The “canonical” data model at the heart of the system; provides a single interoperability point; RDF is the canonical model used by Structured Dynamics for its Big Structures [4]
Domain attributes The data descriptions and characteristics for the constituent datasets in the applicable domain(s) The reference attributes specific to the domain(s) at hand (which are generally more specific than general reference attributes)
Domain ontologies The formal conceptualization of a domain, using a shared vocabulary to denote the types, properties and interrelationships of those concepts The reference concepts and their relationships specific to the domain(s) at hand; generally are mapped to the reference ontologies [5]
Concept maps A diagram that depicts suggested relationships between concepts Structurally similar to a domain ontology; a few related terms shown in Note [6]
Schema The structure of a database that defines the objects and relationships in that database Organizing framework for relational databases (and their tables) [7]
Mappings The process of creating data element correspondences between two distinct data models or schema Mapping predicates are used to relate concepts or attributes from two different datasets or knowledge bases to one another. Mappings are often a precursor to various transformations to bring data into a common representation [8]
Taxonomies A particular classification of related concepts, often of a hierarchical nature Hierarchical relationships are expressed in narrower or broader terms (or subClassOf); may also be see also relationships [9]
Facets Clearly defined, mutually exclusive, and collectively exhaustive aspects, properties or characteristics of a class or specific subject Facets can provide alternative ways for classifying objects beyond a single taxonomy
Categories Grouping objects based on similar properties A category may be viewed as equivalent to a concept [10]
Tables A collection of related data held in a structured format, generally a two-dimensional layout of rows (records) and columns (fields) Simplest and most common data presentation format
Synsets A group of data elements or terms that are considered semantically equivalent for the purposes of information retrieval Also known as a “semset” in the parlance of UMBEL
Metadata Data providing information about one or more aspects of the source data, thus “data about data” It is the description of what data is about rather than the values and attributes of the actual data
Thesauri A form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects A thesaurus is composed a list of words (or terms), a vocabulary for relating these words (or terms) to one another, often hierarchically, and a set of rules on how to use these aspects
Gazetteers A listing of similar entity types with associated structural data (such as countries and population or standard codes) Often used in relation to people or place entity types, though any class of entities may have a gazetteer
Controlled vocabularies The use of predefined, authorized terms as preselected by the sponsor to enforce consistency in terminology Applied to specific domains or sub-domains, with single controlled vocabularies per official language used
Reference lists Authoritative listings of similar objects, each uniquely identified by name or code May be as simple as a comprehensive list of countries with associated ISO codes [11]
Dictionaries A repository of information about data such as meaning, relationships to other data, origin, usage, or format In our context, can range from the meaning associated with standard word dictionaries to the more formal data dictionary
Glossaries An alphabetical list of terms in a particular domain with the definitions for those terms Definition is the only structured information provided
Nested lists Related concepts or entities organized by some form of hierarchical relationship (narrower, broader, subClassOf, etc.) Akin to a simple taxonomy
Ordered lists A finite, ordered collection of values for a given type May also be additional information linked to the listing
Clusters A set of objects grouped according to some basis of similarity (type, attributes, or characteristics) Basis for how the objects got clustered is not always obvious
Unordered lists A container of similar items or entities, with no implied order or sequence Also known as a “bag” or “collection” [12]
Values The actual data; a normal form or a type member Basic QUDT ontologies could contribute here

An alternate way to look at these contributor structures is to characterize them with respect to degree of structure and degree of contributing to interoperability:

Structure v Interoperability

Structure v Interoperability

In general, as might be expected, the greater the degree of structure, the greater its potential contribution to interoperability. The components in the upper right quadrant represent the most structured and interoperable ones. These also conform most to the use of W3C standards for the RDF data model and the OWL ontology languages. Expressions of structure are codified and standardized. Use of best practices also ensures completeness and suitability as reference groundings for interoperability.

The lower left portions of the quadrant represent the least structure and interoperability. However, as standard reference means for characterizing and describing data, even structures in this quadrant can contribute to meeting Big Structure requirements. Tagging of documents (unstructured data) occurs in this less-sophisticated lower left quadrant, but it gives equal footing to 80% of the content that generally resides in text form. (The interoperability system is further enhanced when the basis of the tags is derived from the “semsets” of the reference and domain ontologies, another example of a best practice.)

All of the listed components can thus contribute to Big Structure. However, the completeness of that structure and its usefulness for interoperability increases as one progresses along the blue arrow of the Big Structure continuum. Data interoperability arises from the continued efforts to drive Big Structure to the upper right of this quadrant. As noted, Big Structure is a mindset and process rather than some finite state. As more concepts and attributes get grounded in standard references, the degree of Big Structure (and, thus, data interoperability) continues to increase.

The Foundation of Reference Groundings

In both semantics and artificial intelligence — and certainly in the realm of data interoperability — there is always the problem of symbol grounding. In the conceptual realm, symbol grounding means that when we use a term or phrase we are referring to the same thing; that is, the referent is the same. In the data value realm, symbol grounding means that when we refer to an object or a number — say, the number 4.1 — we are also referring to the same metric. 4.1 inches is not the same as 4.1 centimeters or 4.1 on the Richter scale, and object names for set member types also have the same challenges of ambiguous semantics as do all other things referred to by language.

The variability V in Big Data or the 40-some dimensions of potential semantic heterogeneity [13] are explicit recognitions of the symbol grounding challenge. Assuming we can determine context (itself an important consideration not further discussed here), fixity of reference is essential to these groundings. Context and groundings are the ways by which we remove ambiguity in what we measure and record.

Like dictionaries for human languages, or stars and constellations for navigators, or agreed standards in measurement, or the Greenwich meridian for timekeepers, fixed references are needed to orient and “ground” each new dataset over which we attempt to integrate. Without such fixities of reference, everything floats in reference to other things, the cursed “rubber ruler” phenomenon.

Thus, we can express our Big Structure components from a foundational perspective as well. In Structured Dynamics‘ view of the world, the foundation for data interoperability is grounded in reference structures or ontologies that provide the fixity of reference for concepts and data and their attributes. Upon these foundations are then constructed the domain views of concepts and attributes, which become the target for mapping other references and Big Structures:

Foundations to Big Structure

Foundations to Big Structure

The mappings, transformations and domain and reference ontologies are themselves written in the OWL languages of the W3C and the standards of the RDF data model. At this most expressive end of Big Structure, the representations are in the form of graphs. Network and graph analytics will expand still further business intelligence prospects. The use of these standards with common and testable logic is another means to ensure coherency and interoperability of the Big Structure that results.

Note a key aspect of the grounding foundation is missing: one or more reference ontologies for attributes. Though many examples exist on the concept side, little has been done to explicitly address the questions of data value interoperability. This major gap is a current emphasis of Structured Dynamics, with much that will be said over the coming weeks. Also expect an open source reference ontology for attributes in the near future.

The thing is that we are learning how to make the various parts of this interoperability stack work. We are leveraging existing structural assets of all kinds to establish the semantics and infrastructure for domain interoperability. We know how to match and map these existing structural assets to the reference frameworks that are the foundation to interoperability.

A Vision of Interoperability

The real world is one of heterogeneous datasets, multiple schema and differing viewpoints. Even within single enterprises — and those which formerly expressed little need or interest to interoperate with the broader world — data integration and interoperability has been a real challenge. Big Data itself is not solving these problems. Quite the opposite. Big Data trends are turning data interoperability molehills into mountain-high competitive threats.

Like any well-built structure, data interoperability requires a solid foundation. That foundation must reside in exemplar reference ontologies upon which to ground the semantics and exchange standards for data. Using the canonical RDF data model makes this task practical. Existing information structures of various types across the enterprise and the Web all can and should play a role in establishing reference structures. The accretion of reference structures will lead to still further interoperability and the ability to incorporate more datasets. Currently expensive practices in, say, master data management (MDM) can begin to transition to a new paradigm. It is easy to envision working from a library of existing reference standards for use across enterprises. This kind of incremental expansion of interoperability leads to still more interoperable data in a virtuous cycle of innovation and lower budgets.

As our computing continues to get more virtual and cloud-like, physical and hardware and software architectures must give way to information architectures (in the true sense of interoperability). We have no choice but to treat the architecting of information as a first-order challenge. The totally cool thing about the data integration challenge is that the architecture can be readily varied and tested to achieve a working foundation. Much empirical information exists about how to do it and what to do next. The chief challenge has been to recognize that data interoperability — and its dependence on Big Structure — is a first-order concern (and opportunity). The intersection of Big Structure with Big Data, and with graph and AI algorithms, should create new approaches to chew across the data integration environment. I expect progress to be rapid.


[1] There are at least 40 terms or concepts across these various disciplines, most related to Web and general knowledge content, that have organizational or classificatory aspects that — loosely defined — could be called an “ontology” framework or approach. See M.K. Bergman, 2007. An Intrepid Guide to Ontologies, AI3:::Adaptive Information blog, May 16, 2007.
[2] UMBEL and other upper level ontologies are examples here. In the case of UMBEL, that Big Structure is used as a scaffolding of reference concepts used to link external (unrelated) structures to help inter-operating data between two unrelated systems. Such a Big Structure can also be used for other tasks such as helping machine learning techniques to categorize and disambiguate pieces of data by leveraring such a structure of types.
[3] Unfortunately, no reference structures for attributes yet exist. For a discussion of this status, see the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.
[4] Data models encompass a rather broad span. The RDF discussion represents a more formal end of the data model spectrum, wherein there is complete logic, syntax and serialization discussions, more involved than most data models.
[5] Domain ontologies represent the most closely-aligned view of the domain and its relationships of all of the component structures listed.
[6] Concept maps are very closely related to ontologies, and may include topic maps, mind maps and other graph-like structures of concepts.
[7] Schema may apply to many realms, but in the IT and software context schema mostly refers to database schema related to relational databases. These are often expresssed in UML diagrams or XML schema.
[8] Mappings and transformatons are a huge area of diverse structure and different serializations and specifications. Fortunately, the task of mapping external structure to RDF removes the many-to-many issues with most transformation approaches.
[9] Taxonomies mask an entire sub-categories of directories, folksonomies, subject trees, and more. The key aspect is that relevant concepts are expressed in a graph relationship manner to other concepts, often in a hierarchical fashion.
[10] Categories also includes the general classification process.
[11] I would consider a canonical references listing of country names and codes to be a part of Big Structure, since they act as a controlled vocabulary.
[12] This is a key area for including unstructured documents, since tags are a primary means of adding metadata to a document. When the pool of tags is based on the governing reference and domain ontologies, then interoperability is further promoted.
[13] M.K. Bergman, 2006. Sources and Classification of Semantic Heterogeneities, AI3:::Adaptive Information blog, June 6, 2006.
Posted:July 16, 2014

Battle of Niemen, WWI, photo from WikimediaAre We Losing the War? Was it Even the Right One?

Cinemaphiles will readily recognize Akira Kurosawa‘s Rashomon film of 1951. And, in the 1960s, one of the most popular book series was Lawrence Durrell‘s The Alexandria Quartet. Both, each in its own way, tried to get at the question of what is truth by telling the same story from the perspective of different protagonists. Whether you saw this movie or read these books you know the punchline: the truth was very different depending on the point of view and experience — including self-interest and delusion — of each protagonist. All of us recognize this phenomenon of the blind men’s view of the elephant.

I have been making my living and working full time on the semantic Web and semantic technologies now for a full decade. So has my partner at Structured Dynamics, Fred Giasson. Others have certainly worked longer in this field. The original semantic Web article appeared in Scientific American in 2000 [1], and the foundational Resource Description Framework data model dates from 1999. Fred and I have our own views of what has gone on in the trenches of the semantic Web over this period. We thought a decade was a good point to look back, share what we’ve experienced, and discover where to point our next offensive thrusts.

What Has Gone Well?

The vision of the semantic Web in the Scientific American article painted a picture of globally interconnected data leveraged by agents or bots designed to make our lives easier and more automated. However, by the time that I got directly involved, nearly five years after standards first started to be published, Tim Berners-Lee and many leading proponents of RDF were beginning to shift focus to linked data. The agents, and automation, and ontologies of the initial vision were being downplayed in favor of effective means to publish and consume data based on RDF. In many ways, linked data resembled a re-branding.

This break had been coming for a while, memorably captured by a 2008 ISWC session led by Peter F. Patel-Schneider [2]. This internal division of viewpoint likely caused effort to be split that would have been better spent in proselytizing and improving tools. It also diverted somewhat into internal squabbles. While many others have pointed to a tactical mistake of using an XML serialization for early versions of RDF as a key factor is slowing initial adoption, a factor I agree was at play, my own suspicion is that the philosophical split taking place in the community was the heavier burden.

Whatever the cause, many of the hopes of the heady days of the initial vision have not been obtained over the past fifteen years, though there have been notable successes.

The biomedical community has been the shining exemplar for data interoperability across an entire discipline, with earth sciences, ecology and other science-based domains also showing interoperability success [3]. Families of ontologies accompanied by tooling and best practices have characterized many of these efforts. Sadly, though, most other domains have not followed suit, and commercial interoperability is nearly non-existent.

Most all of the remaining success has resided in single-institution data integration and knowledge representation initiatives. IBM’s Watson and Apple’s Siri are two amazing capabilities run and managed by single institutions, as is Google’s Knowledge Graph. Also, some individual commercial and government enterprises, willing to pay support to semantic technology experts, have shown success in data integration, using RDF, SKOS and OWL.

We have seen the close kinship between natural language, text, and Q & A with the semantic Web, also demonstrated by Siri and more recent offshoots. We have seen a trend toward pairing great-performing open source text engines, notably Solr, with RDF and triple stores. Recommendation systems have shown some success. Linked data publishing has also had some notable examples, including the first of the lot, DBpedia, with certain institutional publishers (such as the Library of Congress, Eurostat, The Getty, Europeana, OpenGLAM [galleries, archives, libraries, and museums]) showing leadership and the commitment of significant vocabularies to linked data form.

On the standards front, early experience led to new and better versions of the SPARQL query language (SPARQL 1.1 was greatly improved in the last decade and appears to be one capability that sells triple stores), RDF 1.1 and OWL 2. Certain open source tools have become prominent, including Protégé, Virtuoso (open source) and Jena (among unnamed others, of course). At least in the early part of this history, tool development was rapid and flourishing, though the innovation pace has dropped substantially according to my tracking database Sweet Tools.

What Has Disappointed?

My biggest disappointments have been, first, the complete lack of distributed data interoperability, and, second, the lack or inability of commercial enterprises to embrace and adopt semantic technologies on their own. The near absence of discussion about instance records and their attributes helps frame the current maturity of the semantic Web. Namely, it has yet to crack the real nuts of data integration and interoperability across organizations. Again, with the exception of the biomedical community, neither in the linked data realm nor in the broader semantic Web, can we point to information based on semantic Web principles being widely shared between systems and organizations.

Some in the linked data community have explicitly acknowledged this. The abstract for the upcoming COLD 2014 workshop, for example, states [4]:

. . . applications that consume Linked Data are not yet widespread. Reasons may include a lack of suitable methods for a number of open problems, including the seamless integration of Linked Data from multiple sources, dynamic discovery of available data and data sources, provenance and information quality assessment, application development environments, and appropriate end user interfaces.

We have written about many issues with linked data, ranging from the use of improper mapping predicates; to the difficulty in publishing; and to dereferencing URIs on the Web since they are sparse and not always properly implemented [5]. But ultimately, most linked data is just instance data that can be represented in simpler attribute-value form. By shunning a knowledge representation language (namely, OWL) at the processing end, we have put too much burden on what are really just instance records. Linked data does not get the balance of labor right. It ignores the reality that data consumers want actionable information over being able to click from data item to data item, with overall quality reduced to the lowest common denominator. If a publisher has the interest and capability to publish quality linked data, great! It should become part of the data ingest pool and the data becomes easy to consume. But to insist on linked data across the board creates unnecessary barriers. Linked data growth has not nearly kept pace with broader structured data growth on the Web [6].

At the enterprise level, the semantic technology stack is hard to grasp and understand for newcomers. RDF and OWL awareness and understanding are nearly nil in companies without prior semantic Web experience, or 99.9% of all companies. This is not a failure of the enterprises; it is the failure of us, the advocates and suppliers. While we (Structured Dynamics) have developed and continue to refine the turnkey Open Semantic Framework stack, and have spent more efforts than most in documenting and explicating its use, the systems are still too complicated. We combine complicated content management systems as user front-ends to a complicated semantic technology stack that needs to be driven by a complicated (to develop) ontology. And we think we are doing some of the best technology transfer around!

Moreover, while these systems are good at integrating concepts and schema, they are virtually silent on the question of actual data integration. It is shocking to say, but the semantic Web has no vocabularies or tools sufficient to enable data items for the same entity from two different datasets to be combined or reconciled [7]. These issues can be solved within the individual enterprise, but again the system breaks when distributed interoperability is the desire. General Web-based inconsistencies, such as in HTML coding or mime types, impose hurdles on distributed interoperability. These are some of the reasons why we see the successes in the context (generally) of single institutions, as opposed to anything that is truly yet Web-wide.

These points, as is often the case with software-oriented technologies, come down to a disappointing state of tooling. Markets drive developer interest, and market share has been disappointing; thus, fewer tools. Tool interest comes from commercial engagements, and not generally grants, the major source of semantic Web funding, particularly in the European Union. Pragmatic tools that solve real problems in user adoption are rarely a sufficient basis for getting a Ph.D.

The weaknesses in tooling extend from basic installation, to configuration, unit and integrated tests, data conversion and lifting, and, especially, all things ontology. Weaknesses in ontology tooling include (critically) mapping, consistency and coherency checking, authoring, managing, version control, re-factoring, optimization, and workflows. All of these issues are solvable; they are standard software challenges. But it is hard to conquer markets largely with the wrong army pursuing the wrong objectives in response to the wrong incentives.

Yet, despite the weaknesses in tooling, we believe we have been fairly effective in transferring technology to our clients. It takes more documentation and more training and, often, accompanying tool development or improvement in the workflow areas critical to the project. But clients need to be told this as well. In these still early stages, successful clients are going to have to expend more staff effort. With reasonable commitment, it is demonstrable that an enterprise can take over and manage a large-scale semantic engagement on its own. Still, for semantic technologies to have greater market penetration, it will be necessary to lower those commitments.

How Has the Environment Changed?

Of course, over the period of this history, the environment as a whole has changed markedly. The Web today is almost unrecognizable from the Web of 15 years ago. If one assumes that Web technologies tend to have a five year or so period of turnover, we have gone through at least two to three generations of change on the Web since the initial vision for the semantic Web.

The most systemic changes in this period have been cloud computing and the adoption of the smartphone. These, plus the network of workstations approach to data centers, have radically changed what is desirable in a large-scale, distributed architecture. APIs have become RESTful and database infrastructures have become flatter and more distributed. These architectures and their supporting infrastructure — such as virtual servers, MapReduce variants, and many applications — have in turn opened the door to performant management of large volumes of flat (key-value or graph) data, or big data.

On the Web side, JavaScript, just a few years older than the semantic Web, is now dominant in Web pages and taking on server-side roles (such as through Node.js). In turn, JSON has now grown in popularity as a form of data representation and transfer and is being adopted to the semantic Web (along with codifying CSV). Mobile, too, affects the Web side because of the need for multiple-platform deployments, touchscreen use, and different user interface paradigms and layout designs. The app ecosystem around smartphones has become a huge source for change and innovation.

Extremely germane to the semantic Web — indeed, overall, for artificial intelligence — has been the occurrence of knowledge-based AI (KBAI). The marrying of electronic Web knowledge bases — such as Wikipedia or internal ones like the Google search index or its Knowledge Graph — with improvements in machine-learning algorithms is systematically mowing down what used to be called the Grand Challenges of computing. Sensors are also now entering the picture, from our phones to our homes and our cars, that exposes the higher-order requirement for data integration combined with semantics. NLP kits have improved in terms of accuracy and execution speed; many semantic tasks such as tagging or categorizing or questioning already perform at acceptable levels for most projects.

On the tooling side, nearly all building blocks for what needs to be done next are available in open source, with some platform areas quite functional (including OSF, of course). We have also been successful in finding clients that agree to open source the development work we do for them, since they are benefiting from the open source development that went on before them.

What Did We Set Out to Achieve?

When Structured Dynamics entered the picture, there were already many tools available and core languages had been released. Our view of the world at that time led us to adopt two priorities for what we thought might be a five year or so plan. We have achieved the objectives we set for ourselves then, though it has taken us a couple of years longer to realize.

One priority was to develop a reference structure for concepts to serve as a “grounding” basis for relating datasets, vocabularies, schema, taxonomies, or ontologies. We achieved this with our first commercial release (v 1.00) of UMBEL in February 2011. Subsequent to that we have progressed to v 1.05. In the coming months we will see two further major updates that have been under active effort for about eight months.

The other priority was to create a turnkey foundation for a semantic enterprise. This, too, has been achieved, with many more releases. The Open Semantic Framework (OSF) is now in version 3.00, backed by a 500-article training documentation and technical wiki. Support tooling now includes automated installation, testing, and data transfer and synchronization.

Because our corporate objectives were largely achieved it was time to look at lessons learned and set new directions. This article, in part, is a result of that process.

How Did Our Priorities Evolve Over the Decade?

I thought it would be helpful to use the content of this AI3 blog to track how concerns and priorities changed for me and Structured Dynamics over this history. Since I started my blog quite soon after my entry into the semantic Web, the record of my perspectives was conterminous and rather complete.

The fifty articles below trace my evolution in knowledge and skills, as well as a progression from structured data to the semantic Web. These 50 articles represent about 11% of all articles in my chronological archive; they were selected as being the most germane to the question of evolution of the semantic Web.

After early ramp up, most of the formative discussion below occurred in the early years. Posts have declined most recently as implementation has taken over. Note most of the links below have  PDFs available from their main pages.

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

The early years of this history were concentrated on gathering background information and getting educated. The release of DBpedia in 2007 showed how knowledge bases would become essential to the semantic Web. We also identified that a lack of shared reference concepts was making it difficult to “ground” different semantic Web datasets or schema to one another. Another key theme was the diversity of native data structures on the Web, but also how all of them could be readily represented in RDF.

By 2008 we began to study the logical underpinnings to the semantic Web as we were coming to understand how it should be practiced. We also began studying Web-oriented architectures as key design guidance going forward. These themes continued into 2009, though now informed by clients and applications, which was expanding our understanding of requirements (and, sometimes, shortcomings) in the enterprise marketplace. The importance of an open world approach to the basic open nature of knowledge management was cementing a clarity of the role and fit of semantic solutions in the overall informaton space. The general community shift to linked data was beginning to surface worries.

2010 marked a shift for us to become more of a popularizer of semantic technologies in the enterprise, useful to attract and inform prospects. The central role of ontologies as the guiding structures (either as codified knowledge structures or as instruction sets for the platform) for OSF opened realizations that generic functional software could be designed that can be re-used in most any knowledge domain by simply changing the data and ontologies guiding them. This increased our efforts in ontology tooling and training, now geared more to the knowledge worker.  The importance of groundings for aligning schema and data caused us to work hard on UMBEL in 2011 to get it to a commercial release state.

All of these efforts were converging on design thoughts about the nature of information and how it is signified and communicated. The bases of an overall philosophy regarding our work emerged around the teachings of Charles S Peirce and Claude Shannon. Semantics and groundings were clearly essential to convey accurate messages. Simple forms, so long as they are correct, are always preferred over complex ones because message transmittal is more efficient and less subject to losses (inaccuracies). How these structures could be represented in graphs affirmed the structural correctness of the design approach. The now obvious re-awakening of artificial intelligence helps to put the semantic Web in context: a key subpart, but still a subset, of artificial intelligence. The percentage of formative articles directly related over these last couple of years to the semantic Web drops much, as the emphasis continues to shift to tech transfer.

What Else Did We Learn?

Not all lessons learned warranted an article on their own. So, we have also reflected on what other lessons we learned over this decade. The overall theme is: Simpler is better.

Distributed data interoperability across the Web is a fundamental weakness. There are no magic tricks to integrate data. Data mapping and integration will always require massaging. Each data integration activity needs its own solution. However, it can greatly be helped with ontologies and with better tooling.

In keeping with the lesson of grounding, a reference ontology for attributes is missing. It is needed as a bridge across disparate datasets describing similar entities or with different attributes for the same entities. It is also a means to reduce the pairwise combinatorial issue of integrating multiple datasets. And, whatever is done in the data integration area, an open world approach will be essential given the nature of knowledge information.

There is good design and best practice for distributed architectures. The larger these installations become, the more important it is to use a lightweight, loosely-coupled design. RESTful Web services and their interfaces are key. Simpler services with fewer functions can be designed to complement one another and increase throughput effectiveness.

Functional programming languages align well with the data and schema in knowledge management functions. Ontologies, as structures, also fit well with functional languages. The ability to create DSLs should continue to improve bringing the knowledge management function directly into the hands of its users, the knowledge workers.

In a broader sense, alluded to above, the semantic Web is but a set of concepts. There are multiple ways to use it. It can be leveraged without requiring “core” semantic Web tools such a triple stores. Solr can act as a semantic store because semantics, NLP and search are naturally married. But, the semantic Web, in turn, needs to become re-embedded in artificial intelligence, now backed by knowledge bases, which are themselves creatures of the semantic Web.

Design needs to move away from linked data or the semantic Web as the goals. The building blocks are there, though perhaps not yet combined or expressed well. The real improvements now to the overall knowledge function will result from knowledge bases, artificial intelligence, and the semantic Web working together. That is the next frontier.

Overall, we perhaps have been in the wrong war for the wrong reasons. Linked data is certainly not an end and mostly appears to represent work, rather than innovation. The semantic Web is no longer the right war, either, because improvements there will not come so much from arguing semantic languages and paradigms. Learning how to master distributed data integration will teach the semantic Web much, and coupling artificial intelligence with knowledge bases will do much to improve the most labor-intensive stumbling blocks in the knowledge management workflow: mappings and transformations. Further, these same bases will extend the reach into analytical and statistical realms.

The semantic Web has always been an infrastructure play to us. On that basis, it will be hard to ever judge market penetration or dominance. So, maybe in terms of a vision from 15 years ago the growth of the semantic Web has been disappointing. But, for Fred and me, we are finally seeing the landscape clearly and in perspective, even if from a viewpoint that may be different from others’. From our vantage point, we are at the exciting cusp of a new, broader synthesis.

NOTE: This is Part I of a two-part series. Part II will appear shortly.

[1] Tim Berners-Lee, James Hendler, and Ora Lassila, “The Semantic Web,” in Scientific American 284(5): pp 34-43, 2001. See http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2.
[2] For those with a spare 90 minutes or so, you may also want to view this panel session and debate that took place on “An OWL 2 Far?” at ISWC ’08 in Karlsruhe, Germany, on October 28, 2008. The panel was chaired by Peter F. Patel-Schneider (Bell Labs, Alcathor) with the panel members of Stefan Decker (DERI Galway), Michel Dumontier (Carleton University), Tim Finin (University of Maryland) and Ian Horrocks (University of Oxford), with much audience participation. See http://videolectures.net/iswc08_panel_schneider_owl/
[3] Open Biomedical Ontologies (OBO) is an effort to create controlled vocabularies for shared use across different biological and medical domains. As of 2006, OBO formed part of the resources of the U.S. National Center for Biomedical Ontology (NCBO). As of the date of this article, there were 376 ontologies listed on the NCBO’s BioOntology site. Both OBO and BioOntology provide tools and best practices.
[4] Fifth International Workshop on Consuming Linked Data (COLD 2014), co-located with the 13th International Semantic Web Conference (ISWC) in Riva del Garda, Italy, October 19-20.
[7] See the thread on the W3C semantic web mailing list beginning at http://lists.w3.org/Archives/Public/semantic-web/2014Jul/0129.html.
Posted:June 15, 2014

Gold Pyramid Unpacking the Growth-producing Factors of Production

In my last article on artificial intelligence, I made the statement that “. . . innovation is the source of wealth creation.” I made that unquestioned statement as part of my reflexive world view. But, when I re-read the article after its posting, I asked myself: What are the actual arguments and evidence for this innovation-to-wealth assertion? Surprisingly, there is not nearly the evidential basis for this assertion that I would have assumed.

Since Adam Smith, the signal focus of economics has been its attempt to explain the basis of growth. This is not surprising since the birth of the field of economics also corresponded to an historically unprecedented inflection point in economic growth (see next). Smith ascribed this source to productivity resulting from the division of labor using his famous example of the pin factory. But it is really only within the past fifty years or so that economists have begun unpacking the growth function from the other factors of production.

Growth is a percent increase from a prior state. In economic terms, growth compounded over a period of time has the virtuous reward of resulting in increased wealth. Economic growth is often measured through such means as revenues (for the individual firm) or GDP (for regions or countries). Net worth (for the firm) or GDP per capita or net worth (for individuals) measure the wealth associated with the current stock of economic goods at any given point in time. And, of course, wealth alone also masks the importance of changes in comfort, convenience, freedom, choice, leisure, mobility and other values that may accompany growth and transcend the material. Too, some “externalities” of economic growth may be negative, such as congestion or pollution, but it is also true that wealthier societies tend to regulate against these effects.

Not only have we seen discontinuities in growth (and then wealth) throughout history, but we see them today between individuals, firms, industries, cities, regions and nations. Unpacking the economic factors of production that lead to growth thus has immense importance across the entire economic spectrum — from individuals to nations. Explicating and then managing these factors are intuitively a basis for improving the welfare of any economic actor. Unlocking the nature of growth, or better understanding that nature, should aid in helping to promote still further growth and wealth. Though questions of distribution and fairness may remain, a rising tide lifts all boats.

Thus, understanding the basis of growth, sustained over time, leading to greater wealth for individuals or nations, is the central question facing economics. And, as we see below, that understanding in turn is intimately related to the importance of information and innovation.

The Common Sense Argument

If we toil, year by year, doing the same activity, like growing wheat, and we gain the same harvest for the same labor and land and inputs, that is what we expect. Yet sometimes, the weather or rainfall patterns may differ, or we may have more children helping us in the fields, or a mule to help plow. Money helps us buy more of the important inputs, maybe more land, more mules or the comfort to have more children. These are the traditional factors of production: that is, land, capital and labor.

If we add more of these factors to the mix, we still understand we have merely tweaked the standard basis of our wheat production. Differences in the amount of these factors of production, throughout most of human history, are what accounted for the differences between rich and poor, landlord and serf. If, by virtue of having more land or children, we are now able to feed more people, we are by first definitions more wealthy, and if we can accumulate more of this wealth, we can leverage these standard factors even more. When we can keep more of what we produce we become more wealthy. Control and exploitation have been logical paths to much wealth creation.

These factors are pretty easy to observe and track. We intuitively understand that more inputs of labor, land or capital can themselves result in growth, but a growth that feels and appears rather fixed based on the change in these inputs. This kind of growth has a more-or-less trending return based on changes in these inputs. These types of inputs may also be subject to diminishing returns, wherein adding more of a given factor produces diminishing or negative payoff. For example, adding more fertilizer to the wheat crop produces less per unit output yield after some optimum, and then can actually reduce yields by burning the crop. Or, while a computer increases the productivity of an individual worker, giving her more computers may actually degrade her overall performance.

But there is also clearly a different kind of growth that is not constrained to a fixed or declining return based on inputs. Perhaps we have a neighbor that raises more wheat, possibly on drier, more marginal land, or with less water or fertilizer. His yield exceeds our own. These differences occur because our neighbor is doing something different and is producing more given his inputs.

Innovation is an individual affair in its discovery, but a communal one in its application. (At which point it is known as information.) Better ways of planting or spacing the wheat, perhaps using a plow, or selecting certain wheat strains for next year’s plantings, or irrigating the land, or providing harnesses to the mules, or dividing and specializing the responsibilities amongst the children, can result in real differences in how much gets harvested for a “similar” set of inputs. And, what I initially innovate, becomes information for the next farmer to emulate. Some of these innovations are new devices, such as harnesses or plows. Some of these innovations are new practices, such as tilling or irrigation methods or specializations in tasks or labor. And, of course, not every farmer must innovate on his own. Copying and imitation diffuse these changes across farms and workers.

Truly, for millenia, this is how human progress took place. Some innovations, such as fire, the wheel, iron and bronze, the arch, alphabets, the plow and the yoke had material benefits to all who encountered them. These innovations were fundamental and diffused at the pace of human movement. But, one could argue, each was understood to be a flash of insight, and not a product of systemic information and process. Further, innovations tended to diffuse slowly, along the pace and concentration of trade routes. The innovative event was quite rare, and most practices had been stable for centuries. It is not at all surprising that early economic ideas tended to focus on the traditional factors of production of land, labor and capital. These had been the steady constants for what had been very slow growth for centuries.

But then a real discontinuity in economic growth compared to all previous recorded history occurred in the early 1800s. Historically flat income averages skyrocketed, as this famous figure showing global changes in per capita (person) GDP from Angus Maddison illustrates [1].

William Nordhaus has captured a similar discontinuity looking at the price of light, normalized according to the labor effort needed to obtain 1000 lumens of light. It, too, shows an exponential decrease in the price of lighting beginning about 1800 [2]:

These comparatively abrupt changes in growth rates, and concomitant changes in wealth, that were orders of magnitude higher than what had been experienced before in human history, garnered the attention of economists and economic historians as never before [3].

From the beginning, this difference in growth rates was largely attributed to “technological change”, but the specific causes of this change have been ascribed to many things. The close concurrency to the Enlightenment suggested some fundamental change in thinking. Similarly, the concurrence with the Industrial Revolution suggested the importance of machines, prime movers and the harnessing of energy. Cultural and religious factors have been posited to explain why Britain and then the United States were the initial centers of growth. The invisible hand of the market and division of labor and specialization were advocated by Adam Smith. I have argued the importance of the mechanical printing press and pulp paper in bringing information to a broader swathe of society [3]. Education and support for basic and applied research have their advocates. Financial and banking innovations, and the rule of law and patents and other intellectual property rights, have also been cited as causes.

Common sense tells us that all of these factors, and perhaps more, can all work as force multipliers to the traditional inputs to the economic function.

But, until the mid-1950s, the broad sense of “technological change” and vague causative factors were more often than not argued in an anecdotal, literary way. Empirical datasets were few and far between to test hypotheses, and quantitative means of reasoning over economic problems were only just beginning. Economic growth theory was only just beginning to be an economic discipline in its own right.

The Theoretical Arguments

Joseph Schumpeter, in The Theory of Economic Development, first published in 1911, argued that innovation was central to economic growth and constantly disrupted the general equilibrium of market exchange [4]. Innovation granted the firm a temporary monopoly status in which to charge higher rents, thereby providing an incentive for further innovation. Schumpeter’s emphasis on entrepreneurialship and his popularization of “creative destruction” recognized that new innovative market entrants may cause older firms to become obsolete. He tied these ideas into his basic views on business cycles, also driven by technological change. Innovation was central to Schumpeter’s economic world view.

But the theoretical story really begins in earnest after World War II when the hidden X factor of technological change — in what came to be expressed as total factor productivity — came to the fore to complete the economic growth equation [5].

The Exogenous Model

Robert Solow is an American economist particularly known for his work on the theory of economic growth; the exogenous growth model is named after his work. Solow took courses from Schumpeter at Harvard and was influenced by his views on innovation and technological change [6], though Solow was also part of the generation of economists embracing the new discipline of mathematical or quantitative economics, which was foreign to Schumpeter.

As noted, economic growth was known to go beyond the typical factors of production. Solow’s insight in two papers in 1956 and 1957, for which he won a Nobel prize, was that technological change, what he called “technological progress,” must be the “residual” left over from empirical growth once the traditional inputs of labor and capital are removed [7].  Using his model, Solow calculated that about 87.5% of the growth in US output per worker was attributable to technical progress [8]. A substitute term is total-factor productivity (TFP), the “residual” in total output not credited to the traditional inputs of labor and capital. By definition, TFP cannot be measured directly.

We can express this mathematically as showing total output (Y) as a function of total-factor productivity (A), capital input (K), labor input (L), and the two inputs’ respective shares of output (α and β are the capital input share of contribution for K and L respectively):

Y = A \times K^\alpha \times L^\beta

These considerations make the exogenous growth model one of the neo-classical growth models, wherein the long-run rate of growth is exogenously supplied, apart from the internal growths of labor and capital. Within this camp, one explanation is based on the savings rate (the Harrod–Domar model); the other, as shown herein, is the rate of technological progress (Solow-Swan model [7]). By definition, in either of these so-called neo-classical models, the savings rate or the rate of technological progress remains unexplained. They are abstract external forces that are just “out there.”

The TFP approach remains strong as a basis for estimating total non-traditional inputs to the production function. It also provides a specific target within quantitative economics to begin addressing explicitly a placeholder for innovation, technological change, information, or other non-traditional considerations for what constitutes the overall production function. But, frankly, TFP still is a blob that needs to be unpacked and teased apart.

The Endogenous Model

A seminal paper by Kenneth Arrow in 1962 paper introduced the concept and evidence for what he called “learning by doing“; what is now more formally understood and accepted as the learning curve. Unlike a specific innovation, the idea of the learning curve captured that experience and practice led to efficiencies and productivity. In other words, more whatever could be done with less what as we learn better how to do whatever.

By the 1960s and 1970s it was becoming clear that developed economies were becoming information economies, increasingly staffed by knowledge workers, and these forces needed to be made explicit within quantitative models. Robert Lucas, now a Nobel laureate from the University of Chicago, probed the questions of rational expectations and internal factors promoting growth. By the mid-1980s, a group of growth theorists had became increasingly dissatisfied with common accounts of exogenous factors determining long-run growth. The focus shifted to the needs for quantitative models that made these “technological” or “information” factors explicit. In other words, these “X” factors needed to be moved from a lump, external consideration to an internal one within the models, with their own multipliers and feedbacks. In short, these new growth factors needed to be made endogenous (internal), not exogenous (external).

A book by David Warsh in 2007, Knowledge and the Wealth of Nations: A Story of Economic Discovery, is a comprehensive explanation of this transition, with a focus on Paul Romer, then of Stanford University, but earlier a colleague of Lucas, pivoting on his seminal paper, “Endogenous Technological Change” [9]. By bringing the consideration internal to the model, it could be groped, inspected and broken into parts.

Besides this essential change in focus, this and related Romer papers also brought two further key insights. First, information and its artifacts are also products and outputs of the economic function. And, second, once produced, many information or knowledge assets may be produced or distributed at essentially zero marginal cost. A new dimension in “rival” and “non-rival” goods had been added to the growth theory lexicon. Information and knowledge themselves were becoming both inputs and outputs to the economic function. This understanding required still further unpacking.

Refining Inputs and Parameters

As a non-economist, it seems a bit perplexing to me how long it took the discipline to start explicating and unpacking the factors of economic growth [10]. To be fair, most every domain of human inquiry has suffered from lacking essential test datasets and statistics upon which to probe and test assumptions. There is perhaps no better poster child for this lack of reference datasets than what has been necessary to test and probe the questions related to economic growth. Yet, as our intro suggested, there is also perhaps no more important area of human inquiry than to understand these non-traditional factors of economic growth. Better understanding of these factors will impact all economic actors from individuals to firms to nations.

Our first approximation must be to get to common units and denominators that enable calculation and comparison. Things like GDP, for example, need to be re-expressed as per capita figures to take out general population growth; money terms need to be expressed in real dollars (or whatever currency), perhaps even further adjusted to account for differences in assumed deflators and inflators across metrics. We’re getting smart enough about this stuff that we can now apply best practices for common data comparisons.

Even the traditional factors of production need further attention. Let’s first take the concept of labor. Labor is ubiquitous in virtually all economic calculations.

Most economic datasets compare items across space and time. A simple labor adjustment to per capita or hours worked can mask these underlying structural changes: life expectancy of the workers; male-female participation in the workforce; hours worked per week; holidays and vacation time; changes in retirement ages; general population and cohort growth; and, then and only then, labor productivity. Of course, the reasons for labor productivity itself come back to innovation and information: the use of better machines, practices and methods by which we do our tasks.

Similarly, the idea of “human capital” has also become predominate in the economic growth literature. Is human capital a subset of general capital? labor? Does human capital include education, training, experience, intellectual capabilitiies, etc.? And, if so, how can these be measured and made consistent for comparison or decision purposes?

We also see that the nature of innovation, information, knowledge, intellectual property (IP), practices, information artifacts, and the like, lack any consistency as to definitions and boundaries. How can nebulous concepts be compared to still other nebulous concepts in order to draw meaningful conclusons? How can test datasets be created to refine these questions if the basic concepts and definitions remain ambiguous?

We see, for example, that knowledge and its role in economic growth may vary as to whether the knowledge is propositional (the ‘sciences’), prescriptive (‘recipes’), a discovery, or an invention [10]. These may not be the best splits, but clearly we must be able to distinguish at minimum innovative ‘aha!s’ from the tech transfer of best practices. These are fundamentally different notions of information. And, of course, none of this discussion directly addresses the internal controversy within the economics community of information v knowledge.

Once we normalize our traditional inputs to the economic function to appropriate per unit bases expressed in constant, real dollars, the residual “total factor productivity” is all due to innovation and information. Innovation is the spark that brings us new methods and devices for doing things, as eventually disseminated throughout the economy via the diffusion of information. Since innovation is itself based on information, we can truly say that information is the fount from which all per capita growth and wealth ultimately derives.

The Empirical Argument

In a recent paper on total factor productivity going back 150 years to the Civil War, researchers from the Congressional Budget Office have calculated that private-sector nonfarm TFP in the United States grew at an average rate of roughly 1.6 percent to 1.8 percent annually, but has experienced several surges occurring in varying parts of the economy [11].

On a different basis, I have used Robert Schiller’s published data on per capita GDP going back to 1900 to show a similar growth trend [12]. The trendline from this data series shows an annual compounded growth rate of about 1.84% per year:

These kinds of growth rates imply a doubling of wealth every 40 to 45 years.

When TFP was first being formulated, Solow calculated that 87.5% of the growth in US output per worker was attributable to technical progress [8]. In 1954, Solomon Fabricant estimated 90% of growth was due to technological factors [13]. But, as we have seen, these were “lumpy” measures and factors like the changing size and composition of the work force (especially the growth of women and two-earner families) also masked other changes.

A different way to approximate the role of technological progress is to look at the market measure of the US markets. Again using Schiller’s CAPE data [12], but also now adjusted to a per capita basis (for the US, [14]), we see the following trends since 1900:

Nominally, labor is removed from this equation because it has been accounted for as an expense on the firm’s books. Similarly, the return due to capital has been accounted for via the payout of dividends. Under these bases, we see that the growth in value of large US firms — despite the severe oscillations due to market cycles — has been a bit more than 1 per cent per annum compounded. This would suggest that the combination of innovation and information accounted for about 55 percent of the overall per capital GDP growth rate noted earlier.

But this proxy is itself flawed in many ways. First, the S&P index is for only the 500 largest US firms, which are certainly not representative. Also, comparing GDP and S&P figures hides the fact that much of the growth and productivity of US firms occurs via foreign subsidiaries. Also, of course, labor and capital productivity — themselves the result of innovation and information — are also taken out of the S&P estimates. The discrepancy between TFP estimates as a source of growth and intrinsic S&P valuation growth is in part explained by this different accounting metric. But the real issue in all of these proxies is that we are not yet fully unpacking the various sources of information and innovation as the drivers of underlying growth.

Only within the last few years have we begun to assemble the right datasets and account for the right factors in this unpacking of growth factors. For example, between 2000 and 2005, estimates at the industry level indicate that almost half of aggregate productivity was due to productivity growth originating from information technology [15], though the IT industries themselves only accounted for a little over 3% of nominal aggregate value [16].

These findings are from a more detailed analysis of productivity and growth by Jorgenson, Ho and Samuels [16]. Their analysis attempted to explicitly separate out innovation from the diffusion of prior innovations due to information. In the authors’ words:

“We show that the great preponderance of economic growth in the US since 1947 involves the replication of existing technologies through investment in equipment, structures, and software and expansion of the labor force. Contrary to the well-known views of Robert Solow (1957) and Simon Kuznets (1971), innovation accounts for only about twenty percent of US economic growth. This is the most important empirical finding from the recent research on productivity measurement surveyed by Jorgenson (2009). “

I think some of these differences are due to semantics and terminology. Remember, early residuals and TFP discussions were centered around the concept of “technological progress”. What Schumpeter referred to as “innovation” is now understood to be too broad; innovation is but a part of the overall growth effect due to information.  What is helpful from these more recent studies is to separate out innovation from information dissemination. The next step, for which we have not yet developed useful datasets, would be to unpack the ideas of innovation and information into the categories from Mokyr [10]. Namely, these are discoveries and inventions (innovation) and the ideas of propositional and prescriptive information first distinguished by Michael Polanyi as tacit knowledge.

The aphorism that we can not understand what we can not measure applies here. To take our understanding of these empirical factors to the next level we will need to refine our concepts and gather defensible data for estimating them. A proper accounting for growth should also likely distinguish transformative innovations (such as the printing press, electricity and computing) from other discoveries and inventions.

The Beautiful Synergy of Innovation and Information

By 2009, Romer and Jones were able to claim that the endogenous growth model had been proven, and they put forward six research questions to look for in the coming 25 years, including the role of human capital, differential growth rates between countries, and accelerated growth [17]. Innovation had finally assumed its central, internal role in understanding growth.

Innovation is the root source of new devices, new technologies, new practices, new methods and new theories. Innovation, in turn, is based upon the foundational substrate of information. As new innovations occur, new information is added to this substrate, all in a virtuous circle.

Markets will rise and fall, and business cycles will gyrate. New businesses and business models will emerge while others are destroyed or whither away. These reflections of animal spirits and uneven (imperfect or wrong) information can never be smoothed. But, the trajectory of growth, fueled by the beautiful synergy of innovation and information, points to an optimistic future.

To be sure, I am not positing a near-term upward trend in the stock markets. In fact, my own personal view is that markets are temporarily oversold, with a higher near-term probability of declines rather than rises. These oscillations are part and parcel of market cycles. My longer-term optimism reflects more fundamental trends.

We are all aware of the explosion of information and content. Today, like the broadening base of information and literacy that I have elsewhere posited as a major factor in the first upward inflection of economic growth in the 1800s [3], we are in the midst of a still newer — and optimistic — inflection point. Digital content and the Internet are bringing information to nearly every human on earth. Assistive technologies are bringing this information to those previously shut out due to disabilities in sight, hearing or mobility. Non-rivalrous goods can be duplicated at essentially zero cost and open source and broad access mean new ventures can be assembled and tested in the marketplace with unprecedented speed at unprecendented lower cost. Innovation is no longer the remit solely of an educated elite, but is available to every thinking person on earth.

These are all harbingers of continued growth and increases in wealth. Sure, ignorance, despotism, fanaticism and prejudice will cause some periods and pockets to be shut off from these trends, but the broad sweep of information and history looks assured.

Innovation, as Schumpeter first posited a century ago, grants the firm a temporary monopolistic advantage. In a time of openness, information growth, and universal access to that information, the winning competitive formula for firms and knowledge workers alike is constant innovation. Though a commitment to innovation leads to a bumpy path, it is an upward one, and most assuredly the path that is on the right side of history.


[1] The historical data were originally developed in three books by Angus Maddison: Monitoring the World Economy 1820-1992, OECD, Paris 1995; The World Economy: A Millennial Perspective, OECD Development Centre, Paris 2001; and The World Economy: Historical Statistics, OECD Development Centre, Paris 2003. All these contain detailed source notes. Figures for 1820 onwards are annual, wherever possible.
For earlier years, benchmark figures are shown for 1 AD, 1000 AD, 1500, 1600 and 1700. These figures have been updated to 2003 and may be downloaded by spreadsheet from the Groningen Growth and Development Centre (GGDC), a research group of economists and economic historians at the Economics Department of the University of Groningen, headed by Maddison before his passing in 2010. See http://www.ggdc.net/.
[2] William D. Nordhaus, 1996. “Do Real-Output and Real-Wage Measures Capture Reality? The History of Lighting Suggests Not,” in Timothy F. Bresnahan and Robert J. Gordon, eds., The Economics of New Goods, University of Chicago Press, ISBN: 0-226-07415-3, January 1996, pp. 27 – 70. See http://www.nber.org/chapters/c6064.
[3] I have addressed these broad topics firstly in, “Information is the Basis for Economic Growth” (Adaptive Information blog, AI3, August 23, 2007), and in some book reviews, notably “Knowledge: Unravelling the X Factor in Growth and Wealth” (Adaptive Information blog, AI3, June 21, 2006) and “Historical Origins of the Knowledge Economy” (Adaptive Information blog, AI3, July 6, 2006).
[4] William Lazonick, 2013. “The Theory of Innovative Enterprise: A Foundation of Economic Analysis,” AIR Working Paper, #13-05/01, 36 pp., May 2013. See http://www.theairnet.org/files/research/WorkingPapers/Lazonick_InnovativeEnterprise_AIR-WP13.0501.pdf.
[5] “The growth of growth theory,” from The Economist, May 18th 2006.
[6] “Robert Solow on Joseph Schumpeter,” in Economist’s View, Thursday, May 17, 2007. Retrieved on June 11, 2014.
[7] Solow’s endogenous model of economic growth is also known as the Solow-Swan neo-classical growth model as the model was independently discovered by Trevor W. Swan and published in “The Economic Record” in 1956, allows the determinants of economic growth to be separated out into increases in inputs (labor and capital) and technical progress.
[8] Robert M. Solow, 1957. “Technical Change and the Aggregate Production Function”. Review of Economics and Statistics (The MIT Press) 39 (3): 312–320. doi:10.2307/1926047. JSTOR 1926047.
[9] Published in the Journal of Political Economy in 1990.
[10] In 2002 Joel Mokyr, an economic historian from Northwestern University, wrote a book that should be read by anyone interested in knowledge and its role in economic growth. The Gifts of Athena : Historical Origins of the Knowledge Economy is a sweeping and comprehensive account of the period from 1760 (in what Mokyr calls the “Industrial Enlightenment”) through the Industrial Revolution beginning roughly in 1820 and then continuing through the end of the 19th century.
[11] Robert Shackleton, 2013. “Total Factor Productivity Growth in Historical Perspective,” Working Paper Series, Congressional Budget Office, 21 pp., March 2013. See  htttp://www.cbo.gov/sites/default/files/cbofiles/attachments/44002_TFP_Growth_03-18-2013.pdf.
[12] Stock market and cyclically-adjusted price earnings (CAPE) ratio data from Robert J. Schiller, 2000. Irrational Exuberance, Princeton University Press. Data as periodically updated and available from http://www.econ.yale.edu/~shiller/data/ie_data.xls.
[13] Solomon fabricant, 1954. “Economic Progress and Economic Change, a part of the 34th annual report of the National Bureau of Economic Research.” New York.
[14] CAPE per capita adjustment from http://www.multpl.com/united-states-population/table?f=m.
[15] Steven Rosenthal, Matthew Russell, Jon D. Samuels, Erich H. Strassner, and Lisa Usher, 2014. “Integrated Industry – Level Production Account for the United States: Intellectual Property Products and the 2007 NAICS,” May 15, 2014 (preliminary), 24 pp. See http://scholar.harvard.edu/files/jorgenson/files/jorgenson_ho_samuels_worldklems_2014_0519.pdf.
[16] Dale W. Jorgenson, Mun S. Ho, and Jon D. Samuels, 2014. “Long-term Estimates of U.S. Productivity and Growth,” pepared for Presentation at Third World KLEMS Conference Growth and Stagnation in the World Economy, Tokyo, May 19-20, 2014. See http://www.worldklems.net/conferences/worldklems2014/worldklems2014_Ho.pdf.
[17]Charles I. Jones and Paul M. Romer, 2009. “The New Kaldor Facts: Ideas, Institutions, Population, and Human Capital,” Working Paper 15094, National Bureau Of Economic Research, 31 pp, June 2009. See http://www.nber.org/papers/w15094.

Posted by AI3's author, Mike Bergman Posted on June 15, 2014 at 11:24 pm in Adaptive Information, Adaptive Innovation | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/1736/innovation-information-growth-and-wealth/
The URI to trackback this post is: http://www.mkbergman.com/1736/innovation-information-growth-and-wealth/trackback/
Posted:June 2, 2014

Dawn of Artificial IntelligenceEight Massive Trends are Waking AI from Its Dark Winters

When I inaugurated this AI3 blog in 2005 I made this statement in the about section to clarify that the “three AIs” stood for adaptive information, adaptive innovation, and adaptive infrastructure, and not the AI of artificial intelligence:

. . . I personally believe artificial intelligence to be a lot of hooey and hype at best, and a misnomer and misdirection at worst. . . . ‘Artificial intelligence’ is a misdirection of attention and energy.

Gulp. OK. Time to take my medicine.

I am today formally retracting those statements — probably should have done so some time ago — and want to explain why. As much as anything, it has to do with the changing understanding of what is artificial intelligence, recently affirmed by global-scale applications and technologies, working effectively right now.

Many Winters within AI

Though the idea of automatons and intelligent agents standing in for humans is about as old as human storytelling, the real basic ideas around artificial intelligence became current as part of the World War II effort and were finally given a name in a famous 1956 conference at Dartmouth. Initially namers and advocates of artificial intelligence included such founders as John McCarthy, Herbert Simon, Claude Shannon and Marvin Minsky. Money to support early interest in artificial intelligence came from the part of the US military that eventually became ARPA (now DARPA), with the funding going to individual researchers to use as they wished as opposed to specific projects. Along with many futuristic visions of the 1950s to 1970s, the promises for artificial intelligence were bold, including being able to capture and automate most notable basic human capabilities.

Popular movies and books promoted the ideas of autonomous robots that we could speak with and command and that would anticipate our needs and wishes so as to act as simulacrum agents lessening our burdens and adding to our leisure and capabilities [1]. Algorithms would be discovered and codified that would mimic the basis of human thought and intelligence. The idea of the Turing machine established a defensible basis for foreseeing that any problem of mathematical logic could be captured and taken on by computers.

The predictable failure of this vision to deliver caused a backlash, sufficient that the US Congress prohibited further open-ended funding via the Mansfield Amendments in 1969 and 1973, such that by 1974 AI funding in the US had largely dried up. Similar restrictions were applied to the British research community. The result of this backlash caused the first of what would prove to be many “winters” of funding and acceptance for AI.

Roughly a decade later, in response to the perceived Japanese threat for “fifth-generation” computing in the mid-1980s, a number of AI programs were again funded. While hardware developments were proceeding apace, efforts around McCarthy’s AI-oriented language Lisp and common sense logic frameworks (what are now called ontologies or knowledge graphs) such as Cyc began to receive sponsorship again. The mid-1990s were the time of “expert systems,” to be populated by knowledge engineers charged with interviewing internal subject matter experts (SMEs) to codify their knowledge for later reuse. These efforts, too, disappointed in terms of the lack of practical benefits delivered. More AI winters ensued.

AI (“artificial intelligence”) came to again lose its credibility. Some researchers moved into specific algorithmic disciplines — Bayesian statistics and neural networks predominant — while others shifted into such areas a “hyperlinks” and what became the semantic Web. Today, one could argue, that the lost mojo of AI has affected those in the semantic Web in almost a dialectic way. First, there are those who embrace the idea of intelligent agents and global knowledge structures, more-or-less in keeping with some sort of vision of artificial intelligence. Second, there are those that have seen the failures of the past, do not want to repeat them, and are more inclined to support “loosely bounded” structure focused on bottoms-up assertions. OWL modelers and ontologists tend to occupy the first camp; linked data advocates more the second camp.

The natural community for knowledge representation and management has thus tended to bifurcate a bit: global, “visionary” AI types, with history to overcome and challenged by the sheer scale of what emerged from the Internet; and incrementalists, happy to accept a bit of RDF structured data in the hopes of an ongoing evolution to more structure and interoperability.

Ten years ago, when I made the conscious decision to reject the AI of artificial intelligence as a label for this blog, an algorithmic-vision of AI seemed “wrong” and not in keeping with the general trends of the Web. That was the basis and justification for my then-statements on AI. But a funny thing happened on the way to a cogent forecast: a massive disruption called the Internet came about that — while it took a decade to gestate — changed the whole underlying substrate over which AI could take place. Like so much of history, innovation had presented to us an entirely different reality upon which to “understand” and develop artificial intelligence. It is those changes — plus the fruits from them — that is defining AI in a new light.

Eight AI Megatrends

There are, by my reckoning, at least eight major trends that have been improving AI’s prospects, especially over the past decade (Numbers #3 to #7 below are quite related to AI, the other three are general trends.) Some of the proven wonders we now see in use such as speech recognition, speech synthesis, language translations, entity recognition, image and facial recognition, computer vision, question answering, autocompletion and spell correction, recommendation systems, sentiment analysis, information extraction, document categorization, natural language processing, machine learning, reasoning, optical character recognition, word sense disambiguation, search and information retrieval, and text generation and summarization, with their many additional categories and sub-categories, are proof these trends are making a difference. None individually constitutes what may be called “AI”, but, in combination, they show compellingly that much of AI’s initial vision is indeed being fulfilled to some degree and in some specific aspect today.

Nearly all of these applications correspond to the Grand Challenges for symbolic computing identified in the 1980s. Until a decade ago, very few of them save search and initial NLP were producing results with sufficient quality and accuracy. Now, all are.

In the past ten years, most evident in the past five, tremendous breakthroughs have occurred across the entire spectrum of artificial intelligence applications. We can point to at least the eight following megatrends enabling these breakthroughs.

#1 Computer Power

A constant river of innovation has fueled the logarithmic power improvements in computers since the first transistor. Moore’s law has led to massive improvements in hardware cost, numbers of computation cycles, and amounts of bits stored. Networking capabilities are now truly global and numbers of interconnected devices exceeds billions. Computer software innovations lead to faster and better procedures and methods; as a category, software innovation likely exceeds hardware improvements as a source of computing productivity. What today fits in the palm of our hand thirty years ago required entire rooms, and did not do one billionth of what can be done today.

The rich savanna of computing has itself encouraged a bloom of innovations, many of which contribute to artificial intelligence prospects.

#2 The Internet (and Web)

Though clearly a related function to the general improvements in computing and hardware, the advent of the Internet and its more relevant offspring of the Web has had, I believe, the most fundamental impact on the change in prospects for artificial intelligence. The sheer scale of the Web network has made available crowdsourced innovations like Wikipedia and other crowdsourced data and knowledge bases. More broadly, global content across the entire Web, accessible via a common HTTP protocol, multiplied every individual’s access to information — pay close attention — by a factor of a billion or more.

Because the entire Web is interconnected, the sheer raw grist of connected data available to analyze such things as relatedness or similarity is gamechanging. Manual constructs and derived relations from years past can now be multiplied and magnified at Web scale. Any relationship test or validation can be accomplished nearly instantaneously and at (essentially) zero cost. Phenomenal!

#3 Expectations

The discrediting of AI and its holdover smell has itself been a factor working in its favor. By being discredited, it has been possible for multiple possible AI components, many listed herein, to be developed and attended to in relative isolation. Each of today’s current pieceparts to AI could be focused upon on their own, without taint from the broader “AI” brush. Because the constituents were recognizable and justifiable on their own, they did not need to fulfill the past overblown visions and expectations for “AI” writ large. The pieceparts could develop in peace.

This observation, if true, means that grand visions like “artificial intelligence” are perhaps rarely (ever?) the result of a grand top-down plan. Rather, like a good stew, it is individual components that need to mature and become available to create the final meal. Since these ingredients need to stand or contribute on their own for their own purposes, the actual resulting stew may vary as to its ultimate ingredients. If one ingredient is not ripe or available, we vary our recipe according to what is available. There is no one single recipe leading to a tasty stew.

Put another way, AI has been flying under the radar for at least the last ten to fifteen years. Portions of the older AI agenda have benefited from specific attention. Better still, the new emergence of the idea of artificial intelligence is also more toned down and practical. Artificial intelligence is now, I believe, understood to be part of a process and not some autonomous embodiment. Human interaction and communication are themselves imprecise and subject to error. Why should not be artificial means to boost those same human capabilities?.

From the standpoint of expectations, artificial intelligence has evolved from science fiction to essentially zero awareness, meanwhile delivering, on a broad scale, focused wonder capabilities such as (nearly) instantaneous translations across 60 leading human languages.

#4 Global Knowledge Bases

How can a system promise useful suggestions or alternatives if it is bereft of information?

At the local or personal level we well understand that we need to describe ourselves via attributes, the more the merrier in terms of a more complete description. A pretty good record for me would include such things as physical description, image, work and economic description, family and life description, education description, text narratives from fun to historical,  etc. The more complete description of me requires many sources and many attributes and many perspectives. But, of course, I do not live alone in the world. To describe my world, which constantly changes, I need to describe other thousands of entities I encounter daily. Each of these, too, has many attributes and relationships to other entities. Each of these entities also changes over time (has histories) and place. So, context becomes another critical dimension.

The growth of the Web at scale has resulted in some tremendous knowledge bases of entities and concepts. Freebase and Wikipedia are two of the best known, but virtually every domain has its own sources and richness. These knowledge bases, in turn, are often open for use by others. Text mining and digital data mean these data can be combined and made to interoperate. That process is only just beginning.

Though early efforts in artificial intelligence understood that capturing and modeling common sense was both an essential and surprisingly difficult task — the impetus, for example, behind the thirty-year effort of the Cyc knowledge base –  what is new in today’s circumstance is how these massive knowledge bases can inform and guide symbolic computing. The literally thousands of research papers regarding use of Wikipedia data alone [2] shows how these massive knowledge bases are providing base knowledge around which AI algorithms can work.

The abiding impression is that the availability of these data sources has fundamentally changed how AI is done. Unlike the early years of mostly algorithms and rules, AI has now evolved to explicitly embrace Web-scale content and data and the statistics that may be derived from global corpora.

#5 Deep Learning

Machine learning is a core AI concept used to determine discriminative characteristics or patterns within source input data. It has been a constant emphasis of AI since the beginning.

Various machine learning algorithms — such as Markov chains, neural networks, conditional random fields, Bayesian statistics, and many other options — can be characterized among many dimensions. Some are supervised, meaning they need to be trained against a standard corpus in order to estimate parameters; others require little or no training, but may be less accurate as a result. Some are statistically based; others are based on pattern matching of various forms.

A more recent trend has been to combine multiple techniques in what is known as deep learning, where the problem set is modeled as a layered hierarchy of distributed representations, with each layer using (often) neural network techniques for unsupervised learning, followed by supervised feedback (often termed “back-propagation”) to fine-tune parameters. While computationally slower than other techniques, this approach has the advantage of automating the supervised learning phase and is proving generally most effective across a range of AI applications.

More fundamentally, there is a virtuous circle of feedback occurring between AI machine learning algorithms and reference knowledge and statistical bases (see next). This can extend the accuracy, completeness and efficiency of supervised methods. Some notable academic departments have relied on Web-scale corpora (University of Washington and Carnegie Mellon University are two prominent examples in the US). The most dominant player in this realm, however, has been Google (though all of the major search engine and social networking companies have smaller initiatives of similar character).

#6 Big Statistical Data

Using both statistical techniques and results from machine learning, massive datasets of entities, relationships and facts are being extracted from the Web. Some of these efforts, such as the academic NELL (CMU) or KnowItAll or Open IE (UWash) involve extractions from the open Web. Others, such as the terabyte (TB) n-gram listings from Google, are derived from Web-scale pages or Google books. These examples are but a sampling of various datasets and corpora available.

These various statistical datasets may be used directly for research on their own, or may contribute to further bootstrapping of still further-refined AI techniques. Similar datasets are aiding advertising placements, search term disambiguation and machine (language) translation. In some cases, while the full datasets may not be available, open APIs may be available for areas such as entity identification or tabular data.

What is important about these trends is that data, statistics and algorithms are all now being combined in various ways with the aim of achieving acceptable AI-backed results at Web scale. It is really via the combination of these techniques that we are seeing the most impressive AI results.

#7 Big Structure

A more nascent area, really in just its first stages of effectiveness, is the application of “big structure” to all of this information. By “big structure” I mean the application of domain and knowledge graphs to help arrange and place the concepts and entities at hand.

At Web scale, the early Yahoo! directory and Open Directory were the first examples of structuring domains. Wikipedia next became the most widely used category structure; Freebase, for example, used Wikipedia to initially bootstrap its own structure. A portion of Freebase is now what is used for Google’s own Knowledge Graph. DBpedia also created its own ontology out of the infobox structure of Wikipedia. The major search engines have also put forward the schema.org structure as a means of (mostly) organizing entity and attribute information and structured data. schema.org putatively is an input to the Google Knowledge Graph, but the exact mechanism and ability to trace the results is pretty opaque.

The need for big structure is rapidly emerging as one of the key challenges for Web-scale AI. The Web and crowdsourcing appears well suited to being able to generate entity and attribute data. What remains unclear is how this information can be coherently organized at the scale of the Web. This problem is becoming acute, because the success of “big data” on the Web needs to ultimately find an organized, coherent expression in the aggregate. This is one major AI challenge that remains distinctly unsolved, though promising first steps exist.

#8 Open Source and Content

The major theme of these AI breakthroughs comes from leveraging the global content of the Web. And this enabler, in turn, has been critically dependent on the open source nature of AI algorithms, software code and code infrastructure and architecture, and open content and (generally) open APIs. Open code, algorithms, datasets and knowledge have expanded the pool of human intelligence that can be brought to bear on the question of artificial intelligence. The positive feedbacks greased through open channels of information, code and data have been absolutely essential to the amazing AI progress of the past few years.

To be sure, open does not mean a level playing field. (See discussion on Google, next.) But, without open source and open content and data, I think no one could argue that progress would have been anywhere near as rapid as it has been. The synergy arising from open source and content has thus been another essential factor in the recent and rapid progress in AI.

The Race to Intelligence

Since innovation is the source of wealth creation, it is also no surprise that the megatrends surrounding AI have also drawn significant investment interest. This interest is in the form of a race to acquire the most innovative AI startups and human expertise (capital) in AI. Since Google has been my common touchstone in this piece — and because Google is the biggest gorilla in the room — we can use them to illustrate the scope and pace of this race. (Though Amazon, Facebook, Microsoft and IBM are also clearly entrants in this race.)

A number of recent articles, notably ones in the Washington Post and The Economist, have highlighted the total dollars at stake in this AI race. Over the past few years, there have been perhaps more than $20 billion in AI-related company acquisitions, with Nest Technologies (Google, $3.2 B), Kiva Systems (Amazon, $775 M), and DeepMind (Google, $660 M) some of the largest.

Within Google alone, there has been a buying spree in search improvements (~ $1.4 B total), robotics ($80 M), machine synthesis and recognition ($250 M), machine learning ($700 M), smart devices ($3.6 B), compression technologies ($200 M), natural language processing ($80 M), and a smattering of others ($50 M), not to mention its internal efforts in self-driving cars. I don’t monitor Google on a constant basis and likely missed some major and relevant acquisitions, but it does appear that Google has perhaps spent over $6 billion over the past five years or so for AI-related acquisitions [3].

As important as start-up acquisitions has been Google’s commitment to hire and partner with many of the leading AI researchers in the world. Besides the strong partnerships Google maintains with such institutions such as the University of Washington, Carnegie Mellon University, MIT, Stanford, UC Berkeley and others, it has also staffed its research ranks with prominent names from those institutions and others.

Peter Norvig, one of the early advocates for combining algorithmic and statistical AI, joined Google in 2001 and is now its Director of Research. Most recently and notably, Ray Kurzweil joined Google as Director of Engineering in 2012. Other notable AI researchers at Google include Alon Halevy (FusionTables), Ramanathan Guha (schema.org), Geoffrey Hinton (deep learning), Evgeniy Gabrilovich (search and machine learning), and many others for whom I am not as familiar with their research. There is probably more AI talent combined at Google than has ever been assembled in one institution before.

With IBM’s Watson getting its own division and Facebook funding an AI center to the tune of $10 B, plus Apple making a similar commitment to robotic manufacturing, it is clear that all of the major players in the computing space are making big bets on AI moving into the future.

AI is Itself But One Beneficiary of These Trends

Since the early winters in artificial intelligence, a phenomenon has developed called the “AI effect“. It really has meant two different things.

First, AI researchers have tended to call their research anything but artificial intelligence. One of the broader and trendy substitutes is known as cognitive computing. Many of the domains and disciplines I noted above got their names and prominent use as substitutes for what used to be labeled as AI. In any case, we can see that AI indeed is a big tent with many components and thrusts.

Second, the “AI effect” also refers to the fact that once an AI technique is embedded in some everyday use, it is no longer perceived as something AI and is taken as a given. Douglas Hofstadter expressed the AI effect concisely by quoting Tesler‘s Theorem: “AI is whatever hasn’t been done yet.”

I was perhaps right to initially reject the algorithmic-centric view of AI from the early years. But now, when matched with big data, big statistics and big structure, all embedded into phenomenal advances in computing power, it is also clear that we are dawning into a new age of AI. One only needs to look at the wondrous progress on many of what had seemed to be impossible Grand Challenges over the past five years to gain an appreciation of the pace and breadth of new developments to come.

These developments will reify and foster similar emphases in semantic technologies, graph structures and analysis, and functional programming and homoiconicity (“data as code, code as data”) that my colleague, Fred Giasson, is now actively exploring. We will find that representational paradigms and the basis of how our tools and algorithms work will increasingly align. There appear to be natural underpinnings to these phenomena, including the pivot of language and meaning, that are closely aligned with the thoughts and writings of that great American pragmatist and logician, Charles S. Peirce. We will increasingly come to see that the wondrous innovations of self-driving cars, talking smartphones, warehouses of fulfillment robots, and computer vision systems can trace their roots back to basic truths of how to see and understand our world.

Understanding these forces will, themselves, help to formulate guidelines and ideas that can foster further innovation. So, in the end, while I still don’t like the term of “artificial” intelligence, it is merely a sign or a term. Adaptive innovations expressed by machines are simply part of the intelligence and structure embodied in the universe, for which we are now gaining the tools and understanding to exploit.


[1] Douglas AdamsHyperland is a great exposition on this vision, with my 2007 blog post pointing to the online video.
[2] Wikipedia maintains its own page of research that relies on Wikipedia; I have earlier captured about 250 selected sources called SWEETpedia that relate specifically to semantic technologies and AI.
[3] These are merely estimates, and likely quite wrong in many specifics. The estimates were compiled by reviewing a listing of Google acquisitions (since 2009), supplemented by individual company searches when the acquisition amounts were not listed, followed by analysis of Google’s SEC Edgar filings in a manner similar to this analysis (which was also used for the robotics estimate).