The Transition from Transactions to ConnectionsVirtually everywhere one looks we are in the midst of a transition for how we organize and manage information, indeed even relationships. Social networks and online communities are changing how we live and interact. NoSQL and graph databases — married to their near cousin Big Data — are changing how we organize and store information and data. Semantic technologies, backed by their ontologies and RDF data model, are showing the way for how we can connect and interoperate disparate information in ways only dreamed about a decade ago. And all of this, of course, is being built upon the infrastructure of the Internet and the Web, a global, distributed network of devices and information that is undoubtedly one of the most important technological developments in human history.
There is a shared structure across all of these developments — the graph. Graphs are proving to be the new universal paradigm for how we organize and manage information. Graphs have an inherently expandable nature, and one which can also capture any existing structure. So, as we see all of the networks, connections, relationships and links — both physical and informational — grow around us, it is useful to step back a bit and contemplate the universal graph structure at the core of these developments.
Understanding that we now live in the Age of the Graph means we can begin studying and using the concept of the graph itself to better analyze and manage our interconnected world. Whether we are trying to understand the physical networks of supply chains and infrastructure or the information relationships within ontologies or knowledge graphs, the various concepts underlying graphs and graph theory, themselves expressed through a rich vocabulary of terms, provide the keys for unlocking still further treasures hidden in the structure of graphs.
The use of “graph” as a mathematical concept is not much more than 100 years old. The beginning explication of the various classes of problems that can be addressed by graph theory probably is no older than 300 years. The use of graphs for expressing logic structures probably is not much older than 100 years, with the intellectual roots beginning with Charles Sanders Peirce [1]. Though likely trade routes and their affiliated roads and primitive transportation or nomadic infrastructures were perhaps the first expressions of physical networks, the emergence and prevalence of networks is a fairly recent phenomenon. The Internet and the Web are surely the catalyzing development that has brought graphs and networks to the forefront.
In mathematics, a graph is an abstract representation of a set of objects where pairs of the objects are connected. The objects are most often known as nodes or vertices; the connections between the objects are called edges. Typically, a graph is depicted in diagrammatic form as a set of dots or bubbles for the nodes, joined by lines or curves for the edges. If there is a logical relationship between connected nodes the edge is directed, and the graph is known as a directed graph. Various structures or topologies can be expressed through this conceptual graph framework. Graphs are one of the principle focuses of study in discrete mathematics [2]. The word “graph” was first used in the sense as a mathematical structure by J.J. Sylvester in 1878 [3].
As representative of various data models, particularly in our company’s own interests in the Resource Description Framework (RDF) model, the nodes can represent “nouns” or subjects or objects (depending on the direction of the links) or attributes. The edges or connections represent “verbs” or relationships, properties or predicates. Thus, the simple “triple” of the basic statement in RDF (consisting of subject – predicate – object) is one of the constituent barbells that make up what becomes the eventual graph structure.
The manipulation and analysis of graph structures comes under the rubric of graph theory. The first recognized paper in that field is the Seven Bridges of Königsberg, written by Leonhard Euler in 1736. The objective of the paper was to find a walking path through the city that would cross each bridge once and only once. Euler proved that the problem has no solution:
![]() |
–> | ![]() |
Euler’s approach represented the path problem as a graph, by treating the land masses as nodes and the bridges as edges. Euler’s proof postulated that if every bridge has been traversed exactly once, it follows that, for each land mass (except for the ones chosen for the start and finish), the number of bridges touching that land mass must be even (the number of connections to a node we now call “degree”). Since that is not true for this instance, there is no solution. Other researchers, including Leibniz, Cauchy and L’Huillier applied this approach to similar problems, leading to the origin of the field of topology.
Later, Cayley broadened the approach to study tree structures, which have many implications in theoretical chemistry. By the 20th century, the fusion of ideas coming from mathematics with those coming from chemistry formed the origin of much of the standard terminology of graph theory.
Graph theory forms the core of network science, the applied study of graph structures and networks. Besides graph theory, the field draws on methods including statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. Classical problems embraced by this realm include the four color problem of maps, the traveling salesman problem, and the six degrees of Kevin Bacon.
Graph theory and network science are the suitable disciplines for a variety of information structures and many additional classes of problems. This table lists many of these applicable areas, most with links to still further information from Wikipedia:
Graphs are among the most ubiquitous models of both natural and human-made structures. They can be used to model many types of relations and process dynamics in physical, biological and social systems. Many problems of practical interest can be represented by graphs. This breadth of applicability makes network science and graph theory two of the most critical analytical areas for study and breakthroughs for the foreseeable future. I touch on this more in the concluding section.
Surely the first examples of graph structures were early trade and nomadic routes. Here, for example, are the trade routes of the Radhanites dating from about 870 AD [4]:
It is not surprising that routes such as these, or other physical networks as exemplified by the bridges of Königsberg, were the stimulus for early mathematics and analysis related to efficient use of networks. Minimizing the time to complete a trade circuit or visiting multiple markets efficiently has clear benefits. These economic rationales apply to a wide variety of modern, physical networks, including:
Of course, included in the latter category is the Internet itself. It is the largest graph in existence, with an estimated 2.2 billion users and their devices all connected in one way or another in all parts of the globe [5].
Graphs and graph theory also have broad applicability to natural systems. For example, graph theory is used extensively to study molecular structures in chemistry and physics. A graph makes a natural model for a molecule, where vertices represent atoms and edges bonds. Similarly, in biology or ecology, graphs can readily express such systems as species networks, ecological relationships, migration paths, or the spread of diseases. Graphs are also proper structures for modeling biological and chemical pathways.
Some of the exemplar natural systems that lend themselves to graph structures include:
As with physical networks, a graph representation for natural systems provides real benefits in computer processing and analysis. Once expressed as a graph, all graph algorithms and perspectives from graph theory and network science can be brought to bear. Statistical methods are particularly applicable to representing connections between interacting parts of a system, as well to representing the physical dynamics of natural systems.
Parallel with the growth of the Internet and Web has been the growth of social networks. Social network analysis (SNA) has arguably been the single most important driver for advances in graph theory and analysis algorithms in recent years. New and interesting problems and challenges — from influence to communities to conflicts — are now being elucidated through techniques pioneered for SNA.
Second only in size to the Internet has been the graph of interactions arising from Facebook. Facebook had about 900 million users as of May 2012, half of which accessed the service via mobile devices [6]. Facebook famously embraced the graph with its own Open Graph protocol, which makes it easy for users to access and tie into Facebook’s social network. A representation of the Facebook social graph as of December 2010 is shown in this well-known figure:
The suitability of the graph structure to capture relationships has been a real boon to better understanding of social and community dynamics. Many new concepts have been introduced as the result of SNA, including such things as influence, diversity, centrality, cliques and so forth. (The opening diagram to this article, for example, models centrality, with blue the maximum and red the minimum.)
Particular areas of social interaction that lend themselves to SNA include:
Entirely new insights have arisen from SNA including finding terrorist leaders, analyzing prestige, or identifying keystone vendors or suppliers in business ecosystems.
Given the ubiquity of graphs as representations of real systems and networks, it is certainly not surprising to see their use in computer science as as means for information representation. We already saw in the table above the many data structures that can be represented as graphs, but the paradigm has even broader applicability.
The critical breakthroughs have come through using the graph as a basis for data models and logic models. These, in turn, provide the basis for crafting entire graph-based vocabularies and languages. Once such structures are embraced, it is a natural extension to also extend the mindset to graph databases as well.
Some of the notable information representations that have a graph as their basis include:
A key point of graphs noted earlier was their inherent extensibility. Once graphs are understood as a great basis for representing both logic and data structures, it is a logical next step to see their applicability extend to knowledge representations and knowledge bases as well.
Graph-theoretic methods have proven particularly useful in linguistics, since natural language often lends itself well to discrete structure. So, not only can graphs represent syntactic and compositional structure, but they can also capture the interrelationships of terms and concepts within those languages. The usefulness of graph theory to linguistics is shown by the various knowledge bases such as WordNet (in various languages) and VerbNet.
Domain ontologies are similar structures, capturing the relationships amongst concepts within a given knowledge domain. These are also known as knowledge graphs, and Google has famously just released its graph of entities to the world [7]. Semantic networks and neural networks are similar knowledge representations.
The following interactive diagram, of the UMBEL knowledge graph of about 25,000 reference concepts for helping to orient disparate datasets [8], shows that some of these graph structures can get quite large:
What all of these examples show is the nearly universal applicability of graphs, from the abstract to the physical, from the small to the large, and every gradation between. We also see how basic graph structures and concepts can be built upon with more structure. This breadth points to the many synergies and innovations that may be transferred from diverse fields to advance the usefulness of graph theories.
Despite the many advances that have occurred in graph theory and the increased attention from social network analysis, many, many graph problems remain some of the hardest in computation. Optimizations, partitioning, mapping, inferencing, traversing and graph structure comparisons remain challenging. And, some of these challenges are only growing due to the growth in the size of networks and graphs.
Applying the lessons of the Internet in such areas as non-relational databases, distributed processing, and big data and map reduce-oriented approaches will help some in this regard. We’re learning how to divide and conquer big problems, and we are discovering data and processing architectures more amenable to graph-based problems.
The fact we have now entered the Age of the Graph also bodes that further scrutiny and attention will lead to more analytic breakthroughs and innovation. We may be in an era of Big Data, but the structure underlying all of that is the graph. And that reality, I predict, will result in accelerated advances in graph theory.
A Decade of Remarkable Advances in Ten Grand IT ChallengesI’ve been in the information theory and technology game for quite some time, but believe nothing has matched the pace of advances of the past ten years. As one example, it was a mere eight years ago that I was sitting in a room with language translation vendors contemplating automated translation techniques for US intelligence agencies. The prospects finally looked doable, but the success of large-scale translation was not assured.
At about that same time, and the years until just recently, a whole slew of Grand Challenges [1] in computing hung out there: tantalizing yet not proven. These areas ranged from information extraction and natural language understanding to speech recognition and automated reasoning.
But things have been changing fast, and with a subtle steadiness that has caused it to go largely unremarked. Sure, all of us have been aware of the huge changes on the Web and search engine ubiquity and social networking. But some of the fundamentally hard problems in computing have also gone through some remarkable (but largely unremarked) advances.
We now have smart phones that speak instructions to us while we instruct them by voice in turn. Virtually all information conceivable is now indexed and made available through the Web; structure is now rapidly characterizing that information, making it even more useful to discover and organize. We can translate documents online with acceptable accuracy into more than 60 languages [2]. We can get directions to or see satellite views of virtually any place on earth. We have in fact become accustomed to new technology magic on a nearly daily basis, so much so that the pace of these advances seems to be a constant, blunting our perspective of just how rapid these advances have been progressing.
These advances are perhaps not the realization of artificial intelligence as articulated in the 1950s to 1980s, but are contributing to a machine-based ability to do tasks useful to humans heretofore impossible and at scales unimaginable. As Google and IBM’s Watson are showing, statistics (among other techniques) applied to massive knowledge bases or text corpora are breaking down all of the Grand Challenges of symbolic computing. The image that is emerging is less one of intelligent machines working autonomously than it is of computers working interactively or semi-automatically with humans to address previously unsolvable problems.
By using a perspective of the decade past, we also demark the seminal paper on the semantic Web by Berners-Lee, Hendler and Lassila from May 2001 [3]. Yet, while this semantic Web vision has been a contributor to the success of the Grand Challenge advances of the past ten years, I think we can also say that it has not been the key or even a primary driver. That day may still yet come. Rather, I think we have to look to natural language and statistics surrounding large-scale corpora as the more telling drivers.
Over the past ten years there have been significant advances on at least ten Grand Challenges in symbolic computation. As the concluding section notes, these advances can be traced in most part to broader advances in natural language processing, the logical and semiotic bases for interoperability, and standards (nominally in the semantic Web) for embracing them. Here are these ten areas of advance, all achieved over the past ten years:
Information extraction (IE) uses various forms of natural language processing (NLP) to identify structured information within unstructured or semi-structured documents. These documents are presented in machine-readable form (including straight text, various document formats or HTML) with the various types of information “tagged” or prompted for inclusion. Information types that can be extracted with one of the various techniques include entities, relations, topics, categories, and so forth. Once tagged or extracted, the information in the documents can now be included and linked to standard structured information (as might come from conventional databases) or to structure in other documents.
Most recently, a large number of online services and open source systems have also become available with strengths in one or more of these extraction types [4]. Some current examples include Yahoo! Term Extraction, OpenCalais, BeliefNetworks, OpenAmplify, Alchemy API, Evri, Extractiv, Illinois Tagger, and about 80 others [4].
Machine translation is the automatic translation of machine-readable text from one human language to another. Accurate and acceptable machine translation requires applying different types of knowledge including grammar, semantics, facts about the real world, etc. Various approaches have been developed and refined over time.
Especially helpful has been the availability of huge corpora in multiple languages to which large-scale statistical analysis may be applied (as is the case of Google’s machine translation) or human editing and refinement (as is the case with the more than 280 language versions of Wikipedia).
While it is true none of these systems have 100% accuracy (even human translators show much variation), the more advanced ones are truly impressive with remaining ambiguities flagged for resolution by semi-automatic means.
Though sentiment analysis is strictly speaking a subset of information extraction, it has the more demanding and useful task of extracting subjective information, often across a group of documents or texts. Sentiment analysis can be applied to online reviews to determine the “polarity” about specific objects, and it is especially useful for identifying public opinion trends or evaluating social media for ranking, polling or marketing purposes.
Because of its greater difficulty and potential high value, many of the leading sentiment analysis capabilities remain proprietary. Some capable open source versions are available nonetheleless. There is also an interesting online application using Twitter feeds.
Many words have more than one meaning. Word sense disambiguation uses either machine learning, dictionaries (gazetteers) of known entities and concepts, ontologies or linguistic databases such as WordNet, or combinations thereof to evaluate ambiguous terms or phrases and resolve them based on context. Some systems need to be “trained” or some work automatically or others are based on evaulation and prompting (semi-automatic) to complete the disambiguation process.
State-of-the-art systems have greater than 90% precision [5]. Most of the leading open source NLP toolkits have quite capable disambiguation modules, and even better proprietary systems exist.
Speech synthesis is the conversion of text to spoken speech and has been around for quite some time. Speech recognition is a far more difficult task in that a given sound clip or real-time spoken speech of a person must be converted to a textual representation, which itself can then be acted upon such as navigating or making selections. Speech recognition is made difficult because of individual voice differences, the variations of human languages and speech patterns, and the need to segment speech into a sequence of words. (In most spoken languages, the sounds representing successive letters blend into each other, so the conversion of the modulated wave form to discrete characters or tokens can be a very difficult process.)
Crude systems of a decade ago required much training with a specific speaker’s voice to show much effectiveness. Today, the range and ability to use these systems without training has markedly improved.
Until recently, improvements largely were driven by military and intelligence requirements. Today, however, with the ubiquity of smart phones and speech interfaces, the consumer market is greatly accelerating progress.
Image recognition is the ability to determine whether or not an electronic image contains some specific object, feature, or activity, and then to extract the image data associated with it. Today, under specific circumstances and for specific tasks, this can be done by computer. However, for the general case of arbitrary objects in arbitrary situations this challenge has not yet been fully met. The systems of today work best for simple geometric objects (e.g., polyhedra), human faces, printed or hand-written characters, or vehicles, and in specific situations, typically described in terms of well-defined illumination, background, and orientation of the object relative to the camera.
Auto license recognition at intersections, face recognition by security cameras, and greatly expanded and improved character recognition systems (machine vision) represent some of the current state-of-the-art. Again, smart phone apps are helping to drive advances.
Most of the previous advances are related to extracting structured information or mapping or deriving additional structured information. Once obtained, of course, the next challenge is in how to relate that information together; that is, how to make it interoperate.
We have been steadily climbing a data federation pyramid [6] — and at an impressively accelerating rate since the adoption of the Internet and Web. These network innovations gave us a common basis and protocols for connecting distributed devices. That, in turn, has freed us to concentrate on the standards for data representation and interoperability.
XML first provided a means for a common data serialization that encouraged various communities and industries to devise exchange vocabularies. RDF provided a means for a common data model, one that was both simple and extensible at the same time [7]. OWL built upon that basis to enable us to build common domain models (see next).
There are alternatives to the semantic Web standards of RDF and OWL such as common logic and there are many competing data exchange formats to XML. None of these standards is essential on its own and all have their communities and advocates. However, because they are standards and they share common network bases, it has also been relatively easy to convert amongst the various available protocols. We are nearly at a global level where everything is connected, machine-readable, and in structured form.
Semantics in machine-readable form means that we can more confidently link and combine available information. We are seeing a veritable explosion of domain models to represent various domains and viewpoints in consensual, interoperable form. What this means is that we are now gaining the computing vocabularies and grammars — along with shared community models (world views) — to get this stuff to work together.
Five years ago we called this phenomena mashups, but no one uses that term any longer because these information brewpots are everywhere, including in our very hands when we interact with the apps on our smart phones. This glue of domain models is generally as invisible to us as is the glue in laminates or the resin in plastics. But they are the strength and foundations nonetheless that enable much of the computing magic unfolding around us.
Once the tyranny of physical separation was shattered between data and machine by the network, the rationale for keeping the data with the app or even the user with the app disappeared. Cloud computing may seem mysterious or sound to have some high-octave hum, but it really is nothing more than saying that the Web enables us to treat all of our computing resources as virtual. Data can be anywhere; machines and hard drives can be anywhere; and applications can be anywhere.
And, virtualness brings benefits in and of itself. Whole computing environments can be installed or removed nearly instantaneously. Peak computing demands can be met with virtual headrooms. Backup and rollover and redundancy practices and strategies can change. Web services mean tailored capabilities can be invoked from anywhere and integrated for local needs. Massive computing resources and server farms can be as accessible to the individual as they are to prior computing behemoths. Combined with continued advances in underlying computing hardware and chips, the computing power available to any user is rising exponentially. There is now even more power in the power curve.
One hears stories of Google or the National Security Agency having access and managing servers measured in the hundreds of thousands. Entirely new operating systems and computing environments — many with roots in open source — such as virtual operating systems and MapReduce approaches like Hadoop have been innovated to deal with the current era of “big data”.
MapReduce is a framework for processing huge datasets using a large number of servers. The “map” step partitions the problem into tractable sub-problems, organized in a tree structure. The “reduce” step then takes the answers to all the sub-problems and combines them to produce the final output.
Such techniques enable analysis of datasets of a size impossible before. This has enabled the development of statistics and analytical techniques that have been able to make correlations and find patterns for some of the Grand Challenge tasks noted before that simply could not be addressed within previous limits. The “big data” approach is providing a brute force alternative to previously intractable problems.
Declining hardware costs and increasing performance (such as from Moore’s Law), combined with the adoption of the Internet + Web network, set the fertile conditions for these unprecedented advances in computing’s Grand Challenges. But the adaptive radiation in innovations now occurring has its own dynamics. In computing terms, we are seeing the equivalent of the Cambrian explosion in evolutionary history.
The dynamics driving this computing explosion are based largely, I believe, on the statistics of information retrieval and extraction needed to cope with the scale of documents on the Web. That, in turn, has impelled innovations in big data and distributed architectures and designs that have pried open previously closed computing lockboxes. As data from everywhere and from every provenance pours into the system, means for handling and interoperating with it have become imperatives. These forces, in turn, have been channeled and are being met through the open and standards-based approaches that helped lead to the development of the Internet and its infrastructure in the first place.
These powerful evolutionary forces in computing are clearly evident in the ten Grand Challenge advances above. But the challenges above are also silent on another factor, underpinning the interoperability initiatives, that is only now just becoming evident and exerting its own powerful force. That is the workable, intellectual foundations for interoperability itself.
Clearly, as the advances in the Grand Challenges show, we are seeing immense exposures of new structured information and impressive means for accessing and managing it on a global, distributed scale. Yet all of this data and this structure begs the question of how to get the information to work together. Further, the sources and viewpoints and methods by which all of this data has been created also puts a huge premium on means to deal with the diversity. Though not evident, and perhaps not even known to many of the innovators and practitioners, there has been a growing intellectual force shaping our foundational views about the nature of things and their representations. This force has been, I believe, one of those root cause drivers helping to show the way to interoperability.
John Sowa, despite his unending criticism of the semantic Web in favor of common logic, has nonetheless been a very positive evangelist for the 19th century American logician and philosopher, Charles Sanders Peirce. Sowa points out that the entire 20th century largely neglected Peirce’s significant contributions in many areas and some philosophers appropriated Peircean insights without proper attribution [8]. Indeed, Peirce has only come to wider attention within the past decade or so. Much of his voluminous lifetime writings have still not yet been committed to publication.
Among many notable contributions, Peirce was passionate about signs and their triadic representations, in a field known as semiotics. The philosophical and logical basis of his triangle of signs deserves your attention, which can not be adequately treated here [9]. However, as summarized by Sowa [8], “A semiotic view of language and logic gets to the heart of the philosophical controversies and their practical implications for linguistics, artificial intelligence, and related subjects.”
In essence, Peirce’s triadic logic of semiotics helps clarify philosophical questions about things, how they are perceived and how they are named that has vexed philosophers at least since the time of Aristotle. What Peirce was able to put forward was a testable logic for how things and the names of things can be understood and related to one another, via logical statements or structures. These, in turn, can be symbolized and formalized into logical constructs that can capture the structure of natural language as well as more structured data.
The clarity of Peirce’s logic of signs is an underlying factor, I believe, for why we are finally seeing our way clear to how to capture, represent and relate information from a diversity of sources and viewpoints that is defensible and interoperable [10]. As we plumb Peircean logics further, I believe we will continue to gain additional insights and methods for combining and relating information. The next phase of our advances on these Grand Challenges is likely to be fueled more by connections and interoperability than in basic extraction or representation.
We are not seeing the vision of artificial intelligence unfold as posed three decades ago. Nor are we seeing the AI-complete type of problems being solved in their entirety [11]. Rather, we are seeing impressive but incomplete approaches. Full automation and autonomy are not yet at hand, and may be so far in the future as to never be. But we are nevertheless seeing advances across the board in all Grand Challenge areas.
What is emerging is a practical achievement of the Grand Challenges, the scale and scope of which is unprecedented in symbolic computing. As we see Peircean logic continue to take hold and interoperability grow in usefulness and stature, I think it fair to say we can look back in ten years to describe where we stand today as having been in the midst of an evolutionary explosion.
Exposing $4.7 Trillion Annually in Undervalued InformationSomething strange began to happen with company valuations beginning twenty to thirty years ago. Book values increasingly began to diverge — go lower — from stock prices or acquisition prices. Between 1982 and 1992 the ratio of book value to market value decreased from 62% to 38% for public US companies [1]. The why of this mystery has largely been solved, but what to do about it has not. Significantly, semantic technologies and approaches offer both a rationale and an imperative for how to get the enterprises’ books back in order. In the process, semantics may also provide a basis for more productive management and increased valuations for enterprises as well.
The mystery of diverging value resides in the importance of information in an information economy. Unlike the historical and traditional ways of measuring a company’s assets — based on the tangible factors of labor, capital, land and equipment — information is an intangible asset. As such, it is harder to see, understand and evaluate than other assets. Conventionally, and still the more common accounting practice, intangible assets are divided into goodwill, legal (intellectual property and trade secrets) and competitive (know-how) intangibles. But — given that intangibles now equal or exceed the value of tangible assets in advanced economies — we will focus instead on the information component of these assets.
As used herein, information is taken to be any data that is presented in a form useful to recipients (as contrasted to the more technical definition of Shannon and Weaver [2]). While it is true that the there is always a question of whether the collection or development of information is a cost or represents an investment, that “information” is of growing importance and value to the enterprise is certain.
The importance of this information focus can be demonstrated by two telling facts, which I elaborate below. First, only five to seven percent of existing information is adequately used by most enterprises. And, second, the global value of this information amounts to perhaps a range of $2.0 trillion to $7.4 trillion annually (yes, trillions with a T)! It is frankly unbelievable that assets of such enormous magnitude are so poorly understood, exploited or managed.
Amongst all corporate resources and assets, information is surely the least understood and certainly the least managed. We value what we measure, and measure what we value. To say that we little measure information — its generation, its use (or lack thereof) or its value — means we are attempting to manage our enterprises with one eye closed and one arm tied behind our backs. Semantic approaches offer us one way, perhaps the best way, to bring understanding to this asset and then to leverage its value.
More than a decade ago Moody and Walsh put forward a seminal paper on the seven “laws” of information [3]. Unlike other assets, information has some unique characteristics that make understanding its importance and valuing it much more difficult than other assets. Since I think it a shame that this excellent paper has received little attention and few citations, let me devote some space to covering these “laws”.
Like traditional factors of production — land, labor, capital — it is critical to understand the nature of this asset of “information”. As the laws below make clear, the nature of “information” is totally unique with respect to other factors of production. Note I have taken some liberty and done some updating on the wording and emphasis of the Moody and Walsh “laws” to accommodate recent learnings and understandings.
Information is not friable and can not be depleted. Using or consuming it has no direct affect on others using or consuming it and only using portions of it does not undermine the whole of it. Using it does not cause a degradation or loss of function from its original state. Indeed, information is actually not “consumed” at all (in the conventional sense of the term); rather, it is “shared”. And, absent other barriers, information can be shared infinitely. The access and
use to information on the Web demonstrates this truth daily.
Thus, perhaps the most salient characteristic of information as an asset is that it can be shared between any number of people, business areas and organizations without loss of value to any party (absent the importance of confidentiality or secrecy, which is another information factor altogether). The sharability or maintenance of value irrespective of use makes information quite different to how other assets behave. There is no dilution from use. As Moody and Walsh put it, “from the firm’s perspective, value is therefore cumulative rather than apportioned across different users.”
In practice, however, this very uniqueness is also a cause of other organizational challenges. Both personal and institutional barriers get erected to limit sharing since “knowledge is power.” One perverse effect of information hoarding or lack of institutional support for sharing is to force the development anew of similar information. When not shared, existing information becomes a cost, and one that can get duplicated many times over.
Most resources degrade with use, such as equipment wearing out. In contrast, the per unit value of information increases with use. The major cost of information is in its capture, storage and maintenance. The actual variable costs of using the information (particularly digital information) is, in essence, zero. Thus, with greater use, the per unit cost of information drops.
There is a corollary to this that also goes to the heart of the question of information as an asset. From an accounting point of view, something can only be an asset if it provides future economic value. If information is not used, it cannot possibly result in such benefits and is therefore not an asset. Unused information is really a liability, because no value is extracted from it. In such cases the costs of capture, storage and maintenance are incurred, but with no realized
value. Without use, information is solely a cost to the enterprise.
The additional corollary is that awareness of the information’s existence is an essential requirement in order to obtain this value. As Moody and Walsh state, “information is at its highest ‘potential’ when everyone in the organization knows where it is, has access to it and knows how to use it. Information is at its lowest ‘potential’ when people don’t even know it is there.”
A still further corollary is the importance of information literacy. Awareness of information without an understanding of where it fits or how to take advantage of it also means its value is hidden to potential users. Thus, in addition to awareness, training and documentation are important factors to help ensure adequate use. Both of these factors
may seem like additional costs to the enterprise beyond capture, storage and maintenance, but — without them — no or little value will be leveraged and the information will remain a sunk cost.
Like most other assets, the value of information tends to depreciate over time [4]. Some information has a short shelf life (such as Web visitations); other has a long shelf life (patents, contracts and many trade secrets). Proper valuation of information should take into account such differences in operational life, analysis or decision life, and statutory life. Operational shelf life tends to be the shortest.
In these regards, information is not too dissimilar from other asset types. The most important point is to be cognizant of use and shelf differences amongst different kinds of information. This consideration is also traded off against the declining costs of digital information storage.
A standard dictum is that the value of information increases with accuracy. The caveat, however, is that some information, because it is not operationally dependent or critical to the strategic interests of the firm, actually can become a cost when capture costs exceed value. Understanding such Pareto principles is an important criterion in evaluating information approaches. Generally, information closest to the transactional or business purpose of the organization will demand higher accuracy.
Such statements may sound like platitudes — and are — in the absence of an understanding of information dependencies within the firm. But, when certain kinds of information are critical to the enterprise, it is just as important to know the accuracy of the information feeding that “engine” as it is for oil changes or maintenance schedules for physical engines. Thus an understanding of accuracy requirements in information should be a deliberate management focus for critical business functions. It is the rare firm that attends to such imperatives today.
A unique contribution from semantic approaches — and perhaps the one resulting in the highest valuation benefit — arises from the increased value of connecting the information. We have come to understand this intimately as the “network effect” from interconnected nodes on a network. It also arises when existing information is connected as well.
Today’s enterprise information environment is often described by many as unconnected “silos”. Scattered databases and spreadsheets and other information repositories litter the information landscape. Not only are these sources unconnected and isolated, but they also may describe similar information in different and inconsistent ways.
As I have described previously in The Law of Linked Data [5], existing information can act as nodes that — once connected to one another — tend to produce a similar network effect to what physical networks exhibit with increasing numbers of users. Of course, the nature of the information that is being connected and its centrality to the mission of the enterprise will greatly affect the value of new connections. But, based on evidence to date, the value of information appears to go up somewhere between a quadratic and exponential function for the number of new connections. This value is especially evident in know-how and competitive areas.
Information overload is a well-known problem. On the other hand, lack of appropriate information is also a compelling problem. The question of information is thus one of relevancy. Too much irrelevant information is a bad thing, as is too little relevant information.
These observations lead to two use considerations. First, means to understand and focus information capture on relevant information is critical. And, second, information management systems should be purposefully designed with user interfaces for easy filtering of irrelevant information.
The latter point sounds straightforward, but, in actuality, requires a semantic underpinning to the enterprise’s information assets. This requirement is because relevancy is in the eye of the beholder, and different users have different terms, perspectives, and world views by which information evaluation occurs. In order for useful filtering, information must be presented in similar terms and perspectives relevant to those users. Since multiple studies affirm that information decision-makers seek more information beyond their overload points [3], it is thus more useful to provide relevant access and filtering methods that can be tailored by user rather than “top down” information restrictions.
With access and connections, information tends to beget more information. This propagation results from summations, analysis, unique combinations and other ways that basic datum get recombined into new datum. Thus, while the first law noted that information can not be consumed (or depleted) by virtue of its use, we can also say that information tends to reproduce and expand itself via use and inspection.
Indeed, knowledge itself is the result of how information in its native state can be combined and re-organized to derive new insights. From a valuation standpoint, it is this very understanding that leads to such things as competitive intelligence or new know-how. In combination with insights from connections, this propagating factor of information is the other leading source of intangible asset valuations.
This law also points to the fact that information per se is not a scarce resource. (Though its availability may be scarce.) Once available, techniques like data mining, analysis, visualization and so forth can be rich sources for generating new information from existing holdings of data.
These “laws” — or perspectives — help to frame the imperatives for how to judge information as an asset and its resulting value. The methodological and conceptual issues of how to explicitly account for information on a company’s books are, of course, matters best left to economists and professional accountants. With the growing share of information in relation to intangible assets, this would appear to be a matter of great importance to national policy. Accounting for R&D efforts as an asset versus a cost, for example, has been estimated to add on the order of 11 percent to US national GDP estimates [9].
The mere generation of information is not necessarily an asset, as the “laws” above indicate. Some of the information has no value and some indeed represents a net sunk cost. What we can say, however, is that valuable information that is created by the enterprise but remains unused or is duplicated means that what was potentially an asset has now been turned into a cost — sometimes a cost repeated many-fold.
Information that is used is an asset, intangible or not. Here, depending on the nature of the information and its use, it can be valued on the basis of cost (historical cost or what it cost to develop it), market value (what others will pay for it), or utility (what is its present value as benefits accrue into the future). Traditionally the historical cost method has been applied to information. Yet, since information can both be sold and still retained by the organization, it may have both market value and utility value, with its total value being the sum.
In looking at these factors, Moody and Walsh propose a number of new guidelines in keeping with the “laws” noted above [3]:
The net result of thinking about information in this more purposeful way is to encourage more accurate valuation methods, and to provide incentives for more use and re-use, particularly in combined ways. Such methods can also help distinguish what information is of more value to the organization, and therefore worthy of more attention and investment.
The emerging discrepancy between market capitalizations and book values began to get concerted academic attention in the 1990s. To be sure, perceptions by the market and of future earnings potential can always color these differences. The simple occurrence of a discrepancy is not itself proof of erroneous or inaccurate valuations. (And, the corollary is that the degree of the discrepancy is not sufficient alone to estimate the intangible asset “gap”, a logical error made by many proponents.) But, the fact that these discrepancies had been growing and appeared to be based (in part) on structural changes linked to intangibles was creating attention.
Leonard Nakamura, an economist with the Federal Reserve Board in Philadelphia, published a working paper in 2001 entitled, “What is the U.S. Gross investment in Intangibles? (At Least) One Trillion Dollars a Year!” [6]. This was one of the first attempts to measure intangible investments, which he defined as private expenditures on assets that are intangible and necessary to the creation and sale of new or improved products and processes, including designs, software, blueprints, ideas, artistic expressions, recipes, and the like. Nakamura acknowledged his work as being preliminary. But he did find direct and indirect empirical evidence to show that US private firms were investing at least $1 trillion annually (as of 2000, the basis year for the data) in intangible assets. Private expenditures, labor and corporate operating margins were the three measurement methods. The study also suggested that the capital stock of intangibles in the US has an equilibrium market value of at least $5 trillion.
Another key group — Carol Corrado, Charles Hulten, and Daniel Sichel, known as “CHS” across their many studies — also began to systematically evaluate the extent and basis for intangible assets and its discrepancy [7]. They estimated that spending on long-lasting knowledge capital — not just intangibles broadly — grew relative to other major components of aggregate demand during the 1990s. CHS was the first to show that by the turn of the millenium that fixed US investment in intangibles was at least as large as business investment in traditional, tangible capital.
By later in the decade, Nakamura was able to gather and analyze time series data that showed the steady increase in the contributions of intangibles [8]:
One can see the cross-over point late in the decade. Investment in intangibles he now estimates to be on the order of 8% to 10% of GDP annually in the US.
Roughly at the same time the National Academies in the US was commissioned to investigate the policy questions of intangible assets. The resulting major study [9] contains much relevant information. But it, too, contained an update by CHS on their slightly different approach to analyzing the growing role of intangible assets:
This CHS analysis shows similar trends to what Nakamura found, though the degree of intangible contributions is estimated as higher (~14% of annual GDP today), with investments in intangibles exceeding tangible assets somewhat earlier.
Surveys of more than 5,000 companies in 25 companies confirmed these trends from a different perspective, and also showed that most of these assets did not get reflected in financial statements. A large portion of this value was due to “brands” and other market intangibles [10]. The total “undisclosed” portion appeared to equal or exceed total
reported assets. Figures for the US indicated there might be a cumulative basis of intangible assets of $9.2 trillion [11].
In parallel, these groups and others began to decompose the intangible asset growth by country, sector, or asset type. The specific component of “information” received a great deal of attention. Uday Apte, Uday Karmarkar and Hiranya Nath, in particular, conducted a couple of important studies during this decade [12,13]. For example, they found nearly two-thirds of recent US GDP was due to information or knowledge industry contributions, a percentage that had been growing over time. They also found that a secondary sector of information internal to firms itself constituted well over 40% of the information economy, or some 28% of the entire economy. So the information activities internal to organizations and institutions represent a very large part of the economy.
The specific components that can constitute the informational portion of intangible assets has also been looked at by many investigators, importantly including key accounting groups. FASB, for example, has specific guidance on treatment of intangible assets in SFAS 141 [14]. Two-thirds of the 90 specific intangible items listed by the American Institute of Certified Public Accountants are directly related to information (as opposed to contracts, brands or goodwill), as shown in [15]. There has also been some good analysis by CHS on breakdowns by intangible assets categories [16]. There are also considerable differences by country on various aspects of these measures (for example, [10]). For example, according to OECD figures from 2002, expenditures for knowledge (R&D, education and software) ranged from nearly 7 percent (Sweden) to below 2 percent (Greece) in OECD countries, with the average of about 4 percent and the US at over 6 percent [17].
The common view is that a typical organization only uses 5 to 7 percent of the information it already has on hand [18], and 20% to 25% of a knowledge worker’s time is spent simply trying to find information [19]. To probe these issues more deeply, I began a series of analyses in 2004 looking at how much money was being spent on preparing documents within US companies, and how much of that investment was being wasted or not re-used [20]. One key finding from that study was that the information within documents in the US represent about a third of total gross domestic product, or an amount equal at the time of the study to about $3.3 trillion annually (in 2010 figures, that would be closer to $4.7 trillion). This level of investment is consistent with the results of Apte et al. and others as noted above.
However, for various reasons — mostly due to lack of awareness and re-use — some 25% of those trillions of dollar spent annually on document creation costs are wasted. If we could just find the information and re-use it, massive benefits could accrue, as these breakdowns in key areas show:
| U.S. FIRMS | $ Million | % |
| Cost to Create Documents | $3,261,091 | |
| Benefits to Finding Missed or Overlooked Documents | $489,164 | 63% |
| Benefits to Improved Document Access | $81,360 | 10% |
| Benefits of Re-finding Web Documents | $32,967 | 4% |
| Benefits of Proposal Preparation and Wins | $6,798 | 1% |
| Benefits of Paperwork Requirements and Compliance | $119,868 | 15% |
| Benefits of Reducing Unauthorized Disclosures | $51,187 | 7% |
| Total Annual Benefits | $781,314 | 100% |
| PER LARGE FIRM | $ Million | |
| Cost to Create Documents | $955.6 | |
| Benefits to Finding Missed or Overlooked Documents | $143.3 | |
| Benefits to Improving Document Access | $23.8 | |
| Benefits of Re-finding Web Documents | $9.7 | |
| Benefits of Proposal Preparation and Wins | $2.0 | |
| Benefits of Paperwork Requirements and Compliance | $35.1 | |
| Benefits of Reducing Unauthorized Disclosures | $15.0 | |
| Total Annual Benefits | $229.0 |
Table. Mid-range Estimates for the Annual Value of Documents, U.S. Firms, 2002 [20]
The total benefit from improved document access and use to the U.S economy is on the order of 8% of GDP. For the 1,000 largest U.S. firms, benefits from these improvements can approach nearly $250 million annually per firm (2002 basis). About three-quarters of these benefits arise from not re-creating the intellectual capital already invested in prior document creation. About one-quarter of the benefits are due to reduced regulatory non-compliance or paperwork, or better competitiveness in obtaining solicited grants and contracts.
This overall value of document use and creation is quite in line with the analyses of intangible assets noted above, and which arose from totally different analytical bases and data. This triangulation brings confidence that true trends in the growing importance of information have been identified.
These various estimates can now be combined to provide an assessment of just how large the “gap” is for the overlooked accounting and use of information assets:
| GDP ($T) | Intangible % | Info Contrib % | Info Assets ($T) | Unused Info ($T) | Total ($T) | ||||||
| Lo | Hi | Lo | Hi | Lo | Hi | Lo | Hi | Lo | Hi | ||
| US | $14.72 | 9% | 14% | 33% | 67% | $0.44 | $1.38 | $0.30 | $1.21 | $0.74 | $2.60 |
| European Union | $15.25 | 8% | 12% | 33% | 50% | $0.40 | $0.92 | $0.31 | $1.26 | $0.72 | $2.17 |
| Remaining Advanced | $10.17 | 8% | 12% | 33% | 50% | $0.27 | $0.61 | $0.21 | $0.84 | $0.48 | $1.45 |
| Rest of World | $34.32 | 2% | 6% | 10% | 25% | $0.07 | $0.51 | $0.00 | $0.71 | $0.07 | $1.22 |
| Total | $74.46 | $1.18 | $3.42 | $0.83 | $4.02 | $2.00 | $7.44 | ||||
| Notes (see endnotes) | [21] | [22] | [23] | ||||||||
Depending, these estimates can either be viewed as being too optimistic about the importance of information assets [25] or too conservative [26]. The breadth of the ranges of these values is itself an expression of the uncertainty in the numbers and the analysis.
The analysis shows that, globally, the value of unused and unaccounted information assets may be on the order of $2.0 trillion to $7.4 trillion annually, with a mid-range value of $4.7 trillion. Even considering uncertainties, these are huge, huge numbers by any account. For the US alone, this range is $750 billion to $2.6 trillion annually. The analysis from the prior studies [20] would strongly suggest the higher end of this range is more likely than the lower. Similarly large gaps likely occur within the European Union and within other advanced nations. For individual firms, depending on size, the benefits of understanding and closing these gaps can readily be measured in the millions to billions [27].
At the high end, these estimates suggest that perhaps as much as 10% of global expenditures is wasted and unaccounted for due to information-related activities. This is roughly equivalent to adding a half of the US economy to the global picture.
In the concluding section, we touch on why such huge holes may appear in the world’s financial books. Clearly, though, even with uncertain and heroic assumptions, the magnitude of this gap is huge, with compelling needs to understand and close it as soon as possible.
The seven Moody and Walsh information “laws” provide the clues to the reasons why we are not properly accounting for information and why we inadequately use it:
Fundamentally, because information is not understood in our bones as central to the well-being of our enterprises, we continue to view the generation, capture and maintenance of information as a “cost” and not an “asset”.
I have maintained for some time an interactive information timeline [28] that attempts to encompass the entire human history of information innovations. For tens of thousands of years steady — yet slow — progress in the ways to express and manage information can be seen in this timeline. But, then, beginning with electricity and then digitization, the pace of innovation explodes.
The same timeframe that sees the importance of intangible assets appear on national and firm accounts is when we see the full digitization of information and its ability to be communicated and linked over digital networks. A very insightful figure by Rama Hoetzlein for his thesis in 2007, which I have modified and enhanced, captures this evolution with some estimated dates as is shown below (click to expand) [29]:
The first insight this figure provides is that all forms of information are now available in digital form. This includes unstructured (images and documents), semi-structured (mark-up and “tagged” information) and structured (database and spreadsheet) information. This information can now be stored and communicated over digital networks with broadly accepted protocols.
But the most salient insight is that we now have the means through semantic technologies and approaches to interrelate all of this information together. Tagging and extraction methods enable us to generate metadata for unstructured documents and content. Data models based on predicate logic and semantic logics give us the flexible means to express the relationships and connections between information. And all of this can be stored and manipulated through graph-based datastores and languages such that we can draw inferences and gain insights. Plus, since all of this is now accessible via the Web and browsers, virtually any user can access, use and leverage this information.
This figure and its dates not only shows where we have come as a species in our use and sophistication with information, but how we need to bring it all together using semantics to complete our transition to a knowledge economy.
The very same metadata and semantic tagging capabilities that enable us to interrelate the information itself also provides the techniques by which we can monitor and track usage and provenance. It is through these additional semantic methods that we can finally begin to gain insight as to what information is of what value and to whom. Tapping this information will complete the circle for how we can also begin to properly valuate and then manage and optimize our information assets.
With our transition to an information economy, we now see that intangible assets exceed the value of tangible ones. We see that the information component of these intangibles represent one-third to two-thirds of these intangibles. In other words, information makes up from 17% to more than one-third of an individual firm’s value in modern economies. Further, we see that at least 25% of firm expenditures on information is wasted, keeping it as a cost and negating its value as an asset.
The “factories” of the modern information economy no longer produce pins with the fixed inputs of labor and capital as in the time of Adam Smith. They rather produce information and knowledge and know-how. Yet our management and accounting systems seem fixed in the techniques of yesteryear. The quaint idea of total factor productivity as a “residual” merely belies our ignorance about the causes of economic growth and firm value. These are issues that should rightly occupy the attention of practitioners in the disciplines of accounting and management.
Accounting methods grounded in the early 1800s that are premised on only capital assets as the means to increase the productivity of labor no longer work. Our engines of innovation are not physical devices, but ideas, innovation and knowledge; in short, information. Capable executives recognize these trends, but have yet to change management practices to address them [31].
As managers and executives of firms we need not await wholesale modernization of accounting practices to begin to make a difference. The first step is to understand the role, use and importance of information to our organizations. Looking clearly at the seven information “laws” and what that means about tracking and monitoring is an immediate way to take this step. The second step is to understand and evaluate seriously the prospects for semantic approaches to make a difference today.
We have now sufficiently climbed the data federation pyramid [32] to where all of our information assets are digital; we have network protocols to link it; we have natural language and extraction techniques for making documents first-class citizens along side structured data; and we have logical data models and sound semantic technologies for tying it all together.
We need to reorganize our “factory” floors around these principles, just as prime movers and unit electric drives altered our factories of the past. We need to reorganize and re-think our work processes and what we measure and value to compete in the 21st century. It is time to treat information as seriously as it has become an integral part of our enterprises. Semantic technologies and approaches provide just the path to do so.
| Blueprints
Book libraries Broadcast licenses Buy-sell agreements Certificates of need Chemical formulas Computer software Computerized databases Contracts Cooperative agreements Copyrights Credit information files Customer contracts Customer and client lists Customer relationships |
Designs and drawings
Development rights Employment contracts Engineering drawings Environmental rights Film libraries Food flavorings and recipes Franchise agreements Historical documents Heath maintenance organization enrollment lists Know-how Laboratory notebooks Literary works Management contracts Manual databases |
Manuscripts
Medical charts and records Musical compositions Newspaper morgue files Noncompete covenants Patent applications Patents (both product and process) Patterns Prescription drug files Prizes and awards Procedural manuals Product designs Proposals outstanding Proprietary computer software Proprietary processes |
Proprietary products
Proprietary technology Publications Royalty agreements Schematics and diagrams Shareholder agreements Solicitation rights Subscription lists Supplier contracts Technical and specialty libraries Technical documentation Technology-sharing agreements Trade secrets Trained and assembled workforce Training manuals |
Refining UMBEL’s Linking and Mapping Predicates with WikipediaWe are only days away from releasing the first commercial version 1.00 of UMBEL (Upper Mapping and Binding Exchange Layer) [1]. To recap, UMBEL has two purposes, both aimed to promote the interoperability of Web-accessible content. First, it provides a general vocabulary of classes and predicates for describing domain ontologies and external datasets. Second, UMBEL is a coherent framework of 28,000 broad subjects and topics (the “reference concepts”), which can act as binding nodes for mapping relevant content.
This last iteration of development has focused on the real-world test of mapping UMBEL to Wikipedia [2]. The result, to be more fully described upon release, has led to two major changes. It has acted to expand the size of the core UMBEL reference concepts to about 28,000. And it has led to adding to and refining the mapping predicates necessary for UMBEL to fulfill its purpose as a reference structure for external resources. This latter change is the focus of this post.
There is a huge diversity of organizational structure and world views on the Web; the linking and mapping predicates to fulfill this purpose must also capture that diversity. Relations between things on the Web can range from the exact and identity, to the approximate, descriptive and casual [3]. The 16 K direct mappings that have now been made between UMBEL and Wikipedia (resulting in the linkage of more than 2 million Wikipedia pages) provide a real-world test for how to capture this diversity. The need is to find the range of predicates that can reflect and capture quality, accurate mappings. Further, because mappings also can be aided with a variety of techniques from the manual to the automatic, it is important to characterize the specific mapping methods used whenever a linking predicate is assigned. Such qualifications can help to distinguish mapping trustworthiness, plus enable later segregation for the application of improved methods as they may arise.
As a result, the UMBEL Vocabulary now has a pretty well vetted and diverse set of linking and mapping predicates. Guidelines for how these differ, how they are used, and how they are qualified is described next.
Properties for linking and mapping need to differ more than in name or intended use. They must represent differences that affect inferences and reasoners, and can be acted upon by specific utilities via user interfaces and other applications. Furthermore, the diversity of mapping predicates should capture the types of diverse mappings and linkages possible between disparate sources.
Sometimes things are individuals or instances; other times they are classes or groupings of similar things. Sometimes things are of the same kind, but not exactly aligned. Sometimes things are unlike, but related in a common way. (Everything in Britain, for example, is a British “thing” even though they may be as different as trees, dead kings or cathedrals.) Sometimes we want to say something about a thing, such as an animal’s fur color or age, as a way to further characterize it, and so on.
The OWL 2 language and existing semantic Web languages give us some tools and existing vocabulary to capture some of this diversity. How these options, plus new predicates defined for UMBEL’s purposes, compare is shown by this table:
| Property | Relative Strength | Usage | Standard Reasoner? | Inverse Property? | Kind of Thing | Symmetrical? | Transitive? | Reflexive? | |
| It is | It Relates to | ||||||||
owl:equivalentClass |
10 | equivalence | X | N/A | class | class | yes | yes | yes |
owl:sameAs |
9 | identity | X | N/A | individual | individual | yes | yes | yes |
rdfs:subClassOf |
8 | subset | X | class | class | no | yes | yes | |
umbel:correspondsTo |
7 | ~equivalence | + / - | anything | RefConcept | yes | yes | yes | |
skos:narrowerTransitive |
6 | hierarchical | X | skos:Concept | skos:Concept | no | yes | no | |
skos:broaderTransitive |
6 | hierarchical | X | skos:Concept | skos:Concept | no | yes | no | |
rdf:type |
5 | membership | X | anything | class | no | no | no | |
umbel:isAbout |
4 | topical | X | anything | RefConcept | perhaps | not likely | not likely | |
umbel:isLike |
3 | similarity | anything | anything | yes | no | not likely | ||
umbel:relatesToXXX |
2 | relationship | anything | SuperType | no | no | not likely | ||
umbel:isCharacteristicOf |
1 | attribute | X | anything | RefConcept | no | no | no | |
I discuss each of these predicates below. But, first, let’s discuss what is in this table and how to interpret it [4].
The Usage metric is described for each property below.
To further aid the understanding of these properties, we can also group them into equivalence, membership, approximate or descriptive categories.
Equivalent properties are the most powerful available since they entail all possible axioms between the resources.
Equivalent class means that two classes have the same members; each is a sub-class of the other. The classes may differ in terms of annotations defined for each of them, but otherwise they are axiomatically equivalent.
An owl:equivalentClass assertion is the most powerful available because of its ability to ‘Explode the Domain‘ [6]. Because of its entailments, owl:equivalentClass should be used with great care.
The owl:sameAs assertion claims two instances to be an identical individual. This assertion also carries with it strong entailments of symmetry and reflexivity.
owl:sameAs is often misapplied [7]. Because of its entailments, it too should be used with great care. When there are doubts about claiming this strong relationship, UMBEL has the umbel:isLike alternative (see below).
Membership properties assert that an instance is a member of a class.
The rdfs:subClassOf asserts that one class is a subset of another class. This assertion is transitive and reflexive. It is a key means for asserting hierarchical or taxonomic structures in an ontology. This assertion also has strong entailments, particularly in the sense of members having consistent general or more specific relationships to one another.
Care must be exercised that full inclusivity of members occurs when asserting this relationship. When correctly asserted, however, this is one of the most powerful means to establish a reasoning structure in an ontology because of its transitivity.
Both of these predicates work on skos:Concept (recall that umbel:RefConcept is itself a subClassOf a skos:Concept). The predicates state a hierarchical link between the two concepts that indicates one is in some way more general (“broader”) than the other (“narrower”) or vice versa. The particular application of skos:broaderTransitive (or its complement) is used to infer the transitive closure of the hierarchical links, which can then be used to access direct or indirect hierarchical links between concepts.
The transitive relationship means that there may be intervening concepts between the two stated resources, making the relationship an ancestral one, and not necessarily (though it is possible to be so) a direct parent-child one.
The rdf:type assertion assigns instances (individuals) to a class. While the idea is straightforward, it is important to understand the intensional nature of the target class to ensure that the assignment conforms to the intended class scope. When this determination can not be made, one of the more approximate UMBEL predicates (see below) should be used.
For one reason or another, the precise assertions of the equivalent or membership properties above may not be appropriate. For example, we might not know sufficiently an intended class scope, or there might be ambiguity as to the identity of a specific entity (is it Jimmy Johnson the football coach, race car driver, fighter, local plumber or someone else?). Among other options — along a spectrum of relatedness — is the desire to assign a predicate that is meant to represent the same kind of thing, yet without knowing if the relationship is an equivalence (identity, or sameAs), a subset, or merely just a member of relationship. Alternatively, we may recognize that we are dealing with different things, but want to assert a relationship of an uncertain nature.
This section presents the UMBEL alternatives for these different kinds of approximate predicates [4].
The most powerful of these approximate predicates in terms of alignment and entailments is the umbel:correspondsTo property. This predicate is the recommended option if, after looking at the source and target knowledge bases [8], we believe we have found the best equivalent relationship, but do not have the information or assurance to assign one of the relationships above. So, while we are sure we are dealing with the same kind of thing, we may not have full confidence to be able to assign one of these alternatives:
rdfs:subClassOf owl:equivalentClass owl:sameAs superClassOf
Thus, with respect to existing and commonly used predicates, we want an umbrella property that is generally equivalent or so in nature, and if perhaps known precisely might actually encompass one of the above relations, but we don’t have the certainty to choose one of them nor perhaps assert full “sameness”. This is not too dissimilar from the rationale being tested for the x:coref predicate in relation to owl:sameAs from the UMBC Ebiquity group [9,10].
The property umbel:correspondsTo is thus used to assert a close correspondence between an external class, named entity, individual or instance with a Reference Concept class. It asserts this correspondence through the basis of both its subject matter and intended scope.
This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.
In most uses, the most prevalent linking property to be used is the umbel:isAbout assertion. This predicate is useful when tagging external content with metadata for alignment with an UMBEL-based reference ontology. The reciprocal assertion, umbel:isRelatedTo is when an assertion within an UMBEL vocabulary is desired to an external ontology. Its application is where the reference vocabulary itself needs to refer to an external topic or concept.
The umbel:isAbout predicate does not have the same level of confidence or “sameness” as the umbel:correspondsTo property. It may also reflect an assertion that is more like rdf:type, but without the confidence of class membership.
The property umbel:isAbout is thus used to assert the relation between an external named entity, individual or instance with a Reference Concept class. It can be interpreted as providing a topical assertion between an individual and a Reference Concept.
This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.
The property umbel:isLike is used to assert an associative link between similar individuals who may or may not be identical, but are believed to be so. This property is not intended as a general expression of similarity, but rather the likely but uncertain same identity of the two resources being related.
This property may be considered as an alternative to sameAs where there is not a certainty of sameness, and/or when it is desirable to assert a degree of overlap of sameness via the umbel:hasMapping reification predicate. This property can and should be changed if the certainty of the sameness of identity is subsequently determined.
It is appropriate to use this property when there is strong belief the two resources refer to the same individual with the same identity, but that association can not be asserted at the present time with full certitude.
This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.
At a different point along this relatedness spectrum we have unlike things that we would like to relate to one another. It might be an attribute, a characteristic or a functional property about something that we care to describe. Further, by nature of the thing we are relating, we may also be able to describe the kind of thing we are relating. The UMBEL SuperTypes (among many other options) gives us one such means to characterize the thing being related.
UMBEL presently has 31 predicates for these assertions relating to a SuperType [11]. The various properties designated by umbel:relatesToXXX are used to assert a relationship between an external instance (object) and a particular (XXX) SuperType. The assertion of this property does not entail class membership with the asserted SuperType. Rather, the assertion may be based on particular attributes or characteristics of the object at hand. For example, a British person might have an umbel:relatesToXXX asserted relation to the SuperType of the geopolitical entity of Britain, though the actual thing at hand (person) is a member of the Person class SuperType.
This predicate is used for filtering or clustering, often within user interfaces. Multiple umbel:relatesToXXX assertions may be made for the same instance.
Each of the 32 UMBEL SuperTypes has a matching predicate for external topic assignments (relatesToOtherOrganism shares two SuperTypes, leading to 31 different predicates):
| SuperType | Mapping Predicate | Comments |
| NaturalPhenomena | relatesToPhenomenon |
This predicate relates an external entity to the SuperType (ST) shown. It indicates there is a relationship to the ST of a verifiable nature, but which is undetermined as to strength or a full rdf:type relationship |
| NaturalSubstances | relatesToSubstance |
same as above |
| Earthscape | relatesToEarth |
same as above |
| Extraterrestrial | relatesToHeavens |
same as above |
| Prokaryotes | relatesToOtherOrganism |
same as above |
| ProtistsFungus | ||
| Plants | relatesToPlant |
same as above |
| Animals | relatesToAnimal |
same as above |
| Diseases | relatesToDisease |
same as above |
| PersonTypes | relatesToPersonType |
same as above |
| Organizations | relatesToOrganizationType |
same as above |
| FinanceEconomy | relatesToFinanceEconomy |
same as above |
| Society | relatesToSociety |
same as above |
| Activities | relatesToActivity |
same as above |
| Events | relatesToEvent |
same as above |
| Time | relatesToTime |
same as above |
| Products | relatesToProductType |
same as above |
| FoodorDrink | relatesToFoodDrink |
same as above |
| Drugs | relatesToDrug |
same as above |
| Facilities | relatesToFacility |
same as above |
| Geopolitical | relatesToGeoEntity |
same as above |
| Chemistry | relatesToChemistry |
same as above |
| AudioInfo | relatesToAudioMusic |
same as above |
| VisualInfo | relatesToVisualInfo |
same as above |
| WrittenInfo | relatesToWrittenInfo |
same as above |
| StructuredInfo | relatesToStructuredInfo |
same as above |
| NotationsReferences | relatesToNotation |
same as above |
| Numbers | relatesToNumbers |
same as above |
| Attributes | relatesToAttribute |
same as above |
| Abstract | relatesToAbstraction |
same as above |
| TopicsCategories | relatesToTopic |
same as above |
| MarketsIndustries | relatesToMarketIndustry |
same as above |
This property may be reified with the umbel:hasMapping property to describe the “degree” of the assertion.
Descriptive properties are annotation properties.
Two annotation properties are used to describe the attribute characteristics of a RefConcept, namely umbel:hasCharacteristic and its reciprocal, umbel:isCharacteristicOf. These properties are the means by which the external properties to describe things are able to be brought in and used as lookup references (that is, metadata) to external data attributes. As annotation properties, they have weak semantics and are used for accounting as opposed to reasoning purposes.
These properties are designed to be used in external ontologies to characterize, describe, or provide attributes for data records associated with a given RefConcept. It is via this property or its inverse, umbel:hasCharacteristic, that external data characterizations may be incorporated and modeled within a domain ontology based on the UMBEL vocabulary.
The choice of these mapping predicates may be aided with a variety of techniques from the manual to the automatic. It is thus important to characterize the specific mapping methods used whenever a linking predicate is assigned. Following this best practice allows us to distinguish mapping trustworthiness, plus to also enable later segregation for the application of improved methods as they may arise.
UMBEL, for its current mappings and purposes, has adopted the following controlled vocabulary for characterizing the umbel:hasMapping predicate; such listings may be readily modified for other domains and purposes when using the UMBEL vocabulary. This controlled vocabulary is based on instances of the Qualifier class. This class represents a set of descriptions to indicate the method used when applying an approximate mapping predicate (see above):
| Qualifier | Description |
| Manual – Nearly Equivalent | The two mapped concepts are deemed to be nearly an equivalentClass or sameAs relationship, but not 100% so |
| Manual – Similar Sense | The two mapped concepts share much overlap, but are not the exact same sense, such as an action as related to the thing it acts upon |
| Heuristic – ListOf Basis | Type assignment based on Wikipedia ListOf category; not currently used |
| Heuristic – Not Specified | Heuristic mapping method applied; script or technique not otherwise specified |
| External – OpenCyc Mapping | Mapping based on existing OpenCyc assertion |
| External – DBOntology Mapping | Mapping based on existing DBOntology assertion |
| External – GeoNames Mapping | Mapping based on existing GeoNames assertion |
| Automatic – Inspected SV | Mapping based on automatic scoring of concepts using Semantic Vectors, with specific alignment choice based on hand selection |
| Automatic – Inspected S-Match | Mapping based on automatic scoring of concepts using S-Match, with specific alignment choice based on hand selection; not currently used |
| Automatic – Not Specified | Mapping based on automatic scoring of concepts using a script or technique not otherwise specified; not currently used |
Again, as noted, for other domains and other purposes this listing can be modified at will.
Final aspects of these mappings are now undergoing a last round of review. A variety of sources and methods have been applied, to be more fully documented at time of release.
Some of the final specifics and counts may be modified slightly by the time of actual release of UMBEL v 1.00, which should occur in the next week or so. Nonetheless, here are some tentative counts for a select portion of these predicates in the internal draft version:
| Item or Predicate | Count |
| Total UMBEL Reference Concepts | 27,917 |
owl:equivalentClass (external OpenCyc, PROTON, DBpedia) |
28,618 |
umbel:correspondsTo (direct mappings to Wikipedia) |
16,884 |
rdf:type |
876,125 |
umbel:relatesToXXX |
3,059,023 |
| Unique Wikipedia Pages Mapped | 2,130,021 |
All of these assignments have also been hand inspected and vetted.
To date, in various steps and in various phases, the inspection of Wikipedia, its categories, and its match with UMBEL has perhaps incurred more than 5,000 hours (or nearly a three person-year equivalence) of expert domain and semantic technology review [12]. As noted, about 60% (16,884 of 27,917) of UMBEL concepts have now been directly mapped to Wikipedia and inspected for accuracy.
Wikipedia provides the most demanding and complete mapping target available for testing the coverage of UMBEL’s reference concepts and the adequacy of its vocabulary. As a result, we have added to and refined the mapping and linking predicates used in the UMBEL vocabulary, and added a Qualifier class to record the mapping process, as this post overviews. We have added the SuperType class to better organize and disambiguate large knowledge bases [13]. And, in this mapping process, we have expanded UMBEL’s reference concepts by about 33% to improve coverage, while remaining consistent with its origins as a faithful subset of the venerable Cyc knowledge structure [14].
A side benefit that has emerged from these efforts — with a huge potential upside — is the valuable combination of UMBEL and Wikipedia as a “gold standard” for aligning and mapping knowledge bases. Such a standard is critically needed. For example, in reviewing many of the existing Wikipedia mappings claimed as accurate, we found misplacement errors that averaged 15.8% [15]. Having a baseline of vetted mappings will aid future mappings. Moreover, having a complete conceptual infrastructure over Wikipedia will enable new and valuable reasoning and inference services.
The results from the UMBEL v 1.00 mapping are promising and very much useful today, but by no means complete. Future versions will extend the current mappings and continue to refine its accuracy and completeness [16]. What we can say, however, is that a coherent organization and conceptual schema — namely, UMBEL — overlaid on the richness of the instance data and content of Wikipedia, can produce immediate and useful benefits. These benefits apply to semantic search, semantic annotation and tagging, reasoning, discovery, inferencing, organization and comparisons.
umbel:correspondsTo predicate is used to assert close correspondence between UMBEL Reference Concepts and Wikipedia categories or pages, yet without entailing the actual Wikipedia category structure.owl:sameAs may lead to contradictions. However, virtually merging the descriptions in a co-reference engine is fine — both provide information that is useful in disambiguating future references as well as for many other purposes.”
Reasons for and Implications from Innovation Moving to ConsumersToday, the headlines and buzz for information technologies centers on smartphones, social networks, cloud computing, tablets and everything Internet. Very little is now discussed about IT in the enterprise. This declining trend began about 15 years ago, and has been accelerating over time. Letting the air out of the enterprise IT balloon has some profound reasons and implications. It also has some lessons and guidance related to semantic approaches and technologies and their adoption by enterprises.
One can probably clock the start of enterprise information technology (IT) to the first use of mainframe computers in the early 1950s [1], or sixty years ago. The earliest mainframes were huge and expensive machines that required their own specially air-conditioned rooms because of the heat they generated. The first use of “information technology” as a term occurred in a Harvard Business Review article from 1958 [2].
Until the late 1960s computers were usually supplied under lease, and were not purchased [3]. Service and all software were generally bundled into the lease amount without separate charge and with source code provided. Then, in 1969, IBM led an industry change by starting to charge separately for (mainframe) software and services, and ceasing to supply source code [3]. At about the same time integrated circuits enabled computer sizes to be reduced, with the minicomputers such as from DEC causing a marked expansion in number of potential customers. Enterprise apps became a huge business, with software licensing and maintenance fees achieving a peak of 70% of IT vendor total revenues by the mid-1990s [4]. However, since that peak, enterprise software as a portion of vendor revenues has been steadily eroding.
One of the earliest enterprise applications was in transaction systems and their underlying database management software. The relational database management system (RDBMS) was initially developed at IBM. Oracle, based on early work for the CIA in the late 1970s and its innovation to write in the C programming language, was able to port the RDBMS to multiple operating systems. These efforts, along with those of other notable vendors (most of which like Informix no longer exist), led to the RDBMS becoming more or less the de facto standard for data management within the enterprise by the 1980s. Today Oracle is the largest supplier of RDBMS software globally, and other earlier database system designs such as network databases or object databases fell out of favor [5].
In 1975, the Altair 8800 was introduced to electronics hobbyists as the first microcomputer, followed then by Apple II and the IBM PC in 1981, among others. Rapidly a slew of new applications became available to the individual, including spreadsheets, small databases, graphics programs and word processors. These apps were a boon to individual productivity and the IBM PC in particular brought credibility and acceptance within the enterprise (along with the growth of Microsoft). Novell and local area networks also pointed the way to a more distributed computing future. By the late 1980s virtually every knowledge worker within enterprises had some degree of computer literacy.
The apogee for enterprise software and apps occurred in the 1990s, with whole classes of new applications (most denoted by three-letter acronyms) such as enterprise resource planning (ERP), business intelligence (BI), customer relationship management (CRM), enterprise information systems (EIS) and the like coming to the fore. These systems also began as proprietary software, which resulted in the “stovepiping” or creating of information silos. In reaction and with great market acceptance, vendors such as SAP arose to provide comprehensive, enterprise-wide solutions, though often at high cost and with significant failure rates.
More significantly, the 1990s also saw the innovation of the World Wide Web with its basis in hypertext links on the Internet. Greatly facilitated by the Mosaic Web browser, the basis of the commercial Netscape browser, and the HTML markup language and HTTP transport protocol, millions began experiencing the benefit of creating Web pages and interconnecting. By the mid-1990s, enterprises were on the Web in force, bringing with them larger content volumes, dynamic databases and enterprise portals. The ability for anyone to become a publisher led to a focus and attention on the new medium that led to still further innovations in e-commerce and online advertising. New languages and uses of Web pages and applications emerged, creating a convergence of design, media, content and interactivity. Venture capital and new startups with valuations independent of revenues led to a frenzy of hype and eventually the dot com crash of 2000.
The growth companies of the past 15 years have not had the traditional focus on enterprises, but on the use and development of the Web. From search (Google) to social interactions (Facebook) to media and video (Flickr, YouTube) and to information (Wikipedia), the engines of growth have shifted away from the enterprise.
Meanwhile, the challenges of data integration and interoperability that were such a keen focus going back to initial enterprise computerization remain. Now, however, these challenges are even greater, as we see images, documents (unstructured data) and Web pages, markup and metadata (semi-structured data) become first-class information citizens. What was a challenge in integrating structured data in the 1980s and 1990s via data warehousing, has now become positively daunting for the enterprise with respect to scale and scope.
The paradox is that as these enterprise needs increased, the attractiveness of the enterprise from an IT perspective has greatly decreased. It is these factors we discuss below, with an eye to how Web architecture, design and opportunities may offer a new path through the maze of enterprise information interoperability.
Since 1995 the Gartner Group has been producing its annual Hype Cycle [6]. The clientele for this research is the enterprise, so Gartner’s presentation of what’s hot and what’s hype and what is being adopted is a good proxy for the IT state of affairs in enterprises. These graphs are reproduced below since 2006 (click to expand). Note how many of the items shown are not very specific to the enterprise:
References to architectures and content processing and related topics were somewhat prevalent in 2006, but have disappeared most recently. In comparison to the innovations noted under the History discussion, it appears that the items on Gartner’s radar are more related to consumer applications and uses. We no longer see whole new categories of enterprise-related apps or enterprise architectures.
The kinds of innovations that are being discussed as important to enterprises in the coming year [7,8] tend to mostly leverage existing innovations in other areas or to wrinkle existing approaches. One report from Constellation Research, for example, lists the five core disruptive technologies of social, mobile, cloud, analytics and unified communications [7]. Only analytics could be described as enterprise focused or driven.
And, even in analytics, the kinds of things being promoted are self-service reporting or analysis [8]. In essence, these opportunities represent the application of Web 2.0 techniques to bring reporting or analysis directly to the analyst. Though important and long overdue, such innovations are more derivative than fundamental.
Master data management (MDM) is another touted area. But, to read analyst’s predictions in these areas, it feels like one has stepped into a time warp of technologies and options from a decade ago. When has XML felt like an innovation?
Of course, there is a whole industry of analysts that makes their living prognosticating to enterprises about what to expect from information technologies and how to adopt and embrace them. The general observations — across the board — seem to center on items such as smartphones and mobile, moving to the cloud for software or platforms (SaaS, PaaS), and collaboration and social networks. As I note below, there is nothing inherently wrong or unexciting per se about these trends. But, what does appear true is that the locus of innovation has shifted from the enterprise to consumers or the Internet.
The shift in innovation away from the enterprise has been structural, not cyclical. That means that very fundamental forces are at work to cause this change in innovation focus. It does not mean that innovation has permanently shifted away from the enterprise (organizations), but that some form of countervailing structural changes would need to occur to see a return to the IT focus on the enterprise from prior decades.
I think we can point to seven structural reasons for this shift, many of which interact with one another. While all of them are bringing benefits (some yet to be foreseen) to the enterprise, and therefore are to be lauded, they are not strictly geared to address specific enterprise challenges.
As pundits say, “The Internet changes everything” [9]. For the reasons noted under the history above, the most important cause for the shift in innovation away from the enterprise has been the Internet.
One aspect that is quite interesting is the use of Internet-based technologies to provide “outsourced” enterprise applications hosted on Web servers. Such “cloud computing” leverages the technologies and protocols inherent to the Internet. It shifts hosting, maintenance and upgrade responsibilities for conventional apps to remote providers. Initially, of course, this simply shifts locus and responsibility from in-house to a virtual party. But, it is also the case that such changes will also promote more subtle shifts in collaboration and interaction possibilities. There is also the fact that quick upgrades of underlying infrastructure and application software can also occur.
The implications for existing enterprise IT staff, traditional providers, and licensing and maintenance approaches are profound. The Internet and cloud computing will perhaps have a greater effect on governance, staffing and management than application functionality per se.
The captivating IT-related innovations at present are mobile (smartphones) and their apps, tablets and e-book readers, Internet TV and video, and social networks of a variety of stripes. Somewhat like the phenomenon of when personal computers first appeared, many of these consumer innovations have applicability to the enterprise, though only as a side effect.
It is perhaps instructive to look back at the adoption of PCs in the enterprise to understand the possible effect of these new consumer innovations. Central IT was never able to control and manage the proliferation of personal computers, and only began to understand years later what benefits and new governance challenges they brought. Enterprise leaders will understand how to embrace and extend today’s new consumer technologies for the enterprise’s benefits; laggards will resist to no avail.
The ubiquity of computing will be enormously impactful on the enterprise. The understanding of what makes sense to do on a mobile basis with a small screen and what belongs on the desk or in the office is merely a glimmer in the current conversation. However, in the end, like most of the other innovations noted in this analysis, the enterprise will largely be a reactive player to these innovations. Yes, the implications will be profound, but their inherent basis are not grounded in unique enterprise challenges. Nonetheless, adapting to them and changing business practice will be critical to asserting enterprise leadership.

Ten years ago open source was largely dismissed in the enterprise. About five years ago VCs and others began funding new commercial open source ventures, even while there were still rear guard arguments from enterprises resisting open source. Meanwhile, as the figure to the right shows, open source projects were growing exponentially [10].
The shift to open source in the enterprise, still ongoing, has been rapid. Within 5 years, more than 50% of enterprise software will be open source [11] . According to an article in Fortune magazine last year [12], a Forrester Research survey found that 48% of enterprise respondents were using open source operating systems, and 57% were using open source code. A similar Accenture survey of 300 large public and private companies found that half are committed to open source software, with 38% saying they would begin using open-source software for “mission-critical” applications over the next 12 months.
There are likely many reasons for this shift, including the Internet itself and its basis in open source. Many of the most successful companies of the past 15 years including Amazon, Google, Facebook, and virtually any large Web site has shown excellent performance and scalability building their IT infrastructure around open source foundations. Most of the large, existing enterprise IT vendors, notably including IBM, Oracle, Nokia, Intel, Sun (prior to Oracle), Citrix, Novell (just acquired by Attachmate) and SAP have bought open source providers or have visible support for open source initiatives. Even two of the most vocal proprietary source proponents of the past — HP and Microsoft — have begun to make moves toward open source.
The age of proprietary software based on proprietary standards is dead. The monopoly rents formerly associated with unique, proprietary platforms and large-scale enterprise apps are over. Even where software remains proprietary, it is embracing open standards for data interchange and APIs. Traditional enterprise apps such as content management, business intelligence and ETL, among all others, are being penetrated by commercial open source offerings (as examples, Alfresco, Pentaho and Talend, respectively). The shift to services and new business models appears to be an inexorable force.
Declining profit margins, matched with the relatively high cost of marketing and sales to enterprises, means attention and focus have been shifting away from the enterprise. And with these shifts in focus has come a reduction in enterprise-focused innovation.
It is not unusual to find deployed systems within enterprises as old as thirty years [13]. So long as they work reasonably well, systems once installed — along with their data — tend to remain in operation until their platforms or functionality become totally obsolete. This leads to rather lengthy turnover cycles, and slow development cycles.
Slow cycles in themselves slow innovation. But slow development cycles are also a disincentive to attract the most capable developers. When development tends to focus on maintenance and scripts and more routines of the same nature, the best developers tend to migrate elsewhere (see next).
Another aspect of slow development cycles is the imperative for new enterprise IT to relate to and accommodate legacy systems — again, including legacy data. This consideration is the source of one of the negative implications of a shift away from innovation in the enterprise: the orphaning of existing information assets.
Arguably the emphasis on consumer and Internet technologies means that is where the best developers gravitate. Developing apps for smartphones or working at one of the cool Internet companies or joining a passionate community of open source developers is now attracting the best developers. Open source and Web-based systems also lead to faster development cycles. The very best developers are often the founders of the next generation startups and Web and software companies [14].
While, of course, huge numbers of computer programmers and IT specialists are hired by enterprises each year, the motivations tend to be higher pay, better benefits and more job security. The nature of the work and the bureaucracy and routine of many IT functions require such compensation. And, because of the other shifts noted elsewhere, even the software startups that are able to attract the most innovative developers no longer tend to develop for enterprise purposes.
Computer science students have been declining in industrialized countries for some time and that is the category of slowest growth in IT [14]. Meanwhile, existing IT personnel often have expertise in older legacy systems or have been focused on bug fixes and more prosaic tasks like report writing. Narrow job descriptions and work activities also keep many existing IT personnel from getting exposed to or learning about new trends or innovations, such as the semantic Web.
Declining numbers of new talent, plus declining interest by that talent, combined with (often) narrow and legacy expertise of existing talent, creates a disappointing storm of energy and innovation to address enterprise IT challenges. Enterprises have it within their power to create more exciting career opportunities to overcome these limitations, but unfortunately IT management often also appears challenged to get on top of these structural forces.
Open source and Internet-based systems have reduced the capital necessary for a new startup by an order of magnitude or so over the past decade. It is now quite possible to get a new startup up and running for tens to hundreds of thousands of dollars, as opposed to the millions of years past. This is leading to more startups, more startups per innovator, and quicker startup and abandonment cycles. Ideas can be tried quickly and more easily thrown away [15].
These dynamics are acting to accelerate overall development cycles and to cause a shift in funding structures and funding amounts by VCs and angels. The kind of market and sales development typical for many enterprise sales does not fit well within these dynamics and is a countervailing force for more capital when all trends point the other way.
In short, all of this is saying that money goes to where the returns are, and returns are not of the same basis as decades past in the enterprise sector. Again, this means a hollowing out of innovation for enterprises.
As an earlier reference noted [4], software revenues as a percent of IT vendor revenues peaked in about the mid-1990s. As profitability for these entities began to decline, so did the overall attractiveness of the sector.
As the next chart shows, coincident with the peak in profitability was the onset of a consolidation trend in the enterprise IT vendor sector [16]. The chart below shows that three of the largest IT vendors today — Oracle, IBM and HP — began an acquisition spree in the mid-1990s that has continued until just recently, as many of the existing major players have already been acquired:
Notable acquisitions over this period include: Oracle — PeopleSoft, Siebel Systems, MySQL, Hyperion, BEA and Sun; HP — EDS, 3Com, VeriFone, Compaq, Palm and Mercury Interactive; IBM — Lotus, Rational, Informix, Ascential, FileNet, Cognos and SPSS. Published acquisition costs exceeded $130 billion, mostly for the larger deals. But terms for 75% of the 262 transactions were not disclosed [16]. The total value of these consolidations likely approaches $200 billion to $300 billion.
Clearly, the market is now favoring large players with large service components. This consolidation trend does belie one early criticism of open source v proprietary software: proprietary software is likely to be better supported. In theory this might be true, but vanishing suppliers does not bode well for support either. Over time, we may likely see successful open source projects showing greater longevity than many IT vendors.
This discussion is not a boo-hoo because the heyday of enterprise IT innovation is past. Much of that innovation was expensive, often failed to achieve successful adoption, and promoted walled gardens and silos. As someone who ran companies directly involved in enterprise software sales, I personally do not miss the meetings, the travel, the suits and the 18-month sales cycles.
The enterprise has gained much from outside innovation in the past, from the personal computer to LANs and browsers and the Internet. To be sure, what we are now seeing with mobile phones has more computing power than the original Space Shuttle [17], and continued mashup and social engagement innovations will have unforeseen and manifest benefits for enterprises. I think this is unalloyed goodness.
We can also see innovations based on the Internet such as the semantic Web and its languages and standards to promote interoperability. Breaking these barriers is critically needed by enterprises of the future. Data models such as RDF [18] and open world mindsets that better accommodate uncertainty and breadth of information [19] can only be seen as positive. The leverage that will come from these non-enterprise innovations may in the end prove to be as important as the enterprise-specific innovations of the past.
Yet a shift to Internet and consumer IT innovation leaves some implications. These concerns have to do with the unique demands and needs of enterprises. One negative implication is that a diminishing supplier base may not lead to actual deployments that are enterprise-ready or -responsive.
The first concern relates to quality and operational integrity. There is an immense gulf between ISO 9000 or Six Sigma and, for example, the “good enough” of standard search results on the Web. Consumer apps do not impose the same thresholds for quality as demanded by paying bosses or paying customers. This is not a value judgment; simply a reality. I see it reflected in the quality of tools and code for many new innovations today on the Web.
Proofs-of-concept and “cool” demos work well for academic theses or basic intros to new concepts. The 20% that gets you 80% goes a long way to point the way to new innovation; but the 80% to get to the last 20% is where enterprises bet their money. Unfortunately, in too many instances, that gap is not being filled. The last 20% is hard work, often boring, and certainly not as exciting as the next Big Thing. And, as the trends above try to explicate, there are also diminishing rewards for living in that territory.
A similar and second concern pervades data interoperability. Data interoperability has been the central challenge of enterprise IT for at least three decades. As soon as we were able to interconnect systems and bridge differences in operating systems and data schema, the Holy Grail has been breaking information barriers and silos. The initial attempts with proprietary data warehouses or enterprise-wide ERP systems were wrongly trying to apply closed solutions to inherently open problems. But, now, finally when we have the open approaches and standards in hand for bridging these gaps, the attractiveness of doing so for the enterprise seems to have vanished.
For example, we see demos, tools and algorithms being published all over the place that show promising advances or improvements in the semantic Web or linked data (among other areas; see [20]). Some of these automated techniques sound wonderful, but real systems require the hard slog of review and manual approval. Quality matters. If Technique A, say, shows an improvement over Technique B of 5%, that is worth touting. But even at 98% percent accuracy, we will still find 20,000 errors in a population of 1 million items. Such errors will simply not work in having trains run on time, seats be available on airplanes, or inventory get to their required destinations.
What can work from the standpoint of linkage or interoperability on the Web according to consumer standards will simply not fly for many enterprises. But, where are the rewards for tackling that hard slog?
Another concern is security and differential access. Open Web systems, bless their hearts, do not impose the same access and need to know restrictions as information systems within enterprises. If we are to adopt Web-based approaches to the next-generation enterprise — a position we strongly advocate — then we are also going to need to figure out how to marry these two world views. Again, there appears to be an effort-reward mismatch here.
These observations are not meant to be a polemic, but a statement of more-or-less current circumstances. Since its widescale adoption, the major challenge — and opportunity — of enterprise IT has been how to leverage the value within the enterprise’s existing digital information assets. That challenge is augmented today with the availability of literally a whole world of external digital knowledge. Yet, the energy and emphasis for innovation to address these challenges has seemingly shifted to consumers and away from the enterprise.
Economics abhors a vacuum. I think two responses may be likely to this circumstance. The first is that new vendors will emerge to address these gaps, but with different cost structures and business models. I’d like to think my own firm, Structured Dynamics, is one of these entities. How we are addressing this opportunity and differences in our business model we will discuss at a later time. In any case, any such new player will need to take account of some of the structural changes noted above.
Another response can come from enterprises themselves, using and working the same forces of change noted earlier. Via collaboration and open source, enterprises can band together to contribute resources, expertise and people to develop open source infrastructures and standards to address the challenges of interoperability. We already see exemplars of such responses in somewhat related areas via initiatives such as Eclipse, Apache, W3C, OASIS and others. By leveraging the same tools of collaboration and open data and systems and the Internet, enterprises can band together and ensure their own self-interests are being addressed.
One advantage of this open, collaborative approach is that it is consistent with the current innovation trends in IT. But the real advantage is that it works and is needed. Without it, it is unclear how the enterprise IT challenge — especially in data interoperability — will be met.