Posted:July 18, 2011

Photo courtesy of levelofhealth.comA Decade of Remarkable Advances in Ten Grand IT Challenges

I’ve been in the information theory and technology game for quite some time, but believe nothing has matched the pace of advances of the past ten years. As one example, it was a mere eight years ago that I was sitting in a room with language translation vendors contemplating automated translation techniques for US intelligence agencies. The prospects finally looked doable, but the success of large-scale translation was not assured.

At about that same time, and the years until just recently, a whole slew of Grand Challenges [1] in computing hung out there: tantalizing yet not proven. These areas ranged from information extraction and natural language understanding to speech recognition and automated reasoning.

But things have been changing fast, and with a subtle steadiness that has caused it to go largely unremarked. Sure, all of us have been aware of the huge changes on the Web and search engine ubiquity and social networking. But some of the fundamentally hard problems in computing have also gone through some remarkable (but largely unremarked) advances.

We now have smart phones that speak instructions to us while we instruct them by voice in turn. Virtually all information conceivable is now indexed and made available through the Web; structure is now rapidly characterizing that information, making it even more useful to discover and organize. We can translate documents online with acceptable accuracy into more than 60 languages [2]. We can get directions to or see satellite views of virtually any place on earth. We have in fact become accustomed to new technology magic on a nearly daily basis, so much so that the pace of these advances seems to be a constant, blunting our perspective of just how rapid these advances have been progressing.

These advances are perhaps not the realization of artificial intelligence as articulated in the 1950s to 1980s, but are contributing to a machine-based ability to do tasks useful to humans heretofore impossible and at scales unimaginable. As Google and IBM’s Watson are showing, statistics (among other techniques) applied to massive knowledge bases or text corpora are breaking down all of the Grand Challenges of symbolic computing. The image that is emerging is less one of intelligent machines working autonomously than it is of computers working interactively or semi-automatically with humans to address previously unsolvable problems.

By using a perspective of the decade past, we also demark the seminal paper on the semantic Web by Berners-Lee, Hendler and Lassila from May 2001 [3]. Yet, while this semantic Web vision has been a contributor to the success of the Grand Challenge advances of the past ten years, I think we can also say that it has not been the key or even a primary driver. That day may still yet come. Rather, I think we have to look to natural language and statistics surrounding large-scale corpora as the more telling drivers.

Ten Grand Challenge Advances

Over the past ten years there have been significant advances on at least ten Grand Challenges in symbolic computation. As the concluding section notes, these advances can be traced in most part to broader advances in natural language processing, the logical and semiotic bases for interoperability, and standards (nominally in the semantic Web) for embracing them. Here are these ten areas of advance, all achieved over the past ten years:

#1 Information Extraction

Information extraction (IE) uses various forms of natural language processing (NLP) to identify structured information within unstructured or semi-structured documents. These documents are presented in machine-readable form (including straight text, various document formats or HTML) with the various types of information “tagged” or prompted for inclusion. Information types that can be extracted with one of the various techniques include entities, relations, topics, categories, and so forth. Once tagged or extracted, the information in the documents can now be included and linked to standard structured information (as might come from conventional databases) or to structure in other documents.

Most recently, a large number of online services and open source systems have also become available with strengths in one or more of these extraction types [4]. Some current examples include Yahoo! Term Extraction, OpenCalais, BeliefNetworks, OpenAmplify, Alchemy API, Evri, Extractiv, Illinois Tagger, and about 80 others [4].

#2 Machine Translation

Machine translation is the automatic translation of machine-readable text from one human language to another. Accurate and acceptable machine translation requires applying different types of knowledge including grammar, semantics, facts about the real world, etc. Various approaches have been developed and refined over time.

Especially helpful has been the availability of huge corpora in multiple languages to which large-scale statistical analysis may be applied (as is the case of Google’s machine translation) or human editing and refinement (as is the case with the more than 280 language versions of Wikipedia).

While it is true none of these systems have 100% accuracy (even human translators show much variation), the more advanced ones are truly impressive with remaining ambiguities flagged for resolution by semi-automatic means.

#3 Sentiment Analysis

Though sentiment analysis is strictly speaking a subset of information extraction, it has the more demanding and useful task of extracting subjective information, often across a group of documents or texts. Sentiment analysis can be applied to online reviews to determine the “polarity” about specific objects, and it is especially useful for identifying public opinion trends or evaluating social media for ranking, polling or marketing purposes.

Because of its greater difficulty and potential high value, many of the leading sentiment analysis capabilities remain proprietary. Some capable open source versions are available nonetheleless. There is also an interesting online application using Twitter feeds.

#4 Disambiguation

Many words have more than one meaning. Word sense disambiguation uses either machine learning, dictionaries (gazetteers) of known entities and concepts, ontologies or linguistic databases such as WordNet, or combinations thereof to evaluate ambiguous terms or phrases and resolve them based on context. Some systems need to be “trained” or some work automatically or others are based on evaulation and prompting (semi-automatic) to complete the disambiguation process.

State-of-the-art systems have greater than 90% precision [5]. Most of the leading open source NLP toolkits have quite capable disambiguation modules, and even better proprietary systems exist.

#5 Speech Synthesis and Recognition

Speech synthesis is the conversion of text to spoken speech and has been around for quite some time. Speech recognition is a far more difficult task in that a given sound clip or real-time spoken speech of a person must be converted to a textual representation, which itself can then be acted upon such as navigating or making selections. Speech recognition is made difficult because of individual voice differences, the variations of human languages and speech patterns, and the need to segment speech into a sequence of words. (In most spoken languages, the sounds representing successive letters blend into each other, so the conversion of the modulated wave form to discrete characters or tokens can be a very difficult process.)

Crude systems of a decade ago required much training with a specific speaker’s voice to show much effectiveness. Today, the range and ability to use these systems without training has markedly improved.

Until recently, improvements largely were driven by military and intelligence requirements. Today, however, with the ubiquity of smart phones and speech interfaces, the consumer market is greatly accelerating progress.

#6 Image Recognition

Image recognition is the ability to determine whether or not an electronic image contains some specific object, feature, or activity, and then to extract the image data associated with it. Today, under specific circumstances and for specific tasks, this can be done by computer. However, for the general case of arbitrary objects in arbitrary situations this challenge has not yet been fully met. The systems of today work best for simple geometric objects (e.g., polyhedra), human faces, printed or hand-written characters, or vehicles, and in specific situations, typically described in terms of well-defined illumination, background, and orientation of the object relative to the camera.

Auto license recognition at intersections, face recognition by security cameras, and greatly expanded and improved character recognition systems (machine vision) represent some of the current state-of-the-art. Again, smart phone apps are helping to drive advances.

#7 Interoperability Standards and Methods

Rapid Progress in Climbing the Data Federation Pyramid

Most of the previous advances are related to extracting structured information or mapping or deriving additional structured information. Once obtained, of course, the next challenge is in how to relate that information together; that is, how to make it interoperate.

We have been steadily climbing a data federation pyramid [6] — and at an impressively accelerating rate since the adoption of the Internet and Web. These network innovations gave us a common basis and protocols for connecting distributed devices. That, in turn, has freed us to concentrate on the standards for data representation and interoperability.

XML first provided a means for a common data serialization that encouraged various communities and industries to devise exchange vocabularies. RDF provided a means for a common data model, one that was both simple and extensible at the same time [7]. OWL built upon that basis to enable us to build common domain models (see next).

There are alternatives to the semantic Web standards of RDF and OWL such as common logic and there are many competing data exchange formats to XML. None of these standards is essential on its own and all have their communities and advocates. However, because they are standards and they share common network bases, it has also been relatively easy to convert amongst the various available protocols. We are nearly at a global level where everything is connected, machine-readable, and in structured form.

#7 Common Domain Models

Semantics in machine-readable form means that we can more confidently link and combine available information. We are seeing a veritable explosion of domain models to represent various domains and viewpoints in consensual, interoperable form. What this means is that we are now gaining the computing vocabularies and grammars — along with shared community models (world views) — to get this stuff to work together.

Five years ago we called this phenomena mashups, but no one uses that term any longer because these information brewpots are everywhere, including in our very hands when we interact with the apps on our smart phones. This glue of domain models is generally as invisible to us as is the glue in laminates or the resin in plastics. But they are the strength and foundations nonetheless that enable much of the computing magic unfolding around us.

#9 Virtual Apps (Cloud Computing)

Once the tyranny of physical separation was shattered between data and machine by the network, the rationale for keeping the data with the app or even the user with the app disappeared. Cloud computing may seem mysterious or sound to have some high-octave hum, but it really is nothing more than saying that the Web enables us to treat all of our computing resources as virtual. Data can be anywhere; machines and hard drives can be anywhere; and applications can be anywhere.

And, virtualness brings benefits in and of itself. Whole computing environments can be installed or removed nearly instantaneously. Peak computing demands can be met with virtual headrooms. Backup and rollover and redundancy practices and strategies can change. Web services mean tailored capabilities can be invoked from anywhere and integrated for local needs. Massive computing resources and server farms can be as accessible to the individual as they are to prior computing behemoths. Combined with continued advances in underlying computing hardware and chips, the computing power available to any user is rising exponentially. There is now even more power in the power curve.

#10 Big Data

One hears stories of Google or the National Security Agency having access and managing servers measured in the hundreds of thousands. Entirely new operating systems and computing environments — many with roots in open source — such as virtual operating systems and MapReduce approaches like Hadoop have been innovated to deal with the current era of “big data”.

MapReduce is a framework for processing huge datasets using a large number of servers. The “map” step partitions the problem into tractable sub-problems, organized in a tree structure. The “reduce” step then takes the answers to all the sub-problems and combines them to produce the final output.

Such techniques enable analysis of datasets of a size impossible before. This has enabled the development of statistics and analytical techniques that have been able to make correlations and find patterns for some of the Grand Challenge tasks noted before that simply could not be addressed within previous limits. The “big data” approach is providing a brute force alternative to previously intractable problems.

Why Such Progress?

Declining hardware costs and increasing performance (such as from Moore’s Law), combined with the adoption of the Internet + Web network, set the fertile conditions for these unprecedented advances in computing’s Grand Challenges. But the adaptive radiation in innovations now occurring has its own dynamics. In computing terms, we are seeing the equivalent of the Cambrian explosion in evolutionary history.

The dynamics driving this computing explosion are based largely, I believe, on the statistics of information retrieval and extraction needed to cope with the scale of documents on the Web. That, in turn, has impelled innovations in big data and distributed architectures and designs that have pried open previously closed computing lockboxes. As data from everywhere and from every provenance pours into the system, means for handling and interoperating with it have become imperatives. These forces, in turn, have been channeled and are being met through the open and standards-based approaches that helped lead to the development of the Internet and its infrastructure in the first place.

These powerful evolutionary forces in computing are clearly evident in the ten Grand Challenge advances above. But the challenges above are also silent on another factor, underpinning the interoperability initiatives, that is only now just becoming evident and exerting its own powerful force. That is the workable, intellectual foundations for interoperability itself.

Clearly, as the advances in the Grand Challenges show, we are seeing immense exposures of new structured information and impressive means for accessing and managing it on a global, distributed scale.  Yet all of this data and this structure begs the question of how to get the information to work together. Further, the sources and viewpoints and methods by which all of this data has been created also puts a huge premium on means to deal with the diversity. Though not evident, and perhaps not even known to many of the innovators and practitioners, there has been a growing intellectual force shaping our foundational views about the nature of things and their representations. This force has been, I believe, one of those root cause drivers helping to show the way to interoperability.

John Sowa, despite his unending criticism of the semantic Web in favor of common logic, has nonetheless been a very positive evangelist for the 19th century American logician and philosopher, Charles Sanders Peirce. Sowa points out that the entire 20th century largely neglected Peirce’s significant contributions in many areas and some philosophers appropriated Peircean insights without proper attribution [8]. Indeed, Peirce has only come to wider attention within the past decade or so. Much of his voluminous lifetime writings have still not yet been committed to publication.

Among many notable contributions, Peirce was passionate about signs and their triadic representations, in a field known as semiotics. The philosophical and logical basis of his triangle of signs deserves your attention, which can not be adequately treated here [9]. However, as summarized by Sowa [8], “A semiotic view of language and logic gets to the heart of the philosophical controversies and their practical implications for linguistics, artificial intelligence, and related subjects.”

In essence, Peirce’s triadic logic of semiotics helps clarify philosophical questions about things, how they are perceived and how they are named that has vexed philosophers at least since the time of Aristotle. What Peirce was able to put forward was a testable logic for how things and the names of things can be understood and related to one another, via logical statements or structures. These, in turn, can be symbolized and formalized into logical constructs that can capture the structure of natural language as well as more structured data.

The clarity of Peirce’s logic of signs is an underlying factor, I believe, for why we are finally seeing our way clear to how to capture, represent and relate information from a diversity of sources and viewpoints that is defensible and interoperable [10]. As we plumb Peircean logics further, I believe we will continue to gain additional insights and methods for combining and relating information. The next phase of our advances on these Grand Challenges is likely to be fueled more by connections and interoperability than in basic extraction or representation.

The Widening Explosion

We are not seeing the vision of artificial intelligence unfold as posed three decades ago. Nor are we seeing the AI-complete type of problems being solved in their entirety [11]. Rather, we are seeing impressive but incomplete approaches. Full automation and autonomy are not yet at hand, and may be so far in the future as to never be. But we are nevertheless seeing advances across the board in all Grand Challenge areas.

What is emerging is a practical achievement of the Grand Challenges, the scale and scope of which is unprecedented in symbolic computing. As we see Peircean logic continue to take hold and interoperability grow in usefulness and stature, I think it fair to say we can look back in ten years to describe where we stand today as having been in the midst of an evolutionary explosion.

[1] Grand Challenges were United States policy objectives for high-performance computing and communications research set in the late 1980s. According to “A Research and Development Strategy for High Performance Computing”, Executive Office of the President, Office of Science and Technology Policy, 29 pp., November 20, 1987, “A grand challenge is a fundamental problem in science or engineering, with broad applications, whose solution would be enabled by the application of high performance computing resources that could become available in the near future.”
[2] For example, as of July 17, 2011, Google offered 63 different source or target languages for translation.
[3] Tim Berners-Lee, James Hendler and Ora Lassila, 2001. “The Semantic Web”. Scientific American Magazine; see
[4] Go to Sweet Tools, and enter the search ‘information extraction’ to see a list of about 85 tools.
[5] See, for example, Roberto Navigli, 2009. “Word Sense Disambiguation: A Survey,” ACM Computing Surveys, 41(2), 2009, pp. 1–69. See
[6] M.K. Bergman, 2006. “Climbing the Data Federation Pyramid,” AI3:::Adaptive Information blog, May 25, 2006; see
[7] M. K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Information blog, April 8, 2009. See
[8] John Sowa, 2006. “Peirce’s Contributions to the 21st Century”, in H. Schärfe, P. Hitzler, & P. Øhrstrøm, eds., Conceptual Structures: Inspiration and Application, LNAI 4068, Springer, Berlin, 2006, pp. 54-69. See
[9] See, as a start, the Wikipedia article on Charles Sanders Peirce (pronounced “purse”), as well as the Arisbe collection of his assembled papers (to date). Also see John Sowa, 2010. “The Role of Logic and Ontology in Language and Reasoning,” from Chapter 11 of Theory and Applications of Ontology: Philosophical Perspectives, edited by R. Poli & J. Seibt, Berlin: Springer, 2010, pp. 231-263. See Sowa also says, “Although formal logic can be studied independently of natural language semantics, no formal ontology that has any practical application can ever be developed and used without acknowledging its intimate connection with NL semantics.”
[10] While Peirce’s logic and clarity of conceptual relationships is compelling, I find reading his writings quite demanding.
[11] In the field of artificial intelligence, the most difficult problems are informally known as AI-complete or AI-hard, meaning that the difficulty of these computational problems is equivalent to solving the central artificial intelligence problem of making computers as intelligent as people. Computer vision, autonomous robots and understanding natural language are amongst challenges recognized by consensus as being AI-complete. However, practical advances on the Grand Challenges were never defined as needing to meet the AI-complete criterion. Indeed, it is even questionable whether such a hurdle is even worthwhile or meaningful on its own.

Posted by AI3's author, Mike Bergman Posted on July 18, 2011 at 10:00 pm in Adaptive Innovation, Semantic Web, Structured Web | Comments (3)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:May 10, 2011

Deciphering Information Assets Exposing $4.7 Trillion Annually in Undervalued Information

Something strange began to happen with company valuations beginning twenty to thirty years ago. Book values increasingly began to diverge — go lower — from stock prices or acquisition prices. Between 1982 and 1992 the ratio of book value to market value decreased from 62% to 38% for public US companies [1]. The why of this mystery has largely been solved, but what to do about it has not. Significantly, semantic technologies and approaches offer both a rationale and an imperative for how to get the enterprises’ books back in order. In the process, semantics may also provide a basis for more productive management and increased valuations for enterprises as well.

The mystery of diverging value resides in the importance of information in an information economy. Unlike the historical and traditional ways of measuring a company’s assets — based on the tangible factors of labor, capital, land and equipment — information is an intangible asset. As such, it is harder to see, understand and evaluate than other assets. Conventionally, and still the more common accounting practice, intangible assets are divided into goodwill, legal (intellectual property and trade secrets) and competitive (know-how) intangibles. But — given that intangibles now equal or exceed the value of tangible assets in advanced economies — we will focus instead on the information component of these assets.

As used herein, information is taken to be any data that is presented in a form useful to recipients (as contrasted to the more technical definition of Shannon and Weaver [2]). While it is true that the there is always a question of whether the collection or development of information is a cost or represents an investment, that “information” is of growing importance and value to the enterprise is certain.

The importance of this information focus can be demonstrated by two telling facts, which I elaborate below. First, only five to seven percent of existing information is adequately used by most enterprises. And, second, the global value of this information amounts to perhaps a range of $2.0 trillion to $7.4 trillion annually (yes, trillions with a T)! It is frankly unbelievable that assets of such enormous magnitude are so poorly understood, exploited or managed.

Amongst all corporate resources and assets, information is surely the least understood and certainly the least managed. We value what we measure, and measure what we value. To say that we little measure information — its generation, its use (or lack thereof) or its value — means we are attempting to manage our enterprises with one eye closed and one arm tied behind our backs. Semantic approaches offer us one way, perhaps the best way, to bring understanding to this asset and then to leverage its value.

The Seven “Laws” of Information

More than a decade ago Moody and Walsh put forward a seminal paper on the seven “laws” of information [3]. Unlike other assets, information has some unique characteristics that make understanding its importance and valuing it much more difficult than other assets. Since I think it a shame that this excellent paper has received little attention and few citations, let me devote some space to covering these “laws”.

Like traditional factors of production — land, labor, capital — it is critical to understand the nature of this asset of “information”. As the laws below make clear, the nature of “information” is totally unique with respect to other factors of production. Note I have taken some liberty and done some updating on the wording and emphasis of the Moody and Walsh “laws” to accommodate recent learnings and understandings.

Law #1: Information Is (Infinitely) Shareable

Information is not friable and can not be depleted. Using or consuming it has no direct affect on others using or consuming it and only using portions of it does not undermine the whole of it. Using it does not cause a degradation or loss of function from its original state. Indeed, information is actually not “consumed” at all (in the conventional sense of the term); rather, it is “shared”. And, absent other barriers, information can be shared infinitely. The access and
use to information on the Web demonstrates this truth daily.

Thus, perhaps the most salient characteristic of information as an asset is that it can be shared between any number of people, business areas and organizations without loss of value to any party (absent the importance of confidentiality or secrecy, which is another information factor altogether). The sharability or maintenance of value irrespective of use makes information quite different to how other assets behave. There is no dilution from use. As Moody and Walsh put it, “from the firm’s perspective, value is therefore cumulative rather than apportioned across different users.”

In practice, however, this very uniqueness is also a cause of other organizational challenges. Both personal and institutional barriers get erected to limit sharing since “knowledge is power.” One perverse effect of information hoarding or lack of institutional support for sharing is to force the development anew of similar information. When not shared, existing information becomes a cost, and one that can get duplicated many times over.

Law #2: The Value of Information Increases With Use

Most resources degrade with use, such as equipment wearing out. In contrast, the per unit value of information increases with use. The major cost of information is in its capture, storage and maintenance. The actual variable costs of using the information (particularly digital information) is, in essence, zero. Thus, with greater use, the per unit cost of information drops.

There is a corollary to this that also goes to the heart of the question of information as an asset. From an accounting point of view, something can only be an asset if it provides future economic value. If information is not used, it cannot possibly result in such benefits and is therefore not an asset. Unused information is really a liability, because no value is extracted from it. In such cases the costs of capture, storage and maintenance are incurred, but with no realized
value. Without use, information is solely a cost to the enterprise.

The additional corollary is that awareness of the information’s existence is an essential requirement in order to obtain this value. As Moody and Walsh state, “information is at its highest ‘potential’ when everyone in the organization knows where it is, has access to it and knows how to use it. Information is at its lowest ‘potential’ when people don’t even know it is there.”

A still further corollary is the importance of information literacy. Awareness of information without an understanding of where it fits or how to take advantage of it also means its value is hidden to potential users. Thus, in addition to awareness, training and documentation are important factors to help ensure adequate use. Both of these factors
may seem like additional costs to the enterprise beyond capture, storage and maintenance, but — without them — no or little value will be leveraged and the information will remain a sunk cost.

Law #3: Information is Perishable

Like most other assets, the value of information tends to depreciate over time [4]. Some information has a short shelf life (such as Web visitations); other has a long shelf life (patents, contracts and many trade secrets). Proper valuation of information should take into account such differences in operational life, analysis or decision life, and statutory life. Operational shelf life tends to be the shortest.

In these regards, information is not too dissimilar from other asset types. The most important point is to be cognizant of use and shelf differences amongst different kinds of information. This consideration is also traded off against the declining costs of digital information storage.

Law #4: The Value of Information Increases With Accuracy

A standard dictum is that the value of information increases with accuracy. The caveat, however, is that some information, because it is not operationally dependent or critical to the strategic interests of the firm, actually can become a cost when capture costs exceed value. Understanding such Pareto principles is an important criterion in evaluating information approaches. Generally, information closest to the transactional or business purpose of the organization will demand higher accuracy.

Such statements may sound like platitudes — and are — in the absence of an understanding of information dependencies within the firm. But, when certain kinds of information are critical to the enterprise, it is just as important to know the accuracy of the information feeding that “engine” as it is for oil changes or maintenance schedules for physical engines. Thus an understanding of accuracy requirements in information should be a deliberate management focus for critical business functions. It is the rare firm that attends to such imperatives today.

Law #5: The Value of Information Increases in Combination

A unique contribution from semantic approaches — and perhaps the one resulting in the highest valuation benefit — arises from the increased value of connecting the information. We have come to understand this intimately as the “network effect” from interconnected nodes on a network. It also arises when existing information is connected as well.

Today’s enterprise information environment is often described by many as unconnected “silos”. Scattered databases and spreadsheets and other information repositories litter the information landscape. Not only are these sources unconnected and isolated, but they also may describe similar information in different and inconsistent ways.

As I have described previously in The Law of Linked Data [5], existing information can act as nodes that — once connected to one another — tend to produce a similar network effect to what physical networks exhibit with increasing numbers of users. Of course, the nature of the information that is being connected and its centrality to the mission of the enterprise will greatly affect the value of new connections. But, based on evidence to date, the value of information appears to go up somewhere between a quadratic and exponential function for the number of new connections. This value is especially evident in know-how and competitive areas.

Law #6: More Is Not Necessarily Better

Information overload is a well-known problem. On the other hand, lack of appropriate information is also a compelling problem. The question of information is thus one of relevancy. Too much irrelevant information is a bad thing, as is too little relevant information.

These observations lead to two use considerations. First, means to understand and focus information capture on relevant information is critical. And, second, information management systems should be purposefully designed with user interfaces for easy filtering of irrelevant information.

The latter point sounds straightforward, but, in actuality, requires a semantic underpinning to the enterprise’s information assets. This requirement is because relevancy is in the eye of the beholder, and different users have different terms, perspectives, and world views by which information evaluation occurs. In order for useful filtering, information must be presented in similar terms and perspectives relevant to those users. Since multiple studies affirm that information decision-makers seek more information beyond their overload points [3], it is thus more useful to provide relevant access and filtering methods that can be tailored by user rather than “top down” information restrictions.

Law #7: Information is Self-propagating

With access and connections, information tends to beget more information. This propagation results from summations, analysis, unique combinations and other ways that basic datum get recombined into new datum. Thus, while the first law noted that information can not be consumed (or depleted) by virtue of its use, we can also say that information tends to reproduce and expand itself via use and inspection.

Indeed, knowledge itself is the result of how information in its native state can be combined and re-organized to derive new insights. From a valuation standpoint, it is this very understanding that leads to such things as competitive intelligence or new know-how. In combination with insights from connections, this propagating factor of information is the other leading source of intangible asset valuations.

This law also points to the fact that information per se is not a scarce resource. (Though its availability may be scarce.) Once available, techniques like data mining, analysis, visualization and so forth can be rich sources for generating new information from existing holdings of data.

Information as an Asset and How to Value

These “laws” — or perspectives — help to frame the imperatives for how to judge information as an asset and its resulting value. The methodological and conceptual issues of how to explicitly account for information on a company’s books are, of course, matters best left to economists and professional accountants. With the growing share of information in relation to intangible assets, this would appear to be a matter of great importance to national policy. Accounting for R&D efforts as an asset versus a cost, for example, has been estimated to add on the order of 11 percent to US national GDP estimates [9].

The mere generation of information is not necessarily an asset, as the “laws” above indicate. Some of the information has no value and some indeed represents a net sunk cost. What we can say, however, is that valuable information that is created by the enterprise but remains unused or is duplicated means that what was potentially an asset has now been turned into a cost — sometimes a cost repeated many-fold.

Information that is used is an asset, intangible or not. Here, depending on the nature of the information and its use, it can be valued on the basis of cost (historical cost or what it cost to develop it), market value (what others will pay for it), or utility (what is its present value as benefits accrue into the future). Traditionally the historical cost method has been applied to information. Yet, since information can both be sold and still retained by the organization, it may have both market value and utility value, with its total value being the sum.

In looking at these factors, Moody and Walsh propose a number of new guidelines in keeping with the “laws” noted above [3]:

  • Operational information should be measured as the cost of collection using data entry costs
  • Management information should be valued based on what it cost to extract the data from operational systems
  • Redundant data should be considered to have zero value (Law #1)
  • Unused data should be considered to have zero value (Law #2)
  • The number of users and number of accesses to the data should be used to multiply the value of the information (Law #2). When information is used for the first time, it should be valued at its cost of collection; subsequent uses should add to this value (perhaps on a depreciated basis; see below)
  • The value of information should be depreciated based on its “shelf life” (Law #3)
  • The value of information should be discounted by its accuracy relative to what is considered to be acceptable (Law #4)
  • And, as an added factor, information that is effectively linked or combined should have its value multiplied (Law #5), though the actual multiplier may be uncertain [5].

The net result of thinking about information in this more purposeful way is to encourage more accurate valuation methods, and to provide incentives for more use and re-use, particularly in combined ways. Such methods can also help distinguish what information is of more value to the organization, and therefore worthy of more attention and investment.

The Growing Importance of Intangible Information

The emerging discrepancy between market capitalizations and book values began to get concerted academic attention in the 1990s. To be sure, perceptions by the market and of future earnings potential can always  color these differences. The simple occurrence of a discrepancy is not itself proof of erroneous or inaccurate valuations. (And, the corollary is that the degree of the discrepancy is not sufficient alone to estimate the intangible asset “gap”, a logical error made by many proponents.) But, the fact that these discrepancies had been growing and appeared to be based (in part) on structural changes linked to intangibles was creating attention.

Leonard Nakamura, an economist with the Federal Reserve Board in Philadelphia, published a working paper in 2001 entitled, “What is the U.S. Gross investment in Intangibles?  (At Least) One Trillion Dollars a Year!” [6]. This was one of the first attempts to measure intangible investments, which he defined as private expenditures on assets that are intangible and necessary to the creation and sale of new or improved products and processes, including designs, software, blueprints, ideas, artistic expressions, recipes, and the like. Nakamura acknowledged his work as being preliminary. But he did find direct and indirect empirical evidence to show that US private firms were investing at least $1 trillion annually (as of 2000, the basis year for the data) in intangible assets.  Private expenditures, labor and corporate operating margins were the three measurement methods.  The study also suggested that the capital stock of intangibles in the US has an equilibrium market value of at least $5 trillion.

Another key group — Carol Corrado, Charles Hulten, and Daniel Sichel, known as “CHS” across their many studies — also began to systematically evaluate the extent and basis for intangible assets and its discrepancy [7].  They estimated that spending on long-lasting knowledge capital — not just intangibles broadly — grew relative to other major components of aggregate demand during the 1990s. CHS was the first to show that by the turn of the millenium that fixed US investment in intangibles was at least as large as business investment in traditional, tangible capital.

By later in the decade, Nakamura was able to gather and analyze time series data that showed the steady increase in the contributions of intangibles [8]:

One can see the cross-over point late in the decade. Investment in intangibles he now estimates to be on the order of 8% to 10% of GDP annually in the US.

Roughly at the same time the National Academies in the US was commissioned to investigate the policy questions of intangible assets. The resulting major study [9] contains much relevant information. But it, too, contained an update by CHS on their slightly different approach to analyzing the growing role of intangible assets:

This CHS analysis shows similar trends to what Nakamura found, though the degree of intangible contributions is estimated as higher (~14% of annual GDP today), with investments in intangibles exceeding tangible assets somewhat earlier.

Surveys of more than 5,000 companies in 25 companies confirmed these trends from a different perspective, and also showed that most of these assets did not get reflected in financial statements. A large portion of this value was due to “brands” and other market intangibles [10]. The total “undisclosed” portion appeared to equal or exceed total
reported assets. Figures for the US indicated there might be a cumulative basis of intangible assets of $9.2 trillion [11].

In parallel, these groups and others began to decompose the intangible asset growth by country, sector, or asset type. The specific component of “information” received a great deal of attention. Uday Apte, Uday Karmarkar and Hiranya Nath, in particular, conducted a couple of important studies during this decade [12,13]. For example, they found nearly two-thirds of recent US GDP was due to information or knowledge industry contributions, a percentage that had been growing over time. They also found that a secondary sector of information internal to firms itself constituted well over 40% of the information economy, or some 28% of the entire economy. So the information activities internal to organizations and institutions represent a very large part of the economy.

The specific components that can constitute the informational portion of intangible assets has also been looked at by many investigators, importantly including key accounting groups. FASB, for example, has specific guidance on treatment of intangible assets in SFAS 141 [14]. Two-thirds of the 90 specific intangible items listed by the American Institute of Certified Public Accountants are directly related to information (as opposed to contracts, brands or goodwill), as shown in [15]. There has also been some good analysis by CHS on breakdowns by intangible assets categories [16]. There are also considerable differences by country on various aspects of these measures (for example, [10]). For example, according to OECD figures from 2002, expenditures for knowledge (R&D, education and software) ranged from nearly 7 percent (Sweden) to below 2 percent (Greece) in OECD countries, with the average of about 4 percent and the US at over 6 percent [17].

. . . Plus Too Much Information Goes Unused

The common view is that a typical organization only uses 5 to 7 percent of the information it already has on hand [18], and 20% to 25% of a knowledge worker’s time is spent simply trying to find information [19]. To probe these issues more deeply, I began a series of analyses in 2004 looking at how much money was being spent on preparing documents within US companies, and how much of that investment was being wasted or not re-used [20]. One key finding from that study was that the information within documents in the US represent about a third of total gross domestic product, or an amount equal at the time of the study to about $3.3 trillion annually (in 2010 figures, that would be closer to $4.7 trillion). This level of investment is consistent with the results of Apte et al. and others as noted above.

However, for various reasons — mostly due to lack of awareness and re-use — some 25% of those trillions of dollar spent annually on document creation costs are wasted. If we could just find the information and re-use it, massive benefits could accrue, as these breakdowns in key areas show:

U.S. FIRMS $ Million %
Cost to Create Documents $3,261,091
Benefits to Finding Missed or Overlooked Documents $489,164 63%
Benefits to Improved Document Access $81,360 10%
Benefits of Re-finding Web Documents $32,967 4%
Benefits of Proposal Preparation and Wins $6,798 1%
Benefits of Paperwork Requirements and Compliance $119,868 15%
Benefits of Reducing Unauthorized Disclosures $51,187 7%
Total Annual Benefits $781,314 100%
Cost to Create Documents $955.6
Benefits to Finding Missed or Overlooked Documents $143.3
Benefits to Improving Document Access $23.8
Benefits of Re-finding Web Documents $9.7
Benefits of Proposal Preparation and Wins $2.0
Benefits of Paperwork Requirements and Compliance $35.1
Benefits of Reducing Unauthorized Disclosures $15.0
Total Annual Benefits $229.0

Table. Mid-range Estimates for the Annual Value of Documents, U.S. Firms, 2002 [20]

The total benefit from improved document access and use to the U.S economy is on the order of 8% of GDP. For the 1,000 largest U.S. firms, benefits from these improvements can approach nearly $250 million annually per firm (2002 basis). About three-quarters of these benefits arise from not re-creating the intellectual capital already invested in prior document creation. About one-quarter of the benefits are due to reduced regulatory non-compliance or paperwork, or better competitiveness in obtaining solicited grants and contracts.

This overall value of document use and creation is quite in line with the analyses of intangible assets noted above, and which arose from totally different analytical bases and data. This triangulation brings confidence that true trends in the growing importance of information have been identified.

How Big is the Information Asset Gap?

These various estimates can now be combined to provide an assessment of just how large the “gap” is for the overlooked accounting and use of information assets:

GDP ($T) Intangible % Info Contrib % Info Assets ($T) Unused Info ($T) Total ($T)
Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi
US $14.72 9% 14% 33% 67% $0.44 $1.38 $0.30 $1.21 $0.74 $2.60
European Union $15.25 8% 12% 33% 50% $0.40 $0.92 $0.31 $1.26 $0.72 $2.17
Remaining Advanced $10.17 8% 12% 33% 50% $0.27 $0.61 $0.21 $0.84 $0.48 $1.45
Rest of World $34.32 2% 6% 10% 25% $0.07 $0.51 $0.00 $0.71 $0.07 $1.22
Total $74.46 $1.18 $3.42 $0.83 $4.02 $2.00 $7.44
Notes (see endnotes) [21] [22] [23]

Depending, these estimates can either be viewed as being too optimistic about the importance of information assets [25] or too conservative [26]. The breadth of the ranges of these values is itself an expression of the uncertainty in the numbers and the analysis.

The analysis shows that, globally, the value of unused and unaccounted information assets may be on the order of  $2.0 trillion to $7.4 trillion annually, with a mid-range value of $4.7 trillion. Even considering uncertainties, these are huge, huge numbers by any account. For the US alone, this range is $750 billion to $2.6 trillion annually. The analysis from the prior studies [20] would strongly suggest the higher end of this range is more likely than the lower. Similarly large gaps likely occur within the European Union and within other advanced nations. For individual firms, depending on size, the benefits of understanding and closing these gaps can readily be measured in the millions to billions [27].

At the high end, these estimates suggest that perhaps as much as 10% of global expenditures is wasted and unaccounted for due to information-related activities. This is roughly equivalent to adding a half of the US economy to the global picture.

In the concluding section, we touch on why such huge holes may appear in the world’s financial books. Clearly, though, even with uncertain and heroic assumptions, the magnitude of this gap is huge, with compelling needs to understand and close it as soon as possible.

The Relationship to Semantic Technologies

The seven Moody and Walsh information “laws” provide the clues to the reasons why we are not properly accounting for information and why we inadequately use it:

  • We don’t know what information we have and can not find it
  • What we have we don’t connect
  • We misallocate resources for generating, capturing and storing information, because we don’t understand its value and potential
  • We don’t manage the use of information or its re-use
  • We duplicate efforts
  • We inadequately leverage what information we have and miss valuable (that is, can be “valuated”) insights that could be gained.

Fundamentally, because information is not understood in our bones as central to the well-being of our enterprises, we continue to view the generation, capture and maintenance of information as a “cost” and not an “asset”.

I have maintained for some time an interactive information timeline [28] that attempts to encompass the entire human history of information innovations. For tens of thousands of years steady — yet slow — progress in the ways to express and manage information can be seen in this timeline. But, then, beginning with electricity and then digitization, the pace of innovation explodes.

The same timeframe that sees the importance of intangible assets appear on national and firm accounts is when we see the full digitization of information and its ability to be communicated and linked over digital networks. A very insightful figure by Rama Hoetzlein for his thesis in 2007, which I have modified and enhanced, captures this evolution with some estimated dates as is shown below (click to expand) [29]:

The first insight this figure provides is that all forms of information are now available in digital form. This includes unstructured (images and documents), semi-structured (mark-up and “tagged” information) and structured (database and spreadsheet) information. This information can now be stored and communicated over digital networks with broadly accepted protocols.

But the most salient insight is that we now have the means through semantic technologies and approaches to interrelate all of this information together. Tagging and extraction methods enable us to generate metadata for unstructured documents and content. Data models based on predicate logic and semantic logics give us the flexible means to express the relationships and connections between information. And all of this can be stored and manipulated through graph-based datastores and languages such that we can draw inferences and gain insights. Plus, since all of this is now accessible via the Web and browsers, virtually any user can access, use and leverage this information.

This figure and its dates not only shows where we have come as a species in our use and sophistication with information, but how we need to bring it all together using semantics to complete our transition to a knowledge economy.

The very same metadata and semantic tagging capabilities that enable us to interrelate the information itself also provides the techniques by which we can monitor and track usage and provenance. It is through these additional semantic methods that we can finally begin to gain insight as to what information is of what value and to whom. Tapping this information will complete the circle for how we can also begin to properly valuate and then manage and optimize our information assets.


With our transition to an information economy, we now see that intangible assets exceed the value of tangible ones. We see that the information component of these intangibles represent one-third to two-thirds of these intangibles. In other words, information makes up from 17% to more than one-third of an individual firm’s value in modern economies. Further, we see that at least 25% of firm expenditures on information is wasted, keeping it as a cost and negating its value as an asset.

The “factories” of the modern information economy no longer produce pins with the fixed inputs of labor and capital as in the time of Adam Smith. They rather produce information and knowledge and know-how. Yet our management and accounting systems seem fixed in the techniques of yesteryear. The quaint idea of total factor productivity as a “residual” merely belies our ignorance about the causes of economic growth and firm value. These are issues that should rightly occupy the attention of practitioners in the disciplines of accounting and management.

Why industrial-era accounting methods have been maintained in the present information age is for students of corporate power politics to debate. It should suffice to remind us that when industrialization induced a shift from the extraction of funds from feudal land possessions to earning profits on invested capital, most of the assumptions about how to measure performance had to change. When the expenses for acquiring information capabilities cease to be an arbitrary budget allocation and become the means for gaining Knowledge Capital, much of what is presently accepted as management of information will have to shift from a largely technological view of efficiency to an asset management perspective [30].

Accounting methods grounded in the early 1800s that are premised on only capital assets as the means to increase the productivity of labor no longer work. Our engines of innovation are not physical devices, but ideas, innovation and knowledge; in short, information. Capable executives recognize these trends, but have yet to change management practices to address them [31].

As managers and executives of firms we need not await wholesale modernization of accounting practices to begin to make a difference. The first step is to understand the role, use and importance of information to our organizations. Looking clearly at the seven information “laws” and what that means about tracking and monitoring is an immediate way to take this step. The second step is to understand and evaluate seriously the prospects for semantic approaches to make a difference today.

We have now sufficiently climbed the data federation pyramid [32] to where all of our information assets are digital; we have network protocols to link it; we have natural language and extraction techniques for making documents first-class citizens along side structured data; and we have logical data models and sound semantic technologies for tying it all together.

We need to reorganize our “factory” floors around these principles, just as prime movers and unit electric drives altered our factories of the past. We need to reorganize and re-think our work processes and what we measure and value to compete in the 21st century. It is time to treat information as seriously as it has become an integral part of our enterprises. Semantic technologies and approaches provide just the path to do so.

[1] Baruch Lev and Jürgen H. Daum, 2003. “Intangible Assets and the Need for a Holistic and More Future-oriented Approach to Enterprise Management and Corporate Reporting,” prepared for the 2003 PMA Intellectual Capital Symposium, 2nd October 2003, Cranfield Management Development Centre, Cranfield University, UK; see
[2] Claude E. Shannon and Warren Weaver, 1949. The Mathematical Theory of Communication. The University of Illinois Press, Urbana, Illinois, 1949. ISBN 0-252-72548-4.
[3] Daniel Moody and Peter Walsh, 1999. “Measuring The Value Of Information: An Asset Valuation Approach,” paper presented at the Seventh European Conference on Information Systems (ECIS’99), Copenhagen Business School, Frederiksberg, Denmark, 23-25 June, 1999. See A precursor paper that is also quite helpful and cited much in Moody and Walsh is R. Glazer, 199. “Measuring the Value of Information: The Information Intensive Organisation”, IBM Systems Journal, Vol 32, No 1, 1993.
[4] Some trade secrets could buck this trend if the value of the underlying enterprise that relies on them increases.
[5] M.K. Bergman, 2009. “The Law of Linked Data,” post in AI3:::Adaptive Information blog, October 11, 2009. See
[6]  Leonard Nakamura, 2001. What is the U.S. Gross Investment in Intangibles?  (At Least) One Trillion Dollars a Year!,
Working Paper No. 01-15, Federal Reserve Bank of Philadelphia, October 2001; see
[7] Carol A. Corrado, Charles R. Hulten, and Daniel E. Sichel, 2004. Measuring Capital and Technology: An Expanded Framework. Federal Reserve Board, August 2004.
[8] Leonard I. Nakamura, 2009. Intangible Assets and National Income Accounting: Measuring a Scientific Revolution, Working Paper No. 09-11, Federal Reserve Bank of Philadelphia, May 8, 2009; see
[9] Christopher Mackie, Rapporteur, 2009. Intangible Assets: Measuring and Enhancing Their Contribution to Corporate Value and Economic Growth: Summary of a Workshop, prepared by the Board on Science, Technology, and Economic Policy (STEP) Committee on National Statistics (CNSTAT), ISBN: 0-309-14415-9, 124 pages; see (available for PDF download with sign-in).
[10] Brand Finance, 2006. Global Intangible Tracker 2006: An Annual Review of the World’s Intangible Value, paper published by Brand Finance and The Institute of Practitioners in Advertising, London, UK, December 2006. See
[11] Kenan Patrick Jarboe and Roland Furrow, 2008. Intangible Asset Monetization: The Promise and the Reality, Working Paper #03 from the Athena Alliance, April 2008. See
[12] Uday M. Apte and Hiranya K. Nath, 2004, Size, Structure and Growth of the US Information Economy,” UCLA Anderson School of Management on Business and Information Technologies, December 2004; see
[13] Uday M. Apte, Uday S. Karmarkar and Hiranya K Nath, 2008. “Information Services in the US Economy: Value, Jobs and Management Implications,” California Management Review, Vol. 50, No.3, 12-30, 2008.
[14] See the Financial Accounting Standards Board—SFAS 141; see

[15] See further, AICPA Special Committee on Financial Reporting, 1994. Improving Business Reporting—A Customer Focus: Meeting the Information Needs of Investors and Creditors. See

Blueprints Book libraries

Broadcast licenses

Buy-sell agreements

Certificates of need

Chemical formulas

Computer software

Computerized databases


Cooperative agreements


Credit information files

Customer contracts

Customer and client lists

Customer relationships

Designs and drawings Development rights

Employment contracts

Engineering drawings

Environmental rights

Film libraries

Food flavorings and recipes

Franchise agreements

Historical documents

Heath maintenance organization enrollment lists


Laboratory notebooks

Literary works

Management contracts

Manual databases

Manuscripts Medical charts and records

Musical compositions

Newspaper morgue files

Noncompete covenants

Patent applications

Patents (both product and process)


Prescription drug files

Prizes and awards

Procedural manuals

Product designs

Proposals outstanding

Proprietary computer software

Proprietary processes

Proprietary products Proprietary technology


Royalty agreements

Schematics and diagrams

Shareholder agreements

Solicitation rights

Subscription lists

Supplier contracts

Technical and specialty libraries

Technical documentation

Technology-sharing agreements

Trade secrets

Trained and assembled workforce

Training manuals

[16] See, for example, Carol Corrado, Charles Hulten and Daniel Sichel, 2009. “Intangible Capital and U.S. Economic Growth,” Review of Income and Wealth Series 55, Number 3, September 2009; see
[17] As stated in Kenan Patrick Jarboe, 2007. Measuring Intangibles: A Summary of Recent Activity, Working Paper #02 from the Athena Alliance, April 2007. See
[18] The 5% estimate comes from Graham G. Rong, Chair at MIT Sloan CIO Symposium, as reported in the on May 5, 2011. (Rong also touted the use of semantic technologies to overcome this lack of use.) A similar 7% estimate comes from Pushpak Sarkar, 2002. “Information Quality in the Knowledge-Driven Enterprise,” InfoManagement Direct, November 2002. See
[19] M.K. Bergman, 2005. “Search and the ’25% Solution’,” AI3:::Adaptive Innovation blog, September 14, 2005. See
[20] M.K. Bergman, 2005.  “Untapped Assets: the $3 Trillion Value of U.S. Documents,” BrightPlanet Corporation White Paper, July 2005, 42 pp. Also available  online and in PDF.
[21] From the CIA, 2011. The World Factbook; accessed online at on May 9, 2011. The “remaining advanced” countries are Australia, Canada, Iceland, Israel, Japan, Liechtenstein, Monaco, New Zealand, Norway, Puerto Rico, Singapore. South Korea, Switzerland, Taiwan.
[22] The range of estimates is drawn from the Nakamura [8] and CHS [9] studies, with each respectively providing the lower and upper bounds. These values have been slightly decremented for non-US advanced countries, and greatly reduced for non-advanced ones.
[23] The high range is based on the categorical share of intangible asset categories (60 of 90) from the AIPCA work [15]; the lower range is from the one-third of GDP estimates from [20].These values have been slightly decremented for non-US advanced countries, and greatly reduced for non-advanced ones.
[24] For unused information assets, the high range is based on the one-third of GDP and 25% “waste” estimates from [20]; the low range halves each of those figures. These values have been slightly decremented for non-US advanced countries, and greatly reduced for non-advanced ones (and zero for the low range).
[25] Reasons for the estimates to be too optimistic are information as important as goodwill; branding; intellectual basis of cited resources is indeed real; considerable differences by country and sector (see [10] and [16]).
[26] Reasons for the estimates to be too conservative: no network effects; greatly discounted non-advanced countries; share is growing (but older estimates used); considerable differences by country and sector (see [10] and [16]).
[27] For some discussion of individual firm impacts and use cases see [10] and [20], among others.
[28] See the Timeline of Information History, and its supporting documentation at M.K. Bergman, 2008. “Announcing the ‘Innovations in Information’ Timeline,” AI3:::Adaptive Information blog, July 6, 2008; see
[29] This figure is a modification of the original published by Rama C. Hoetzlein, 2007. Quanta – The Organization of Human Knowedge: Systems for Interdisciplinary Research, Master’s Thesis, University of California Santa Barbara, June 2007; see (p 112). I adapted this figure to add logics, data and metadata to the basic approach, with color coding also added.
[30] From Paul A. Strassmann, 1998. “The Value of Knowledge Capital,” American Programmer, March 1998. See
[31] For example, according to [11], in a 2003 Accenture survey of senior managers across industries, 49 percent of respondents said that intangible assets are their primary focus for delivering long-term shareholder value, but only 5 percent stated that they had an organized system to track the performance of these assets. Also, according sources cited in Gio Wiederhold, Shirley Tessler, Amar Gupta and David Branson Smith, 2009. “The Valuation of Technology-Based Intellectual Property in Offshoring Decisions,” in Communications for the Association of Information Systems (CAIS) 24, May 2009 (see Owners and stockholders acknowledge that IP valuation of technological assets is not routine within many organizations. A 2007 study performed by Micro Focus and INSEAD highlights the current state of affairs: Of the 250 chief information officers (CIOs) and chief finance officers (CFOs) surveyed from companies in the U.S., UK, France, Germany, and Italy, less than 50 percent had attempted to value their IT assets, and more than 60 percent did not assess the value of their software.
[32] M.K. Bergman, 2006. “Climbing the Data Federation Pyramid,” AI3:::Adaptive Information blog, May 25, 2006; see
Posted:March 7, 2011

from Wikimedia CommonsThe Time and Technology is Here to Stand Software Engineering on its Head

As an information society we have become a software society. Software is everywhere, from our phones and our desktops, to our cars, homes and every location in between. The amount of software used worldwide is unknowable; we do not even have agreed measures to quantify its extent or value [1]. We suspect there are at least 1 billion lines of code that have accumulated over time [1,2]. On the order of $875 billion was spent worldwide on software in 2010, of which about half was for packaged software and licenses and the rest for programmer services, consulting and outsourcing [3]. In the U.S. alone, about 2 million people work as programmers or related [4].

It goes without saying that software is a very big deal.

No matter what the metrics, it is expensive to develop and maintain software. This is also true for open source, which has its own costs of ownership [5]. Designing software faster with fewer mistakes and more re-use and robustness have clearly been emphases in computer science and the discipline of programming from its inception.

This attention has caused a myriad of schools and practices to develop over time. Some of the earlier efforts included computer-aided software engineering (CASE) or Grady Booch’s (already cited in [1]) object-oriented design (OOD). Fourth-generation languages (4GLs) and rapid application development (RAD) were popular in the 1980s and 1990s. Most recently, agile software development or extreme programming have grabbed mindshare.

Altogether, there are dozens of software development philosophies, each with its passionate advocates. These express themselves through a variety of software development methodologies that might be characterized or clustered into the prototyping or waterfall or spiral camps.

In all instances, of course, the drivers and motivations are the same: faster development, more re-use, greater robustness, easier maintainability, and lower development costs and total costs of ownership.

The Ontology Perspective in this Mix

For at least the past decade, ontologies and semantic Web-related approaches have also been part of this mix. A good summary of these efforts comes from Michael Uschold in an invited address at FOIS 2008 [6]. In this review, he points to these advantages for ontology-based approaches to software engineering:

  • Re-use — abstract/general notions can be used to instantiate more concrete/specific notions, allowing more reuse
  • Reduced development times — producing software artifacts that are closer to how we think, combined with reuse and automation that enables applications to be developed more quickly
  • Increased reliability — formal constructs with automation reduces human error
  • Decreased maintenance costs — increased reliability and the use of automation to convert models to executable code reduces errors. A formal link between the models and the code makes software easier to comprehend and thus maintain.

These first four items are similar to the benefits argued for other software engineering methodologies, though with some unique twists due to the semantic basis. However, Uschold also goes on to suggest benefits for ontology-based approaches not claimed by other methodologies:

  • Reduced conceptual gap — application developers can interact with the tools in a way that is closer to their thinking
  • Facilitate automation — formal structures are amenable to automated reasoning, reducing the load on the human, and
  • Agility/flexibility — ontology-driven information systems are more flexible, because you can much more easily and reliably make changes in the model than in code.

In making these arguments, Uschold picks up on the “ontology-driven information systems” moniker first put forward by Nicola Guarino in 1998 [7]. The ideas around ODIS have had substantial impact on the semantic Web community, especially in the use of formal ontologies and modeling approaches. The FOIS series of conferences, and most recently the ODiSE series, have been spawned from these ideas. There is also, for example, a fairly rich and developed community working on the integration of UML via ontologies as the drivers or specifiers of software [8].

Yet, as Uschold is careful to point out, the idea of ODIS extends beyond software engineering to encompass all of information systems. My own categorization of how ontologies may contribute to information systems is:

  1. Domain modeling — this category includes the domain knowledge representations and reasoning and inference bases that are the traditional understanding of ontologies in the semantic space. The structural aspects are akin to a database schema definition; the unique aspects of ontologies reside in their logic foundations and graph structures, which offer more power in inferencing, reasoning and graph analysis than conventional approaches
  2. Model-driven architectures (MDA) — like UML, these are platform-independent specifications that provide the functional and dataflow definitions of “models” executed by the system. These are the natural progeny of earlier CASE approaches, for example. Such systems also potentially allow graphical or visual means for building or hooking together components as a substitute to direct coding
  3. Program specifications and excecutables — though fairly experimental at present, these approaches use the languages of RDF, OWL or direct use of logic languages to create the equivalent of executable software programs. A couple of experimental systems include Fhat and Neno, for example, point to possible future directions in this area [9]
  4. Runtime or utility components — proper construction of ontologies can be a source for labels and prompts within user interfaces and other runtime uses. Because of the ontology basis, these contributions may also be contextual [10]
  5. Automated agents — based on context, user choices and the governing ontologies, new instruction sets can be generated via what some term automated agents or “robots” to instruct subsequent steps in the software, including potentially analysis or validation. Mission Critical IT [11] is apparently the most advanced in this area; we discuss their ODASE approach more below
  6. Bespoke drivers of generic applications — through using and combining a number of the aspects above, in its totality this approach is a very different paradigm, as we describe below.

When we look at this list from the standpoint of conventional software or software engineering, we see that #1 shares overlaps with conventional database roles and #2, #3 and #4 with conventional programmer or software engineering responsibilities. The other portions, however, are quite unique to ontology-based approaches.

But Is Software Engineering Even the Right Focus?

For decades, issues related to how to develop apps better and faster have been proposed and argued about. We still have the same litany of challenges and issues from expense to re-use and brittleness. And, unfortunately, despite many methodologies du jour, we still see bottlenecks in the enterprise relating to such matters as:

Software is merely an intermediary artifact to accomplish some given tasks. Rather than “engineering” software, the focus should be on how to fulfill those tasks in an optimal manner — and that demands a systems approach.
  • data access
  • queries
  • data transformations
  • data integration or federation
  • reports
  • other data presentations
  • business analysis, and
  • targeted, specialty functionality.

Promises such as self-service reporting touted at the inception of data warehousing two decades ago are still to be realized [12]. Enterprises still require the overhead and layers of IT to write SQL for us and prepare and fix reports. If we stand back a bit, perhaps we can come to see that the real opportunity resides in turning the whole paradigm of software engineering upside down.

Our objective should not be software per se. Software is merely an intermediary artifact to accomplish some given task. Rather than engineering software, the focus should be on how to fulfill those tasks in an optimal manner. How can we keep the idea of producing software from becoming this generation’s new buggy whip example [13]?

For reasons we delve into a bit more below, it perhaps has required a confluence of some new semantic technologies and ontologies to create the opening for a shift in perspective. That shift is one from software as an objective in itself to one of software as merely a generic intermediary in an information task pipeline.

Though this shift may not apply (at least with current technologies) to transactional and process-based software, I submit it may be fundamental to the broad category of knowledge management. KM includes such applications as business intelligence, data warehousing, data integration and federation, enterprise information integration and management, competitive intelligence, knowledge representation, and so forth. These are the real areas where integration and reports and queries and analysis remain frustrating bottlenecks for knowledge workers. And, interestingly, these are also the same areas most amenable to embracing an open world (OWA) mindset [14].

If we stand back and take a systems perspective to the question of fulfilling functional KM tasks, we see that the questions are both broader and narrower than software engineering alone. They are broader because this systems perspective embraces architecture, data, structures and generic designs. The questions are narrower because software — within this broader context — can be now be generalized as artifacts providing the fulfillment of classes of functions.

ODapps: The Ontology-Driven Application Approach

Open Semantic Framework (OSF) at openstructs.orgOntology-driven applications — or ODapps for short — based on adaptive ontologies are a topic we have been nibbling around and discussing for some time. In our oft-cited seven pillars of the semantic enterprise we devote two pillars specifically (#4 and #3, respectively) to these two components [15]. However, in keeping with the systems perspective relevant to a transition from software engineering to generic apps, we should also note that canonical data models (via RDF) and a Web-oriented architecture are two additional pillars in the vision.

ODapps are modular, generic software applications designed to operate in accordance with the specifications contained in one or more ontologies. The relationships and structure of the information driving these applications are based on the standard functions and roles of ontologies (namely as domain ontologies as noted under #1 above), as supplemented by the UI and instruction sets and validations and rules (as noted under #4 and #5 above). The combination of these specifications as provided by both properly constructed domain ontologies and supplementary utility ontologies is what we collectively term adaptive ontologies [16].

ODapps fulfill specific generic tasks, consistent with their bespoke design (#6 above) to respond to adaptive ontologies. Examples of current ontology-driven apps include imports and exports in various formats, dataset creation and management, data record creation and management, reporting, browsing, searching, data visualization and manipulation (through libraries of what we call semantic components), user access rights and permissions, and similar. These applications provide their specific functionality in response to the specifications in the ontologies fed to them.

ODapps are designed more similarly to widgets or API-based frameworks than to the dedicated software of the past, though the dedicated functionality (e.g., graphing, reporting, etc.) is obviously quite similar. The major change in these ontology-driven apps is to accommodate a relatively common abstraction layer that responds to the structure and conventions of the guiding ontologies. The major advantage is that single generic applications can supply shared functionality based on any properly constructed adaptive ontology.

In fact, the widget idea from Web 2.0 is a key precursor to the ODapps design. What we see in Web 2.0 are dedicated single-purpose widgets that perform a display operation (such as Google Maps) based on the properly structured data fed to them (structured geolocational information in the case of GMaps).

In Structured Dynamics‘ early work with RDF-based applications by our predecessor company, Zitgist, we demonstrated how the basic Web 2.0 widget idea could be extended by “triggering” which kind of mashup widget got invoked by virtue of the data type(s) fed to it. The Query Builder presented contextual choices for how to build a SPARQL query via UI based on what prior dropdown list choices were made. The DataViewer displayed results with different widgets (maps, profiles, etc.) depending on which part of a query’s results set was inspected (by responding to differences in data types). These two apps, in our opinion, remain some of the best developed in the semantic Web space, even though development on both ceased nearly four years ago.

This basic extension of data-driven applications — as informed by a bit more structure — naturally evolved into a full ontology-driven design. We discovered that — with some minor best practice additions to conventional ontologies — we could turn ontologies into powerhouses that informed applications through:

  • An understanding of the kind of things under consideration, including their inference chains
  • The types of data in results sets, and how that informs the nature of the widget(s) (maps, calendars, timelines, charts, tabular reports, images, stories, media, etc.) appropriate to display and manipulate that information, and
  • UI and utility functions such as interface labels, mouseovers, auto-suggests, spelling suggestions, synonym matches, etc.

Like the earlier Zitgist discoveries, basing the applications on only one or two canonical data models and serializations (RDF and a simple data exchange XML, which Fred Giasson calls structXML) provides the input uniformity to make a library of generic applications tractable. And, embedding the entire framework in a Web-oriented architecture means it can be distributed and deployed anywhere accessible by HTTP.

Booch has maintained for years that in software design abstraction is good, but not if too abstract [1]. ODapps are a balanced abstraction within the framework of canonical architectures, data models and data structures. This design thus limits software brittleness and maximizes software re-use. Moreover, it shifts the locus of effort from software development and maintenance to the creation and modification of knowledge structures. The KM emphasis can shift from programming and software to logic and terminology [16].

In the sub-sections below, we peel back some portions of this layered design to unveil how some of these major pieces interact.

Built Upon an Ontology- and Web-based Architecture

Again, to cite Booch, the most fundamental software design decision is architecture [1]. In the case of Structured Dynamics and its support for ODapps, its open semantic framework (OSF) is embedded in a Web-oriented architecture (WOA). The OSF itself is a layered design that proceeds from a kernel of existing assets (data and structures) and proceeds through conversion to Web service access, and then ontology organization and management via ODapps [17]. The major layers in the OSF stack are:

  • Existing assets — any and all existing information and data assets, ranging from unstructured to structured. Preserving and leveraging those assets is a key premise
  • scones / irON – the conversion layer, in part consisting of information extraction of subject concepts or named entities (scones) or the instance record Object Notation for conveying XML, JSON or spreadsheets (CSV) in RDF-ready form (via irON or RDFizers)
  • structWSF – a platform-independent suite of more than 20 RESTful Web services, organized for managing structured data datasets; it provides the standard, common interface by which existing information assets get represented and presented to the outside world and to other layers in the OSF stack
  • Ontologies — are the layer containing the structured assets “driving” the system; this includes the concepts and relationships of the domain at hand, and administrative ontologies that guide how the user interfaces or widgets in the system should behave
  • conStruct – connecting modules to enable structWSF and sComponents to be hosted/embedded in Drupal, and
  • sComponents – (mostly) Flex semantic components (widgets) for visualizing and manipulating structured data.

Not all of these layers or even their specifics is necessary for an ontology-driven app design [18]. However, the general foundations of generic apps, properly constructed adaptive ontologies, and canonical data models and structures should be preserved in order to operationalize ODapps in other settings.

OSF is the Basis for Domain-specific Instantiations

The power of this design is that by swapping out adaptive ontologies and relevant data, the entire OSF stack as is can be used to deploy multiple instantiations. Potential uses can be as varied as the domain coverage of the domain ontologies that drive this framework.

The OSF semantic framework is a completely open and generic one. The same set of tools and capabilities can be applied to any domain that needs to manage and understand information in its own domain. With the existing ODApps in hand, this includes from unstructured text or documents to conventional structured databases.

What changes from domain to domain are the data structures (the ontologies, schema and entity references) and their instance data (which can also be converted from existing to canonical forms). Here is an illustration of how this generic framework can be leveraged for different deployments. Note that Citizen Dan is a local government example of the OSF framework with relatively complete online demos:

(click for full size)

Structured Dynamics continues to wrinkle this basic design for different clients and different industries. As we round out the starting set of ODapps (see below), the major effort in adapting this generic design to different uses is to tailor the ontologies and “RDFize” existing data assets.

Lower Layers

Conversion of existing assets to RDF and canonical forms is not discussed further here. See the irON and scones documentation or the TechWiki for more information on these topics.

The structWSF Web Services Layer

The first suite of ODapps occurs at the structWSF Web services layer. structWSF provides a set of generic functions and endpoints to:

  • Import or export datasets
  • Create, update, delete (CRUD) or otherwise manage data records
  • Search records with full-text and faceted search
  • Browse or view existing records or record sets, based on simple to possible complex selection or filtering criteria, or
  • Process results sets through workflows of various natures, involving specialized analysis, information extraction or other functions.

Here is a listing of current ODapp functions within structWSF (with links to details for each):

WSF management Web services
User-oriented Web services

At this level the information access and processing is done largely on the basis of structured results sets. Other visualization and display ODapps are listed in the next subsection.

The Semantics Components Layer

The visualization and data display and manipulation ODapps are provided via the semantic components layer. Structured Dynamics’s sComponents are Flex-based widgets that conform to a standard, generic design. Other developers using the OSF framework are developing JavaScript versions [19]. Here is the current library (with links to details for each):

New Components
Components Extending Flex

These components can be used in combination with any of the structWSF ODapps, meaning the filtering, searching, browsing, import/export, etc., may be combined as an input or output option with the above.

The next animated figure shows how the basic interaction flow works with these components:

(click for full size)

Using the ODapp structure it is possible to either “drive” queries and results sets selections via direct HTTP request via endpoints (not shown) or via simple dropdown selections on HTML forms or Flex widgets (shown). This design enables the entire system to be driven via simple selections or interactions without the need for any programming or technical expertise.

As the diagram shows, these various sComponents get embedded in a layout canvas for the Web page. By interacting with the various components, new queries are generated (most often as SPARQL queries) to the various structWSF Web services endpoints. The result of these requests is to generate a structured results set, which includes various types and attributes.

An internal ontology that embodies the desired behavior and display options (SCO, the Semantic Component Ontology) is matched with these types and attributes to generate the formal instructions to the sComponents. When combined with the results set data, and attribute information in the irON ontology, plus the domain understanding in the domain ontology, a synthetic schema is constructed that instructs what the interface may do next. Here is an example schema:

(click for full size)

These instructions are then presented to the sControl component, which determines which widgets (individual components, with multiples possible depending on the inputs) need to be invoked and displayed on the layout canvas.

As new user interactions occur with the resulting displays and components, the iteration cycle is generated anew, again starting a new cycle of queries and results sets. Importantly, as these pathways and associated display components get created, they can be named and made persistent for later re-use or within dashboard invocations.

Self-service Reporting

Since self-service reporting has been such a disappointment [12], it is worth noting another aspect from this ODapp design. Every “thing” that can be presented in the interface can have a specific display template associated with it. Absent another definition, for example, any given “thing” will default to its parental type (which, ultimate, is “Thing”, the generic template display for anything without a definition; this generally defaults to a presentation of all attributes for the object).

However, if more specific templates occur in the inference path, they will be preferentially used. Here is a sample of such a path:

Digital Camera
SLR Digital Camera
Olympus Evolt E520

At the ultimate level of a particular model of Olympus camera, its display template might be exactly tailored to its specifications and attributes.

This design is meant to provide placeholders for any “thing” in any domain, while also providing the latitude to tailor and customize to every “thing” in the domain.

It is critical that generic apps through an ODapp approach also provide the underpinnings for self-service reporting. The ultimate metric is whether consumers of information can create the reports they need without any support or intervention by IT.

Adaptive Analysis

The Mission Critical IT reference provided earlier [11] helps point to the potentials of this paradigm in a different way. Mission Critical also shows user interfaces contextually chosen based on prior selections. But they extend that advantage with context-specific analysis and validation through the SWRL rules-base semantic language. This is an exciting extension of the base paradigm that confirms the applicability of this approach to business intelligence and general enterprise analytics.

Standing Software Engineering on its Head

All of this points to a very exciting era for enterprise and consumer apps moving into the future. We perhaps should no longer talk about “killer apps”; we can shift our focus to the information we have at hand and how we want to structure and analyze it.

Using ontologies to write or specify code or to compete as an alternative to conventional software engineering approaches seems too much like more of the same. The systems basis in which such methodologies such as MDA reside have not fixed the enterprise software challenges of decades-long standing. Rather, a shift to generic applications driven by adaptive ontologies — ODapps — looks to shift the locus from software and programming to data and knowledge structures.

This democratization of IT means that everything in the knowledge management realm can become “self service.” We can create our own analyses; develop our own reports; and package and disseminate what we and our colleagues need, when they need it. Through ontology-driven apps and adaptive ontologies, we can turn prior decades of software engineering practices on their head.

What Structured Dynamics and a handful of other vendors are showing is by no means yet complete. Our roster of ODapp widgets and templates still needs much filling out. The toolsets available for creating, maintaining, mapping and extending the ontologies underlying these systems are still woefully inadequate [20]. These are important development needs for the near term.

And, of course, none of this means the end of software development either. Process and transactions systems still likely reside outside of this new, emerging paradigm. Creating great and solid generic ODapps still requires software. Further, ODapps and their potential are completely silent on how we create that software and with what languages or methodologies. The era of software engineering is hardly at an end.

What is exceptionally powerful about the prospects in ontology-driven apps is to speed time to understanding and place information manipulation directly in the hands of the knowledge worker. This is a vision of information access and control that has been frustrated for decades. Perhaps, with ontologies and these semantic technologies, that vision is now near at hand.

[1] This estimate is from Grady Booch, 2005. “The Complexity of Programming Models,” see He comments on the weakness of software lines of code as a meaningful measure. At the time in 2005, he estimated perhaps 800 billion lines of code has accumulated, which given growth and vagaries of such guesstimates I have updated to the 1 billion number noted.
[2] For a wildly different estimate, that has been criticized somewhat, see Blackduck Software, 2009. “Estimating the Development Cost of Open Source Software,” at According to Blackduck’s research there are over 200,000 OSS projects on the Internet representing more than 4.9 billion lines of available code from 4,000 sites that the company monitors. Blackduck estimates that reproducing this OSS would cost $387 billion for “typical” SLOC estimating bases. While Blackduck is likely in the best place of any organization to track open source given their business model, others have criticized the estimates because only a portion (fewer than 10%, consistent with my own research) of open source projects are active, and many active projects also share significant code bases. Nonetheless, there is still a huge disparity between the 1 billion SLOC estimate in [1] and this estimate of 5 billion for open source alone. This disparity is an indicator of the measurement challenges.
[3] See IMAP, 2010. Computing & Internet Software Global Report — 2010, 40 pp, see The relative splits they show for software packages and licenses, IT consulting or outsourcing are 48%, 29% and 23%, respectively, of the total shown. Note however, that Gartner estimates are as high as 2x these amounts, again showing the uncertainty of measuring software; see, for example,
[4] For this and related measures, see Business Software Alliance, 2009. Software Industry Facts and Figures, see
[5] Simply conduct a Web search on ‘”open source” “cost of ownership”‘ to see the many studies in this area. Depending on advocacy, estimates may be as high as proprietary software to a lower, but still substantial percentage. In no cases are open source understood to be fully “free” once maintenance, upgrades, modifications, and site adaptations are considered.
[6] Michael Uschold, 2008. “Ontology-Driven Information Systems: Past, Present and Future,” in Proceedings of the Fifth International Conference on Formal Ontology in Information Systems (FOIS 2008), Carola Eschenbach and Michael Grüninger, eds., IOS Press, Amsterdam, Netherlands, pp 3-20; see
[7] Nicola Guarino, 1998. “Formal Ontology and Information Systems,” in Proceedings of FOIS’98, Trento, Italy, June 6-8, 1998. Amsterdam, IOS Press, pp. 3-15; see
[8] See Phil Tetlow et al., eds., 2006. Ontology Driven Architectures and Potential Uses of the Semantic Web in Software Engineering, a W3C Editor’s Draft on Best Practices, February 11, 2006; see UML class diagrams have close resemblance to certain ontology structures. This effort was part of a formal collaboration between W3C and the Object Management Group (OMG), which resulted among other things in the production of the Ontology Definition Metamodel (ODM). In the OMG’s model-driven architecture (MDA) initiative, models are used not only for design and maintenance purposes, but as a basis for generating executable artifacts for downstream use. The MDA approach grew out of much of the standards work conducted in the 1990s in the Unified Modeling Language (UML).
[9] Neno is a semantic network programming language and Fhat is a virtual machine that works off of it. These two projects have been largely abandoned. A related project is Ripple, a relational, stack-based dataflow language by Joshua Shinavier, which is episodically updated.
[10] Holger Knublauch of TopQuadrant has made the point that ontologies can also have runtime uses as well: “In contrast to conventional Model-Driven Architecture known from object-oriented systems, semantic applications use their data models not only at design time, but also as runtime components. The rich declarative semantics of ontological data models can be exploited to drive user interfaces and to control an application’s behavior.” See H. Knublauch, 2007. “From Ontology Design to Deployment: Semantic Application Development with TopBraid,” presented at the 2007 Semantic Technology Conference, San Jose, CA; see
[11] Mission Critical IT describes its ODASE platform (Ontology Driven Architecture for Software Engineering) as a set of tools to facilitate the creation of working applications from a semantic business model (an ontology), using the open standards OWL, SWRL and RDF. The ODASE code generators (a.k.a “robots”) generate an API based on the business terminology defined by the OWL+SWRL+RDF business model, which the ODASE platform then uses to execute the rules and reasoning as contextual choices are made by the user. Among other links, the company has an impressive online demo that shows a consumer telecommunications purchase example; there is also a video explaining the rules basis of the ODASE framework.
[12] See Wayne W. Eckerson, 2007. “The Myth of Self-Service Business Intelligence,” in TDWI Online, October 18, 2007; see
[13] The buggy whip industry as a major economic entity ceased to exist with the introduction of the automobile, and is cited in economics and marketing as an example of an industry ceasing to exist because its market niche, and the need for its product, disappears. Not recognizing what industry or business purpose is being served is an oft-cited cause for obsolescence. Thus, software engineering is a practice that serves the creation of software, which itself is only a means to a functional end.
[14] See M. K. Bergman, 2009. The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Information blog, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems.
[15] See M.K. Bergman, 2010. Seven Pillars of the Open Semantic Enterprise, AI3:::Adaptive Information blog, January 12, 2010.
[16] See M.K. Bergman, 2009. Ontologies as the ‘Engine’ for Data-Driven Applications, AI3:::Adaptive Information blog, June 10, 2009, for the first presentation of these topics, but the specific term adaptive ontology was not yet used. That term was first introduced in “Confronting Misconceptions with Adaptive Ontologies” (August 17, 2009). The dedicated treatment of these topics and their interplay was provided in M.K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies”, AI3:::Adaptive Information blog, November 23, 2009. The relation of these topics to enterprise software was first presented in M.K. Bergman, 2009. “Fresh Perspectives on the Semantic Enterprise”, AI3:::Adaptive Information blog, September 28, 2009.
[17] Some 250 pp of complete technical documentation for these projects is provided on the Structured Dynamics’ open source OpenStructs TechWiki.
[18] For more discussion of semantic components, see F. Giasson, 2010. “Semantic Components,” in his blog, July 5, 2010. For more discussion of the layered OSF design, see M.K. Bergman, 2010. Domain-specific Instantiations based on the Open Semantic Framework, AI3:::Adaptive Information blog, June 17, 2010.
[19] To find these groups and follow the open source OSF developments, see xxx. So long as the basic design comports with the foundations herein, sComponents may be developed in any rich Internet application (RIA) environment.
[20] Ontology development, management and mapping is the emerging imperative in the semantic technology space. For some thoughts on how Structured Dynamics is approaching this question, see a Normative Landscape of Ontology Tools on the TechWiki.
Posted:February 28, 2011

Photo courtesy goldonomic.comWikipedia + UMBEL + Friends May Offer One Approach

In the first part of this series we argued for the importance of reference structures to provide the structures and vocabularies to guide interoperability on the semantic Web. The argument was made that these reference structures are akin to human languages, requiring a sufficient richness and terminology to enable nuanced and meaningful communications of data across the Web and within the context of their applicable domains.

While the idea of such reference structures is great — and perhaps even intuitive when likened to human languages — the question is begged as to what is the basis for such structures? Just as in human languages we have dictionaries, thesauri, grammar and style books or encyclopedia, what are the analogous reference sources for the semantic Web?

In this piece, we tackle these questions from the perspective of the entire Web. Similar challenges and approaches occur, of course, for virtually every domain and specific community. But, by focusing on the entirety of the Web, perhaps we can discern the grain of sand at the center of the pearl.

Bootstrapping the Semantic Web

The idea of bootstrapping is common in computers, compilers or programming. Every computer action needs to start from a basic set of instructions from which further instructions or actions are derived. Even starting up a computer (“booting up”) reflects this bootstrapping basis. Bootstrapping is the answer to the classic chicken-or-egg dilemma by embedding a starting set of instructions that provides the premise at start up [1]. The embedded operand for simple addition, for example, is the basis for building up more complete mathematical operations.

So, what is the grain of sand at the core of the semantic Web that enables it to bootstrap meaning? We start with the basic semantics and “instructions” in the core RDF, RDFS and OWL languages. These are very much akin to the basic BIOS instructions for computer boot up or the instruction sets leveraged by compilers. But, where do we go from there? What is the analog to the compiler or the operating system that gives us more than these simple start up instructions? In a semantics sense, what are the vocabularies or languages that enable us to understand more things, connect more things, relate more things?

To date, the semantic Web has given us perhaps a few dozen commonly used vocabularies, most of which are quite limited and simple pidgin languages such as DC, FOAF, SKOS, SIOC, BIBO, etc. We also have an emerging catalog of “things” and concepts from Wikipedia (via DBpedia) and similar. (Recall, in this piece, we are trying to look Web-wide, so the many fine building blocks for domain purposes such as found in biology, medicine, finance, astronomy, etc., are excluded.) The purposes and scope of these vocabularies widely differ and attack quite different slices of the information space. SKOS, for example, deals with describing simple knowledge structures like taxonomies or thesauri; SIOC is for describing social media.

By virtue of adoption, each of these core languages has proved its usefulness and role. But, as skew lines in space, how do these vocabularies relate to one another? And, how can all of the specific domain vocabularies also relate to those and one another where there are points of intersection or overlap? In short, after we get beyond the starting instructions for the semantic Web, what is our language and vocabulary? How do we complete the bootstrap process?

Clearly, like human languages, we need rich enough vocabularies to describe the things in our world and a structure of the relationships amongst those things to give our communications meaning and coherence. That is precisely the role provided by reference structures.

The Use and Role of ‘Gold Standards’

To prevent reference structures from being rubber rulers, some fixity or grounding needs to establish the common understanding for its referents. Such fixed references are often called ‘gold standards‘. In money, of course, this used to be a fixed weight of gold, until that basis was abandoned in the 1970s. In the metric system, there are a variety of fixed weights and measures that are employed. In the English language, the Oxford English Dictionary (OED) is the accepted basis for the lexicon. And so on.

Yet, as these examples show, none of these gold standards is absolute. Money now floats; multiple systems of measurement compete; a variety of dictionaries are used for English; most languages have their own reference sets; etc. The key point in all gold standards, however, is that there is wide acceptance for a defined reference for determining alignments and arbitrating differences.

Gold standards or reference standards play the role of referees or arbiters. What is the meaning of this? What is the definition of that? How can we tell the difference between this and that? What is the common way to refer to some thing?

Let’s provide one example in a semantic Web context. Let’s say we have a dataset and its schema A that we are aligning with another dataset with schema B. If I say two concepts align exactly across these datasets and you say differently, how do we resolve this difference? On one extreme, each of us can say our own interpretation is correct, and to heck with the other. On the other extreme, we can say both interpretations are correct, in which case both assertions are meaningless. Perhaps papering over these extremes is OK when only two competing views are in play, but what happens when real problems with many actors are at stake? Shall we propose majority rule, chaos, or the strongest prevails?

These same types of questions have governed human interaction from time immemorial. One of the reasons to liken the problem of operability on the semantic Web to human languages, as argued in Part I, is to seek lessons and guidance for how our languages have evolved. The importance of finding common ground in our syntax and vocabularies — and, also, critically, in how we accept changes to those — is the basis for communication. Each of these understandings needs to be codified and documented so that they can be referenced, and so that we can have some confidence of what the heck it is we are trying to convey.

For reference structures to play their role in plugging this gap — that is, to be much more than rubber rulers — they need to have such grounding. Naturally, these groundings may themselves change with new information or learning inherent to the process of human understanding, but they still should retain their character as references. Grounded references for these things — ‘gold standards’ — are key to this consensual process of communicating (interoperating).

Some ‘Gold Standards’ for the Semantic Web

The need for gold standards for the semantic Web is particularly acute. First, by definition, the scope of the semantic Web is all things and all concepts and all entities. Second, because it embraces human knowledge, it also embraces all human languages with the nuances and varieties thereof. There is an immense gulf in referenceability from the starting languages of the semantic Web in RDF, RDFS and OWL to this full scope. This gulf is chiefly one of vocabulary (or lack thereof). We know how to construct our grammars, but we have few words with understood relationships between them to put in the slots.

The types of gold standards useful to the semantic Web are similar to those useful to our analogy of human languages. We need guidance on structure (syntax and grammar), plus reference vocabularies that encompass the scope of the semantic Web (that is, everything). Like human languages, the vocabulary references should have analogs to dictionaries, thesauri and encyclopedias. We want our references to deal with the specific demands of the semantic Web in capturing the lexical basis of human languages and the connectedness (or not) of things. We also want bases by which all of this information can be related to different human languages.

To capture these criteria, then, I submit we should consider a basic starting set of gold standards:

  • RDF/RDFS/OWL — the data model and basic building blocks for the languages
  • Wikipedia — the standard reference vocabulary of things, concepts and entities, plus other structural guidances
  • WordNet — lexical language references as an aid to natural language processing, and
  • UMBEL — the structural reference for the connectedness of things for basic coherence and inference, plus a vocabulary for mapping amongst reference structures and things.

Each of these potential gold standards is next discussed in turn. The majority of discussion centers on Wikipedia and UMBEL.

RDF/RDFS/OWL: The Language

Naturally, the first suggested gold standard for the semantic Web are the RDF/RDFS/OWL language components. Other writings have covered their uses and roles [2]. In relation to their use as a gold standard, two documents, one on RDF semantics [3] and the other an OWL [4] primer, are two great starting points. Since these languages are now in place and are accepted bases of the semantic Web, we will concentrate on the remaining members of the standard reference set.

Wikipedia: The Vocabulary (and More)

The second suggested gold standard for the semantic Web is Wikipedia, principally as a sort of canonical vocabulary base or lexicon, but also for some structural aspects. Wikipedia now contains about 3.5 million English articles, by far larger than any other knowledge base, and has more than 250 language versions. Each Wikipedia article acts as more or less a reference for the thing it represents. In addition, the size, scope and structure of Wikipedia make it an unprecedented resource for researchers engaged in natural language processing (NLP), information extraction (IE) and semantic Web-related tasks.

For some time I have been maintaining a listing called SWEETpedia of academic and research articles focused on the use of Wikipedia for these tasks. The latest version tracks some 250 articles [5], which I guess to be about one half or more of all such research extant. This research shows a broad variety of potential roles and contributions from Wikipedia as a gold standard for the semantic Web, some of which is detailed in the tables below.

An excellent report by Olena Medelyan et al. from the University of Waikato in New Zealand, Mining Meaning from Wikipedia, organized this research up through 2008 and provided detailed commentary and analysis of the role of Wikipedia [6]. They noted, for example, that Wikipedia has potential use as an encyclopedia (its intended use), a corpus for testing and modeling NLP tasks, as a thesaurus, a database, an ontology or a network structure. The Intelligent Wikipedia project from the University of Washington has also done much innovative work on “automatically learned systems [that] can render much of Wikipedia into high-quality semantic data, which provides a solid base to bootstrap toward the general Web” [7].

However, as we proceed through the next discussions, we’ll see that the weakest aspect of Wikipedia is its category structure. Thus, while Wikipedia is unparalleled as the gold standard for a reference vocabulary for the Web, and has other structural uses as well, we will need to look elsewhere for how that content is organized.

Major Wikipedia Initiatives

Many groups have recognized these advantages for Wikipedia, and have built knowledge bases around it. Also, many of these groups have also recognized the category (schema) weaknesses in Wikipedia and have proposed alternatives. Some of these major initiatives, which also collectively represent a large number of the research articles in SWEETpedia, include:

Project Schema Basis Comments
DBpedia Wikipedia Infoboxes excellent source for URI identifiers; structure extraction basis used by many other projects
Freebase User Generated schema are for domains based on types and properties; at one time had a key dependence on Wikipedia; has since grown much from user-generated data and structure; now owned by Google
Intelligent Wikipedia Wikipedia Infoboxes a broad program and a general set of extractors for obtaining structure and relationships from Wikipedia; was formerly known as KOG; from Univ of Washington
SIGWP Wikipedia Ontology the Special Interest Group of Wikipedia (Research or Mining); a general group doing research on Wikipedia structure and mining; schema basis is mostly from a thesaurus; group has not published in two years
UMBEL UMBEL Reference Concepts RefConcepts based on the Cyc knowledge base; provides a tested, coherent concept schema, but one with gaps regarding Wikipedia content; has 28,000 concepts mapped to Wikipedia
WikiNet Extracted Wikipedia Ontology part of a long-standing structure extraction effort from Wikipedia leading to an ontology; formerly known as WikiRelate; from the Heidelberg Institute for Theoretical Studies (HITS)
Wikipedia Miner N/A generalized structure extractor; part of a wider basis of Wikipedia research at the Univ of Waikato in New Zealand
Wikitology Wikipedia Ontology general RDF and ontology-oriented project utilizing Wikipedia; effort now concluded; from the Ebiquity Group at the Univ of Maryland
YAGO WordNet maps Wordnet to Wikipedia, with structured extraction of relations for characterizing entities


It is interesting to note that none of the efforts above uses the Wikipedia category structure “as is” for its schema.

Structural Sources within Wikipedia

The surface view of Wikipedia is topic articles placed into one or more categories. Some of these pages also include structured data tables (or templates) for the kind of thing the article is; these are called infoboxes. An infobox is a fixed-format table placed at the top right of articles to consistently present a summary of some unifying aspect that the articles share. For example, see the listing for my home town, Iowa City, which has a city infobox.

However, this cursory look at Wikipedia in fact masks much additional and valuable structure. Some early researchers noted this [8]. The recognition of structure has also been a key driver for the interest in Wikipedia as a knowledge base (in addition to its global content scope). The following table is a fairly complete listing of structure possibilities within Wikipedia (see Endnotes for any notes):

Wikipedia Structure Potential Applications Note
Entire Corpus
knowledge base; graph structure; corpus for n-grams, other constructions [9]
category suggestion; semantic relatedness; query expansion; potential parent category
Contained Articles
semantically-related terms (siblings)
hyponymic and meronymic relations between terms
Listing Pages/Categories
semantically-related terms (siblings)
Patterned Categories
functional metadata [9]
Infobox Templates
synonyms; key-value pairs
units of measure; fact extraction [9]
category suggestion; entity suggestion
coordinates; places; geolocational; (may also appear in full article text)
Issue Templates
Multiple Types
exclusion candidates; other structural analysis; examples include Stub, Message Boxes, Multiple Issues [9]
Category Templates [13]
Category Name
disambiguation; relatedness
Category Links
semantic relatedness
First Paragraph
definition; abstract
Full Text
complete discussion; related terms; context; translations; NLP analysis basis; relationships; sentiment
synonymy; spelling variations, misspellings; abbreviations; query expansion
named entities; domain specific terms or senses
category suggestion (phrase marked in bold in first paragraph)
Section Heading(s)
category suggestion; semantic relatedness [9]
See Also
related concepts; query expansion [9]
Further Reading
related concepts [9,10]
External Links
related concepts; external harvest points
Article Links
related terms; co-occurrences
synonyms; spelling variations; related terms; query expansion
link graph; related terms
category suggestion; functional metadata
category suggestion; functional metadata
external harvest points [9,10]
thumbnails; image recognition for disambiguation; controversy (edit/upload frequency) [11]
related concepts; related terms; functional metadata [9]
Disambiguation Pages
Article Links
sense inventory
Discussion Pages
Discussion Content
Redux for Article Structure
see Articles for uses
History Pages
Edit Frequency
topicalness; controversy (diversity of editors, reversions)
Edit Basis
lexical errors [9]
instances; named entity candidates
Alternate Language Versions
Redux for All Structures
see all items above; translation; multilingual alignment; entity disambiguation [12]

The potential for Wikipedia to provide structural understandings is evident from this table. However, it should be noted that, aside from some stray research initiatives, most effort to date has focused on the major initiatives noted earlier or from analyzing linking and infoboxes. There is much additional research that could be powered by the Wikipedia structure as it presently exists.

From the standpoint of the broader semantic Web, the potential of Wikipedia in the areas of metadata enhancement and mapping to multiple human languages [12] are particularly strong. We are only now at the very beginning phases of tapping this potential.

Structural Weaknesses

The three main weaknesses with Wikipedia are its category structure [14], inconsistencies and incompleteness. The first weakness means Wikipedia is not a suitable organizational basis for the semantic Web; the next two weaknesses, due to the nature of Wikipedia’s user-generated content, are constantly improving.

Our recent effort to map between UMBEL and Wikipedia, undertaken as part of the recent UMBEL v 1.00 release, spent considerable time analyzing the Wikipedia category structure [15]. Of the roughly half million categories in Wikipedia, only about 85,000 were found to be suitable candidates to participate in an actual schema structure. Further breakdowns are shown by this table resulting from our analysis:

Wikipedia Category Breakdowns
Removals 20.7%
Administrative 15.7%
Misc Cleaning 5.0%
Functional (not schema) 61.8%
Fn Dates 10.1%
Fn Nationalities 9.6%
Fn Listings, related 0.8%
Fn Occupations 1.0%
Fn Prepositions 40.4%
Candidates 17.4%
SuperTypes 1.7%
General Structure 15.7%
TOTAL 100.0%

Fully 1/5 of the categories are administrative or internal in nature. The large majority of categories are, in fact, not structural at all, but what we term functional categories, which means the category contains faceting information (such as subclassifying musicians into British musicians) [16]. Functional categories can be a rich source of supplementary metadata for its assigned articles — though, no one has yet processed Wikipedia in this manner — but are not a useful basis for structural conceptual relationships or inferencing.

This weakness in the Wikipedia category system has been known for some time [17], but researchers and others still attempt to do mappings on mostly uncleaned categories. Though most researchers recognize and remove internal or administrative categories in their efforts, using the indiscriminate remainder of categories still leads to poor precision in resulting mappings. In fact, in comparison to one of the more rigorous assessments to date [18], our analysis still showed a 6.8% error rate in hand inspected categories.

Other notable category problems include circular references, skipped intermediate categories, misassigned categories and incomplete assignments.

Nonetheless, Wikipedia categories do have a valuable use in the analysis of local relationships (one degree of relatedness) and for finding missing category candidates. And, as noted, the functional categories are also a rich and untapped source of additional article metadata.

Like any knowledge base, Wikipedia also has inconsistent and incomplete coverage of topics [19]. However, as more communities accept Wikipedia as a central resource deserving completeness, we should see these gaps continue to get filled.

The DBpedia Implementation

One of the first database versions of Wikipedia built for semantic Web purposes is DBpedia. DBpedia has an incipient ontology useful for some classification purposes. Its major structural organization is built around the Wikipedia infoboxes, which are applied to about a third of Wikipedia articles. DBpedia also has multiple language versions.

DBpedia is a core hub of Linked Open Data (LOD), which now has about 300 linked datasets; has canonical URIs used by many other applications; has extracted versions and tools very useful for further processing; and has recently moved to incorporate live updates from the source Wikipedia [20]. For these reasons, the DBpedia version of Wikipedia is the suggested implementation version.

WordNet: Language Relationships

The third suggested gold standard for the Semantic Web is WordNet, a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. There are over 50 languages covered by wordnet approaches, most mapped to this English WordNet [21].

Though it has been used in many ontologies [22], WordNet is most often mapped for its natural language purposes and not used as a structure of conceptual relationships per se. This is because it is designed for words and not concepts. It contains hundreds of basic semantic inconsistencies and also lacks much domain applicability. Entities, of course, are also lacking. In those cases where WordNet has been embraced as a schema basis, much work is generally expended to transform it into an ontology suitable for knowledge representation.

Nonetheless, for word sense disambiguation and other natural language processing tasks, as well as for aiding multi-lingual mappings, WordNet and its various other language variants is a language reference gold standard.

UMBEL: A Coherent Structure

So, with these prior gold standards we gain a basic language and grammar; a base (canonical) vocabulary and some structure guidance; and a reference means for processing and extracting information from input text. Yet two needed standards remain.

One needed standard is a conceptual organizing structure (or schema) by which the canonical vocabulary of concepts and instances can be related. This core structure should be constructed in a coherent [23] manner and expressly designed to support inferencing and (some) reasoning. This core structure should be sufficiently large to embrace the scope of the semantic Web, but not so detailed as to make it computationally inefficient. Thus, the core structure should be a framework that allows more focused and purposeful vocabularies to be “plugged in”, depending on the domain and task at hand. Unfortunately, the candidate category structures from our other gold standards in Wikipedia and WordNet do not meet these criteria.

A second needed standard is a bit of additional vocabulary “glue” specifically designed for the purposes of the semantic Web and ontology and domain incorporation. We have multiple and disparate world views and contexts, as well as the things described by them [24]. To get them to interoperate — and to acknowledge differences in alignment or context — we need a set of relational predicates (vocabulary) that can capture a range of mappings from the exact to the approximate [25]. Unlike other reference vocabularies that attempt to capture canonical definitions within defined domains, this vocabulary is expressly required by the semantic Web and its goal to federate different data and schema.

UMBEL has been expressly designed to address both of these two main needs [26]. UMBEL is a coherent categorization structure for the semantic Web and a mapping vocabulary designed for dataset and conceptual interoperability. UMBEL’s 28,000 reference concepts (RefConcepts) are based on the Cyc knowledge base [27], which itself is expressly designed as a common sense representation of the world with express variations in context supported via its 1000 or so microtheories. Cyc, and UMBEL upon which it is based, are by no means the “correct” or “only” representations of the world, but they are coherent ones and thus internally consistent.

UMBEL’s role to allow datasets to be “plugged in” and related through some fixed referents was expressed by this early diagram [28]:

Lightweight Binding to an Upper Subject Structure Can Bring Order
[Click on image for full-size pop-up]

The idea — which is still central to this kind of reference structure — is that a set of reference concepts can be used by multiple datasets to connect and then inter-relate. These are shown by the nested subjects (concepts) in the umbrella structure.

UMBEL, of course, is not the only coherent structure for such interoperability purposes. Other major vocabularies (such as LCSH; see below) or upper-level ontologies (such as SUMO, DOLCE, BFO or PROTON, etc.) can fulfill portions of these roles, as well. In fact, the ultimate desire is for multiple reference structures to emerge that are mapped to one another, similar to how human languages can inter-relate. Yet, even in that desired vision, there is still a need for a bootstrapped grounding. UMBEL is the first such structure expressly designed for the two needed standards.

Mappings to the Other Standards

UMBEL is already based on the central semantic Web languages of RDF, RDFS, SKOS, and OWL 2. The recent version 1.00 now maps 60% of UMBEL to Wikipedia, with efforts for the remaining in process. UMBEL provides mappings to WordNet, via its Cyc relationships. More of this is in process and will be exposed. And the mappings between UMBEL and GeoNames [29] for locational purposes is also nearly complete.

The Gold Resides in Combining These Standards

Each of these reference structures — RDF/OWL, Wikipedia, WordNet, UMBEL — is itself coherent and recognized or used by multiple parties for potential reference purposes on the semantic Web. The advocacy of them as standards is hardly radical.

However, the gold lies in the combination of these components. It is in this combination that we can see a grounded knowledge base emerge that is sufficient for bootstrapping the semantic Web.

The challenge in creating this reference knowledge base is in the mapping between the components. Fortunately, all of the components are already available in RDF/OWL. WordNet already has significant mappings to Wikipedia and UMBEL. And 60% of UMBEL is already mapped to Wikipedia. The remaining steps for completing these mappings are very near at hand. Other vocabularies, such as GeoNames [29], would also beneficially contribute to such a reference base.

Yet to truly achieve a role as a gold standard, these mappings should be fully vetted and accurate. Automated techniques that embed errors are unacceptable. Gold standards should not themselves be a source for propagation of errors. Like dictionaries or thesauri, we need reference structures that are quality and deserving of reference. We need canonical structures and canonical vocabularies.

But, once done, these gold standards themselves become reference sources that can aid automatic and semi-automatic mappings of other vocabularies and structures. Thus, the real payoff is not that these gold standards themselves get actually embedded in specific domain uses or whatever, but that they can act as reference referees for helping align and ground other structures.

Like the bootstrap condition, more and more reference structures may be brought into this system. A reference structure does not mean reliance; it need not even have more than minimal use. As new structures and vocabularies are brought into the mix, appropriate to specific domains or purposes, reference to other grounding structures will enable the structures and vocabularies to continue to expand. So, not only are reference concepts necessary for grounding the semantic Web, but we also need to pick good mapping predicates for properly linking these structures together.

In this manner, many alternative vocabularies can be bootstrapped and mapped and then used as the dominant vocabularies for specific purposes. For example, at the level of general knowledge categorization, vocabularies such as LCSH, the Dewey Decimal Classification, UDC, etc., can be preferentially chosen. Other specific vocabularies are at the ready, with many already used for domain purposes. Once grounded, these various vocabularies can also interoperate.

Grounding in gold standards enables the freedom to switch vocabularies at will. Establishing fixed reference points via such gold standards will power a virtuous circle of more vocabularies, more mappings, and, ultimately, functional interoperability no matter the need, domain or world view.

This is the last of a two-part series on the importance and choice of reference structures (Part I) and gold standards (Part II) on the semantic Web.

[1] For example, according to the Wikipedia entry on Machine code, “A machine code instruction set may have all instructions of the same length, or it may have variable-length instructions. How the patterns are organized varies strongly with the particular architecture and often also with the type of instruction. Most instructions have one or more opcode fields which specifies the basic instruction type (such as arithmetic, logical, jump, etc) and the actual operation (such as add or compare) and other fields that may give the type of the operand(s), the addressing mode(s), the addressing offset(s) or index, or the actual value itself.”
[2] See, for example, M.K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Information blog, April 8, 2009; see and M.K. Bergman, 2010. “Ontology Tutorial Series,” AI3:::Adaptive Information blog, September 27, 2010; see
[3] Patrick Hayes, ed., 2004. RDF Semantics, W3C Recommendation 10 February 2004. See
[4] Pascal Hitzler et al., eds., 2009. OWL 2 Web Ontology Language Primer, a W3C Recommendation, 27 October 2009; see
[5] See SWEETpedia from the AI3:::Adaptive Information blog, which currently lists about 250 articles and citations.
[6] Olena Medelyan, Catherine Legg, David Milne and Ian H. Witten, 2008. Mining Meaning from Wikipedia, Working Paper Series ISSN 1177-777X, Department of Computer Science, The University of Waikato (New Zealand), September 2008, 82 pp. See This paper and its findings is discussed more in M.K. Bergman, 2008. “Research Shows Natural Fit between Wikipedia and Semantic Web,” AI3:::Adaptive Information blog, October 15, 2008; see
[7] For a comprehensive treatment, see Fei Wu, 2010. Machine Reading: from Wikipedia to the Web, a doctoral thesis to the Department of Computer Science, University of Washington, 154 pp; see To my knowledge, this paper also was the first to use the “bootstrapping” metaphor.
[8] Quite a few research papers have characterized various aspects of the Wikipedia structure. One of the first and most comprehensive was Torsten Zesch, Iryna Gurevych, Max Mühlhäuser, 2007b. Analyzing and Accessing Wikipedia as a Lexical Semantic Resource, and the longer technical report. See Also, 2008. In Proceedings of the Biannual Conference of the Society for Computational Linguistics and Language Technology, pp. 213221. Also, for another early discussion, see Linyun Fu, Haofen Wang, Haiping Zhu, Huajie Zhang, Yang Wang and Yong Yu, 2007. Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring. See
[9] This structural basis in Wikipedia is largely untapped.
[10] Citations and references appear to be highly selective (biased) in Wikipedia; nonetheless, those available are useful seeding points for more suitable harvests.
[11] Images have been used a thumbnails and linked references to the articles they are hosted in, but have not been analyzed much for semantics or file names.
[12] There are a variety of efforts underway to use Wikipedia as a multi-language cross-reference based on its 250 language versions; search, for example, on “multiple language” in SWEETpedia. Both named entity and concept matches can be used to correlate in multiple languages. This is greatly aided by inter-language links.
[13] When present, these appear at the bottom of an article and have many related categories; see this one for the semantic Web.
[14] See further and for a discussion of use and guidelines for Wikipedia categories.
[15] For the release notice, see Annex H to the UMBEL Specifications provides a description of the mapping methodologies and results.
[16] Functional categories combine two or more facets in order to split or provide more structured characterization of a category. For example, Category:English cricketers of 1890 to 1918, has as its core concept the idea of a cricketer, a sports person. But, this is also further characterized by nationality and time period. Functional categories tend to have a A x B x C construct, with prepositions denoting the facets. From a proper characterization standpoint, the items in this category should be classified as a Person –> Sports Person –> Cricketer, with additional facets (metadata) of being English and having the period 1890 to 1981 assigned.
[17] See, for example, Massimo Poesio et al., 2008. ELERFED: Final Report, see, wherein they state, “We discovered that in the meantime information about categories in Wikipedia had grown so much and become so unwieldy as to limit its usefulness.” Additional criticisms of the category structure may be found in S. Chernov, T. Iofciu, W. Nejdl and X. Zhou, 2006. “Extracting Semantic Relationships between Wikipedia Categories,” in Proceedings of the 1st International Workshop: SemWiki’06—From Wiki to Semantics., co-located with the 3rd Annual European Semantic Web Conference ESWC’06 in Budva, Montenegro, June 12, 2006; and L Muchnik, R. Itzhack, S. Solomon and Y. Louzon, 2007. “Self-emergence of Knowledge Trees: Extraction of the Wikipedia Hierarchies,” in Physical Review E 76(1). Also, this blog post from Bob Bater at KOnnect, “Wikipedia’s Approach to Categorization,” September 22, 2008, provides useful comments on category issues; see
[18] Olena Medelyan and Cathy Legg, 2008. Integrating Cyc and Wikipedia: Folksonomy Meets Rigorously Defined Common-Sense, in Proceedings of the WIKI-AI: Wikipedia and AI Workshop at the AAAI08 Conference, Chicago, US. See
[19] As two references among many, see A. Halavais and D. Lackaff, 2008. “An Analysis of Topical Coverage of Wikipedia,” in Journal of Computer-Mediated Communication 13 (2): 429–440; and A. Kittur, E. H. Chi and B. Suh, 2009. “What’s in Wikipedia? Mapping Topics and Conflict using Socially Annotated Category Structure,” in Proceedings of the 27th Annual CHI Conference on Human Factors in Computing Systems, pp 4–9.
[20] See, especially DBpedia reference.
[21] See for a listing of known wordnets by language.
[22] For example, see this listing in Wikipedia.
[23] M.K. Bergman, 2008. “When is Content Coherent?,” AI3:::Adaptive Information blog, July 25, 2008; see
[24] For a couple of useful references on this topic, first see this discussion regarding contexts (and the possible relation to Cyc microtheories): Ramanathan V. Guha, Rob McCool, and Richard Fikes, 2004. “Contexts for the Semantic Web,” in Sheila A. McIlraith, Dimitris Plexousakis, and Frank van Harmelen, eds., International Semantic Web Conference, volume 3298 of Lecture Notes in Computer Science, pp. 32-46. Springer, 2004. See For another discussion about local differences and contexts and the difficulty of reliance on “common” understandings, see: Krzysztof Janowicz, 2010. “The Role of Space and Time for Knowledge Organization on the Semantic Web,” in Semantic Web 1: 25–32; see×307213/fulltext.pdf.
[25] OWL already provides the exact predicates; see further M.K. Bergman, 2010. “The Nature of Connectedness on the Web,” AI3:::Adaptive Information blog, November 22, 2010, 2008; see and the UMBEL mapping predicates in this vocabulary listing.
[26] UMBEL is a reference of 28,000 concepts (classes and relationships) derived from the Cyc knowledge base. The reference concepts of UMBEL are mapped to Wikipedia, DBpedia ontology classes, GeoNames and PROTON. UMBEL is designed to facilitate the organization, linkage and presentation of heterogeneous datasets and information. It is meant to lower the time, effort and complexity of developing, maintaining and using ontologies, and aligning them to other content. See further the UMBEL Specifications (including Annexes A – H), Vocabulary and RefConcepts.
[27] Cyc is an artificial intelligence project that has assembled a comprehensive ontology and knowledge base of everyday common sense knowledge, with the goal to provide human-like reasoning. The OpenCyc version 3.0 contains nearly 200,000 terms and millions of relationship assertions. Started in 1984, by 2010 an estimated 1000 person years had been invested in its development.
[28] This image and more related to the general question of interoperability in relation to a reference structure is provided in M.K. Bergman, 2007, “Where are the Road Signs for the Structured Web?,” AI3:::Adaptive Information blog, May 29, 2007; see
[29] GeoNames is a geographical database available for free download under a Creative Commons Attribution license. It contains over 10 million geographical names and consists of 7.5 million unique features, of which 2.8 million are populated places. All features are categorized into one out of nine feature classes and further subcategorized into one out of 645 feature codes. Given the importance of locational information, GeoNames is a natural complement to the gold standards mentioned herein. See further its Web site, which also showcases a nifty browser of mappings to Wikipedia.

Posted by AI3's author, Mike Bergman Posted on February 28, 2011 at 12:07 am in Semantic Web, Structured Web, UMBEL | Comments (2)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:February 21, 2011

Hitting the Sweet SpotReference Structures Provide a Third Way

Since the first days of the Web there has been an ideal that its content could extend beyond documents and become a global, interoperating storehouse of data. This ideal has become what is known as the “semantic Web“. And within this ideal there has been a tension between two competing world views of how to achieve this vision. At the risk of being simplistic, we can describe these world views as informal v formal, sometimes expressed as “bottom up” v “top down” [1,2].

The informal view emphasizes freeform and diversity, using more open tagging and a bottoms-up approach to structuring data [3]. This group is not anarchic, but it does support the idea of open data, open standards and open contributions. This group tends to be oriented to RDF and is (paradoxically) often not very open to non-RDF structured data forms (as, for example, microdata or microformats). Social networks and linked data are quite central to this group. RDFa, tagging, user-generated content and folksonomies are also key emphases and contributions.

The formal view tends to support more strongly the idea of shared vocabularies with more formalized semantics and design. This group uses and contributes to open standards, but is also open to proprietary data and structures. Enterprises and industry groups with standard controlled vocabularies and interchange languages (often XML-based) more typically reside in this group. OWL and rules languages are more often typically the basis for this group’s formalisms. The formal view also tends to split further into two camps: one that is more top down and engineering oriented, with typically a more closed world approach to schema and ontology development [4]; and a second that is more adaptive and incremental and relies on an open world approach [5].

Again, at the risk of being simplistic, the informal group tends to view many OWL and structured vocabularies, especially those that are large or complex, as over engineered, constraining or limiting freedom. This group often correctly points to the delays and lack of adoption associated with more formal efforts. The informal group rarely speaks of ontologies, preferring to use the term of vocabularies. In contrast, the formal group tends to view bottoms-up efforts as chaotic, poorly structured and too heterogeneous to allow machine reasoning or interoperability. Some in the formal group sometimes advocate certification or prescribed training programs for ontologists.

Readers of this blog and customers of Structured Dynamics know that we more often focus on the formal world view and more specifically from an open world perspective. But, like human tribes or different cultures, there is no one true or correct way. Peaceful coexistence resides in the understanding of the importance and strength of different world views.

Shared communication is the way in which we, as humans, learn to understand and bridge cultural and tribal differences. These very same bases can be used to bridge the differences of world views for the semantic Web. Shared concepts and a way to communicate them (via a common language) — what I call reference structures [6] — are one potential “sweet spot” for bridging these views of the semantic Web [7].

Referring to Referents as Reference

According to Merriam Webster and Wikipedia, a reference is the intentional use of one thing, a point of reference or reference state, to indicate something else. When reference is intended, what the reference points to is called the referent. References are indicated by sounds (like onomatopoeia), pictures (like roadsigns), text (like bibliographies), indexes (by number) and objects (a wedding ring), but many other methods can be used intentionally as references. In language and libraries, references may include dictionaries, thesauri and encyclopedias. In computer science, references may include pointers, addresses or linked lists. In semantics, reference is generally construed as the relationships between nouns or pronouns and objects that are named by them.

The Building Blocks of Language

Structures, or syntax, enable multiple referents to be combined into more complex and meaningful (interpretable) systems. Vocabularies refer to the set of tokens or words available to act as referents in these structures. Controlled vocabularies attempt to limit and precisely define these tokens as a means of reducing ambiguity and error. Larger vocabularies increase richness and nuance of meaning for the tokens. Combined, syntax, grammar and vocabularies are the building blocks for constructing understandable human languages.

Many researchers believe that language is an inherent capability of humans, especially including children. Language acquisition is expressly understood to be the combined acquisition of syntax, vocabulary and phonetics (for spoken language). Language development occurs via use and repetition, in a social setting where errors are corrected and communication is a constant. Via communication and interaction we learn and discover nuance and differences, and acquire more complex understandings of syntax structures and vocabulary. The contact sport of communication is itself a prime source for acquiring the ability to communicate. Without the structure (syntax) and vocabulary acquired through this process, our language utterances are mere babblings.

Pidgin languages emerge when two parties try to communicate, but do not share the same language. Pidgin languages result in much simplified vocabularies and structure, which lead to frequent miscommunication. Small vocabularies and limited structure share many of these same limitations.

Communicating in an Evanescent Environment

Information theory going back to Shannon defined that the “fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point” [8]. This assertion applies to all forms of communication, from the electronic to animal and human language and speech.

Every living language is undergoing constant growth and change. Current events and culture are one driver of new vocabulary and constructs. We all know the apocryphal observation that northern peoples have many more words for snow, for example. Jargon emerges because specific activities, professions, groups, or events (including technical change) often have their own ideas to communicate. Slang is local or cultural usage that provides context and communication, often outside of “formal” or accepted vocabularies. These sources of environmental and other changes cause living languages to be constantly changing in terms of vocabulary and (also, sometimes) structure.

Natural languages become rich in meaning and names for entities to describe and discern things, from plants to people. When richness is embedded in structure, contexts can emerge that greatly aid removing ambiguity (“disambiguating”). Contexts enable us to discern polysemous concepts (such as bank for river, money institution or pool shot) or similarly named entities (such as whether Jimmy Johnson is a race car driver, football coach, or a local plumber). As with vocabulary growth, contexts sometimes change in meaning and interpretation over time. It is likely the Gay ’90s would not be used again to describe a cultural decade (1890s) in American history.

All this affirms what all of us know about human languages:  they are dynamic and changing. Adaptable (living) languages require an openness to changing vocabulary and changing structure. The most dynamic languages also tend to be the most open to the coining of new terminology; English, for example, is estimated to have 25,000 new words coined each year [9].

The Semantic Web as a Human Language

One could argue that similar constructs must be present within the semantic Web to enable either machine or human understanding. At first blush this may sound a bit surprising:  Isn’t one premise of the semantic Web machine-to-machine communications with “artificial intelligence” acting on our behalf in the background? Well, hmmm, OK, let’s probe that thought.

Recall there are different visions about what constitutes the semantic Web. In the most machine-oriented version, the machines are posited to replace some of what we already do and anticipate what we already want. Like Watson on Jeopardy, machines still need to know that Toronto is not an American city [10]. So, even with its most extreme interpretation — and one that is more extreme than my own view of the near-term semantic Web — machine-based communication still has these imperatives:

  • Humans, too, interact with data and need to understand it
  • Much of the data to be understood and managed is based on human text (unstructured), and needs to be adequately captured and represented
  • There is no basis to think that machine languages can be any simpler in representing the world than human languages.

These points suggest that machine languages, even in the most extreme machine-to-machine sense, still need to have a considerable capability akin to human languages.  Of course, computer programming languages and data exchange languages as artificial languages need not read like a novel. In fact, most artificial languages have more constraints and structure limitations than human languages. They need to be read by machines with fixed instruction sets (that is, they tend to have fewer exceptions and heuristics).

But, even with software or data, people write and interact with these languages, and human readability is a key desirable aspect for modern artificial languages [11]. Further, there are some parts of software or data that also get expressed as labels in user interfaces or for other human factors. The admonition to Web page developers to “view source” is a frequent one. Any communication that is text based — as are all HTTP communications on the Web, including the semantic Web — has this readability component.

Though the form (structure) and vocabulary (tokens) of languages geared to machine use and understanding most certainly differ from that used by humans, that does not mean that the imperatives for reference and structure are excused. It seems evident that small vocabularies, differing vocabularies and small and incompatible structures have the same limiting effect on communications within the semantic Web as they do for human languages.

Yet, that being said, correcting today’s relative absence of reference and structure on the nascent semantic Web should not then mean an overreaction to a solution based on a single global structure. This is a false choice and a false dichotomy, belied by the continued diversity of human languages [12]. In fact, the best analog for an effective semantic Web might be human languages with their vocabularies, references and structures. Here is where we may find the clues for how we might improve the communications (interoperability) of the semantic Web.

A Call for Vehement Moderation

Freeform tagging and informal approaches are quick and adaptive. But, they lack context, coherence and a basis for interoperability. Highly engineered ontologies capture nuance and sophistication. But, they are difficult and expensive to create, lack adoption and can prove brittle. Neither of these polar opposites is “correct” and each has its uses and importance. Strident advocacy of either extreme alone is shortsighted and unsuited to today’s realities. There is not an ineluctable choice between freedom and formalism.

An inherently open and changing world with massive growth of information volumes demands a third way. Reference structures and vocabularies sufficient to guide (but not constrain) coherent communications are needed. Structure and vocabulary in an open and adaptable language can provide the communication medium. Depending on task, this language can be informal (RDF or data struct forms convertible to RDF) or formal (OWL). The connecting glue is provided by the reference vocabularies and structures that bound that adaptable language. This is the missing “sweet spot” for the semantic Web.

Just like human languages, these reference structures must be adaptable ones that can accommodate new learning, new ideas and new terminology. Yet, they must also have sufficient internal consistency and structure to enable their role as referents. And, they need to have a richness of vocabulary (with defined references) sufficient to capture the domain at hand. Otherwise, we end up with pidgin communications.

We can thus see a pattern emerging where informal approaches are used for tagging and simple datasets; more formal approaches are used for bounded domains and the need for precise semantics; and reference structures are used when we want to get multiple, disparate sources to communicate and interoperate. So long as these reference structures are coherent and designed for vocabulary expansion and accommodation for synonyms and other means for terminology mapping, they can adapt to changing knowledge and demands.

For too long there has been a misunderstanding and mischaracterization of anything that smacks of structure and referenceability as an attempt to limit diversity, impose control, or suggest some form of “One Ring to rule them all” organization of the semantic Web. Maybe that was true of other suggestions in the past, but it is far from the enabling role of reference structures advocated herein. This reaction to structure has something of the feeling of school children adverse to their writing lessons taking over the classroom and then saying No! to more lessons. Rather than Lord of the Rings we get Lord of the Flies.

To try to overcome this misunderstanding — and to embrace the idea of language and communication for the semantic Web — I and others have tried in the past to find various analogies or imagery to describe the roles of these reference structures. (Again, all of those vagaries of human language and communication!). Analogies for these reference structures have included [13]:

  • backbones, to signal their importance as dependable structures upon which we can put “meat on the bones”
  • scaffoldings, to emphasize their openness and infrastructural role
  • roadmaps, as orienting and navigational frameworks for information
  • docking ports, as connection points for diverse datasets on the Web
  • forest paths, to signal common traversals but with much to discover once we step off the paths
  • infoclines, to represent the information interface between different world views,
  • and others.

What this post has argued is the analogy of reference structures to human language and communication. In this role, reference structures should be seen as facilitating and enabling. This is hardly a vision of constraints and control. The ability to articulate positions and ideas in fact leads to more diversity and freedom, not less.

To be sure, there is extra work in using and applying reference structures. Every child comes to know there is work in learning languages and becoming articulate in them. But, as adults, we also come to learn from experience the frustration that individuals with speech or learning impairments have when trying to communicate. Knowing these things, why do we not see the same imperatives for the semantic Web? We can only get beyond incoherent babblings by making the commitment to learn and master rich languages grounded in appropriate reference structures. We are not compelled to be inchoate; nor are our machines.

Yet, because of this extra work, it is also important that we develop and put in place semi-automatic [14] ways to tag and provide linkages to such reference structures. We have the tools and information extraction techniques available that will allow us to reference and add structure to our content in quick and easy ways. Now is the time to get on with it, and stop babbling about how structure and reference vocabularies may limit our freedoms.

This is the first of a two-part series on the importance and choice of reference structures (Part I) and gold standards (Part II) on the semantic Web.

[1] This is reflected well in a presentation from the NSF Workshop on DB & IS Research for Semantic Web and Enterprises, April 3, 2002, entitled “The “Emergent, Semantic Web: Top Down Design or Bottom Up Consensus?“. This report defines top down as design and committee-driven; bottom up is more decentralized and based on social processes. Also, see Ralf Klischewski, 2003. “Top Down or Bottom Up? How to Establish a Common Ground for Semantic Interoperability within e-Government Communities,” pp. 17-26, in R. Traunmüller and M Palmirani, eds., E-Government: Modelling Norms and Concepts as Key Issues: Proceedings of 1st International Workshop on E-Government at ICAIL 2003, Bologna, Italy. Also, see David Weinberger, 2006. “The Case for Two Semantic Webs,” KM World, May 26, 2006; see
[2] For a discussion about formalisms and the nature of the Web, see this early report by F.M. Shipman III and C.C. Marshall, 1994. “Formality Considered Harmful: Experiences, Emerging Themes, and Directions,” Xerox PARC Technical Report ISTL-CSA-94-08-02, 1994; see
[3] Others have posited contrasting styles, most often as “top down” v. “bottom up.” However, in one interpretation of that distinction, “top down” means a layer on top of the existing Web; see further, A. Iskold, 2007. “Top Down: A New Approach to the Semantic Web,” in ReadWrite Web, Sept. 20, 2007. The problem with this terminology is that it offers a completely different sense of “top down” to traditional uses. In Iskold’s argument, his “top down” is a layering on top of the existing Web. On the other hand, “top down” is more often understood in the sense of a “comprehensive, engineered” view, consistent with [1].
[4] See M. K. Bergman, 2009. The Open World Assumption: Elephant in the Room, December 21, 2009. The open world assumption (OWA) generally asserts that the lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity. Another way to say it is that everything is permitted until it is prohibited. OWA lends itself to incremental and incomplete approaches to various modeling problems.
The closed world assumption (CWA) is a key underpinning to most standard relational data systems and enterprise schema and logics. CWA is the logic assumption that what is not currently known to be true, is false. For semantics-related projects there is a corollary problem to the use of CWA which is the need for upfront agreement on what all predicates “mean”, which is difficult if not impossible in reality when different perspectives are the explicit purpose for the integration.
[5] See M.K. Bergman, 2010. “Two Contrasting Styles for the Semantic Enterprise,” AI3:::Adaptive Information blog post, February 15, 2010. See
[6] I first used the term in passing in M.K. Bergman, 2007. “An Intrepid Guide to Ontologies,” AI3:::Adaptive Information blog post, May 16, 2007. See, then more fully elaborated the idea in “Where are the Road Signs for the Structured Web,” AI3:::Adaptive Information blog post, May 29, 2007. See
[7] See Catherine C. Marshall and Frank M. Shipman, 2003. “Which Semantic Web?,” in Proceedings of ACM Hypertext 2003, pp. 57-66, August 26-30, 2003, Nottingham, United Kingdom;, for a very different (but still accurate and useful) way to characterize the “visions” for the semantic Web. In this early paper, the authors posit three competing visions: 1) the development of standards, akin to libraries, to bring order to digital documents; this is the vision they ascribe to the W3C and has been largely adopted via use of URIs as identifiers, and languages such as RDF and OWL; 2) a vision of a globally distributed knowledge base (which they characterize as being Tim Berners-Lee’s original vision, with examples being Cyc or Apple’s (now disbanded) Knowledge Navigator; and 3) a vision of an infrastructure for the coordinated sharing of data and knowledge..
[8] See Claude E. Shannon‘s classic paper “A Mathematical Theory of Communication” in the Bell System Technical Journal in July and October 1948.
[9] This reference is from the Wikipedia entry on the English language: Kister, Ken. “Dictionaries defined.” Library Journal, 6/15/92, Vol. 117 Issue 11, p 43.
[10] See, or simply do a Web search on “watson toronto jeopardy” (no quotes).
[11] Readability is important because programmers spend the majority of their time reading, trying to understand and modifying existing source code, rather than writing new source code. Unreadable code often leads to bugs, inefficiencies, and duplicated code. It has been known for at least three decades that a few simple readability transformations can make code shorter and drastically reduce the time to understand it. See James L. Elshoff and Michael Marcotty, 1982. “Improving Computer Program Readability to Aid Modification,” Communications of the ACM, v.25 n.8, p. 512-521, Aug 1982; see From the Wikipedia entry on Readability.
[12] According the the Wikipedia entry on Language, there are an estimated 3000 to 6000 active human languages presently in existence.
[13] The forest path analogy comes from Atanas Kiryakov of Ontotext. The remaining analogies come from M.K. Bergman in his AI3:::Adaptive Innovation blog: “There’s Not Yet Enough Backbone,” May 1, 2007 (backbone); “The Role of UMBEL: Stuck in the Middle with you …,” May 11, 2008 (infocline, scaffolding and docking port); “Structure Paves the Way to the Semantic Web,” May 3, 2007 (roadmap).
[14] Semi-automatic methods attempt to apply as much automated screening and algorithmic- or rules-based scoring as possible, and then allow the final choices to be arbitrated by humans. Fully automated systems, particularly involving natural language processing, are not yet acceptable because of (small, but) unacceptably high error rates in precision. The best semi-automated approaches handle all tasks that are rote or error-free, and then limit the final choices to those areas where unacceptable errors are still prevalent. As time goes on, more of these areas can be automated as algorithms, heuristics and methodologies improve. Eventually, of course, this may lead to fully automated approaches.

Posted by AI3's author, Mike Bergman Posted on February 21, 2011 at 2:27 am in Semantic Web, Structured Web, UMBEL | Comments (3)
The URI link reference to this post is:
The URI to trackback this post is: