Posted:July 18, 2011

In the Midst of an Evolutionary Explosion

A Decade of Remarkable Advances in Ten Grand IT Challenges

I’ve been in the information theory and technology game for quite some time, but believe nothing has matched the pace of advances of the past ten years. As one example, it was a mere eight years ago that I was sitting in a room with language translation vendors contemplating automated translation techniques for US intelligence agencies. The prospects finally looked doable, but the success of large-scale translation was not assured.

At about that same time, and the years until just recently, a whole slew of Grand Challenges [1] in computing hung out there: tantalizing yet not proven. These areas ranged from information extraction and natural language understanding to speech recognition and automated reasoning.

But things have been changing fast, and with a subtle steadiness that has caused it to go largely unremarked. Sure, all of us have been aware of the huge changes on the Web and search engine ubiquity and social networking. But some of the fundamentally hard problems in computing have also gone through some remarkable (but largely unremarked) advances.

We now have smart phones that speak instructions to us while we instruct them by voice in turn. Virtually all information conceivable is now indexed and made available through the Web; structure is now rapidly characterizing that information, making it even more useful to discover and organize. We can translate documents online with acceptable accuracy into more than 60 languages [2]. We can get directions to or see satellite views of virtually any place on earth. We have in fact become accustomed to new technology magic on a nearly daily basis, so much so that the pace of these advances seems to be a constant, blunting our perspective of just how rapid these advances have been progressing.

These advances are perhaps not the realization of artificial intelligence as articulated in the 1950s to 1980s, but are contributing to a machine-based ability to do tasks useful to humans heretofore impossible and at scales unimaginable. As Google and IBM’s Watson are showing, statistics (among other techniques) applied to massive knowledge bases or text corpora are breaking down all of the Grand Challenges of symbolic computing. The image that is emerging is less one of intelligent machines working autonomously than it is of computers working interactively or semi-automatically with humans to address previously unsolvable problems.

By using a perspective of the decade past, we also demark the seminal paper on the semantic Web by Berners-Lee, Hendler and Lassila from May 2001 [3]. Yet, while this semantic Web vision has been a contributor to the success of the Grand Challenge advances of the past ten years, I think we can also say that it has not been the key or even a primary driver. That day may still yet come. Rather, I think we have to look to natural language and statistics surrounding large-scale corpora as the more telling drivers.

Ten Grand Challenge Advances

Over the past ten years there have been significant advances on at least ten Grand Challenges in symbolic computation. As the concluding section notes, these advances can be traced in most part to broader advances in natural language processing, the logical and semiotic bases for interoperability, and standards (nominally in the semantic Web) for embracing them. Here are these ten areas of advance, all achieved over the past ten years:

#1 Information Extraction

Information extraction (IE) uses various forms of natural language processing (NLP) to identify structured information within unstructured or semi-structured documents. These documents are presented in machine-readable form (including straight text, various document formats or HTML) with the various types of information “tagged” or prompted for inclusion. Information types that can be extracted with one of the various techniques include entities, relations, topics, categories, and so forth. Once tagged or extracted, the information in the documents can now be included and linked to standard structured information (as might come from conventional databases) or to structure in other documents.

Most recently, a large number of online services and open source systems have also become available with strengths in one or more of these extraction types [4]. Some current examples include Yahoo! Term Extraction, OpenCalais, BeliefNetworks, OpenAmplify, Alchemy API, Evri, Extractiv, Illinois Tagger, and about 80 others [4].

#2 Machine Translation

Machine translation is the automatic translation of machine-readable text from one human language to another. Accurate and acceptable machine translation requires applying different types of knowledge including grammar, semantics, facts about the real world, etc. Various approaches have been developed and refined over time.

Especially helpful has been the availability of huge corpora in multiple languages to which large-scale statistical analysis may be applied (as is the case of Google’s machine translation) or human editing and refinement (as is the case with the more than 280 language versions of Wikipedia).

While it is true none of these systems have 100% accuracy (even human translators show much variation), the more advanced ones are truly impressive with remaining ambiguities flagged for resolution by semi-automatic means.

#3 Sentiment Analysis

Though sentiment analysis is strictly speaking a subset of information extraction, it has the more demanding and useful task of extracting subjective information, often across a group of documents or texts. Sentiment analysis can be applied to online reviews to determine the “polarity” about specific objects, and it is especially useful for identifying public opinion trends or evaluating social media for ranking, polling or marketing purposes.

Because of its greater difficulty and potential high value, many of the leading sentiment analysis capabilities remain proprietary. Some capable open source versions are available nonetheleless. There is also an interesting online application using Twitter feeds.

#4 Disambiguation

Many words have more than one meaning. Word sense disambiguation uses either machine learning, dictionaries (gazetteers) of known entities and concepts, ontologies or linguistic databases such as WordNet, or combinations thereof to evaluate ambiguous terms or phrases and resolve them based on context. Some systems need to be “trained” or some work automatically or others are based on evaulation and prompting (semi-automatic) to complete the disambiguation process.

State-of-the-art systems have greater than 90% precision [5]. Most of the leading open source NLP toolkits have quite capable disambiguation modules, and even better proprietary systems exist.

#5 Speech Synthesis and Recognition

Speech synthesis is the conversion of text to spoken speech and has been around for quite some time. Speech recognition is a far more difficult task in that a given sound clip or real-time spoken speech of a person must be converted to a textual representation, which itself can then be acted upon such as navigating or making selections. Speech recognition is made difficult because of individual voice differences, the variations of human languages and speech patterns, and the need to segment speech into a sequence of words. (In most spoken languages, the sounds representing successive letters blend into each other, so the conversion of the modulated wave form to discrete characters or tokens can be a very difficult process.)

Crude systems of a decade ago required much training with a specific speaker’s voice to show much effectiveness. Today, the range and ability to use these systems without training has markedly improved.

Until recently, improvements largely were driven by military and intelligence requirements. Today, however, with the ubiquity of smart phones and speech interfaces, the consumer market is greatly accelerating progress.

#6 Image Recognition

Image recognition is the ability to determine whether or not an electronic image contains some specific object, feature, or activity, and then to extract the image data associated with it. Today, under specific circumstances and for specific tasks, this can be done by computer. However, for the general case of arbitrary objects in arbitrary situations this challenge has not yet been fully met. The systems of today work best for simple geometric objects (e.g., polyhedra), human faces, printed or hand-written characters, or vehicles, and in specific situations, typically described in terms of well-defined illumination, background, and orientation of the object relative to the camera.

Auto license recognition at intersections, face recognition by security cameras, and greatly expanded and improved character recognition systems (machine vision) represent some of the current state-of-the-art. Again, smart phone apps are helping to drive advances.

#7 Interoperability Standards and Methods

Rapid Progress in Climbing the Data Federation Pyramid

Most of the previous advances are related to extracting structured information or mapping or deriving additional structured information. Once obtained, of course, the next challenge is in how to relate that information together; that is, how to make it interoperate.

We have been steadily climbing a data federation pyramid [6] — and at an impressively accelerating rate since the adoption of the Internet and Web. These network innovations gave us a common basis and protocols for connecting distributed devices. That, in turn, has freed us to concentrate on the standards for data representation and interoperability.

XML first provided a means for a common data serialization that encouraged various communities and industries to devise exchange vocabularies. RDF provided a means for a common data model, one that was both simple and extensible at the same time [7]. OWL built upon that basis to enable us to build common domain models (see next).

There are alternatives to the semantic Web standards of RDF and OWL such as common logic and there are many competing data exchange formats to XML. None of these standards is essential on its own and all have their communities and advocates. However, because they are standards and they share common network bases, it has also been relatively easy to convert amongst the various available protocols. We are nearly at a global level where everything is connected, machine-readable, and in structured form.

#7 Common Domain Models

Semantics in machine-readable form means that we can more confidently link and combine available information. We are seeing a veritable explosion of domain models to represent various domains and viewpoints in consensual, interoperable form. What this means is that we are now gaining the computing vocabularies and grammars — along with shared community models (world views) — to get this stuff to work together.

Five years ago we called this phenomena mashups, but no one uses that term any longer because these information brewpots are everywhere, including in our very hands when we interact with the apps on our smart phones. This glue of domain models is generally as invisible to us as is the glue in laminates or the resin in plastics. But they are the strength and foundations nonetheless that enable much of the computing magic unfolding around us.

#9 Virtual Apps (Cloud Computing)

Once the tyranny of physical separation was shattered between data and machine by the network, the rationale for keeping the data with the app or even the user with the app disappeared. Cloud computing may seem mysterious or sound to have some high-octave hum, but it really is nothing more than saying that the Web enables us to treat all of our computing resources as virtual. Data can be anywhere; machines and hard drives can be anywhere; and applications can be anywhere.

And, virtualness brings benefits in and of itself. Whole computing environments can be installed or removed nearly instantaneously. Peak computing demands can be met with virtual headrooms. Backup and rollover and redundancy practices and strategies can change. Web services mean tailored capabilities can be invoked from anywhere and integrated for local needs. Massive computing resources and server farms can be as accessible to the individual as they are to prior computing behemoths. Combined with continued advances in underlying computing hardware and chips, the computing power available to any user is rising exponentially. There is now even more power in the power curve.

#10 Big Data

One hears stories of Google or the National Security Agency having access and managing servers measured in the hundreds of thousands. Entirely new operating systems and computing environments — many with roots in open source — such as virtual operating systems and MapReduce approaches like Hadoop have been innovated to deal with the current era of “big data”.

MapReduce is a framework for processing huge datasets using a large number of servers. The “map” step partitions the problem into tractable sub-problems, organized in a tree structure. The “reduce” step then takes the answers to all the sub-problems and combines them to produce the final output.

Such techniques enable analysis of datasets of a size impossible before. This has enabled the development of statistics and analytical techniques that have been able to make correlations and find patterns for some of the Grand Challenge tasks noted before that simply could not be addressed within previous limits. The “big data” approach is providing a brute force alternative to previously intractable problems.

Why Such Progress?

Declining hardware costs and increasing performance (such as from Moore’s Law), combined with the adoption of the Internet + Web network, set the fertile conditions for these unprecedented advances in computing’s Grand Challenges. But the adaptive radiation in innovations now occurring has its own dynamics. In computing terms, we are seeing the equivalent of the Cambrian explosion in evolutionary history.

The dynamics driving this computing explosion are based largely, I believe, on the statistics of information retrieval and extraction needed to cope with the scale of documents on the Web. That, in turn, has impelled innovations in big data and distributed architectures and designs that have pried open previously closed computing lockboxes. As data from everywhere and from every provenance pours into the system, means for handling and interoperating with it have become imperatives. These forces, in turn, have been channeled and are being met through the open and standards-based approaches that helped lead to the development of the Internet and its infrastructure in the first place.

These powerful evolutionary forces in computing are clearly evident in the ten Grand Challenge advances above. But the challenges above are also silent on another factor, underpinning the interoperability initiatives, that is only now just becoming evident and exerting its own powerful force. That is the workable, intellectual foundations for interoperability itself.

Clearly, as the advances in the Grand Challenges show, we are seeing immense exposures of new structured information and impressive means for accessing and managing it on a global, distributed scale. Yet all of this data and this structure begs the question of how to get the information to work together. Further, the sources and viewpoints and methods by which all of this data has been created also puts a huge premium on means to deal with the diversity. Though not evident, and perhaps not even known to many of the innovators and practitioners, there has been a growing intellectual force shaping our foundational views about the nature of things and their representations. This force has been, I believe, one of those root cause drivers helping to show the way to interoperability.

John Sowa, despite his unending criticism of the semantic Web in favor of common logic, has nonetheless been a very positive evangelist for the 19th century American logician and philosopher, Charles Sanders Peirce. Sowa points out that the entire 20th century largely neglected Peirce’s significant contributions in many areas and some philosophers appropriated Peircean insights without proper attribution [8]. Indeed, Peirce has only come to wider attention within the past decade or so. Much of his voluminous lifetime writings have still not yet been committed to publication.

Among many notable contributions, Peirce was passionate about signs and their triadic representations, in a field known as semiotics. The philosophical and logical basis of his triangle of signs deserves your attention, which can not be adequately treated here [9]. However, as summarized by Sowa [8], “A semiotic view of language and logic gets to the heart of the philosophical controversies and their practical implications for linguistics, artificial intelligence, and related subjects.”

In essence, Peirce’s triadic logic of semiotics helps clarify philosophical questions about things, how they are perceived and how they are named that has vexed philosophers at least since the time of Aristotle. What Peirce was able to put forward was a testable logic for how things and the names of things can be understood and related to one another, via logical statements or structures. These, in turn, can be symbolized and formalized into logical constructs that can capture the structure of natural language as well as more structured data.

The clarity of Peirce’s logic of signs is an underlying factor, I believe, for why we are finally seeing our way clear to how to capture, represent and relate information from a diversity of sources and viewpoints that is defensible and interoperable [10]. As we plumb Peircean logics further, I believe we will continue to gain additional insights and methods for combining and relating information. The next phase of our advances on these Grand Challenges is likely to be fueled more by connections and interoperability than in basic extraction or representation.

The Widening Explosion

We are not seeing the vision of artificial intelligence unfold as posed three decades ago. Nor are we seeing the AI-complete type of problems being solved in their entirety [11]. Rather, we are seeing impressive but incomplete approaches. Full automation and autonomy are not yet at hand, and may be so far in the future as to never be. But we are nevertheless seeing advances across the board in all Grand Challenge areas.

What is emerging is a practical achievement of the Grand Challenges, the scale and scope of which is unprecedented in symbolic computing. As we see Peircean logic continue to take hold and interoperability grow in usefulness and stature, I think it fair to say we can look back in ten years to describe where we stand today as having been in the midst of an evolutionary explosion.

[1] Grand Challenges were United States policy objectives for high-performance computing and communications research set in the late 1980s. According to “A Research and Development Strategy for High Performance Computing”, Executive Office of the President, Office of Science and Technology Policy, 29 pp., November 20, 1987, “A grand challenge is a fundamental problem in science or engineering, with broad applications, whose solution would be enabled by the application of high performance computing resources that could become available in the near future.”

[2] For example, as of July 17, 2011, Google offered 63 different source or target languages for translation.

[3] Tim Berners-Lee, James Hendler and Ora Lassila, 2001. “The Semantic Web”. Scientific American Magazine; see http://www.scientificamerican.com/article.cfm?id=the-semantic-web.

[4] Go to Sweet Tools, and enter the search ‘information extraction’ to see a list of about 85 tools.

[5] See, for example, Roberto Navigli, 2009. “Word Sense Disambiguation: A Survey,” ACM Computing Surveys, 41(2), 2009, pp. 1–69. See http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf.

[6] M.K. Bergman, 2006. “Climbing the Data Federation Pyramid,” AI3:::Adaptive Information blog, May 25, 2006; see https://www.mkbergman.com/229/climbing-the-data-federation-pyramid/.

[7] M. K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Information blog, April 8, 2009. See https://www.mkbergman.com/483/advantages-and-myths-of-rdf/

[8] John Sowa, 2006. “Peirce’s Contributions to the 21st Century”, in H. Schärfe, P. Hitzler, & P. Øhrstrøm, eds., Conceptual Structures: Inspiration and Application, LNAI 4068, Springer, Berlin, 2006, pp. 54-69. See http://www.jfsowa.com/pubs/csp21st.pdf.

[9] See, as a start, the Wikipedia article on Charles Sanders Peirce (pronounced “purse”), as well as the Arisbe collection of his assembled papers (to date). Also see John Sowa, 2010. “The Role of Logic and Ontology in Language and Reasoning,” from Chapter 11 of Theory and Applications of Ontology: Philosophical Perspectives, edited by R. Poli & J. Seibt, Berlin: Springer, 2010, pp. 231-263. See http://www.jfsowa.com/pubs/rolelog.pdf. Sowa also says, “Although formal logic can be studied independently of natural language semantics, no formal ontology that has any practical application can ever be developed and used without acknowledging its intimate connection with NL semantics.”

[10] While Peirce’s logic and clarity of conceptual relationships is compelling, I find reading his writings quite demanding.

[11] In the field of artificial intelligence, the most difficult problems are informally known as AI-complete or AI-hard, meaning that the difficulty of these computational problems is equivalent to solving the central artificial intelligence problem of making computers as intelligent as people. Computer vision, autonomous robots and understanding natural language are amongst challenges recognized by consensus as being AI-complete. However, practical advances on the Grand Challenges were never defined as needing to meet the AI-complete criterion. Indeed, it is even questionable whether such a hurdle is even worthwhile or meaningful on its own.

Posted:June 2, 2011

Structured Web Gets Massive Boost

Contrary to Some Views, Google and Co.’s Microdata Effort will Also Boost RDF

In my opinion, perhaps the most important event for the structured Web since RDF was released a dozen years ago was today’s joint announcement by the search engine triumvirate of Google, Bing and Yahoo! releasing Schema.org. Schema.org is a vendor specification for nearly 300 mini-schema (or structured record definitions) that can be used to tag information in Web pages. These schema are organized into a clean little hierarchy and cover many of the leading things — from organizations to people to products and creative works — that can be written about and characterized on the Web.

These schema specifications are based on the microdata standard presently under review as part of the pending HTML5 specification. Microdata are set record descriptions of key-value pair attributes that can be embedded into the HTML Web page language. These microdata schema are similar to microformats, but broader in coverage and extensible. Microdata is also simpler than RDFa, another W3C specification that the Schema.org organizers call “. . . extensible and very expressive, but the substantial complexity of the language has contributed to slower adoption.”

Is the Initiative a Slap in RDF’s Face?

Various forums have been alive with howls and questions from many RDF and RDFa advocates that this initiative negates years of effort behind those formats. Yet I and my company, Structured Dynamics, which base our entire technology approach on semantics and RDF, do not see this announcement as a threat or rejection. What gives; what is the difference in perspective?

In our view, RDF and its triple representations in its data model, is the simplest and most expressive means to represent any data or any data relationship. As such, RDF, and its language extensions such as OWL and ontologies, provide a robust and flexible canonical data model for capturing any extant data or schema. No matter what the native form of the source information, we can boil it down to RDF and inter-relate it to any other information. It is for these reasons (and others) we have frequently termed RDF as the universal data solvent.

But, simple records and simple data need not be encumbered with the complexity of RDF. We have long argued for the importance of naive data structs. Many of these are simple key-value pairs where the subject is implied. The so-called little structured data records in Wikipedia, called infoboxes, are of this form. JSON and many simple data formats also have cleaner data formats.

The basic fact that RDF provides a universal data model for any kind of native data does not necessarily translate into its use as the actual data exchange format. Rather, winning data exchange formats are those that can be easily understood, easily expressed and therefore widely used. I think there is a real prospect that microdata, ready for ingest and expression by the Web’s leading search engines, may represent a real sea change in the availability and expression of structured data on the Web.

More structure — not less — is the real fuel that will promote greater adoption of RDF when it comes time to interoperate that data. The RDF community should rejoice that more structure will be coming to the Web from Google et al.’s announcement. We should also soon see an explosion of tools and utilities and services that make it easy to automatically add such structure to Web pages via single clicks. Then, once this structure is available, watch out!

So, while the backers of Schema.org also announced their continued support for microformats and RDFa as they presently exist, I rather suspect today’s announcement represents a denouement for these alternative formats. Though these formats may be creatively destroyed, I think the effect on RDF itself will be a profound and significant boost. I foresee clarity coming to the marketplace regarding RDF’s role: as a canonical means for expressing data of any form, and not necessarily as a data exchange format.

The Initiative is No Surprise

This initiative, led by Google, should be no surprise. Google is the registered agent for the Schema.org Web site and has been the key proponent of microdata via its support of Ian Hickson in the WhatWG and HTML5 work groups. As I stated a couple of years back, Google has also not hidden its interests in structured data. Practically daily we see more structured data appear in Google search results and it has maintained a very active program in structured data extraction from text and tables for some years.

Google and its search engine partners recognize that search needs are evolving from keyword retrievals to structure, relationships, and filtered, targeted results. Those advances come from structure — as well as the semantic relationships between things that something like the Schema.org begins to represent.

Many within the W3C and elsewhere questioned why Google was pushing microdata when there were competing options such as microformats or RDFa (or even earlier variants). Of course, like Microsoft of a decade earlier, some ascribed Google’s microdata advocacy as arising from commercial interests or clout in advertising alone. Of course Google has an economic interest in the growth and usefulness of the Web. But I do not believe its advocacy to be premised on clout or “my way or the highway.”

Google and the search engine triumvirate understand well — much better than many of the researchers and academics that dominate mailing list discussions — that use and adoption trump elegance and sophistication. When one deconstructs the design of microdata and the nearly 300 schema now released behind it, I think the pragmatic observer can only come to one conclusion: Job well done!

Why This is Exciting

I have been a fervent RDF advocate for nearly a decade and have also been a vocal proponent of the structured Web as a necessary stepping stone to the semantic Web. In fact, here is a repeat of a diagram I have used many times over the past 5 years:


Document Web	Structured Web		Semantic Web
		Linked Data
Document-centric Document resources Unstructured data and semi-structured data HTML URL-centric circa 1993	Data-centric Structured data Semi-structured data and structured data XML, JSON, RDF, etc URI-centric circa 2003	Data-centric Linked data Semi-structured data and structured data RDF, RDF-S URI-centric circa 2007	Data-centric Linked data Semi-structured data and structured data RDF, RDF-S, OWL URI-centric circa ???

When one looks at the schema of schema that accompany today’s announcement, it is really clear just how encompassing and important these instant standards will become:

BookFormatType
ItemAvailability
OfferItemCondition

Language
Offer

AggregateOffer

Quantity

Distance
Duration
Mass

GeoCoordinates
NutritionInformation

CreativeWork

Article

BlogPosting
NewsArticle
ScholarlyArticle

Blog
Book
ItemList
Map
MediaObject

AudioObject
ImageObject
MusicVideoObject
VideoObject

Movie
MusicPlaylist

MusicAlbum

MusicRecording
Painting
Photograph
Recipe
Review
Sculpture
TVEpisode
TVSeason
TVSeries
WebPage

AboutPage
CheckoutPage
CollectionPage
ImageGallery
VideoGallery
ContactPage
ItemPage
ProfilePage
SearchResultsPage

WebPageElement

SiteNavigationElement
Table
WPAdBlock
WPFooter
WPHeader
WPSideBar

Event

BusinessEvent
ChildrensEvent
ComedyEvent
DanceEvent
EducationEvent
Festival
FoodEvent
LiteraryEvent
MusicEvent
SaleEvent
SocialEvent
SportsEvent
TheaterEvent
UserInteraction

UserBlocks
UserCheckins
UserComments
UserDownloads
UserLikes
UserPageVisits
UserPlays
UserPlusOnes
UserTweets

VisualArtsEvent

Organization

LocalBusiness

AnimalShelter
AutomotiveBusiness

AutoBodyShop
AutoDealer
AutoPartsStore
AutoRental
AutoRepair
AutoWash
GasStation
MotorcycleDealer
MotorcycleRepair

ChildCare
DryCleaningOrLaundry
EmergencyService

FireStation
Hospital
PoliceStation

EmploymentAgency
EntertainmentBusiness

AdultEntertainment
AmusementPark
ArtGallery
Casino
ComedyClub
MovieTheater
NightClub

FinancialService

AccountingService
AutomatedTeller
BankOrCreditUnion
InsuranceAgency

FoodEstablishment

Bakery
BarOrPub
Brewery
CafeOrCoffeeShop
FastFoodRestaurant
IceCreamShop
Restaurant
Winery

GovernmentOffice

PostOffice

HealthAndBeautyBusiness

BeautySalon
DaySpa
HairSalon
HealthClub
NailSalon
TattooParlor

HomeAndConstructionBusiness

Electrician
GeneralContractor
HVACBusiness
HousePainter
Locksmith
MovingCompany
Plumber
RoofingContractor

InternetCafe
Library
LodgingBusiness

BedAndBreakfast
Hostel
Hotel
Motel

MedicalOrganization

Dentist
Hospital
MedicalClinic
Optician
Pharmacy
Physician
VeterinaryCare

ProfessionalService

AccountingService
Attorney
Dentist
Electrician
GeneralContractor
HousePainter
Locksmith
Notary
Plumber
RoofingContractor

RadioStation
RealEstateAgent
RecyclingCenter
SelfStorage
ShoppingCenter
SportsActivityLocation

BowlingAlley
ExerciseGym
GolfCourse
HealthClub
PublicSwimmingPool
SkiResort
SportsClub
StadiumOrArena
TennisComplex

Store

AutoPartsStore
BikeStore
BookStore
ClothingStore
ComputerStore
ConvenienceStore
DepartmentStore
ElectronicsStore
Florist
FurnitureStore
GardenStore
GroceryStore
HardwareStore
HobbyShop
HomeGoodsStore
JewelryStore
LiquorStore
MensClothingStore
MobilePhoneStore
MovieRentalStore
MusicStore
OfficeEquipmentStore
OutletStore
PawnShop
PetStore
ShoeStore
SportingGoodsStore
TireShop
ToyStore
WholesaleStore

TelevisionStation
TouristInformationCenter
TravelAgency

NGO

PerformingGroup
DanceGroup
MusicGroup
TheaterGroup

SportsTeam

Organization (con’t)

Corporation
EducationalOrganization

CollegeOrUniversity
ElementarySchool
HighSchool
MiddleSchool
Preschool
School

GovernmentOrganization

Airport
Aquarium
Beach
BusStation
BusStop
Campground
Cemetery
Crematorium
EventVenue
FireStation
GovernmentBuilding
CityHall
Courthouse
DefenceEstablishment
Embassy
LegislativeBuilding
Hospital
MovieTheater
Museum
MusicVenue
Park
ParkingFacility
PerformingArtsTheater
PlaceOfWorship
BuddhistTemple
CatholicChurch
Church
HinduTemple
Mosque
Synagogue
Playground
PoliceStation
RVPark
StadiumOrArena
SubwayStation
TaxiStand
TrainStation
Zoo

Landform

BodyOfWater

Canal
LakeBodyOfWater
OceanBodyOfWater
Pond
Reservoir
RiverBodyOfWater
SeaBodyOfWater
Waterfall

Continent
Mountain
Volcano

LandmarksOrHistoricalBuildings
LocalBusiness
Residence

ApartmentComplex
GatedResidenceCommunity
SingleFamilyResidence

TouristAttraction

Product

Today’s announcement is the best news I have heard in years regarding the structured Web, RDF, and the semantic Web. This announcement is — I believe — the signal event of the structured Web. With regard to my longstanding diagram above, I can go to bed tonight knowing we have now crossed the threshold into the semantic Web.

Posted:May 16, 2011

Intro to structOntology

A Video Introduction to a New Online Ontology Editor and Manager

Structured Dynamics is pleased to unveil structOntology — its ontology manager application within the conStruct open source semantic technology suite. We are doing so via a video, which provides a bit more action about this exciting new app.

structOntology has been on our radar for more than two years. But, it was only in embracing the OWLAPI some eight months back that we finally saw our way clear to how to implement the system.

The app, superbly developed by Fred Giasson, has many notable advantages — some of which are covered by the video — but two deserve specific attention: 1) the superior search function (if you have been using Protégé or similar, you will love the fact this search indexes everything, courtesy of Solr); and 2) the availability of its functionality directly within the applications that are driven by the ontologies. Of course, there’s other cool stuff too!:

(If you have trouble seeing this, here is the direct YouTube link or an alternate local Flash version if you can not access YouTube.)

More information on structOntology will be forthcoming over the coming weeks. We will be posting it as open source as part of the Open Semantic Framework by early summer.

Posted:May 10, 2011

Leveraging Intangible Assets Using Semantic Technologies

Exposing $4.7 Trillion Annually in Undervalued Information

Download PDF

Something strange began to happen with company valuations beginning twenty to thirty years ago. Book values increasingly began to diverge — go lower — from stock prices or acquisition prices. Between 1982 and 1992 the ratio of book value to market value decreased from 62% to 38% for public US companies [1]. The why of this mystery has largely been solved, but what to do about it has not. Significantly, semantic technologies and approaches offer both a rationale and an imperative for how to get the enterprises’ books back in order. In the process, semantics may also provide a basis for more productive management and increased valuations for enterprises as well.

The mystery of diverging value resides in the importance of information in an information economy. Unlike the historical and traditional ways of measuring a company’s assets — based on the tangible factors of labor, capital, land and equipment — information is an intangible asset. As such, it is harder to see, understand and evaluate than other assets. Conventionally, and still the more common accounting practice, intangible assets are divided into goodwill, legal (intellectual property and trade secrets) and competitive (know-how) intangibles. But — given that intangibles now equal or exceed the value of tangible assets in advanced economies — we will focus instead on the information component of these assets.

As used herein, information is taken to be any data that is presented in a form useful to recipients (as contrasted to the more technical definition of Shannon and Weaver [2]). While it is true that the there is always a question of whether the collection or development of information is a cost or represents an investment, that “information” is of growing importance and value to the enterprise is certain.

The importance of this information focus can be demonstrated by two telling facts, which I elaborate below. First, only five to seven percent of existing information is adequately used by most enterprises. And, second, the global value of this information amounts to perhaps a range of $2.0 trillion to $7.4 trillion annually (yes, trillions with a T)! It is frankly unbelievable that assets of such enormous magnitude are so poorly understood, exploited or managed.

Amongst all corporate resources and assets, information is surely the least understood and certainly the least managed. We value what we measure, and measure what we value. To say that we little measure information — its generation, its use (or lack thereof) or its value — means we are attempting to manage our enterprises with one eye closed and one arm tied behind our backs. Semantic approaches offer us one way, perhaps the best way, to bring understanding to this asset and then to leverage its value.

The Seven “Laws” of Information

More than a decade ago Moody and Walsh put forward a seminal paper on the seven “laws” of information [3]. Unlike other assets, information has some unique characteristics that make understanding its importance and valuing it much more difficult than other assets. Since I think it a shame that this excellent paper has received little attention and few citations, let me devote some space to covering these “laws”.

Like traditional factors of production — land, labor, capital — it is critical to understand the nature of this asset of “information”. As the laws below make clear, the nature of “information” is totally unique with respect to other factors of production. Note I have taken some liberty and done some updating on the wording and emphasis of the Moody and Walsh “laws” to accommodate recent learnings and understandings.

Law #1: Information Is (Infinitely) Shareable

Information is not friable and can not be depleted. Using or consuming it has no direct affect on others using or consuming it and only using portions of it does not undermine the whole of it. Using it does not cause a degradation or loss of function from its original state. Indeed, information is actually not “consumed” at all (in the conventional sense of the term); rather, it is “shared”. And, absent other barriers, information can be shared infinitely. The access and
use to information on the Web demonstrates this truth daily.

Thus, perhaps the most salient characteristic of information as an asset is that it can be shared between any number of people, business areas and organizations without loss of value to any party (absent the importance of confidentiality or secrecy, which is another information factor altogether). The sharability or maintenance of value irrespective of use makes information quite different to how other assets behave. There is no dilution from use. As Moody and Walsh put it, “from the firm’s perspective, value is therefore cumulative rather than apportioned across different users.”

In practice, however, this very uniqueness is also a cause of other organizational challenges. Both personal and institutional barriers get erected to limit sharing since “knowledge is power.” One perverse effect of information hoarding or lack of institutional support for sharing is to force the development anew of similar information. When not shared, existing information becomes a cost, and one that can get duplicated many times over.

Law #2: The Value of Information Increases With Use

Most resources degrade with use, such as equipment wearing out. In contrast, the per unit value of information increases with use. The major cost of information is in its capture, storage and maintenance. The actual variable costs of using the information (particularly digital information) is, in essence, zero. Thus, with greater use, the per unit cost of information drops.

There is a corollary to this that also goes to the heart of the question of information as an asset. From an accounting point of view, something can only be an asset if it provides future economic value. If information is not used, it cannot possibly result in such benefits and is therefore not an asset. Unused information is really a liability, because no value is extracted from it. In such cases the costs of capture, storage and maintenance are incurred, but with no realized
value. Without use, information is solely a cost to the enterprise.

The additional corollary is that awareness of the information’s existence is an essential requirement in order to obtain this value. As Moody and Walsh state, “information is at its highest ‘potential’ when everyone in the organization knows where it is, has access to it and knows how to use it. Information is at its lowest ‘potential’ when people don’t even know it is there.”

A still further corollary is the importance of information literacy. Awareness of information without an understanding of where it fits or how to take advantage of it also means its value is hidden to potential users. Thus, in addition to awareness, training and documentation are important factors to help ensure adequate use. Both of these factors
may seem like additional costs to the enterprise beyond capture, storage and maintenance, but — without them — no or little value will be leveraged and the information will remain a sunk cost.

Law #3: Information is Perishable

Like most other assets, the value of information tends to depreciate over time [4]. Some information has a short shelf life (such as Web visitations); other has a long shelf life (patents, contracts and many trade secrets). Proper valuation of information should take into account such differences in operational life, analysis or decision life, and statutory life. Operational shelf life tends to be the shortest.

In these regards, information is not too dissimilar from other asset types. The most important point is to be cognizant of use and shelf differences amongst different kinds of information. This consideration is also traded off against the declining costs of digital information storage.

Law #4: The Value of Information Increases With Accuracy

A standard dictum is that the value of information increases with accuracy. The caveat, however, is that some information, because it is not operationally dependent or critical to the strategic interests of the firm, actually can become a cost when capture costs exceed value. Understanding such Pareto principles is an important criterion in evaluating information approaches. Generally, information closest to the transactional or business purpose of the organization will demand higher accuracy.

Such statements may sound like platitudes — and are — in the absence of an understanding of information dependencies within the firm. But, when certain kinds of information are critical to the enterprise, it is just as important to know the accuracy of the information feeding that “engine” as it is for oil changes or maintenance schedules for physical engines. Thus an understanding of accuracy requirements in information should be a deliberate management focus for critical business functions. It is the rare firm that attends to such imperatives today.

Law #5: The Value of Information Increases in Combination

A unique contribution from semantic approaches — and perhaps the one resulting in the highest valuation benefit — arises from the increased value of connecting the information. We have come to understand this intimately as the “network effect” from interconnected nodes on a network. It also arises when existing information is connected as well.

Today’s enterprise information environment is often described by many as unconnected “silos”. Scattered databases and spreadsheets and other information repositories litter the information landscape. Not only are these sources unconnected and isolated, but they also may describe similar information in different and inconsistent ways.

As I have described previously in The Law of Linked Data [5], existing information can act as nodes that — once connected to one another — tend to produce a similar network effect to what physical networks exhibit with increasing numbers of users. Of course, the nature of the information that is being connected and its centrality to the mission of the enterprise will greatly affect the value of new connections. But, based on evidence to date, the value of information appears to go up somewhere between a quadratic and exponential function for the number of new connections. This value is especially evident in know-how and competitive areas.

Law #6: More Is Not Necessarily Better

Information overload is a well-known problem. On the other hand, lack of appropriate information is also a compelling problem. The question of information is thus one of relevancy. Too much irrelevant information is a bad thing, as is too little relevant information.

These observations lead to two use considerations. First, means to understand and focus information capture on relevant information is critical. And, second, information management systems should be purposefully designed with user interfaces for easy filtering of irrelevant information.

The latter point sounds straightforward, but, in actuality, requires a semantic underpinning to the enterprise’s information assets. This requirement is because relevancy is in the eye of the beholder, and different users have different terms, perspectives, and world views by which information evaluation occurs. In order for useful filtering, information must be presented in similar terms and perspectives relevant to those users. Since multiple studies affirm that information decision-makers seek more information beyond their overload points [3], it is thus more useful to provide relevant access and filtering methods that can be tailored by user rather than “top down” information restrictions.

Law #7: Information is Self-propagating

With access and connections, information tends to beget more information. This propagation results from summations, analysis, unique combinations and other ways that basic datum get recombined into new datum. Thus, while the first law noted that information can not be consumed (or depleted) by virtue of its use, we can also say that information tends to reproduce and expand itself via use and inspection.

Indeed, knowledge itself is the result of how information in its native state can be combined and re-organized to derive new insights. From a valuation standpoint, it is this very understanding that leads to such things as competitive intelligence or new know-how. In combination with insights from connections, this propagating factor of information is the other leading source of intangible asset valuations.

This law also points to the fact that information per se is not a scarce resource. (Though its availability may be scarce.) Once available, techniques like data mining, analysis, visualization and so forth can be rich sources for generating new information from existing holdings of data.

Information as an Asset and How to Value

These “laws” — or perspectives — help to frame the imperatives for how to judge information as an asset and its resulting value. The methodological and conceptual issues of how to explicitly account for information on a company’s books are, of course, matters best left to economists and professional accountants. With the growing share of information in relation to intangible assets, this would appear to be a matter of great importance to national policy. Accounting for R&D efforts as an asset versus a cost, for example, has been estimated to add on the order of 11 percent to US national GDP estimates [9].

The mere generation of information is not necessarily an asset, as the “laws” above indicate. Some of the information has no value and some indeed represents a net sunk cost. What we can say, however, is that valuable information that is created by the enterprise but remains unused or is duplicated means that what was potentially an asset has now been turned into a cost — sometimes a cost repeated many-fold.

Information that is used is an asset, intangible or not. Here, depending on the nature of the information and its use, it can be valued on the basis of cost (historical cost or what it cost to develop it), market value (what others will pay for it), or utility (what is its present value as benefits accrue into the future). Traditionally the historical cost method has been applied to information. Yet, since information can both be sold and still retained by the organization, it may have both market value and utility value, with its total value being the sum.

In looking at these factors, Moody and Walsh propose a number of new guidelines in keeping with the “laws” noted above [3]:

Operational information should be measured as the cost of collection using data entry costs
Management information should be valued based on what it cost to extract the data from operational systems
Redundant data should be considered to have zero value (Law #1)
Unused data should be considered to have zero value (Law #2)
The number of users and number of accesses to the data should be used to multiply the value of the information (Law #2). When information is used for the first time, it should be valued at its cost of collection; subsequent uses should add to this value (perhaps on a depreciated basis; see below)
The value of information should be depreciated based on its “shelf life” (Law #3)
The value of information should be discounted by its accuracy relative to what is considered to be acceptable (Law #4)
And, as an added factor, information that is effectively linked or combined should have its value multiplied (Law #5), though the actual multiplier may be uncertain [5].

The net result of thinking about information in this more purposeful way is to encourage more accurate valuation methods, and to provide incentives for more use and re-use, particularly in combined ways. Such methods can also help distinguish what information is of more value to the organization, and therefore worthy of more attention and investment.

The Growing Importance of Intangible Information

The emerging discrepancy between market capitalizations and book values began to get concerted academic attention in the 1990s. To be sure, perceptions by the market and of future earnings potential can always color these differences. The simple occurrence of a discrepancy is not itself proof of erroneous or inaccurate valuations. (And, the corollary is that the degree of the discrepancy is not sufficient alone to estimate the intangible asset “gap”, a logical error made by many proponents.) But, the fact that these discrepancies had been growing and appeared to be based (in part) on structural changes linked to intangibles was creating attention.

Leonard Nakamura, an economist with the Federal Reserve Board in Philadelphia, published a working paper in 2001 entitled, “What is the U.S. Gross investment in Intangibles? (At Least) One Trillion Dollars a Year!” [6]. This was one of the first attempts to measure intangible investments, which he defined as private expenditures on assets that are intangible and necessary to the creation and sale of new or improved products and processes, including designs, software, blueprints, ideas, artistic expressions, recipes, and the like. Nakamura acknowledged his work as being preliminary. But he did find direct and indirect empirical evidence to show that US private firms were investing at least $1 trillion annually (as of 2000, the basis year for the data) in intangible assets. Private expenditures, labor and corporate operating margins were the three measurement methods. The study also suggested that the capital stock of intangibles in the US has an equilibrium market value of at least $5 trillion.

Another key group — Carol Corrado, Charles Hulten, and Daniel Sichel, known as “CHS” across their many studies — also began to systematically evaluate the extent and basis for intangible assets and its discrepancy [7]. They estimated that spending on long-lasting knowledge capital — not just intangibles broadly — grew relative to other major components of aggregate demand during the 1990s. CHS was the first to show that by the turn of the millenium that fixed US investment in intangibles was at least as large as business investment in traditional, tangible capital.

By later in the decade, Nakamura was able to gather and analyze time series data that showed the steady increase in the contributions of intangibles [8]:

One can see the cross-over point late in the decade. Investment in intangibles he now estimates to be on the order of 8% to 10% of GDP annually in the US.

Roughly at the same time the National Academies in the US was commissioned to investigate the policy questions of intangible assets. The resulting major study [9] contains much relevant information. But it, too, contained an update by CHS on their slightly different approach to analyzing the growing role of intangible assets:

Trends in Intangibles Values - Version 2

This CHS analysis shows similar trends to what Nakamura found, though the degree of intangible contributions is estimated as higher (~14% of annual GDP today), with investments in intangibles exceeding tangible assets somewhat earlier.

Surveys of more than 5,000 companies in 25 companies confirmed these trends from a different perspective, and also showed that most of these assets did not get reflected in financial statements. A large portion of this value was due to “brands” and other market intangibles [10]. The total “undisclosed” portion appeared to equal or exceed total
reported assets. Figures for the US indicated there might be a cumulative basis of intangible assets of $9.2 trillion [11].

In parallel, these groups and others began to decompose the intangible asset growth by country, sector, or asset type. The specific component of “information” received a great deal of attention. Uday Apte, Uday Karmarkar and Hiranya Nath, in particular, conducted a couple of important studies during this decade [12,13]. For example, they found nearly two-thirds of recent US GDP was due to information or knowledge industry contributions, a percentage that had been growing over time. They also found that a secondary sector of information internal to firms itself constituted well over 40% of the information economy, or some 28% of the entire economy. So the information activities internal to organizations and institutions represent a very large part of the economy.

The specific components that can constitute the informational portion of intangible assets has also been looked at by many investigators, importantly including key accounting groups. FASB, for example, has specific guidance on treatment of intangible assets in SFAS 141 [14]. Two-thirds of the 90 specific intangible items listed by the American Institute of Certified Public Accountants are directly related to information (as opposed to contracts, brands or goodwill), as shown in [15]. There has also been some good analysis by CHS on breakdowns by intangible assets categories [16]. There are also considerable differences by country on various aspects of these measures (for example, [10]). For example, according to OECD figures from 2002, expenditures for knowledge (R&D, education and software) ranged from nearly 7 percent (Sweden) to below 2 percent (Greece) in OECD countries, with the average of about 4 percent and the US at over 6 percent [17].

. . . Plus Too Much Information Goes Unused

The common view is that a typical organization only uses 5 to 7 percent of the information it already has on hand [18], and 20% to 25% of a knowledge worker’s time is spent simply trying to find information [19]. To probe these issues more deeply, I began a series of analyses in 2004 looking at how much money was being spent on preparing documents within US companies, and how much of that investment was being wasted or not re-used [20]. One key finding from that study was that the information within documents in the US represent about a third of total gross domestic product, or an amount equal at the time of the study to about $3.3 trillion annually (in 2010 figures, that would be closer to $4.7 trillion). This level of investment is consistent with the results of Apte et al. and others as noted above.

However, for various reasons — mostly due to lack of awareness and re-use — some 25% of those trillions of dollar spent annually on document creation costs are wasted. If we could just find the information and re-use it, massive benefits could accrue, as these breakdowns in key areas show:

U.S. FIRMS	$ Million	%
Cost to Create Documents	$3,261,091

Benefits to Finding Missed or Overlooked Documents	$489,164	63%
Benefits to Improved Document Access	$81,360	10%
Benefits of Re-finding Web Documents	$32,967	4%

Benefits of Proposal Preparation and Wins	$6,798	1%
Benefits of Paperwork Requirements and Compliance	$119,868	15%
Benefits of Reducing Unauthorized Disclosures	$51,187	7%

Total Annual Benefits	$781,314	100%

PER LARGE FIRM	$ Million
Cost to Create Documents	$955.6

Benefits to Finding Missed or Overlooked Documents	$143.3
Benefits to Improving Document Access	$23.8
Benefits of Re-finding Web Documents	$9.7

Benefits of Proposal Preparation and Wins	$2.0
Benefits of Paperwork Requirements and Compliance	$35.1
Benefits of Reducing Unauthorized Disclosures	$15.0

Total Annual Benefits	$229.0

Table. Mid-range Estimates for the Annual Value of Documents, U.S. Firms, 2002 [20]

The total benefit from improved document access and use to the U.S economy is on the order of 8% of GDP. For the 1,000 largest U.S. firms, benefits from these improvements can approach nearly $250 million annually per firm (2002 basis). About three-quarters of these benefits arise from not re-creating the intellectual capital already invested in prior document creation. About one-quarter of the benefits are due to reduced regulatory non-compliance or paperwork, or better competitiveness in obtaining solicited grants and contracts.

This overall value of document use and creation is quite in line with the analyses of intangible assets noted above, and which arose from totally different analytical bases and data. This triangulation brings confidence that true trends in the growing importance of information have been identified.

How Big is the Information Asset Gap?

These various estimates can now be combined to provide an assessment of just how large the “gap” is for the overlooked accounting and use of information assets:

	GDP ($T)	Intangible %		Info Contrib %		Info Assets ($T)		Unused Info ($T)		Total ($T)
		Lo	Hi	Lo	Hi	Lo	Hi	Lo	Hi	Lo	Hi
US	$14.72	9%	14%	33%	67%	$0.44	$1.38	$0.30	$1.21	$0.74	$2.60
European Union	$15.25	8%	12%	33%	50%	$0.40	$0.92	$0.31	$1.26	$0.72	$2.17
Remaining Advanced	$10.17	8%	12%	33%	50%	$0.27	$0.61	$0.21	$0.84	$0.48	$1.45
Rest of World	$34.32	2%	6%	10%	25%	$0.07	$0.51	$0.00	$0.71	$0.07	$1.22

Total	$74.46					$1.18	$3.42	$0.83	$4.02	$2.00	$7.44
Notes (see endnotes)	[21]	[22]		[23]				[24]

Depending, these estimates can either be viewed as being too optimistic about the importance of information assets [25] or too conservative [26]. The breadth of the ranges of these values is itself an expression of the uncertainty in the numbers and the analysis.

The analysis shows that, globally, the value of unused and unaccounted information assets may be on the order of $2.0 trillion to $7.4 trillion annually, with a mid-range value of $4.7 trillion. Even considering uncertainties, these are huge, huge numbers by any account. For the US alone, this range is $750 billion to $2.6 trillion annually. The analysis from the prior studies [20] would strongly suggest the higher end of this range is more likely than the lower. Similarly large gaps likely occur within the European Union and within other advanced nations. For individual firms, depending on size, the benefits of understanding and closing these gaps can readily be measured in the millions to billions [27].

At the high end, these estimates suggest that perhaps as much as 10% of global expenditures is wasted and unaccounted for due to information-related activities. This is roughly equivalent to adding a half of the US economy to the global picture.

In the concluding section, we touch on why such huge holes may appear in the world’s financial books. Clearly, though, even with uncertain and heroic assumptions, the magnitude of this gap is huge, with compelling needs to understand and close it as soon as possible.

The Relationship to Semantic Technologies

The seven Moody and Walsh information “laws” provide the clues to the reasons why we are not properly accounting for information and why we inadequately use it:

We don’t know what information we have and can not find it
What we have we don’t connect
We misallocate resources for generating, capturing and storing information, because we don’t understand its value and potential
We don’t manage the use of information or its re-use
We duplicate efforts
We inadequately leverage what information we have and miss valuable (that is, can be “valuated”) insights that could be gained.

Fundamentally, because information is not understood in our bones as central to the well-being of our enterprises, we continue to view the generation, capture and maintenance of information as a “cost” and not an “asset”.

I have maintained for some time an interactive information timeline [28] that attempts to encompass the entire human history of information innovations. For tens of thousands of years steady — yet slow — progress in the ways to express and manage information can be seen in this timeline. But, then, beginning with electricity and then digitization, the pace of innovation explodes.

The same timeframe that sees the importance of intangible assets appear on national and firm accounts is when we see the full digitization of information and its ability to be communicated and linked over digital networks. A very insightful figure by Rama Hoetzlein for his thesis in 2007, which I have modified and enhanced, captures this evolution with some estimated dates as is shown below (click to expand) [29]:

Knowledge System Trends throughout History

The first insight this figure provides is that all forms of information are now available in digital form. This includes unstructured (images and documents), semi-structured (mark-up and “tagged” information) and structured (database and spreadsheet) information. This information can now be stored and communicated over digital networks with broadly accepted protocols.

But the most salient insight is that we now have the means through semantic technologies and approaches to interrelate all of this information together. Tagging and extraction methods enable us to generate metadata for unstructured documents and content. Data models based on predicate logic and semantic logics give us the flexible means to express the relationships and connections between information. And all of this can be stored and manipulated through graph-based datastores and languages such that we can draw inferences and gain insights. Plus, since all of this is now accessible via the Web and browsers, virtually any user can access, use and leverage this information.

This figure and its dates not only shows where we have come as a species in our use and sophistication with information, but how we need to bring it all together using semantics to complete our transition to a knowledge economy.

The very same metadata and semantic tagging capabilities that enable us to interrelate the information itself also provides the techniques by which we can monitor and track usage and provenance. It is through these additional semantic methods that we can finally begin to gain insight as to what information is of what value and to whom. Tapping this information will complete the circle for how we can also begin to properly valuate and then manage and optimize our information assets.

Conclusion

With our transition to an information economy, we now see that intangible assets exceed the value of tangible ones. We see that the information component of these intangibles represent one-third to two-thirds of these intangibles. In other words, information makes up from 17% to more than one-third of an individual firm’s value in modern economies. Further, we see that at least 25% of firm expenditures on information is wasted, keeping it as a cost and negating its value as an asset.

The “factories” of the modern information economy no longer produce pins with the fixed inputs of labor and capital as in the time of Adam Smith. They rather produce information and knowledge and know-how. Yet our management and accounting systems seem fixed in the techniques of yesteryear. The quaint idea of total factor productivity as a “residual” merely belies our ignorance about the causes of economic growth and firm value. These are issues that should rightly occupy the attention of practitioners in the disciplines of accounting and management.

Why industrial-era accounting methods have been maintained in the present information age is for students of corporate power politics to debate. It should suffice to remind us that when industrialization induced a shift from the extraction of funds from feudal land possessions to earning profits on invested capital, most of the assumptions about how to measure performance had to change. When the expenses for acquiring information capabilities cease to be an arbitrary budget allocation and become the means for gaining Knowledge Capital, much of what is presently accepted as management of information will have to shift from a largely technological view of efficiency to an asset management perspective [30].

Accounting methods grounded in the early 1800s that are premised on only capital assets as the means to increase the productivity of labor no longer work. Our engines of innovation are not physical devices, but ideas, innovation and knowledge; in short, information. Capable executives recognize these trends, but have yet to change management practices to address them [31].

As managers and executives of firms we need not await wholesale modernization of accounting practices to begin to make a difference. The first step is to understand the role, use and importance of information to our organizations. Looking clearly at the seven information “laws” and what that means about tracking and monitoring is an immediate way to take this step. The second step is to understand and evaluate seriously the prospects for semantic approaches to make a difference today.

We have now sufficiently climbed the data federation pyramid [32] to where all of our information assets are digital; we have network protocols to link it; we have natural language and extraction techniques for making documents first-class citizens along side structured data; and we have logical data models and sound semantic technologies for tying it all together.

We need to reorganize our “factory” floors around these principles, just as prime movers and unit electric drives altered our factories of the past. We need to reorganize and re-think our work processes and what we measure and value to compete in the 21st century. It is time to treat information as seriously as it has become an integral part of our enterprises. Semantic technologies and approaches provide just the path to do so.

[1] Baruch Lev and Jürgen H. Daum, 2003. “Intangible Assets and the Need for a Holistic and More Future-oriented Approach to Enterprise Management and Corporate Reporting,” prepared for the 2003 PMA Intellectual Capital Symposium, 2nd October 2003, Cranfield Management Development Centre, Cranfield University, UK; see http://www.juergendaum.de/articles/pma_ic_symp_jdaum_final.pdf.

[2] Claude E. Shannon and Warren Weaver, 1949. The Mathematical Theory of Communication. The University of Illinois Press, Urbana, Illinois, 1949. ISBN 0-252-72548-4.

[3] Daniel Moody and Peter Walsh, 1999. “Measuring The Value Of Information: An Asset Valuation Approach,” paper presented at the Seventh European Conference on Information Systems (ECIS’99), Copenhagen Business School, Frederiksberg, Denmark, 23-25 June, 1999. See http://wwwinfo.deis.unical.it/zumpano/2004-2005/PSI/lezione2/ValueOfInformation.pdf. A precursor paper that is also quite helpful and cited much in Moody and Walsh is R. Glazer, 199. “Measuring the Value of Information: The Information Intensive Organisation”, IBM Systems Journal, Vol 32, No 1, 1993.

[4] Some trade secrets could buck this trend if the value of the underlying enterprise that relies on them increases.

[5] M.K. Bergman, 2009. “The Law of Linked Data,” post in AI3:::Adaptive Information blog, October 11, 2009. See https://www.mkbergman.com/837/the-law-of-linked-data/.

[6] Leonard Nakamura, 2001. What is the U.S. Gross Investment in Intangibles? (At Least) One Trillion Dollars a Year!,
Working Paper No. 01-15, Federal Reserve Bank of Philadelphia, October 2001; see http://www.phil.frb.org/files/wps/2001/wp01-15.pdf.

[7] Carol A. Corrado, Charles R. Hulten, and Daniel E. Sichel, 2004. Measuring Capital and Technology: An Expanded Framework. Federal Reserve Board, August 2004. http://www.federalreserve.gov/pubs/feds/2004/200465/200465pap.pdf.

[8] Leonard I. Nakamura, 2009. Intangible Assets and National Income Accounting: Measuring a Scientific Revolution, Working Paper No. 09-11, Federal Reserve Bank of Philadelphia, May 8, 2009; see http://www.philadelphiafed.org/research-and-data/publications/working-papers/2009/wp09-11.pdf.

[9] Christopher Mackie, Rapporteur, 2009. Intangible Assets: Measuring and Enhancing Their Contribution to Corporate Value and Economic Growth: Summary of a Workshop, prepared by the Board on Science, Technology, and Economic Policy (STEP) Committee on National Statistics (CNSTAT), ISBN: 0-309-14415-9, 124 pages; see http://www.nap.edu/openbook.php?record_id=1274 (available for PDF download with sign-in).

[10] Brand Finance, 2006. Global Intangible Tracker 2006: An Annual Review of the World’s Intangible Value, paper published by Brand Finance and The Institute of Practitioners in Advertising, London, UK, December 2006. See http://www.brandfinance.com/images/upload/9.pdf.

[11] Kenan Patrick Jarboe and Roland Furrow, 2008. Intangible Asset Monetization: The Promise and the Reality, Working Paper #03 from the Athena Alliance, April 2008. See http://www.athenaalliance.org/pdf/IntangibleAssetMonetization.pdf.

[12] Uday M. Apte and Hiranya K. Nath, 2004, “Size, Structure and Growth of the US Information Economy,” UCLA Anderson School of Management on Business and Information Technologies, December 2004; see http://www.anderson.ucla.edu/documents/areas/ctr/bit/ApteNath.pdf.pdf.

[13] Uday M. Apte, Uday S. Karmarkar and Hiranya K Nath, 2008. “Information Services in the US Economy: Value, Jobs and Management Implications,” California Management Review, Vol. 50, No.3, 12-30, 2008.

[14] See the Financial Accounting Standards Board—SFAS 141; see http://www.gasb.org/pdf/fas141r.pdf.

[15] See further, AICPA Special Committee on Financial Reporting, 1994. Improving Business Reporting—A Customer Focus: Meeting the Information Needs of Investors and Creditors. See http://www.aicpa.org/InterestAreas/AccountingAndAuditing/Resources/EBR/DownloadableDocuments/Jenkins%20Committee%20Report.pdf.

Blueprints Book librariesBroadcast licenses

Buy-sell agreements

Certificates of need

Chemical formulas

Computer software

Computerized databases

Contracts

Cooperative agreements

Copyrights

Credit information files

Customer contracts

Customer and client lists

Customer relationships

Designs and drawings Development rightsEmployment contracts

Engineering drawings

Environmental rights

Film libraries

Food flavorings and recipes

Franchise agreements

Historical documents

Heath maintenance organization enrollment lists

Know-how

Laboratory notebooks

Literary works

Management contracts

Manual databases

Manuscripts Medical charts and recordsMusical compositions

Newspaper morgue files

Noncompete covenants

Patent applications

Patents (both product and process)

Patterns

Prescription drug files

Prizes and awards

Procedural manuals

Product designs

Proposals outstanding

Proprietary computer software

Proprietary processes

Proprietary products Proprietary technologyPublications

Royalty agreements

Schematics and diagrams

Shareholder agreements

Solicitation rights

Subscription lists

Supplier contracts

Technical and specialty libraries

Technical documentation

Technology-sharing agreements

Trade secrets

Trained and assembled workforce

Training manuals

[16] See, for example, Carol Corrado, Charles Hulten and Daniel Sichel, 2009. “Intangible Capital and U.S. Economic Growth,” Review of Income and Wealth Series 55, Number 3, September 2009; see http://www.conference-board.org/pdf_free/IntangibleCapital_USEconomy.pdf.

[17] As stated in Kenan Patrick Jarboe, 2007. Measuring Intangibles: A Summary of Recent Activity, Working Paper #02 from the Athena Alliance, April 2007. See http://www.athenaalliance.org/pdf/MeasuringIntangibles.pdf.

[18] The 5% estimate comes from Graham G. Rong, Chair at MIT Sloan CIO Symposium, as reported in the SemanticWeb.com on May 5, 2011. (Rong also touted the use of semantic technologies to overcome this lack of use.) A similar 7% estimate comes from Pushpak Sarkar, 2002. “Information Quality in the Knowledge-Driven Enterprise,” InfoManagement Direct, November 2002. See http://www.information-management.com/infodirect/20021115/6045-1.html.

[19] M.K. Bergman, 2005. “Search and the ‘25% Solution’,” AI3:::Adaptive Innovation blog, September 14, 2005. See https://www.mkbergman.com/121/search-and-the-25-solution/.

[20] M.K. Bergman, 2005. “Untapped Assets: the $3 Trillion Value of U.S. Documents,” BrightPlanet Corporation White Paper, July 2005, 42 pp. Also available online and in PDF.

[21] From the CIA, 2011. The World Factbook; accessed online at https://www.cia.gov/library/publications/the-world-factbook/index.html on May 9, 2011. The “remaining advanced” countries are Australia, Canada, Iceland, Israel, Japan, Liechtenstein, Monaco, New Zealand, Norway, Puerto Rico, Singapore. South Korea, Switzerland, Taiwan.

[22] The range of estimates is drawn from the Nakamura [8] and CHS [9] studies, with each respectively providing the lower and upper bounds. These values have been slightly decremented for non-US advanced countries, and greatly reduced for non-advanced ones.

[23] The high range is based on the categorical share of intangible asset categories (60 of 90) from the AIPCA work [15]; the lower range is from the one-third of GDP estimates from [20].These values have been slightly decremented for non-US advanced countries, and greatly reduced for non-advanced ones.

[24] For unused information assets, the high range is based on the one-third of GDP and 25% “waste” estimates from [20]; the low range halves each of those figures. These values have been slightly decremented for non-US advanced countries, and greatly reduced for non-advanced ones (and zero for the low range).

[25] Reasons for the estimates to be too optimistic are information as important as goodwill; branding; intellectual basis of cited resources is indeed real; considerable differences by country and sector (see [10] and [16]).

[26] Reasons for the estimates to be too conservative: no network effects; greatly discounted non-advanced countries; share is growing (but older estimates used); considerable differences by country and sector (see [10] and [16]).

[27] For some discussion of individual firm impacts and use cases see [10] and [20], among others.

[28] See the Timeline of Information History, and its supporting documentation at M.K. Bergman, 2008. “Announcing the ‘Innovations in Information’ Timeline,” AI3:::Adaptive Information blog, July 6, 2008; see https://www.mkbergman.com/421/announcing-the-innovations-in-information-timeline/.

[29] This figure is a modification of the original published by Rama C. Hoetzlein, 2007. Quanta – The Organization of Human Knowedge: Systems for Interdisciplinary Research, Master’s Thesis, University of California Santa Barbara, June 2007; see http://www.rchoetzlein.com/quanta/ (p 112). I adapted this figure to add logics, data and metadata to the basic approach, with color coding also added.

[30] From Paul A. Strassmann, 1998. “The Value of Knowledge Capital,” American Programmer, March 1998. See http://www.strassmann.com/pubs/valuekc/.

[31] For example, according to [11], in a 2003 Accenture survey of senior managers across industries, 49 percent of respondents said that intangible assets are their primary focus for delivering long-term shareholder value, but only 5 percent stated that they had an organized system to track the performance of these assets. Also, according sources cited in Gio Wiederhold, Shirley Tessler, Amar Gupta and David Branson Smith, 2009. “The Valuation of Technology-Based Intellectual Property in Offshoring Decisions,” in Communications for the Association of Information Systems (CAIS) 24, May 2009 (see http://ilpubs.stanford.edu:8090/951/2/Article_07-270.pdf): Owners and stockholders acknowledge that IP valuation of technological assets is not routine within many organizations. A 2007 study performed by Micro Focus and INSEAD highlights the current state of affairs: Of the 250 chief information officers (CIOs) and chief finance officers (CFOs) surveyed from companies in the U.S., UK, France, Germany, and Italy, less than 50 percent had attempted to value their IT assets, and more than 60 percent did not assess the value of their software.

[32] M.K. Bergman, 2006. “Climbing the Data Federation Pyramid,” AI3:::Adaptive Information blog, May 25, 2006; see https://www.mkbergman.com/229/climbing-the-data-federation-pyramid/.

Posted:April 25, 2011

Workflow Perspectives on the Open Semantic Framework

Advances in How to Transfer Semantic Technologies to Enterprise Users

For some time, our mantra at Structured Dynamics has been, “We’re successful when we are not needed. [1]” In support of this vision, we have been key developers of an entire stack of semantic technologies useful to the enterprise, the open semantic framework (OSF); we have formulated and contributed significant open source deployment guidance to the MIKE2.0 methodology for semantic technologies in the enterprise called Open SEAS; we have developed useful structured data standards and ontologies; and we have made massive numbers of free how-to documents and images available for download on our TechWiki. Today, we add further to these contributions with our workflows guidance. All of these pieces contribute to what we call the total open solution. Prior documentation has described the overall architecture or layered approach of the open semantic framework (OSF). Those documents are useful, but lack a practical understanding of how the pieces fit together or how an OSF instance is developed and maintained. This new summary overviews a series of seven different workflows for various aspects of developing and maintaining an OSF (based on Drupal) [2]. In addition, each workflow section also cross-references other key documentation on the TechWiki, as well as points to possible tools that might be used for conducting each specific workflow.

Overview

Seven different workflows are described, as shown in the diagram below. Each of the workflows is color-coded and related to the other workflows. The basic interaction with an OSF instance tends to occur from left-to-right in the diagram, though the individual parts are not absolutely sequential. As each of the seven specific workflows is described below, it is keyed by the same color-coded portion of the overall workflow.

Each of the component workflows is itself described as a series of inter-relating activities or tasks.

Installation Workflow

Installation is mostly a one-time effort and proceeds in a more-or-less sequential basis. As various components of the stack are installed, they are then configured and tested for proper installation. The installation guide is the governing document for this process, with quite detailed scripts and configuration tests to follow. The blue bubbles in the diagram represent the major open source software components of Virtuoso (RDF triple store), Solr (full-text search) and Drupal (content management system).

Another portion of this workflow is to set up the tools for the backoffice access and management, such as PuTTY and WinSCP (among others).

Click here to see the tools associated with this workflow sequence, as described in the TechWiki desktop tools document.

Configure & Presentation Workflow

One of the most significant efforts in the overall OSF process is the configuration and theming of the host portal, generally based on Drupal. The three major clusters of effort in this workflow are the design of the portal, including a determination of its intended functionality; the setting of the content structure (stubbing of the site map) for the portal; and determining user groups and access rights. Each of these, in turn, is dependent on one or more plug-in modules to the Drupal system. Some of these modules are part of the conStruct series of OSF modules, and others are evaluated and drawn from the more than 8000 third-party plug-in modules to Drupal.

The Design aspect involves picking and then modifying a theme for the portal. These may start as one of the open source existing Drupal themes, as well as those more specifically recommended for OSF. If so, it will likely be necessary to do some minor layout modifications on the PHP code and some CSS (styling) changes. Theming (skinning) of the various semantic component widgets (see below) also occurs as part of this workflow. The Content Structure aspect involves defining and then stubbing out placeholders for eventual content. Think of this step as creating a site map structure for the OSF site, including major Drupal definitions for blocks, Views and menus. Some of the entity types are derived from the named entity dictionaries used by a given project. More complicated User assignments and groups are best handled through a module such as Drupal’s Organic Groups. In any event, determination of user groups (such as anonymous, admins, curators, editors, etc.) is a necessary early determination, though these may be changed or modified over time. For site functionality, Modules must be evaluated and chosen to add to the core system. Some of these steps and their configuration settings are provided in the guidelines for setting up Drupal document. None of the initial decisions “lock in” eventual design and functionality. These may be modified at any time moving forward.

Click here to see the tools associated with this workflow sequence, as described in the TechWiki desktop tools document.

Structured Data Workflow

Of course, a key aspect of any OSF instance is the access and management of structured data. There are basically two paths for getting structured data into the system. The first, involving (generally) smaller datasets is the manual conversion of the source data to one of the pre-configured OSF import formats of RDF, JSON, XML or CSV. These are based on the irON notation; a good case study for using spreadsheets is also available. The second path (bottom branch) is the conversion of internal structured data, often from a relational data store. Various converters and templates are available for these transformations. One excellent tool is FME from Safe Software (representing the example shown utilizing a spatial data infrastructure (SDI) data store), though a very large number of options exist for extract, transform and load. In the latter case, procedures for polling for updates, triggering notice of updates, and only extracting the deltas for the specific information changed can help reduce network traffic and upload/conversion/indexing times.

Click here to see the tools associated with this workflow sequence, as described in the TechWiki desktop tools document.

Content Workflow

The structured data from the prior workflow process is then matched with the remaining necessary content for the site. This content may be of any form and media (since all are supported by various Drupal modules), but, in general, the major emphasis is on text content. Existing text content may be imported to the portal or new content can be added via various WYSIWYG graphical editors for use within Drupal. (The excellent WYWIWYG Drupal module provides an access point to a variety of off-the-shelf, free WYSIWYG editors; we generally use TinyMCE but multiples can also be installed simultaneously). The intent of this workflow component is to complete content entry for the stubs earlier created during the configuration phase. It is also the component used for ongoing content additions to the site.

Content that is tagged by the scones tagger is done so based on the concepts in the domain ontology (see below) and the named entities (as contained in “dictionaries”) used by a given project. Once tagged, this information can also now be related to the other structured data in the system. Once all of this various content is entered into the system, it is then available for access and manipulation by the various conStruct modules (see figure above) and semantic component widgets (see below).

Click here to see the tools associated with this workflow sequence, as described in the TechWiki desktop tools document.

Ontologies Workflow

Though the next flowchart below appears rather complicated, there are really only three tasks that most OSF administrators need worry about with respect to ontologies:

Adding a concept to the domain ontology (a class) and setting its relationships to other concepts
Adding a dataset attribute (data characteristic) for various dataset records, or
Adding or changing an annotation for either of these things, such as the labels or descriptions of the thing.

In actuality, of course, editing, modifying or deleting existing information is also important, but they are easier subsets of activities and user interfaces to the basic add (“create”) functions. The OSF interface provides three clean user interfaces to these three basic activities [3]. These basic activities may be applied to the three major governing ontologies in any OSF installation:

The domain ontology, which captures the conceptual description of the instance’s domain space
The semantic components ontology (SCO), which sets what widgets may display what kinds of data, and
irON for the instance record attributes and metadata (annotations).

All of the OSF ontology tools work off of the OWLAPI as the intermediary access point. The ontologies themselves are indexed as structured data (RDF with Virtuoso) or full text (Solr) for various search, retrieval and reasoning activities.

Because of the central use of the OWLAPI, it is also possible to use the Protégé editor/IDE environment against the ontologies, which also provides reasoners and consistency checking.

Click here to see the tools associated with this workflow sequence, as described in the TechWiki desktop tools document.

Filter & Select Workflow

The filter and select activities are driven by user interaction, with no additional admin tools required. This workflow is actually the culmination of all of the previous sequences in that it exposes the structured data to users, enables them to slice-and-dice it, and then to view it with a choice of relevant widgets (semantic components). For example, see this animation:

Animated Filtering and Selection Workflow

Considerable more detail and explanation is available for these semantic components.

Click here to see the tools associated with this workflow sequence, as described in the TechWiki desktop tools document.

Maintenance Workflow

The ongoing maintenance of an OSF instance is mostly a standard Drupal activity. Major activities that may occur include moderating comments; rotating or adding new content; managing users; and continued documentation of the site for internal tech transfer and training. If the portal embraces other aspects of community engagement (social media), these need to be handled as part of this workflow as well. All aspects of the site and its constituent data may be changed, or added to at any time.

Click here to see the tools associated with this workflow sequence, as described in the TechWiki desktop tools document.

Moving from Here

When first introduced in our three–part series, we noted the interlocking pieces that constituted the total open solution of the open semantic framework (OSF) (see right). We also made the point — unfortunately still true today — that the relative maturity and completeness of all of these components still does not allow us to achieve fully, “We’re successful when we are not needed.” As a small firm that is committed to self-funding via revenues, Structured Dynamics is only able to add to its stable of open source software and to develop methodologies and provide documentation based on our client support. Yet, despite our smallness, our superb client support has enabled us to aggressively and rapidly add to all four components of this total open solution. This newest series of ongoing workflow documents (plus some very significant expansions and refinements of the OSF code base) is merely the latest example of this dynamic. Through judicious picking of clients (and vice versa), and our insistence that new work and documentation be open sourced because it itself has benefitted from prior open source, we and our client partners have been making steady progress to this vision of enterprises being able to adopt and install semantic solutions on their own. Inch-by-inch we are getting there. The status of our vision today is that we are still needed in most cases to help formulate the implementation plan and then guide the initial set-up and configuration of the OSF. This support typically includes ontology development, data conversion and overall component integration. While it is true that some parties have embraced the OSF code and documentation and are implementing solutions on their own, this still requires considerable commitment and knowledge and skills in semantic technologies. The great news about today’s status is that — after initial set-up and configuration — we are now able to transfer the technology to the client and walk away. Tools, documentation, procedures and workflows are now adequate for the client to extend and maintain their OSF instance on their own. This great news includes a certification process and program for transferring the technology to client staff and assessing their proficiency in using it. We have been completely open about our plans and our status. In our commitment to our vision of success, much work is still needed on the initial install and configure steps and on the entire area of ontology creation, extension and mapping [4]. We are working hard to bridge these gaps. We welcome additional partners that share with us the vision of complete, turnkey frameworks — including all aspects of total open solutions. Inch-by-inch we are approaching the realization of a vision that will fundamentally change how every enterprise can leverage its existing information assets to deliver competitive advantage and greater value for all stakeholders. You are welcome aboard!

[1] This has been the thematic message on Structured Dynamics‘ Web site for at least two years. The basic idea is to look at open source semantic technologies from the perspective of the enterprise customer, and then to deliver all necessary pieces to enable that customer to install, deploy and maintain the OSF stack on its own. The sentiment has infused our overall approach to technology development, documentation, technology transfer and attention to methodologies.

[2] The first version of this article appeared as Workflow Perspectives on OSF on the OpenStructs TechWiki on April 19, 2011.

[3] The current release of OSF does not yet have these components included; they will be released to the open source SVNs by early summer.

[4] The best summary of the vision for where ontology development needs to head is provided by the Normative Landscape of Ontology Tools article on the TechWiki; see especially the second figure in that document.

Main Links

Search

A Decade of Remarkable Advances in Ten Grand IT Challenges

Ten Grand Challenge Advances

#1 Information Extraction

#2 Machine Translation

#3 Sentiment Analysis

#4 Disambiguation

#5 Speech Synthesis and Recognition

#6 Image Recognition

#7 Interoperability Standards and Methods

#7 Common Domain Models

#9 Virtual Apps (Cloud Computing)

#10 Big Data

Why Such Progress?

The Widening Explosion

Contrary to Some Views, Google and Co.’s Microdata Effort will Also Boost RDF

Is the Initiative a Slap in RDF’s Face?

The Initiative is No Surprise

Why This is Exciting

A Video Introduction to a New Online Ontology Editor and Manager

Exposing $4.7 Trillion Annually in Undervalued Information

The Seven “Laws” of Information

Law #1: Information Is (Infinitely) Shareable

Law #2: The Value of Information Increases With Use

Law #3: Information is Perishable

Law #4: The Value of Information Increases With Accuracy

Law #5: The Value of Information Increases in Combination

Law #6: More Is Not Necessarily Better

Law #7: Information is Self-propagating

Information as an Asset and How to Value

The Growing Importance of Intangible Information

. . . Plus Too Much Information Goes Unused

How Big is the Information Asset Gap?

The Relationship to Semantic Technologies

Conclusion

Advances in How to Transfer Semantic Technologies to Enterprise Users

Overview

Installation Workflow

Configure & Presentation Workflow

Structured Data Workflow

Content Workflow

Ontologies Workflow

Filter & Select Workflow

Maintenance Workflow

Moving from Here