Evolution
AI³
Adaptive Information
Adaptive Innovation
Adaptive Infrastructure
a·dap·tive adj. Showing or having a capacity to make fit for new or special situations; flexible; a successful adjustment.

Blogasbörd (cloud version):
Send Email   Get SIOC Profile   Get FOAF Profile   Syndicate full contents for this site using RSS 20
Main Links
Categories
Calendar
March 2010
S M T W T F S
« Feb    
 123456
78910111213
14151617181920
21222324252627
28293031  
Archives
More . . .  
Search
 
Sponsored Links
-->
Affiliations
structWSF
Credits
Blog software courtesy of WordPress Obtain Technorati profile Subscribe with Bloglines
View Mike's profile on LinkedIn
Date:   February 21, 2007

Deep WebIt’s Taken Too Many Years to Re-visit the ‘Deep Web’ Analysis

It’s been seven years since Thane Paulsen and I first coined the term ‘deep Web‘, perhaps representing a couple of full generational cycles for the Internet. What we knew then and what “Web surfers” did then has changed markedly. And, of course, our coining of the term and BrightPlanet’s publishing of the first quantitative study on the deep Web did nothing to create the phenomenon of dynamic content itself — we merely gave it a name and helped promote a bit of understanding within the general public of some powerful subterranean forces driving the nature and tectonics of the emerging Web.

The first public release of The Deep Web: Surfacing Hidden Value (courtesy of the Internet Archive’s Wayback Machine), in July 2000, opened with a bold claim:

BrightPlanet has uncovered the "deep" Web — a vast reservoir of Internet content that is 500 times larger than the known "surface" World Wide Web. What makes the discovery of the deep Web so significant is the quality of content found within. There are literally hundreds of billions of highly valuable documents hidden in searchable databases that cannot be retrieved by conventional search engines.

The day the study was released we needed to increase our servers nine-fold to meet news demand after CNN and then 300 major news outlets eventually picked up the story. By 2001 when the University of Michigan’s Journal of Electronic Publishing and its wonderful editor, Judith A. Turner, decided to give the topic renewed thrust, we were able to clean up the presentation and language quite a bit, but did little to actually update many of the statistics. (That version, in fact, is the one mostly cited today.)

Over the years there have been some books published and other estimates put forward, more often citing lower amounts in the deep Web than my original estimates, but, with one exception (see below), none of these were backed by new analysis. I was asked numerous times to update the study, and indeed had even begun collating new analysis at a couple of points, but the effort to complete the work was substantial and the effort always took a back seat to other duties and so was never completed.

Recent Updates and Criticisms

It was thus with some surprise and pleasure that I first found reference yesterday to Dirk Lewandowski’s and Phillip Mayr’s 2006 paper, “Exploring the Academic Invisible Web” [Library Hi Tech 24(4), 529-539], that takes direct aim at the analysis in my original paper. (Actually, they worked from the 2001 JEP version, but, as noted, the analysis is virtually identical to the original 2000 version.) The authors pretty soundly criticize some of the methodology in my original paper and, for the most part, I agree with them.

My original analysis combined a manual evaluation of the “top 60″ then-extant Web databases with an estimate of the total number of searchable databases (estimated at about 200,000, which they incorrectly cite as 100,000) and assessments of the mean size of each database based on a random sampling of those databases. Lewandowski and Mayr note conceptual flaws in the analysis at these levels:

  • First, by use of mean database size rather than median size, the size is overestimated,
  • Second, databases of questionable content to their interests in academic content (such as weather records from NOAA or Earth survey data by satellite) skewed my estimates upward, and
  • Third, my estimates were based on database size estimates (in GBs) and not internal record counts.

On the other hand, the authors also criticized that my definition of deep content was too narrow, and overlooked certain content types such as PDFs now routinely indexed and retrieved on the surface Web. We also have had uncertain, but tangible growth in standard search engine content — with the last cited amounts about 20 billion documents since Google and Yahoo! ceased their war of index numbers.

Though not really offering an alternative, full-blown analysis, the authors use the Gale Directory of Databases to derive an alternative estimate of perhaps 20 billion to 100 billion documents on the deep Web of interest for academic purposes, which they later seem to imply also needs to be discounted by further percentages to get at “word-oriented” and “full-text or bibliographic” records that they deem appropriate.

My Assessment of the Criticisms

As noted, I generally agree with these criticisms. For example, since the time of original publication, we have seen the power distribution nature of most things on the Internet, including popularity and traffic. Exponential distributions will always result in overestimates using calculations based on means rather than medians. I also think that meaningful content types were both overused (more database-like records) and underused (PDF content that is now routinely indexed) in my original analysis.

However, the authors’ third criticism is patently wrong, since three different methods were used to estimate internal database record counts and the average sizes of each record they contained. I would also have preferred a more careful reading by the authors of my actual paper, since there are numerous other citations in error and mis-characterizations.

On an epistemological level, I disagree with the authors’ use of the term “invisible Web”, a label that we tried hard in the paper to overturn and that is fading as a current term of art. Internet Tutorials (initially, SUNY at Albany Library) addresses this topic head-on, preferring “deep Web” on a number of compelling grounds, including that “there is no such thing as recorded information that is invisible. Some information may be more of a challenge to find than others, but this is not the same as invisibility.”

Finally, I am not compelled by the author’s simplistic, alternate partial estimate based solely on the Gale database, but they readily acknowledge to not doing a full-blown analysis and to having different objectives in mind. I agree with the authors in calling for a full, alternative analysis. I think we all agree that is a non-trivial undertaking and could itself be subject to newer methodological pitfalls.

So, What is the Quantitative Update?

Within a couple of years after the initial publication of my paper, I suspected the “500 times” claim for the greater size of the deep Web in comparison to what is discoverable by search engines may have been too high. Indeed, in later corporate literature and Powerpoint presentations, I backed off the initial 2000-2001 claims and began speaking in ranges from a “few times” to as high as “100 times” greater for the size of the deep Web.

In the last seven years, the only other quantitative study of its kind of which I am aware is documented in the paper, “Structured Databases on the Web: Observations and Implications,” conducted by Chang et al. in April 2004 and published in the ACM SIGMOD, that estimated 330,000 deep Web sources with over 1.2 million query forms, reflecting a fast 3-7 times increase in 4 years from the date of my original paper. Unlike the Lewandowski and Mayr partial analysis, this effort and others by that group suggests an even larger deep Web than my initial estimates!

The truth is, we didn’t know then — and we don’t know now — what the actual size of the dynamic Web truly is. (And, aside from a sound bite, does it really matter? It is huge by any measure.) Heroic efforts such as these quantitative analyses or the still-more ambitious efforts of UC Berkeley’s SIM School on How Much Information? still have a role in helping to bound our understanding of information overload. As long as such studies gain news traction, they will be pursued. So, what might today’s story look like?

First, the methodological problems in my original analysis remain and (I believe today) resulted in overestimates. Another factor today leading to a potential overestimate of the deep Web v. the surface Web would be the fact that much “deep” content is being more exposed to standard search engines, be it through Google’s Scholar, Yahoo!’s library relationships, individual site indexing and sharing such as through search appliances, and other “gray” factors we noted in our 2000-2001 studies. These factors, and certainly more, act to narrow the difference between exposed search engine content (”surface Web”) and what we have termed the “deep Web.”

However, countering these facts are two newer trends. First, foreign language content is growing at much higher rates and is often under-sampled. Second, blogs and other democratized sources of content are exploding. What these trends may be doing to content balances is, frankly, anyone’s guess.

So, while awareness of the qualitative nature of Web content has grown tremendously in the past near-decade, our quantitative understanding remains weak. Improvements in technology and harvesting can now overcome earlier limits.

Perhaps there is another Ph.D. candidate or three out there that may want to tackle this question in a better (and more definitive) way. According to Chang and Cho in their paper, “Accessing the Web: From Search to Integration,” presented at the 2006 ACM SIGMOD International Conference on Management of Data in Chicago:

On the other hand, for the deep Web, while the proliferation of structured sources has promised unlimited possibilities for more precise and aggregated access, it has also presented new challenges for realizing large scale and dynamic information integration. These issues are in essence related to data management, in a large scale, and thus present novel problems and interesting opportunities for our research community.

Who knows? For the right researcher with the right methodology, there may be a Science or Nature paper in prospect!

Posted on February 21, 2007 at 1:22 pm in Deep Web, Document Assets | Comments (4)
The URI link reference to this post is: http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/
The URI to trackback this post is: http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/trackback/
Date:   September 8, 2006

John Newton (co-founder formerly of Documentum, now of Alfresco) puts a telling marker on the table in his recent post on the Commoditization of ECM. Though noting the term "enterprise content management" did not even exist prior to 1998, he goes on to observe that expansion of the definition of what was appropriate in ECM and the consolidation of the leading players occurred rapidly. He concludes that this process has commoditized the market, with competitive differentiation now based on market size rather than functionality. The platforms from the leading IBM, Microsoft and EMC-Documentum vendors all can manage documents, Web content, images, forms and records via basic library services, metadata management, search and retrieval, workflow, portal integration, and development kits.

If such consolidation and standardization of functionality were Newton’s only point one could say, “ho, hum,” such has been true in all major enterprise software markets.

But, in my reading, he goes on to make two more important and fundamental points, both of which existing enterprise software vendors ignore at their peril.

Poor Foundations and Poor Performance

Newton notes that ECM applications are never bought based on the nature of their repositories, but an inefficient repository can result in the rejection of the system. He also acknowledges that ECM installations are costly to set up and maintain, difficult to use, poorly performing and lack essential automation (such as classification). (Kind of sounds like most enterprise software initiatives, doesn’t it?)

Indeed, I have repeatedly documented these gaps for virtually all large-scale document-centric or federated applications. The root cause — besides rampant poor interface designs — has been in my opinion poorly suited data management foundations. Relational or IR-based systems both perform poorly for different reasons in managing semi-structured data. This problem will not be solved by open source per se (see below), though there are some interesting options emerging from open source that may point the way to new alternatives, as well as incipient designs from BrightPlanet and others.

The Proprietary Killers of Open Standards and Open Source

Service-oriented architectures (SOA), the various Web services standards (WS**), the certain JSRs (170 and 283 in documents, but also 168 and others), plus all of the various XML and semantic derivatives are moving rapidly with the very real prospect of “pluggability” and the substitution of various packages, components and applications across the entire enterprise stack.

In quoting Newton’s case at Alfresco, by aggregating these existing open source components they were able to get their ECM product ready in less than one year:

  • Spring – A framework that provides the wiring of the repository and the tools to extend capabilities without rebuilding the repository (Aspect-Oriented Programming)
  • Hibernate – An object-relational mapping tool that stores content metadata in database and handles all the idiosyncrasies of each SQL dialect
  • Lucene – An internet-scale full-text and general purpose information retrieval engine that supports federated search, taxonomic, XML and full-text search
  • EHCache – Distributed intelligent caching of content and metadata in a loosely coupled environment
  • jBPM – A full featured enterprise production workflow and business process engine that includes BPEL4WS support
  • Chiba – A complete Xforms interface that can be used for the configuration and management of the repository
  • Open Office – Provides a server-based and Linux-compatible transformation of MS Office based content
  • ImageMagic – Supports transformation and watermarking of images.

Moreover, the combination of these components led to an inherent architecture including pluggable modules, rules and templating engines, workflow and business process management, security, and other enterprise-level capabilities. In prior times, I estimate no proprietary-based vendor could have accomplished this for ten times or more the effort.

Similar Trends and Challenges in the Entire Enterprise Space

Newton is obviously well placed to comment on these trends within ECM. But similar trends can be seen in every major enterprise software space. For virtually every component one can imagine, there is a very capable open source offering. Many of the newer open source ventures are indeed centered around aggregating and integrating various open source components followed by either dual-source licensing or support services as the basis of their business models. At its most extreme, this trend has expanded to the whole process of enterprise application integration (EAI) itself through offerings such as LogicBlaze FUSE with its SOA-oriented standards and open source components. Initiatives such as SCA (service component architecture) will continue to fuel this trend.

So, enterprise software vendors, listen to your wake up call. It is as if gold dubloons, pearls and jewels are laying all of the floor. If you and your developers don’t take the time to bend over and pick them up, someone else will. As Joel Mokyr has compellingly researched, the innovation of systems or how to integrate pieces can be every bit as important as the ‘Aha!’ discovery. Open source is now giving a whole new breed of bakers new ingredients for baking the cake.

Posted on September 8, 2006 at 10:50 am in Adaptive Information, Document Assets, Open Source, Software and Venture Capital Comments Off
The URI link reference to this post is: http://www.mkbergman.com/277/the-commoditization-of-content-software/
The URI to trackback this post is: http://www.mkbergman.com/277/the-commoditization-of-content-software/trackback/
Date:   April 4, 2006

Author's Note: An earlier blog series by me has now been turned into a PDF white paper under the auspices of BrightPlanet Corp The citation for this effort is:

M.K. Bergman, "Why Are $800 Billion in Document Assets Wasted Annually?” BrightPlanet Corporation White Paper, April 2006, 27 pp.

Download PDF file Click here to obtain a PDF copy of this full report (27 pp, 203 KB)

It is a tragedy of no small import when $800 billion in readily available savings from creating, using and sharing documents is wasted in the United States each year. How can waste of such magnitude occur right before our noses? And how can this waste occur so silently, so insidiously, and so ubiquitously that none of us can see it?

This free white paper attempts to address these questions. This report is the result of a series of posts in response to an earlier white paper I authored under BrightPlanet sponsorship entitled, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents. [1]

This full report intetgrates information from earlier blog postings:

Public and enterprise expenditures to address the wasted document assets problem remain comparatively small, with growth in those expenditures flat in comparison to the rate of document production. This report attempts to bring attention and focus to the various ways that technology, people, and process can bring real document savings to our collective pocketbooks.


[1] Michael K. Bergman, "Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents," BrightPlanet Corporation White Paper, July 2005, 42 pp. The paper contains 80 references, 150 citations, and many data tables.

Posted on April 4, 2006 at 10:29 am in Adaptive Information, Document Assets, Information Automation Comments Off
The URI link reference to this post is: http://www.mkbergman.com/197/full-report-why-are-800-billion-in-document-assets-wasted-annually/
The URI to trackback this post is: http://www.mkbergman.com/197/full-report-why-are-800-billion-in-document-assets-wasted-annually/trackback/
Date:   March 26, 2006

It is a tragedy of no small import when $800 billion in readily available savings from creating, using and sharing documents is wasted in the United States each year. How can waste of such magnitude  — literally equivalent to almost 8% of gross domestic product or more than 40% of what the nation spends on health care [1] — occur right before our noses? And how can this waste occur so silently, so insidiously, and so ubiquitously that none of us can see it?

Let me repeat. The topic is $800 billion in annual waste in the U.S. alone, perhaps equivalent to as much as $3 trillion globally, that can be readily saved each year with improved document management and use. Achieving these savings does not require Herculean efforts, simply focused awareness and the application of best practices and available technology. As the T.D. Waterhouse commercial says, “You can do this.”

This entry concludes a series of posts resulting from an earlier white paper I authored under BrightPlanet sponsorship. Entitled, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents,[2] that paper documented via many references and databases the magnitude of the poor use of document assets within enterprises. The paper was perhaps the most comprehensive look to date at the huge expenditures document creation and use occupy within our modern knowledge economy, and first quantified the potential $800 billion annual savings in overcoming readily identifiable waste.

Simply documenting the magnitude of expenditures and savings was mindblowing. But what actually became more perplexing was why the scope of something so huge and so amenable to corrective action was virtually invisible to policy or business attention. The vast expenditures and potential savings surfaced in the research quite obviously begged the question: Why is no one seeing this?

I then began this series to look at why document use savings may fit other classes of “big” problems such as high blood pressure as a silent killer, global warming from odorless and colorless greenhouse gasses, or the underfunding of cost-effective water systems and sanitation by international aid agencies. There seems to be something more difficult involving ubiquitous problems with broadly shared responsibilities.

The series began in October of last year and concludes with this summary.  Somehow, however, I suspect the issues touched on in this series are still poorly addressed and will remain a topic for some time to come.

The series looked at four major categories:

This summary wraps up the series.

I can truthfully conclude that I really haven’t yet fully put my finger on the compelling reason(s) as to why broad, universal problems such as document use and management remain a low priority and have virtual no visibility despite the very real savings that current techniques and process can bring. But I think some of the relevant factors are covered in these topics.

The arguments in Part I are pretty theoretical. They firstly ask if it is in the public interest to strive for improvements in “information” efficiency, some of which may be applicable to the private sector with possible differentials in gains. They secondly question the rhetoric of “information overload” that can lead to a facile resignation about whether the whole “information” problem can be meaningfully tackled. One dog that won’t hunt is the claim that computers intensify the information problem of private gain v. societal benefit because now more stuff can be processed. Such arguments are diversions that obfuscate deserved and concentrated public policy that can bring real public benefits  — and soon. Why else do we not see tax and economic policies that can enrich our populace by hundreds of billions of dollars annually?

Part II argues that barriers to collaboration, many cultural but others social and technical, help to prevent a broader consensus about the importance of documents reuse (read:  “information” and “knowledge”). Document reuse is likely the single largest reservoir of potential waste reductions. One real problem is the lack of top leadership within the organziation to encourage collaboration and efficiencies in document use and management through appropriate training and rewards, and commitments to install effective document infrastructures.

Part III re-visits prior failings and high costs in document or content initiatives within the enterprise. Perceptions of past difficulties color the adoption of new approaches and technologies. The lack of standards, confusing terminology, some failed projects, immaturity of the space, and the as-yet emergence of a dominant vendor have prevented more widespread adoption of what are clearly needed solutions to pressing business content needs. There are no accepted benchmarks by which to compare vendor performance and costs. Document use and management software can be considered to be at a similar point to where structured data was at 15 years ago at the nascent emergence of the data warehousing market. Growth in this software market will require substantial improvements in TCO and scalability, among a general increase in awareness of the magnitude of the problem and available means to solve it.

Part IV looks at what might be called issues of attention, perception or psychology. These factors are limiting the embrace of meaningful approaches to improve document access and use and to achieve meaningful cost savings. Document intelligence and document information automation markets still fall within the category of needing to “educate the market.”  Since this category is generally dreaded by most venture capitalists (VCs), that perception is also acting to limit the financing of fresh technologies and entrepreneurialiship.

The conclusion is that public and enterprise expenditures to address the wasted document assets problem remain comparatively small, with growth in those expenditures flat in comparison to the rate of document production. Hopefully, this series   — plus, also hopefully, ongoing dialog and input from the community  — can continue to bring attention and focus to the various ways that technology, people, and process can bring real document savings to our collective pocketbooks.


[1] According to the U.S. Dept of Health and Human Services, the nation spent $1.9 trillion on health care in 2004; see http://www.cms.hhs.gov/NationalHealthExpendData/02_NationalHealthAccountsHistorical.asp#TopOfPage.

[2] Michael K. Bergman, “Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents,” BrightPlanet Corporation White Paper, July 2005, 42 pp. The paper contains 80 references, 150 citations, and many data tables.

NOTE: This posting concludes a series looking at why document assets are so poorly utilized within enterprises.  The magnitude of this problem was first documented in a BrightPlanet white paper by the author titled, Untapped Assets:  The $3 Trillion Value of U.S. Enterprise Documents.  An open question in that paper was why more than $800 billion per year in the U.S. alone is wasted and available for improvements, but enterprise expenditures to address this problem remain comparatively small and with flat growth in comparison to the rate of document production.  This series is investigating the various technology, people, and process reasons for the lack of attention to this problem.

Posted on March 26, 2006 at 9:46 pm in Adaptive Information, Document Assets, Information Automation Comments Off
The URI link reference to this post is: http://www.mkbergman.com/138/why-are-800-billion-in-document-assets-wasted-annually-v-summary/
The URI to trackback this post is: http://www.mkbergman.com/138/why-are-800-billion-in-document-assets-wasted-annually-v-summary/trackback/
Date:   December 5, 2005

Earlier posts have noted the near-term importance of the federal government to integrated document management and integration of open source software.  A recent article by Darryl K. Taft of eWeek.com titled "GSA Modernizes With Open-Source Stack" indicates the lead role the General Services Administration will play, at least on the civilian side of the government.  According to the article:

George Thomas, a chief architect at the General Services Administration, said the GSA is leading the effort to deliver an OSERA (Open Source eGov Reference Architecture) that will feature foundational technologies such as MDA (Model Driven Architecture), an ESB (enterprise service bus), an SOA (service-oriented architecture) and the Semantic Web, among other things.

OSERA deserves close tracking as the federal government implements these standards. GSA has set up a Web site on OSERA that is still awaiting content.

Posted on December 5, 2005 at 9:55 am in Adaptive Information, Document Assets, Open Source, Semantic Web Comments Off
The URI link reference to this post is: http://www.mkbergman.com/172/gsa-open-source-initiative/
The URI to trackback this post is: http://www.mkbergman.com/172/gsa-open-source-initiative/trackback/
Date:   November 23, 2005

In earlier posts I have put forward a vision for the semantic Web in the enterprise that has an extensible database supporting semi-structured data at its core with XML mediating multiple ingest feeds, interaction with analytic tools, and sending results to visualization and reporting tools.

This is well and good as far as it goes.  However, inevitably, whenever more than one tool or semi-structured dataset is added to a system, it brings with it a different “view” of the world.  Formalized and standardized protocols and languages are needed to both:  1) capture these disparate “views” and 2) provide facilities to map them to resolve data and schema federation heterogeneities.  These are the roles of RDF and OWL.

Fortunately, there is a very active community with tools and insights for working in RDF and OWL.  Stanford and UMBC are perhaps the two leading centers of academic excellence.

If you are not generally familiar with this stuff, I recommend you begin with the recent “Order from Chaos” from Natalya Noy of the Protégé group at Stanford Medical.  This piece describes issues like trust, etc., that are likely not as relevant to application of the semantic Web to enterprise intranets as they are to the cowboy nature of the broader Internet.  However, much else of this article is of general use to the architect considering enterprise applications.

To keep things simple and to promote interoperability, a critical aspect of any enterprise semantic Web implementation  will be providing the “data API” (including extensible XML, and RDF and OWL) standards that govern the rules of how to play in the sandbox.  Time spent defining these rules of engagement will pay off in spades in relation to any other appproach for multiple ingest, multiple analytic tools and multiple audiences, reports and collaboration.

Another advantage of this approach is the existence of many open source tools for managing such schema (e.g.Protégé) and visualization (literally dozens), among thousands of ontologies and other intellectual property.

Posted on November 23, 2005 at 12:55 pm in Document Assets, Information Automation, Semantic Web Comments Off
The URI link reference to this post is: http://www.mkbergman.com/165/more-semantic-web-in-the-enterprise/
The URI to trackback this post is: http://www.mkbergman.com/165/more-semantic-web-in-the-enterprise/trackback/
Date:   November 15, 2005

Today, the value of the information contained within documents created each year in the United States represents about a third of total gross domestic product, or an amount of about $3.3 trillion.[1] Moreover, about $800 billion of these expenditures are wasted and are readily recoverable by businesses, but are not. Up to 80% of all corporate information is contained within documents. Perhaps up to 35% of all company employees in the U.S. can be classified as knowledge workers using and relying on documents. So, given these factors, how could such large potential cost savings from better document use be overlooked?

Previous installments in this series have looked at issues of private v. public information, barriers to collaboration, and solutions as being too expensive as possible reasons for why these potential savings are not realized. This fourth installment looks at a fourth reason; namely, what might be called issues of attention, perception or psychology. Interesting observations in this area come from disciplines as diverse as sales, behaviorial psychology, economics and operations research.

The SPIN Rationale

One explanation for this lack of attention can be described by the fact that document problems are still in the area of implicit needs as opposed to explicit needs. In other words, the perception of the problem is still situational but has not yet become concrete in terms of bottom-line impacts.

In Neil Rackham’s SPIN sales terminology (Situation Problems Implications Needs/pay-off),[2] the enterprise document market is still at a “situational” level of understanding. Decisions to buy or implement solutions are largely strategic and limited to early adopters that are the visionaries in their market segments. The inability to express and quantify the implications of not realizing the value of document assets means that ROI analysis can not justify a deployment and market growth can not cross the chasm.

The situation begins with the inability to quantify the importance of both internal and external document assets to all aspects of the enterprise’s bottom line. Early adopters of enterprise content software typically capture less than 1% of valuable internal documents available; large enterprises are witnessing the proliferation of internal and external Web sites, sometimes exceeding thousands; use of external content is presently limited to Internet search engines, producing non-persistent results and no capture of the investment in discovery or results; and “deep” content in searchable databases, which is common to large organizations and represents 90% of external Internet content, is completely untapped. Indeed, the issue of poor document use in an organizaation can be seen in terms of the figure below:

The diagram indicates that these root conditions or situations cause problems in low quality of decisions or low staff productivity. For examples, documents or proposals get duplicated without knowledge of prior effort that could be leveraged; opportunities are missed; or outdated or incomplete information is applied to various tasks. These root problems can impact virtually all aspects of the organization’s operations: sales are lost; competitors are overlooked; compliance requirements are missed. These problems can lead to significant bottom-line implications from revenue and market share, to reputation and valuation and even indeed survival.

Thus, in the view of the SPIN model, the lack of attention to the issue of document assets can, in part, be ascribed to the sales or investigatory process. Specific questions have not been posed that move the decision maker from a position of situational awareness to one of explicit bottom-line implications.

There is undoubtedly truth to this observation. Sales of large document solutions to enterprises require a consultative sales approach and significant education of the market is required. As a first-order circumstance, this implies long sales leadtimes and the dreaded “educating the market” that most VCs try to avoid.

But there are even larger factors at play than a lack of explicitness regarding document assets.

The Ubiquitous and Obvious Are Often Overlooked

Put your index finger one inch from your nose. That is how close  — and unfocused — document importance is to an organization. Documents are the salient reality of a knowledge economy, but like your finger, documents are often too close, ubiquitous and commonplace to appreciate.

The dismissal of the ubiquitous, common or obvious can be seen in a number of areas. In terms of R&D and science, this issue has been termed “mundane science” wherein most academic research topics exclude many of the issues that affect the largest number of people or have the most commonality. [3] In organizational and systems research, such issues have also been the focus of better, more rigorous problem identificaton and analysis techniques such as the “rational model” or the “theory of constraints” (TOC).[4]

Compounding the issue of the overlooked obvious is the lack of a quantified understanding of the problem. There is an old Chinese saying that roughly translated is “what cannot be measured, cannot be improved.” Many corporate executives surely believe this to be the case for document creation and productivity.

More Specifically: Bounded Awareness

Chugh and Bazerman have recently coined a term “bounded awareness” for the phenomenon of missing easily observed and relevant data.[5] As they explain:

“Bounded awareness is a phenomenon that encompasses a variety of psychological processes, all of which lead to the same error: a failure to see, seek, use, or share important and relevant information that is easily seen, sought, used, or shared.”

The authors note the experiments from Simons[6] that extend Neisser’s 1979 video in which a person in a gorilla costume walks through a basketball game, thumping his chest, and is clearly and comically visible for more than five seconds, but is not generally recalled by observers without prompting.

Chugh and Bazerman classify a number of these phenomena, with two most applicable to the document assets problem:

  • Inattentional blindness — direct information when attention is drawn or focused elsewhere
  • System neglect — this phenomenom is the tendency to undervalue a broader, pivotal factor to subsidiary ones, as in for example the effect of campaign finance-reform on specific political issues. In the document assets case, the general role of document access and management is neglected as a system over more readily understood specific issues such as search or spell checking. In other words, people tend to value issues that are more clearly seen as end states or outcomes.

Note the relation of these studies by behaviorial psychologists to the SPIN terminology of the sales executive. Clearly, perceptual studies by scientists will lead to better understandings of market outreach.

Perceptions of Intractability?

An earlier installment in this series noted the high cost of enterprise content solutions, more generally linked to software that performed poorly and did not scale. In computer science, intractable problems are those which take too long to execute, the problem may not be computable, or we may not know how to solve the problem (e.g., problems in artificial intelligence). Tractable problems can run in a reasonable amount of time for even very large amounts of input data. Intractable problems require huge amounts of time for even modest input sizes.[7]

At low scales, the efficiency of various computer algorithms is not terrible important because multiple methods can produce acceptable performance times. But at large scales whether a problem is tractable or not is not fixed: it depends critically on the efficiency of the algorithm applied to the problem. Let’s take for example the issue of searching text items:

Take n to represent the number of keys in a list, and let O represent the order of the number of comparison operations required to find an entry. For a small number of n items, the algorithm used is unimportant, and even a slow sequential search will work well. Sequentially searching the list until the desired match is found is O (n), or linear time. If there are 1000 items in a list, and there is an equal probability of searching for any item in the list, on average it will require n/2 = 500 comparisons to find the item (assuming all items already are on the list). A binary search works by dividing the list in half after each comparison. This is logarithmic time O (log n ), much faster than linear time. For a 1000 item example it works out to about 10 comparisons. An O (1) operation, such as hashing, is applicable when some algorithm computes the item location and then retrieves it. On large lists it will significantly outperform a binary search, because it makes no comparisons. (It is a little more complicated than that because there may be collisions for the same address computed for different keys.) However, if the location is already known, even the hashing computation is unnecessary. This is what happens with direct addressing (the technique used by BrightPlanet), which will obtain the desired item in a single step.[8]

Poorly performing algorithms at large scales can require processing times for updates that take longer than the period between updates, and, thus, at least for that algorithm, are intractable at those scales.

This is one of the key and perceived problems to most document processing software at large scales — their computational inefficiencies do not allow updates to occur for the meaningful document volumes important to larger organizations. Whether the specific reasons are known by company managers and IT personnel, it is a widespread understanding — correct for most vendors — within the marketplace.

Since BrightPlanet’s core index work engine is more efficient than other approaches (due, in part, to better sorting mechanisms as noted above, but also due to other factors), current perceived limits of intractability may not apply. However, these advances are still not generally known. Until broader understanding for how more contemporary approaches to document use and management are gained, perceptions of past poor performance will limit market acceptance.

Educating the Market

Thus, factors of awareness, attention and perception are also limiting the embrace of meaningful approaches to improve document access and use and achieve meaningful cost savings. These challenges may mean that the document intelligence and document information automation markets still fall within the category of needing to “educate the market.” Since this category is generally dreaded by most venture capitalists (VCs), that perception is also acting to limit the achievable improvements and cost savings available to this market.

But there is perhaps a very important broader question that remains open here: educating the market through the individual customer (viz. the SPIN sale) vs. educating the market through breaking market-wide bounded awareness. In fact the latter, much as what occurred with data warehousing 15-20 years ago, can create entirely new markets. This latter category should perhaps be of much greater VC interest with its accompanying potential for first-mover advantage.


[1] Michael K. Bergman, “Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents,” BrightPlanet Corporation White Paper, July 2005, 42 pp. All 80 references, 150 citations and calculations are fully documented in the full paper. See http://www.brightplanet.com/technology/whitepapers.asp.

[2] Neil Rackham, SPIN Selling, McGraw Hill, 197 pp., 1988.

[3] Daniel M. Kammen and Michael R. Dove, “The Virtues of Mundane Science,” Environment, Vol. 39 No. 6, July/August 1997. See http://ist-socrates.berkeley.edu/~rael/Mundane_Science.pdf

[4] Victoria Mabin, “Goldratt’s ‘Theory of Constraints’ Thinking Processes: A Systems Methodology linking Soft with Hard,” The 17th International Conference of The System Dynamics Society and the 5th Australian & New Zealand Systems Conference, July 20 – 23, 1999, Wellington, New Zealand, 12 pp. See http://www.systemdynamics.org/conf1999/PAPERS/PARA104.PDF

[5] Dolly Chugh and Max Bazerman, “Bounded Awareness: What You Fail to See Can Hurt You,” Harvard Business School Working Paper #05-037, 35 pp., August 25, 2005 revision. See http://www.people.hbs.edu/mbazerman/Papers/05-037.pdf

[6] See the various demos available at http://viscog.beckman.uiuc.edu/djs_lab/demos.html.

[7] Professor Constance Royden, College of the Holy Cross, course uutline for CSCI 150, Tractable and Intractable Problems, Spring 2003. See http://mathcs.holycross.edu/~croyden/csci150spr03/notes/lec33_tractable.html

[8] R. L. Kruse, Data Structures and Program Design, Prentice Hall Press, Englewood Cliffs, New Jersey, 1987.

NOTE: This posting is part of a series looking at why document assets are so poorly utilized within enterprises.  The magnitude of this problem was first documented in a BrightPlanet white paper by the author titled, Untapped Assets:  The $3 Trillion Value of U.S. Enterprise Documents.  An open question in that paper was why more than $800 billion per year in the U.S. alone is wasted and available for improvements, but enterprise expenditures to address this problem remain comparatively small and with flat growth in comparison to the rate of document production.  This series is investigating the various technology, people, and process reasons for the lack of attention to this problem.

Posted on November 15, 2005 at 11:55 am in Adaptive Information, Document Assets, Information Automation | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/137/why-are-800-billion-in-document-assets-wasted-annually-iv-the-problem-is-too-close-for-focus/
The URI to trackback this post is: http://www.mkbergman.com/137/why-are-800-billion-in-document-assets-wasted-annually-iv-the-problem-is-too-close-for-focus/trackback/
Date:   October 26, 2005

As noted by the Nobel laureate economist Herbert Simon more than 30 years ago:[1]

What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of sources that might consume it. . . . The only factor becoming scarce in a world of abundance is human attention.

Spiraling document growth combined with the universal migration of digital information to the Internet has been come to be known by the terms “infoglut” or “information overload.” The issue, of course, is not simply massive growth, but more importantly the ability to find the right information at the right time to make actionable decisions.

Document assets are poorly utilized at all levels and within all departments within enterprises. The magnitude of this problem was first documented in a BrightPlanet white paper titled, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents. An open question in that paper was why nearly $800 billion per year in the U.S. alone is wasted and available for improvements, but enterprise expenditures to address this problem remain comparatively small and with flat growth in comparison to the rate of document production.

Earlier parts in this series addressed whether the root causes of this poor use were due to the nature of private v. public information or due to managerial and other barriers to collaboration. This part investigates whether high software and technology costs matched with poor performance is a root cause.

The Document Situation Within U.S. Enterprises

Document creation represents about $3.3 billion in annual costs to U.S. enterprises, or about 30% of gross national product, $800 billion of which can be reclaimed through better access, recall and use of these intellectual assets. For the largest U.S. firms, annual benefits from better document use average about $250 million per firm.[2]

Perhaps at least 10% of an enterprise’s information changes on a monthly basis.[3] A 2003 UC Berkeley study on “How Much Information?” estimated that more than 4 billion pages of internal office documents are generated annually in the U.S. with archival value. The percentage of unstructured (document) data to the total amount of enterprise data is estimated at 85% and growing.[4] Year-on-year office document growth rates are on the order of 22%.[2]

Based on these averages, a ‘typical’ document may cost on the order of $380 each to create.[5] Standard practice suggests it may cost on average $25 to $40 per document simply for filing.[6] Indeed, labor costs can account for up to 30% of total document handling costs.[7] Of course, a “document” can vary widely in size, complexity and time to create, and therefore its individual cost and value will vary widely. An invoice generated from an automated accounting system could be a single page and be produced automatically in the thousands; proposals for very large contracts can take tens or thousands or even millions of dollars to create.

According to a Coopers & Lybrand study in 1993 90 percent of corporate memory exists on paper.[8] A Xerox Corporation study commissioned in 2003 and conducted by IDC surveyed 1000 of the largest European companies and had similar findings:[9],[10]

  • On average 45% of an executive’s time was spent dealing with documents
  • 82% believe that documents were crucial to the successful operation of their organizations
  • A further 70% claimed that poor document processes could impact the operational agility of their organizations
  • While 83%, 78% and 76% consider faxes, email and electronic files as documents, respectively, only 48% and 46% categorize web pages and multimedia content as such.

Significantly, 90 to 97 percent of the corporate respondents to the Coopers & Lybrand and Xerox studies, respectively, could not estimate how much they spent on producing documents each year. Almost three quarters of them admit that the information is unavailable or unknown to them.

These statistics apply to the perhaps 20 million knowledge workers within US firms (though other estimates have ranged as high as 40 million).[11], [12] Of this number, perhaps nearly one million have job responsibilities solely devoted to content management. In the largest firms, there are likely 300 employees or more whose sole responsibility is content management.

The High Cost of Searching and Organizing

The average knowledge worker spends 2.3 hrs per day — or about 25% of work time — searching for critical job information, with 60% saying search is a difficult process, made all the more difficult without a logical organization to content.[3] A USC study reported that typically only 32% of employees in knowledge organizations have access to good information about technical developments relevant to their work, and 79% claim they have inadequate information about what their competitors are doing.[13]

According to the Gartner Group, the average enterprise spends from 60 to 70% of its application development budgets creating ways to access disparate data, importantly including documents.[14] IDC estimates that enterprises employing 1,000 knowledge workers may waste well over $6 million per year each in searching for information that does not exist, failing to find information that does, or recreating information that could have been found but was not.[15] As that report stated, “It is simply impossible to create knowledge from information that cannot be found or retrieved.”

Forrester reported in 2002 that 54% of Global 3500 companies relied at that time on homegrown systems to manage content.[16] One vendor cites national averages as indicating that most organizations spend from 5% to 10% of total company revenue on handling documents;[7] Cap Ventures suggests these ranges may be as high as 6% to 15%, with the further observation that 85% of all archived documents never leave the filing cabinet.[6]

An A.T. Kearney study sponsored by Adobe, EDS, Hewlett-Packard, Mayfield and Nokia, published in 2001, estimated that workforce inefficiencies related to content publishing cost organizations globally about $750 billion. The study further estimated that knowledge workers waste between 15% to 25% of their time in non-productive document activities.[17]

Delphi Group’s research points to the lack of organized information as the number one problem in the opinion of business professionals. More than three-quarters of the surveyed corporations indicated that a taxonomy or classification system for documents is imperative or somewhat important to their business strategy; more than one-third of firms that classify documents still use manual techniques.[6]

So, how does an enterprise proceed to place its relevant documents into a hierarchically organized taxonomy or subject tree? The conventional approach taken by most vendors separates the process into two steps. First, each document is inspected and then “metatagged” with relevant words and concepts specific to the enterprise’s view of the world. The actual labels for the tags are developed from an ontology or the eventual taxonomic structure in which the documents will get placed.[18] Then, second, these now-tagged documents are then evaluated on the basis of the tags against the subject tree for conducting the actual placements. But, as noted below, this approach is extremely costly and does not scale.

Web Sprawl: The Proliferation of Corporate Web Sites

Another issue facing enterprises, especially large ones, is the proliferation of Web sites or “Web sprawl.” This proliferation began as soon as the Internet became popular. Here are some anecdotal examples:

  • As early as 1995, DEC (purchased by Compaq and then Hewlett Packard) had 400 internal Web sites and Sun Microsystems had more than 1,000[19]
  • As reported in 2000, Intel had more than 1 million URLs on its intranet with more than 100 new Web sites being introduced each month[20]
  • In 2002, IBM consolidated over 8,000 intranet sites, 680 ‘major’ sites, 11 million Web pages and 5,600 domain names into what it calls the IBM Dynamic Workplaces, or W3 to employees[21]
  • Silicon Graphics’ ‘Silicon Junction’ company-wide portal serves 7,200 employees with 144,000 Web pages consolidated from more than 800 internal Web sites[22]
  • Hewlett-Packard Co., for example, has sliced the number of internal Web sites it runs from 4,700 (1,000 for employee training, 3,000 for HR) to 2,600, and it makes them all accessible from one home, @HP [23],[24]
  • Providence Health Systems recently consolidated more than 200 sites[25]
  • Avaya Corporation is now consolidating more than 800 internal Web sites globally[26]
  • The Wall Street Journal recently reported that AT&T has more than 10 information architects on staff to maintain its 3,600 intranet sets that contain 1.5 million public Web pages[27]
  • The new Department of Homeland Security is faced with the challenge of consolidating more than 3,000 databases inherited from its various constituent agencies.[28]

Corporate IT does not even know the full extent of Web site proliferation, similar to the loss of centralized control when personal PCs entered the enterprise. In that circumstance it took changes in managerial mindsets and new technology such as the PC network by Novell before control could be reasserted. Similar changes will be necessary to corral the issue of Web sprawl.

The Tyranny of Expectations

Vendor hype is one of the causes of misplaced expectations, but also wrong assumptions regarding benefits and costs.

One area where this can occur is in time savings. Vendors and customers often use time savings by knowledge workers as a key rationale for justifying a document initiative. This comes about because many studies over the years have noted that white collar employees spend a consistent 20% to 25% of their time seeking information; the premise is that more effective search will save time and drop these percentages. However, the fact these percentages have held stable over time suggests this is the “satisficing” allocation of time to information search. Thus, while better tools to aid better discovery may lead to finding better information and making better decisions more productively — an intangible and important justification in itself — there may not result a strict time or labor savings from more efficient search.[29]

Another area is lack of awareness about full project costs. According to Charles Phillips of Morgan Stanley, only 30% of the money spent on major software projects goes to the actual purchase of commercially packaged software. Another third goes to internal software development by companies. The remaining 37% goes to third-party consultants.[30]

The Poor Performance of Existing Software

High expectations matched with poor performance is the match in the gas-filled room. Some of the causes of poor document content software performance include:

  • Poor Scalability – according to a market report published by Plumtree in 2003, the average document portal contains about 37,000 documents.[31] This was an increase from a 2002 Plumtree survey that indicated average document counts of 18,000.[32] However, about 60% of respondents to a Delphi Group survey said they had more than 50,000 internal documents in their internal environment (generally the department level). Poor scalability and low coverage of necessary documents is a constant refrain by early enterprise implementers
  • Long Implementation Times – though average time to stand-up a new content installation is about 6 months, there is also a 22% risk that deployment times exceeds that and an 8% risk it takes longer than one year. Furthermore, internal staff necessary for initial stand-up average nearly 14 people (6 of whom are strictly devoted to content development), with the potential for much larger head counts[33]
  • Very High Ongoing Maintenance and Staffing Costs – a significantly limiting factor to adoption is the trend that suggests that ongoing maintenance and staffing costs exceed the initial deployment effort. Based on analysis from BrightPlanet, the table below summarizes set-up, ongoing maintenance and key metrics for today’s conventional approaches versus what BrightPlanet can do.  These staffing estimates are consistent with a survey of 40 installations that found there were on average 14 content development staff managing each enterprise’s content portal.[34] Current practices costing $5 to $11 per document for electronic access are simply unacceptable:

DOCUMENT

INITIAL SET-UP

MAINTENANCE

BASIS

Staff

Mos

$/Doc

Staff

$/Doc

Current Practice

37,000

6.2

5.4

$4.861

6.4

$11.278

BrightPlanet

250,000

1.0

0.8

$0.017

0.3

$0.078

BP Advantage

6.8 x + up

6.2 x

6.7 x

280.4 x

21.4 x

144.6 x

  • Lousy Integration Capabilities — content can not be treated in isolation for the total information needs of the organization
  • High TCO – all of these factors combine into an unacceptable total cost of ownership. High TCO and risk are simply too great to raise the priority of document management sufficiently up within IT priorities, despite the general situational awareness that “infoglut” is costing the firm a ton.

The Result: An Immature Market Space

The lack of standards, confusing terminology, some failed projects, immaturity of the space and the as-yet emergence of a dominant vendor have prevented more widespread adoption of what are clearly needed solutions to pressing business content needs. Vendors and industry analysts alike confuse the market with competing terminology, each trying to carve out a unique “message” in this ill-formed space. Read multiple white papers or inspect multiple vendor Web sites and these difficulties become evident. There are no accepted benchmarks by which to compare performance and cost implications for content management. This limitation is especially acute because, given the confusion in the market, there are no independent sources to turn to for insight and quantitative comparisons.

These issues — in combination with high costs, risks and uncertainty of performance and implementation success — lead to a very immature market at present.

Conclusions

Clearly, the high costs of document management software matched with poor performance and unmet expectations is one of the root causes for the $800 billion annual waste in document use within U.S. enterprises. However, as other parts of this series point out, the overall explanation for this wasteful situation is very complex with other important contributing factors at play.

Document use and management software can be considered to be at a similar point to where structured data was at 15 years ago at the nascent emergence of the data warehousing market. Growth in this software market will require substantial improvements in TCO and scalability, among a general increase in awareness of the magnitude of the problem and available means to solve it.


[1] H.A. Simon, “Designing Organizations for an Information Rich World,” in M. Greenberger (ed.), Computers, Communications, and the Public Interest, pp. 38-52, July 1971, The Johns Hopkins University Press, Balimore, MD. Reprinted in: H.A. Simon, Models of Bounded Rationality and Other Economic Topics, Vol. 2.Collected Papers, The MIT Press, Cambridge, MA, May 1982.

[2] M.K. Bergman, “Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents,” BrightPlanet Corporation White Paper, December 2004, 37 pp. See http://www.brightplanet.com/technologydocumentvalue.asp.

[3] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See http://delphigroup.com.

[4] P. Lyman and H. Varian, “How Much Information, 2003,” retrieved from http://www.sims.berkeley.edu/how-much-info-2003 on December 1, 2003.

[5] M.K. Bergman, “A Cure to IT Indigestion: Deep Content Federation,” BrightPlanet Corporation White Paper, December 2004, 40 pp. See http://www.brightplanet.com/technology/whitepapers.asp

[6] Cap Ventures information, as cited in ZyLAB Technologies B.V., “Know the Cost of Filing Your Paper Documents,” Zylab White Paper, 2001. See http://www.zylab.com/downloads/whitepapers/PDF/21%20-%20Know%20the%20cost%20of%20filing%20your%20paper%20documents.pdf.

[7] Optika Corporation. See http://www.optika.com/ROI/calculator/ROI_roiresults.cfm

[8] As initially published in Inc Magazine in 1993. Reference to this document may be found at: http://www.contingencyplanning.com/PastIssues/marapr2001/6.asp

[9] J. Snowdon, Documents — The Lifeblood of Your Business?, October 2003, 12 pp. The white paper may be found at: http://www.mdy.com/News&Events/Newsletter/IDCDocMgmt.pdf

[10] Xerox Global Services, Documents – An Opportunity for Cost Control and Business Transformation, 28 pp., 2003. The findings may be found at: http://www.sap.com/solutions/srm/pdf/CCS_Xerox.pdf

[11] Nuala Beck, Shifting Gears: Thriving in the New Economy, Harper Collins Publishers, Toronto, 1993.

[12] pers. comm.., Guy Cresse, Aberdeen Group, November 19, 2001.

[13] S.A. Mohrman and D.L. Finegold, Strategies for the Knowledge Economy: From Rhetoric to Reality, 2000, University of Southern California study as supported by Korn/Ferry International, January 2000, 43 pp. See http://www.marshall.usc.edu/ceo/Books/pdf/knowledge_economy.pdf.

[14] Gartner Group, as reported by P. Hallett, Schemalogic Corporation, at the 2003 Enterprise Data Forum, Philadelphia, PA, November 2003. See http://www.wilshireconferences.com/EDF2003/tripreport.htm.

[15] C. Sherman and S. Feldman, “The High Cost of Not Finding Information,” International Data Corporation Report #29127, 11 pp., April 2003.

[16] J.P. Dalton, “Enterprise Content Management Delusions,” Forrester Research Report, June 2002. 12 pp. See http://www.forrester.com/ER/Research/Report/Summary/0,1338,14981,00.html.

[17] A.T. Kearney, Network Publishing: Creating Value Through Digital Content, A.T. Kearney White Paper, April 2001, 32 pp. See http://www.adobe.com/aboutadobe/pressroom/pressmaterials/networkpublishing/pdfs/netpubwh.pdf.

[18] Though most widely used, the concept of “taxonomy” began with Linnaeus whose purpose was to name and place organisms within a hierarchical structure with dichotomous keys (yes, no) deciding each branch. The result is to place every species within a unique taxon including such concepts as family, genus and species. Content subject taxonomies allow multiple choices at each branch and therefore do not have a strict dichotomous structure. “Ontologies” better refer more generally to the nature or “being” of a problem space; they generally consist of a controlled vocabulary of related concepts. Ontologies need not, and often do not, have a hierarchical structure, and are therefore also not strictly accurate. “Subject tree” visually conveys the hierarchical, nested character of these structures, but is less “technical” than other terms.

[19] D. Strom, “Creating Private Intranets: Challenges and Prospects for IS,” an Attachmate White Paper prepared by David Stron, Inc., November 16, 1995. See http://www.strom.com/pubwork/intranetp.html.

[20] A. Aneja, C.Rowan and B. Brooksby, “Corporate Portal Framework for Transforming Content Chaos on Intranets,” Intel Technology Journal Q1, 2000. See http://developer.intel.com/technology/itj/q12000/pdf/portal.pdf.

[21] J. Smeaton, “IBM’s Own Intranet: Saving Big Blue Millions,” Intranet Journal, Sept. 25, 2002. See http://www.intranetjournal.com/articles/200209/ij_09_25_02a.html.

[22] See http://www.wookieweb.com/Intranet/.

[23] D. Voth, “Why Enterprise Portals are the Next Big Thing,” LTI Magazine, October 1, 2002. See http://www.ltimagazine.com/ltimagazine/article/articleDetail.jsp?id=36877.

[24] A. Nyberg, “Is Everybody Happy?” CFO Magazine, November 01, 2002. See http://www.cfo.com/article/1%2C5309%2C8062%2C00.html.

[25] See http://www.cubiccompass.com/downloads/Industry/Healthcare/Providence%20Health%20Systems%20Case%20Study.doc.

[26] See http://www.proudfoot-plc.com/pdf_20004-USPR1002Avayaweb.asp.

[27] Wall Street Journal, May 4, 2004, p. B1.

[28] pers. comm.., Jonathon Houk, Director of DHS IIAP Program, November 2003.

[29] M.E.D. Koenig, “Time Saved — a Misleading Justification for KM,” KMWorld Magazine, Vol 11, Issue 5, May 2002. See http://www.kmworld.com/publications/magazine/index.cfm.

[30] C. Phillips, “Stemming the Software Spending Spree,” Optimize Magazine, April 2002, Issue 6. See http://www.optimizemag.com/article/showArticle.jhtml?articleId=17700698&pgno=1.

[31] This average was estimated by interpolating figures shown on Figure 8 in Plumtree Corporation, “The Corporate Portal Market in 2003,” Plumtree Corp. White Paper, 30 pp. See http://www.plumtree.com/portalmarket2003/default.asp..

[32] This average was estimated by interpolating figures shown on the p.14 figure in Plumtree Corporation, “The Corporate Portal Market in 2002,” Plumtree Corp. White Paper, 27 pp. See http://www.plumtree.com/pdf/Corporate_Portal_Survey_White_Paper_February2002.pdf.

[33] Analysis based on reference 31, with interpolations from Figure 16.

[34]M. Corcoran, “When Worlds Collide: Who Really Owns the Content,” AIIM Conference, New York, NY, March 10, 2004.  See
http://show.aiimexpo.com/convdata/aiim2003/brochures/64CorcoranMary.pdf
.

NOTE: This posting is part of a series looking at why document assets are so poorly utilized within enterprises.  The magnitude of this problem was first documented in a BrightPlanet white paper by the author titled, Untapped Assets:  The $3 Trillion Value of U.S. Enterprise Documents.  An open question in that paper was why more than $800 billion per year in the U.S. alone is wasted and available for improvements, but enterprise expenditures to address this problem remain comparatively small and with flat growth in comparison to the rate of document production.  This series is investigating the various technology, people, and process reasons for the lack of attention to this problem.

Posted on October 26, 2005 at 9:25 am in Adaptive Information, Document Assets, Information Automation | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/136/why-are-800-billion-in-document-assets-wasted-annually-iii-enterprise-solutions-are-too-expensive/
The URI to trackback this post is: http://www.mkbergman.com/136/why-are-800-billion-in-document-assets-wasted-annually-iii-enterprise-solutions-are-too-expensive/trackback/
Page 1 of 212»
Copyright © 2004–2010 Michael K. Bergman.   This work is licensed under a Creative Commons License