
Today, in the advanced knowledge economy of the United States, the information contained within documents represents about a third of total gross domestic product, or an amount of about $3.3 trillion annually.
Yet our understanding of the value of documents and the means to manage them is abysmal. These failures impact enterprises of all sizes from the standpoints of revenues, profitability and reputation. Continued national productivity growth — and thus the wealth of all citizens — depends critically on understanding and managing these document values.
As this white paper describes, the lack of a compelling and demonstrable common understanding of the importance of documents is in itself a major factor limiting available productivity benefits. There is an old Chinese saying that roughly translated is “what cannot be measured, cannot be improved.” Many corporate officers may believe this to be the case for document creation and productivity, but, as this paper shows, in fact many of these document issues can be measured.
This Friday brown bag leftover was first placed into the AI3 refrigerator on July 20, 2005. No changes have been made to the original posting.
I’d like to thank David Siegel for recently highlighting this post from 5 years ago with nice kudos on his PowerOfPull blog. That reference is what caused me to dust off the cobwebs from this older piece.
To wit, some 25% of all of the annual trillions of dollar spent on document creation costs lend themselves to actionable improvements:
| U.S. FIRMS |
$ Million |
% |
| Cost to Create Documents |
$3,261,091 |
|
| Benefits | ||
| Benefits to Finding Missed or Overlooked Documents |
$489,164 |
63% |
| Benefits to Improved Document Access |
$81,360 |
10% |
| Benefits of Re-finding Web Documents |
$32,967 |
4% |
| Benefits of Proposal Preparation and Wins |
$6,798 |
1% |
| Benefits of Paperwork Requirements and Compliance |
$119,868 |
15% |
| Benefits of Reducing Unauthorized Disclosures |
$51,187 |
7% |
| Total Annual Benefits |
$781,314 |
100% |
| PER LARGE FIRM |
$ Million |
|
| Cost to Create Documents |
$955.6 |
|
| Benefits to Finding Missed or Overlooked Documents |
$143.3 |
|
| Benefits to Improving Document Access |
$23.8 |
|
| Benefits of Re-finding Web Documents |
$9.7 |
|
| Benefits of Proposal Preparation and Wins |
$2.0 |
|
| Benefits of Paperwork Requirements and Compliance |
$35.1 |
|
| Benefits of Reducing Unauthorized Disclosures |
$15.0 |
|
| Total Annual Benefits |
$229.0 |
Table 1. Mid-range Estimates for the Annual Value of Documents, U.S. Firms, 2002[1]
The total benefit from improved document access and use to the U.S economy is on the order of $800 billion annually, or about 8% of GDP. For the 1,000 largest U.S. firms, benefits from these improvements can approach nearly $250 million annually per firm. About three-quarters of these benefits arise from not re-creating the intellectual capital already invested in prior document creation. About one-quarter of the benefits are due to reduced regulatory non-compliance or paperwork, or better competitiveness in obtaining solicited grants and contracts.
Indeed, even these figures likely severely underestimate the benefits to enterprises from an improved leverage of document assets. It has always been the case that the best and most successful companies have been able to make better advantage of their intellectual assets than their competitors. The competitiveness advantage from better document access and use alone may exceed the huge benefits in the table above.
Documents — that is, unstructured and semi-structured data — are now at the point where structured data was at 15 years ago. At that time, companies realized that consolidating information from multiple numeric databases would be a key source of competitive advantage. That realization led to the development and growth of the data warehousing or business intelligence markets, now representing about $3.9 billion in annual software sales.
Search and enterprise content management software today only represents a fraction of that amount — perhaps on the order of $500 million annually. But given that intellectual content in documents represents three to four times the amount in numeric structured data, it is clear that document software capabilities are not being well utilized, reaching only a small fraction of their market potential.
The estimates provided in this white paper are drawn from numerous sources and are extremely fragmented, perhaps even inconsistent. One hope in preparing this document was to stimulate more research attention and data gathering around the critical issues of document value to the enterprise and the economy at large.
Documents: The Drivers of a Knowledge Economy
Documents: The Linchpin of Corporate Intellectual Assets
Documents: Unknown Value, Huge Implications
Documents: The Next Generation of Data Warehousing?
Connecting the Dots: A Pointillistic Approach
Number of ‘Valuable’ Documents Produced per Firm
Total Annual U.S. ‘Costs’ to Create Documents
‘Cost’ of Creating a ‘Typical’ Document
‘Cost’ of a Missed or Overlooked Document
Other Document Total ‘Cost’ Factors and Summary
Archival Lifetime of ‘Valuable’ Documents
Estimate of Time and Effort Devoted to Document Search
Effect of Non-persistent Search Efforts
‘Cost’ of Creating and Maintaining a Document Category Portal
‘Cost’ of Inaccessible or Hidden Intranet Sites
‘Costs’ and Opportunity Costs of Winning Proposals
‘Costs’ of Regulation and Regulatory Non-compliance
‘Cost’ of an Unauthorized Posted Document
How many documents does your organization create each year? What effort does this represent in terms of total staffing costs? What does it cost to create a ‘typical’ document? Of documents created, how much of the value in them is readily sharable throughout your organization? How long do you need to keep valuable documents and how can you access them? How much existing document content is re-created simply because prior work cannot be found? When prior information is missed, what do these prior investments in documents represent in terms of loss of market share, revenue or reputation? Indeed, what does the term, “document” represent in your organization’s context?
If you have difficulty answering these questions, you are not alone. Depending on the survey, from 90% to 97% of enterprises cannot answer these questions — in whole or in part. The purpose of this white paper is to provide the first comprehensive assessment ever of these document values.
Enterprises and the analyst community have historically overlooked the impact of document creation as opposed to document handling. Document creation is about 2-3 times more important — from an embedded cost standpoint — than document handling. Second, all aspects of document creation, and later access and use, assume a much greater role in the overall economics of enterprises than have been realized previously.
Put your index finger one inch from your nose. That is how close — and unfocused — document importance is to an organization. Documents are the salient reality of a knowledge economy, but like your finger, documents are often too close, ubiquitous and commonplace to appreciate.
How do your employees earn their livings? Writing proposals? Marketing or selling? Evaluating competitors or opportunities? Persuading? Analyzing? Communicating? Teaching? Of course, in some sectors, many make their living from growing things or making things. These are essential jobs — indeed, until the last few decades were the predominant drivers of economies — but are now being supplanted in advanced economies by knowledge work. Perhaps up to 35% of all company employees in the U.S. can be classified as knowledge workers.
And knowledge work means documents. The fact is that knowledge is produced and communicated through the written word. When we search, when we write, when we persuade, we may often do so verbally but make it persistent through the written word.
IBM estimates that corporate data doubles every six to eight months, 85% of which are documents.[2] At least 10% of an enterprise’s information changes on a monthly basis.[3] Year-on-year office document growth rates are on the order of 22%.[4] As later analysis indicates, there are perhaps on the order of 10 billion documents created annually in the U.S with a mid-range “asset” value of $3.3 trillion per year. Documents are a huge contributor to the United States’ gross domestic product of $10.5 trillion (2002).
A Xerox Corporation study commissioned in 2003 and conducted by IDC surveyed 1000 of the largest European companies and had similar findings:[6],[7]
But, if defining what constitutes a document is hard, identifying the costs associated with all the document activities is almost impossible for many organizations. Ninety to 97 percent of the corporate respondents to the Coopers & Lybrand and Xerox studies, respectively, could not estimate how much they spent on producing documents each year. Almost three quarters of them admit that the information is unavailable or unknown to them.
An A.T. Kearney study sponsored by Adobe, EDS, Hewlett-Packard, Mayfield and Nokia, published in 2001, estimated that workforce inefficiencies related to content publishing cost organizations globally about $750 billion. The study further estimated that knowledge workers waste between 15% to 25% of their time in non-productive document activities.[8]

Figure 1. The Situation of Poor Enterprise Document Use Leads to Real Implications
But the situation is much broader and results in part from the inability to quantify the importance of both internal and external document assets to all aspects of the enterprise’s bottom line. For examples drawn from the main body of this white paper, early adopters of enterprise content software typically capture less than 1% of valuable internal documents available; large enterprises are witnessing the proliferation of internal and external Web sites, sometimes exceeding thousands; use of external content is presently limited to Internet search engines, producing non-persistent results and no capture of the investment in discovery or results; and “deep” content in searchable databases, which is common to large organizations and represents 90% of external Internet content, is completely untapped.
A USC study reported that typically only 32% of employees in knowledge organizations have access to good information about technical developments relevant to their work, and 79% claim they have inadequate information about what their competitors are doing.[9]
The enterprise content integration software market is fragmented and confused, with only a few established companies providing partial solutions. Content integration is still a small market with annual revenues of less than $50 million worldwide.[10] Vendor offerings fail to satisfy customer needs because of a lack of functionality and a lack of scalability to enterprise volumes. Sales in the market remain distinctly lower than those projected by industry analysts, even as the magnitude of “information overload” continues to grow at a dramatic rate.
Documents — that is, unstructured and semi-structured data — are now at the point where structured data was at 15 years ago. At that time, companies realized that consolidating information from multiple numeric databases would be a key source of competitive advantage. That realization led to the development and growth of the data warehousing or business intelligence markets, now representing about $3.9 billion in annual software sales.[11]
Certain categories of businesses have been leaders in content integration, especially those that have recently had mergers and acquisitions activity, those that need to integrate business applications with content, and those for which the reuse of marketing assets across the organization is critical.10
Stonebraker and Hellerstein have provided an insightful roadmap for how enterprise data integration or “federation” has trended over time: Data warehousing → Enterprise application integration → Enterprise content integration → Enterprise information integration.[12] There are two threads to this trend. First, there has been a growing recognition of the importance of document (unstructured) content to contribute to actionable information. Second, increasingly unified and integrated means are being applied to all data sources to allow single-access retrievals.
The state of information regarding the value and cost of documents is extremely poor. Lack of defensible and vetted estimates for this information undercuts the ability to properly estimate the intellectual assets tied up in documents or the impacts of overlooked or misused documents.
Only three large document studies — the Coopers & Lybrand, Xerox and A.T. Kearney studies noted above — have been conducted in the past ten years regarding the use and importance of documents within enterprises, and then solely from the standpoint of executive perceptions.
The quantified picture presented in this white paper regarding the costs and benefits of document creation, access and use is a paint-by-the-numbers assemblage of disparate data. The paper draws upon about 80 different data sources, many fragmented. The analysis approach by necessity has needed to conjoin assumptions and data from many diverse sources.
This approach leads to both uncertainty regarding “true” values and likely inaccuracies or mis-estimates in some areas. To make the assessment as consistent as possible, a base year of 2002 was used, the common year reference for most of the available data sources. To bracket uncertainties, most estimates are provided in low, medium and high estimates.
Thus, this study should be viewed as preliminary, but strongly indicative of the value of documents. Further research and data collection will surely refine these estimates. Clearly, though, by any measure, the value of documents to the enterprise is significant and huge, and should not continue to be overlooked.
Though valuable content resides everywhere, the first challenge to enterprises is getting a handle on their own internal document content.
A recent UC Berkeley study on “How Much Information?” estimated that more than 4 billion pages of internal office documents with archival value are generated annually in the U.S. (Note: this is not the amount created, only those documents deemed worthy of retaining for more than one year).
|
Firm Size (employees) |
1-9 |
10-19 |
20-99 |
100-499 |
500-999 |
1000-2500 |
2500-9999 |
>10,000 |
| Firms |
3,716,944 |
616,064 |
518,258 |
85,304 |
8,572 |
5,161 |
2,704 |
930 |
| Employees |
12,328,094 |
8,274,541 |
20,370,447 |
16,410,367 |
5,906,266 |
7,894,226 |
12,519,664 |
31,357,579 |
| Knowledge Workers |
2,217,093 |
1,488,099 |
3,663,435 |
2,951,251 |
1,062,187 |
1,419,703 |
2,251,545 |
5,639,368 |
| Number of Pages – Low |
465,842,666 |
312,670,737 |
769,739,697 |
620,099,840 |
223,180,542 |
298,299,744 |
473,081,537 |
1,184,911,325 |
| Number of Pages – High |
1,164,606,665 |
781,676,843 |
1,924,349,242 |
1,550,249,599 |
557,951,355 |
745,749,360 |
1,182,703,842 |
2,962,278,313 |
| Number of Docs – Low |
46,584,267 |
31,267,074 |
76,973,970 |
62,009,984 |
22,318,054 |
29,829,974 |
47,308,154 |
118,491,133 |
| Number of Docs- High |
116,460,666 |
78,167,684 |
192,434,924 |
155,024,960 |
55,795,135 |
74,574,936 |
118,270,384 |
296,227,831 |
| Docs/Firm – Low |
13 |
51 |
149 |
727 |
2,604 |
5,780 |
17,496 |
127,410 |
| Docs/Firm – High |
31 |
127 |
371 |
1,817 |
6,509 |
14,450 |
43,739 |
318,525 |
| Docs/Firm – 3 yr Low |
38 |
152 |
446 |
2,181 |
7,811 |
17,340 |
52,487 |
382,229 |
| Docs/Firm – 5 yr High |
157 |
634 |
1,857 |
9,087 |
32,545 |
72,249 |
218,695 |
1,592,623 |
| Content Management Workers |
105,709 |
70,951 |
174,670 |
140,713 |
50,644 |
67,690 |
107,352 |
268,881 |
| CMWs/Firm |
0 |
0 |
0 |
2 |
6 |
13 |
40 |
289 |
Table 2. Document Projections for U.S. Firms by Size, 2002 Basis
Sources: UC Berkeley[13], U.S. Commerce Department[14], U.S. Bureau of Labor Statistics[15], U.S. Census Bureau[16]
Table 2 and Table 3 attempt to summarize the scale of this challenge for U.S. firms (for internal enterprise documents only). (See[17] for a description of methodology regarding document scales, note[18] for estimating the numbers of enterprise knowledge workers, and note[19] for estimating content workers. A rough multiplier of 3x to 4x can be applied to extrapolate globally.[20]) Breakouts are provided by size of firm; these include estimates for the number of knowledge and content workers within U.S. firms.
|
Category |
Value |
| Firms |
4,953,937 |
| Employees |
127,273,960 |
| Knowledge Workers |
20,692,680 |
| Annual Number of Docs – Low |
9,291,013,320 |
| Annual Number of Docs- High |
21,739,130,435 |
| Annual Docs/Firm – Low |
1,875 |
| Annual Docs/Firm – High |
4,388 |
| Total Docs/Firm – 3 yr Low |
1,990 |
| Total Docs/Firm – 5 yr High |
5,601 |
| Content Management Workers |
986,610 |
| CMWs/Firm |
0.2 |
Table 3. Total Annual Document Projections for U.S. Firms, 2002 Basis
Table 4 takes this information and breaks out distribution of document production for a ‘typical’ knowledge worker according to major document types. The data from this table is based on analysis of dozens of BrightPlanet customers averaged across about 10 million documents in various repositories.
|
% Based On |
||||||||||
|
All |
Unique |
MBs |
KB/Page |
Pg/Doc |
Pages |
|
Docs |
MBs |
Pages |
|
| Archival Documents (3 yrs) | ||||||||||
| DOC |
281 |
59 |
20 |
10.5 |
2,938 |
52% |
36% |
50% |
||
|
46 |
28 |
14 |
43.6 |
2,017 |
9% |
17% |
34% |
|||
| PPT |
32 |
26 |
55 |
14.6 |
474 |
6% |
16% |
8% |
||
| XLS |
178 |
51 |
100 |
2.7 |
484 |
33% |
31% |
8% |
||
| Weighted |
537 |
164 |
28 |
11.0 |
5,912 |
100% |
100% |
100% |
||
| Current Documents (I yr) | ||||||||||
| DOC |
221 |
71 |
20 |
5.1 |
1,127 |
49% |
35% |
32% |
||
|
66 |
36 |
14 |
24.7 |
1,634 |
15% |
18% |
46% |
|||
| PPT |
53 |
76 |
55 |
12.9 |
687 |
12% |
38% |
20% |
||
| XLS |
108 |
17 |
100 |
0.6 |
70 |
24% |
8% |
2% |
||
| Weighted |
449 |
199 |
57 |
7.8 |
3,517 |
100% |
100% |
100% |
||
| Total per Employee | ||||||||||
| DOC |
502 |
129 |
20 |
8.1 |
4,065 |
51% |
36% |
43% |
||
|
112 |
64 |
14 |
32.5 |
3,650 |
11% |
18% |
39% |
|||
| PPT |
86 |
102 |
55 |
13.5 |
1,161 |
9% |
28% |
12% |
||
| XLS |
285 |
68 |
100 |
1.9 |
554 |
29% |
19% |
6% |
||
| Weighted |
986 |
363 |
39 |
9.6 |
9,430 |
100% |
100% |
100% |
||
Table 4. Document Production for a ‘Typical’ Knowledge Worker
Note that word processed documents account for about 50% of typical production and storage demands. However, also note that documents of the highest archival value, as converted to PDFs for sharing and deployment, also represent about a third to two-fifths of stored documents.
Based on the information from Table 2 to Table 4 above, all updated to a common year 2002 basis, we can now estimate the total annual costs in the U.S. for creating all internal enterprise documents. The analysis is based on the UC Berkeley information and the Coopers & Lybrand studies. The “bottom up” case is based on the number of annual U.S. documents estimated based on Table 2. These results are shown in the table below:
|
Annual U.S. Office Documents |
|||
|
Number (M) |
$/Document |
Total $ (B) |
|
| “Bottom Up” – Low |
1,387 |
$738.58 |
$1,024 |
| “Bottom Up” – High |
7,242 |
$141.43 |
$1,024 |
| Coopers & Lybrand |
11,975 |
$272.33 |
$3,261 |
| C&L – UCB |
27,737 |
$272.33 |
$7,554 |
| C&L – “Bottom Up” |
4,315 |
$272.33 |
$1,175 |
| Average |
10,531 |
$384.11 |
$3,253 |
Table 5. Annual U.S. Office Document Cost Estimates[21]
The average numbers above represent the average of the unique values in each column. The Table 5 analysis suggests there may be on the order of 10 billion documents created annually in the U.S with a total “asset” value on the order of $3.3 trillion per year.
Based on the averages in the table above, a ‘typical’ document may cost on the order of $380 each to create.[22] Of course, a “document” can vary widely in size, complexity and time to create, and therefore its individual cost and value will vary widely. An invoice generated from an automated accounting system could be a single page and produced automatically in the thousands; proposals for very large contracts can take tens of thousands to millions of dollars to create. For examples, here are some other ‘typical’ costs for a variety of documents:
|
Ave. Cost |
||
| ‘Typical’ Document |
$384.11 |
|
| Invoice |
$4.43 |
[23] |
| Mortgage Application |
$210.00 |
[24] |
| ‘Typical’ Proposal |
$17,500.00 |
[25] |
Table 6. ‘Typical’ per Document Creation Costs
Depending on document mix and activities, individual enterprises may want to vary the average document creation costs used in their cost-benefit estimates.
The Coopers & Lybrand study suggests that 7.5 percent of all documents are lost forever, and that it costs $120 in labor ($150 updated to 2002) to find a misfiled document;[26] other studies suggest that 5% to 6% of documents are routinely misplaced or misfiled.
In fact, the extent of this problem is unknown and is affirmed by the Xerox results:[27]
Five independent studies suggest that, on average, organizations spend from 5% to 15% of total company revenue on handling documents.27,[28],[29],[30],[31] These seemingly innocuous percentages can translate into huge bottom-line impacts for U.S. enterprises. For example, the total GDP of the United States was on the order of $10.5 trillion at the end of 2002.[32] Translating this value into the results of Table 5 and the information in previous sections indicates the importance of document creation and handling for U.S enterprises:
|
Low |
Medium |
High |
|
| Total U.S. Gross Domestic Product ($B) |
$10,487 |
$10,487 |
$10,487 |
| Total Document Handling ($B) |
$524 |
$1,049 |
$1,573 |
|
% of total GDP: |
5.0% |
10.0% |
15.0% |
| Total Document Creation ($B) |
$1,100 |
$3,261 |
$7,554 |
|
% of total GDP: |
10.5% |
31.1% |
72.0% |
| Total Document Misfiled ($B) |
$32 |
$81 |
$160 |
|
% of total GDP: |
0.3% |
0.8% |
1.5% |
| ALL U.S. Document Burdens ($B) |
$1,656 |
$4,390 |
$9,287 |
|
% of total GDP: |
15.8% |
41.9% |
88.6% |
Table 7. Range Estimates for Total U.S. Document Burdens in Enterprises, 2002[33]
A few observations relate to this table. First, enterprises and the analyst community have greatly overlooked the impact of document creation as opposed to document handling. Document creation is about 2-3 times more important – from an embedded cost standpoint – than document handling. Second, all aspects of document creation assume a much greater role in the overall economics of enterprises than has been realized previously.
The fact that documents have received so little management attention, awareness, measurement and direct attention to improve performance is shocking.
The ‘low’ and ‘high’ estimates for documents in Table 2 and Table 3 assume that 2% and 5%, respectively, of internal documents have archival value. Were these percentages to be higher, the volume of documents requiring integration and access would likewise increase. The 2% value is derived from the UC Berkeley study,[34] which also refers to an unpublished European study that places archival amounts at 10%. Unfortunately, there is little empirical information to support the degree to which documents deserve to be kept for archival purposes.
Assuming that documents may retain value for three to five years, the largest firms perhaps have as many as 4 million internal documents on average with enterprise-wide value. Firms with fewer employees generally have lower document counts. Archival percentages, however, are a tricky matter, since apparently 85% of all archived documents are accessed.[35]
Various estimates by Cowles/Simba,[36] Veronis, Suhler & Associates,[37] and Outsell[38] place the current market for on line business information in the $30 billion to $140 billion range, with significant projected growth. Outsell also indicates that marketing, sales, and product development professionals rely most heavily on information from the Internet for their daily decision making, based on a comparative study of Fortune 500 business professionals’ use of the open Web and fee-based desktop information content services.[39] Clearly, relevant and targeted content, much of which resides on line, has extreme value to enterprises.
UC Berkeley estimates that about 500 petabytes of new information was published on the Web in 2002,34 based on original analysis conducted by BrightPlanet.[40] The compound growth rate in Web documents has been on the order of more than 200% annually.[41] Estimates for deep Web content range from about 6-8 times larger [42] to 500 times larger40 than standard “surface web” content. The size of Internet content is overwhelming, of highly variable quality, growing at a rapid pace, and with much of its content ephemeral.
According to a recent study by iProspect, about 56 percent of users use search engines every day, based on a population of which more than 70 percent use the Internet more than 10 hours per week. Professionals abandon a current search 38% of the time after inspecting only one results page (the listing of document result URLs), and overall 82% of users attempt another search if relevant results are not found within the first three results pages. Just 13 percent of users said that they use different search engines for different types of searches.[43] Only 7.5 percent of Internet users said they refined their search with additional keywords in cases where they were unable to achieve satisfactory results.[44]
The average knowledge worker spends 2.3 hrs per day – or about 25% of work time – searching for critical job information.[45] IDC estimates that enterprises employing 1,000 knowledge workers waste well over $6 million per year each in searching for information that does not exist, failing to find information that does, or recreating information that could have been found but was not.[46] As that report stated, “It is simply impossible to create knowledge from information that cannot be found or retrieved.”
Vendors and customers often use time savings by knowledge workers as a key rationale for justifying a document or content initiative. This comes about because many studies over the years have noted that white collar employees spend a consistent 20% to 25% of their time seeking information; the premise is that more effective search will save time and drop these percentages. As a sample calculation, each 1% reduction in time devoted to search produces:
$50,000 (base salary) * 1.8 (burden rate) * 1.0% = $900/ employee
The stable percentage effort devoted to search over time suggests it is the “satisficing” allocation. (In other words, knowledge workers are willing to devote a quarter of their time to finding relevant information.) Thus, while better tools to aid better discovery may lead to finding better information and making better decisions more productively – a far more important justification in itself – there may not result a strict time or labor savings from more efficient search.[47]
The percentage of Web page visits that are re-visits is estimated at between 58%[48] and 80%.[49] While many of these re-visitations occur shortly after the first visit (e.g., during the same session using the back button), a significant number occur after a considerable amount of time has elapsed. Thus, it is not surprising that a survey of problems using the Web found “Not being able to find a page I know is out there,” and “Not being able to return to a page I once visited,” accounted for 17% of the problems reported, and that the most common problem using bookmarks was, “Changed content.”[50] Depending on the content type, users use either “direct” or “indirect” approaches to re-find previously discovered information:
|
Direct |
Indirect |
|
| Specific Information |
42% |
58% |
| General Information |
58% |
43% |
| Specific Documents |
29% |
71% |
| Web Documents |
77% |
23% |
| Emails |
9% |
91% |
Table 8. General Approaches to Re-finding Previously Discovered Information [51]
Direct approaches require remembering or specifically noting the specific location of the information. Direct approaches include: direct entry; emailing to self; emailing to others; printing out; saving as file; pasting the URL into a document; and posting to a personal Web site.
Indirect approaches include: searching; looking through bookmarks; and recalling from a history file. All of these indirect approaches are supported by modern browsers. Note that re-finding Web pages or documents relies heavily on having a record of a previously visited URL.
As a University of Washington study supported by Microsoft discovered, all of the specific direct and indirect techniques applied to these re-discovery approaches have significant drawbacks in terms of desired functions for the recall process: [52]
| Portability | No of Access Points | Persistence | Preservation | Currency | Context | Reminding | Ease of Integration | Communication | Ease of Maintenance | |
|
DIRECT APPROACHES |
||||||||||
| Direct Entry |
Low |
High |
Low |
Med |
High |
Low |
Low |
? |
Low |
High |
| Email to Self |
Low |
High |
Low |
Med |
High |
High |
High |
Med |
Low |
Med |
| Email to Others |
Low |
High |
Low |
Med |
High |
High |
Low |
Low? |
High |
High |
| Print-out |
High |
High |
High |
Low |
Low |
Low |
High |
Med |
High |
Med |
| Save as File |
Med? |
Low? |
High |
High |
Low |
Low |
Low |
Med? |
Low |
Med |
| Paste URL in Doc |
Low |
Low? |
Low |
Med |
High |
High |
High? |
High? |
Low |
High |
| Personal Web Site |
Low |
High |
Low |
Med |
High |
High |
High? |
High |
Med |
High? |
|
INDIRECT APPROACHES |
||||||||||
| Search |
Low |
High |
Low |
Med |
High |
Low |
Low |
? |
Low |
High |
| Bookmark |
Low |
Low |
Low |
Med |
High |
Low |
Low |
Low |
Low |
Low |
| History |
Low |
Low |
Low |
Med |
High |
Low |
Low |
Low? |
Low |
? |
Table 9. Strengths and Weakness of Existing Techniques to Re-use Web Information
The general observation is that no present technique is able alone to keep search persistent, current or maintain context. These combined inadequacies mean that previously found information is not easily found again, or re-discovered, as the following table shows:
|
Percent |
|
| Information No Longer Available |
37% |
| Re-tracing Path Fails |
14% |
| Time Length Since Last Find |
9% |
| Other Failure Reasons |
9% |
|
Total Information Lost |
68% |
| Success Finding Lost Information |
32% |
Table 10. Success in Finding Important Earlier Found Web Information [53]
This table has a number of important observations. First, some 37% of previously found information disappears from the Web, consistent with other findings that estimate about 40% of all Web content disappears annually, some of which has historical or archival value.[54]
Second, and most importantly, nearly 70% of previously found valuable information cannot be rediscovered again. More than half of this problem is because the information is no longer available on the Web, but other reasons relate to the inadequacies of recall techniques for finding previously discovered information.
These observations can translate into some relatively huge costs on a per employee and per enterprise basis, as the table below shows:
|
Per Knowledge Worker |
Per ‘Large’ |
All |
||
|
Per Doc |
All Docs |
Enterprise ($000) |
Enterprises ($M) |
|
| Re-finding Documents |
$148.54 |
$585 |
$3,547 |
$12,103 |
| Re-creating Documents |
$384.11 |
$1,008 |
$6,114 |
$20,864 |
| TOTAL |
$1,593 |
$9,661 |
$32,967 |
|
Table 11. ‘Cost’ of Not Readily Re-finding Valuable Web Information
This analysis assumes that some previously found information of value is again re-found (60%), but some is also not re-found and must be re-created (40%).[55] The ‘large’ enterprise is identical to the definition in Table 2 (which is also nearly equivalent to a Fortune 1000 company).[56]
The analysis indicates that poor methods to recall previously found and valuable Web documents may cost $1,600 per knowledge worker per year. This translates into nearly a $10 million productivity loss for the largest enterprises, or nearly $33 billion across all U.S. industries.
In relation to the total document costs noted in Table 7 above, these may seem to be comparatively small numbers. However, when viewed in the context of unproductive standard Web search, they indicate important failings in the ability to recall previously found valuable results from searches and their attendant productivity losses.
Users, administrators and industry analysts alike recognize the importance of placing content into logical, intuitive and hierarchically organized categories. About 60% of knowledge workers note that search is a difficult process, made all the more difficult without a logical organization to content.[57] While technical distinctions exist, these logical structures organized into a hierarchical presentation are most often referred to as “taxonomies,” though other terms such as ontology, subject directory, subject tree, directory structure or classification schema may be used.
Delphi Group’s research with corporate Web sites points to the lack of organized information as the number one problem in the opinion of business professionals. More than three-quarters of the surveyed corporations indicated that a taxonomy or classification system for documents is imperative or somewhat important to their business strategy; more than one-third of firms that classify documents still use manual techniques.57 Hierarchical arrangements of categorized subjects trigger associations and relationships that are not obvious when simply searching keywords. Other advantages cited for the taxonomic presentation of documents are the greater likelihood of discovery, ease-of-use, overcoming the difficulty of formulating effective search queries, being able to search only within related documents, discovery of relationships among similar terminology and concepts, and user satisfaction.[58],[59]
From the user standpoint, knowledge workers want to impose taxonomic order on document chaos, but only if the taxonomy models their domain accurately. They also want software to assist with categorizing, as long as it respects the taxonomy they created. Finally, the results of these category placements should be presented via a portal. Thus, as the common concern across all requirements, the taxonomy takes on tremendous importance for an application’s success.[60]

Figure 2. Typical Large Firm Documents, Thousands
Enterprises that have adopted directory structures for content management are not yet achieving enterprise-wide relevance, presenting on average 1% of all relevant documents in an organized portal view. These limitations appear to be driven by weaknesses in the technology and high costs associated with conventional approaches:
|
DOCUMENT |
INITIAL SET-UP |
MAINTENANCE |
||||
|
BASIS |
Staff |
Mos |
$/Doc |
Staff |
$/Doc |
|
| Current Practice |
37,000 |
6.2 |
5.4 |
$4.861 |
6.4 |
$11.278 |
| BrightPlanet |
250,000 |
1.0 |
0.8 |
$0.017 |
0.3 |
$0.078 |
| BP Advantage |
6.8 x + up |
6.2 x |
6.7 x |
280.4 x |
21.4 x |
144.6 x |
Table 12. Staff, Time and per Document Costs for Categorized Document Portals
Though conventional approaches to content integration seem to lead to high per document set-up and maintenance costs, these should be contrasted with standard practice that suggests it may cost on average $25 to $40 per document simply for filing.29 Indeed, labor costs can account for up to 30% of total document handling costs.28 Nonetheless, at $5 to $11 per document for content management alone, this could result in no actual cost savings if electronic access does not displace current filing practices. When multiplied across all enterprise documents, these uncertainties can translate into huge swings in costs or benefits for a content portal initiative.
While other vendors claim fast categorization times, what they fail to mention is the lengthy pre-processing times necessary for generating their categorization metatags. According to Forrester Research, some of these metatagging systems can only process five to 15 documents per hour![67]
In 2003, the portal vendor Plumtree noticed a new trend that it called “Web sprawl,” by which it meant the costly proliferation of Web applications, intranets and extranets.[68] BEA has taken up this trend as a major thrust to its Web service offerings through an approach it calls “enterprise portal rationalization” (EPR).[69] According to BEA, its architectural offerings are meant to control the “metastasizing” of corporate Web sites.
How common and to what scale is the proliferation of enterprise Web sites? I have not been able to find any comprehensive studies on this topic, but has been able to find many anecdotal examples. The proliferation, in fact, began as soon as the Internet became popular:
BrightPlanet’s customers confirm these trends, with indicators of hundreds if not thousands of internal Web sites common in the largest companies. Indeed, it is surprising how many instances there are where corporate IT does not even know the full extent of Web site proliferation. The problem is likely much greater than realized:
|
Low |
Med |
High |
|
| Number of Large Firms |
930 |
1,500 |
3,000 |
| Ave Number of Web Sites per Firm |
100 |
500 |
900 |
| Ave. Number of Documents per Web Site |
100 |
350 |
1,500 |
| Total Large Firm Web Sites |
93,000 |
750,000 |
2,700,000 |
| Percentage of Known Web Sites |
85% |
60% |
40% |
| Percentage of Doc Federation for Known Sites |
50% |
10% |
2% |
| Site Development & Maintenance | |||
| Development Cost per Web Site |
$300 |
$1,701 |
$9,000 |
| Annual Maintenance Cost per Site |
$800 |
$3,947 |
$21,000 |
| Total Yr 1 Cost per Site |
$1,100 |
$5,649 |
$30,000 |
| Total Yr 1 per Large Firm Costs ($000) |
$110 |
$2,824 |
$27,000 |
| Total Yr 1 Large Firm Costs ($M) |
$102 |
$4,237 |
$81,000 |
| ‘Cost’ of Unfound Documents | |||
| No. of Unknown Documents per Firm |
5,750 |
80,500 |
820,800 |
| Total Number of Large Firm Unknown Docs |
5,347,500 |
120,750,000 |
2,462,400,000 |
| Total Cost per Web Site |
$6,900 |
$23,915 |
$350,310 |
| Cost of Unknown Docs per Firm ($000) |
$690 |
$11,958 |
$315,279 |
| Total Cost of Large Firm Unknown Docs ($M) |
$642 |
$17,937 |
$945,837 |
| Summary | |||
| Total Cost per Firm ($000) |
$800 |
$14,782 |
$342,279 |
| Total Cost all Large Firms ($M) |
$744 |
$22,173 |
$1,026,837 |
| Development as % of Total Costs |
14% |
19% |
8% |
| Unfound Documents as % of Total Costs |
86% |
81% |
92% |
Table 13. Development and Unfound Document ‘Costs’ for Large Firms due to Web Sprawl
Table 13 consolidates previous information to estimate what the ‘costs’ of Web sprawl might be to larger firms (analogous to the Fortune 1000). The table presents Low, Medium and High estimates for number of Web sites per firm, known and unknown documents in each, and associated costs for initial site development and first-year maintenance plus the value of unfound information. The Medium category uses the average values from previous tables. The Low and High values bracket these amounts based on distribution of known values and expert judgment.
The table indicates as a mid-range estimate that an individual Web site for a large enterprise may cost about $6,000 to set-up and maintain in the first year and represents $24,000 in opportunity costs due to unknown or unfound documents. For the average large enterprise across all Web sites, these costs may be $4.2 million and $12.0 million, respectively. Across all large firms, total costs due to Web sprawl may be on the order of $22 billion.
While site development and maintenance costs are not trivial, exceeding $4 billion for all large firms (which can also be significantly reduced – see previous section), the major cost impact comes from the inability to find or federate the information that is available. Unfound documents represent well in excess of 80% of the costs associated with Web sprawl.
The Web sprawl situation is analogous to other major technology shifts. For example, in the early 1980s, IT grappled mightily with the proliferation of personal computers. Centralized control was impossible in that circumstance because individuals and departments recognized the productivity benefits to be gained by PCs. Only when enterprise-capable vendors of networking technology, such as Novell, were able to offer integration solutions was the corporation able to control and fully exploit the PC’s technology potential.
The proliferation of internal enterprise Web sites is responding to similar drivers: innovation, customer service, or superior methods of product or solutions delivery. Ambitious mid-level managers will continue to exploit these advantages by “cowboy” additions of more corporate Web sites, and that is likely to the good for most enterprises. Gaining control and fully realizing the value of this Web site proliferation – while not stymieing innovation – will likely require enabling technology analogous to the networking of PCs.
The previous analysis has focused on more-or-less direct costs and drivers. These impacts are huge and deserve proper consideration. But there are other implications from the inability to access and manage relevant document information. These implications fall into the categories of lost opportunities, liabilities, or non-compliance. These implications often far outweigh the direct costs in their bottom-line impacts. This section presents only a few of these many opportunities.
Competitive proposals are an important revenue factor to hundreds of thousands of businesses. Indeed, contracts and grants from federal, state and local governments accounted for 12.1% of GDP in 2002; the amount competitively awarded equaled about 5.6% of GDP.[78] Reducing the fully-burdened costs of producing responses to competitive procurements and improving the rate of successfully obtaining them can be a huge competitive advantage to business.
Significant proportions of commercial projects and programs are likewise awarded through competitive proposals and bids. However, literature references to these are limited, and the remainder of this section relies on federal sector statistics as a proxy for the overall category.
Though the federal government is making strides in providing central clearinghouses to opportunities – and is also doing much in moving to uniform application standards and electronic application submissions – these efforts are still in their nascent stages and similar efforts at the state and local level are severely lagging. As a result, the magnitude of the proposal opportunity is perhaps largely unknown to many businesses. This lack of appreciation and attention to the cost- and success-drivers behind winning proposals is a real gap in the competitiveness of many individual businesses.
Table 14 on the following page consolidates information from many government sources to quantify the magnitude of this competitively-awarded grant and contract opportunity with governments.
Table 14. Federal, State & Local Contract and Grant Opportunities, 2002
This analysis suggests there are nearly $600 billion available each year for competitively awarded grants and procurements from all levels of government within the U.S.; about 60% from the federal sector. The average competitive award is about $270 K for grants; about $220 K for contract procurements.
Aside from construction firms (which are excluded in this and prior analyses), there are on the order of 92,500 federal contract-seeking firms today.[87] In 2003, the top 200 federal contracting firms accounted for nearly $190 billion in contract outlays.[88] While it is unclear what proportion of these commitments were competitive (81% of total federal commitments) or based on all contract procurements (57% of total federal commitments), it is clear that more than 90,000 firms are competing via a classic power curve for a minor portion of available federal revenues. This power curve is shown in Figure 3 below for the 200 largest federal contractors, which obtain a proportionately high percentage of all contract dollars.

Figure 3. Power Curve Distribution of Top 200 Federal Contractors by Revenue, 2002
The combination of these factors enables an estimate of the bottom-line proposal impacts by firm. This information is shown in the table below:
Table 15. Combined Preparation Costs and Opportunity Costs for Proposals
Across all entities, the annual cost of preparing proposals to competitive solicitations from government agencies at all levels is on the order of $22 billion, $5 billion for winning firms and $17 billion for losing firms. Better access to missing information and better information – assuming no change in the underlying ideas or proposal-writing skills – suggests that proposal response costs could be reduced by more than $3 billion annually. Another $3 billion annually is available for better winning of competitive proposals. Individual benefits to firms that respond to competitive solicitations is on average $1.25 million per competing firm.[95]
The more significant benefit to individual firms from improved access to “missing” information and better information is increasing the likelihood of winning a competitive award. Firms that embrace these practices are estimated to obtain a $1.2 million annual benefit. Given that many firms that have previously been losing awards have relatively low annual revenues, the percent impact on the bottom line can be quite striking due to improved proposal preparation information.
A December 2001 small business poll by the National Federation of Independent Business (NFIB) gauged the impacts of the regulatory workload on firms. When asked “is government regulation a very serious, somewhat serious, not too serious, or not at all serious problem for your business,” nearly half, or 43.6 percent, answered “very serious” or “somewhat serious.” The respondents indicated the most serious regulatory problems were at the federal level (49 %), state level (35 %) or local level (13%) of government. The biggest single regulatory problem cited was extra paperwork, followed by difficulty understanding how to comply with regulations and dollars spent doing so.[96] A later December 2003 NFIB survey indicates that the average cost per hour of complying with paperwork requirements was $48.72.[97]
|
Type of Regulation |
All Firms |
<20 Employees |
20-499 Employees |
500+ Employees |
| All Federal Regulations |
$5,107 |
$7,544 |
$4,671 |
$4,827 |
| Environmental |
$1,312 |
$3,600 |
$1,269 |
$776 |
| Economic |
$2,234 |
$1,748 |
$1,782 |
$2,688 |
| Workplace |
$843 |
$897 |
$944 |
$755 |
| Tax Compliance |
$719 |
$1,300 |
$676 |
$608 |
Table 16. Per Employee Costs of Federal Regulation by Firm Size, 2002
According to a 2001 report, “The Impact of Regulatory Costs on Small Firms” by W. Mark Crain and Thomas D. Hopkins, the total costs of Federal regulations were estimated to be $843 billion in 2000, or 8 percent of the U. S. Gross Domestic Product. Of these costs, $497 billion fell on business and $346 billion fell on consumers or other governments. Here are how those impacts are estimated on a per employee basis across a range of firm sizes:[98]
As of September 30, 2002, federal agencies estimated there were about 8.2 billion “burden hours” of paperwork government-wide. Almost 95 percent of those 8.2 billion hours were being collected primarily for the purpose of regulatory compliance. [99]
|
Burden Hrs (million) |
Labor Costs ($M) |
|
| Total Government |
8,223.17 |
$318,237 |
| Total Gov (excl. Treasury) |
1,472.74 |
$56,995 |
| Treasury |
6,750.43 |
$261,242 |
| Transportation |
244.73 |
$9,471 |
| HHS |
224.83 |
$8,701 |
| Labor |
189.22 |
$7,323 |
| EPA |
140.47 |
$5,436 |
| Defense |
92.36 |
$3,574 |
| Agriculture |
88.59 |
$3,428 |
| Justice |
46.60 |
$1,803 |
| Education |
38.44 |
$1,488 |
| State |
29.23 |
$1,131 |
| HUD |
21.93 |
$849 |
| Commerce |
11.65 |
$451 |
| Interior |
7.66 |
$296 |
| Energy |
3.76 |
$146 |
| SEC |
136.58 |
$5,286 |
| FTC |
69.66 |
$2,696 |
| FCC |
26.80 |
$1,037 |
| SSA |
24.89 |
$963 |
| FAR (contracts) |
24.49 |
$948 |
| FCIC |
9.87 |
$382 |
| NRC |
8.34 |
$323 |
| FEMA |
7.77 |
$301 |
| Veterans Administration |
7.31 |
$283 |
| NASA |
5.95 |
$230 |
| NSF |
4.46 |
$173 |
| FERC |
4.38 |
$170 |
| SBA |
2.77 |
$107 |
Table 17. Federal Government Paperwork Burdens, 2002[100]
A December 2003 NFIB survey indicates that the average cost per hour of complying with paperwork requirements was $48.72.[101] If these costs are substituted, the total cost burden in the table above would be about $400 billion, $71 billion of which excludes Treasury and the IRS.
Despite legislation requiring federal paperwork reduction and embracing of e-government initiatives, paperwork burdens continue to increase. Total burden hours in 2002, for example, increased 600 million hours, or about 4 percent, from the previous year. The Code of Federal Regulations (CFR) continues to expand despite efforts to curtail further growth. The CFR grew from 71,000 pages in 1975 to 135,000 pages in 1998. Annually, there are more than 4,000 regulatory changes introduced by the federal government. The federal government now has over 8,000 separate information collection requests authorized by OMB.[102]
Table 18. Federal Fines and Penalties to Corporations, 2002
Another source of costs to enterprises are civil penalties and fines for non-compliance with existing regulations, as shown in the table above for 2002 by agency. A total of $5 billion annually is expended by U.S. businesses for civil penalties due to non-compliance with federal regulation, $1 billion of which is due to non-tax purposes.
However, these estimates may undercount actual fines and penalties levied by the federal government due to the accounting basis of the OMB source. For example, the Department of Labor (DOL) collected fines and penalties totaling $175 million from employers in fiscal year 2002 for Fair Labor Standards Act (FLSA) violations.[107] According to a 2002 report, since 1990, 43 of the government’s top contractors paid approximately $3.4 billion in fines/penalties, restitution, and settlements.[108] And, according to another report, the corporations liable to the top 100 False Claims Act paid more than $12 billion since 1986.[109] Since there is no central clearinghouse for this information, with both individual agency general counsels and the Department of Justice responsible for actual collections, the figures in Table 18 should be interpreted as estimates.
Table 19 on the next page consolidates the information in Table 16 to Table 18 to estimate the overall regulatory and paperwork burdens on U.S. businesses, plus estimates of the benefits to be gained from better document access and use.
Unauthorized information disclosures derive mainly from within an organization. The ease of electronic record duplication and dissemination – particularly through postings on enterprise Web sites – increases a firm’s vulnerability to this problem. Records mutate and propagate in poorly controlled environments. On average, unauthorized disclosure of confidential information costs Fortune 1000 companies about $15 million per company per year.[110]
A few privacy laws demonstrate the potential liabilities associated with disclosure of confidential information due to inadvertent mistakes or disgruntled employees. As one example, the Health Insurance Portability and Accountability Act (HIPAA) of 1996 sets security standards protecting the confidentiality and integrity of “individually identifiable health information,” past, present or future. Failure to comply with any of the electronic data, security, or privacy standards can result in civil monetary penalties up to $25,000 per standard per year. Violation of the privacy regulations for commercial or malicious purposes can result in criminal penalties of $50,000 to $250,000 in fines and one to ten years of imprisonment.[111]
Table 19. Regulatory Burden and Benefits to Firms from Improved Information
As another example, the Gramm-Leach-Bliley Act (GLBA) of 1999 mandates the financial industry to create guidelines for the safeguarding of customer information. GLBA includes severe civil and criminal penalties for non-compliance, with civil penalties up to $100,000 for each violation and key officers may be fined up to $10,000 per violation. Violation of the GLBA can also carry hefty sanctions, including termination of FDIC insurance and fines of up to $1,000,000 for an individual or one percent of the total assets of the financial institution.[117]
Other major areas of unauthorized disclosure liability occur in national security, identity theft, and commerce, tax and Social Security information. Indeed, virtually every state and federal agency related to a company’s business has policies and fines regarding unauthorized disclosures. Monitoring these requirements is thus an imperative for enterprise management to prevent exposure to fines and loss of reputation.
On a less-quantifiable basis there are also risks about the clarity of the enterprise message to customers, suppliers and partners. Unmanaged Web sprawl is a critical hole for enterprises to ensure compliance with privacy and confidentiality regulations, and to promote clarity of message and accuracy to stakeholders.
Prior to the analysis in this white paper, the state of understanding about the value of document assets had been abysmal. While still preliminary and subject to much improvement, this study has nonetheless found:
As noted throughout, there is a considerable need for additional research and data on document creation, use, costs and benefits. Additional technical endnotes are provided in the PDF version of the full paper.
[1] All sources and assumptions are fully documented in footnotes in the main body of this white paper; general assumptions used in multiple tables are provided in the Technical Endnotes.
[2] As quoted by Armando Garcia, vice president of content management at IBM; see http://www.contentworld.com/conference/conthur.html
[3] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See http://delphigroup.com.
[4] Based on the 1999 to 2001 estimate changes in reference 34, Table 2-6.
[5] As initially published in Inc Magazine in 1993. Reference to this document may be found at: http://www.contingencyplanning.com/PastIssues/marapr2001/6.asp
[6] J. Snowdon, Documents – The Lifeblood of Your Business?, October 2003, 12 pp. The white paper may be found at: http://www.mdy.com/News&Events/Newsletter/IDCDocMgmt.pdf
[7] Xerox Global Services, Documents – An Opportunity for Cost Control and Business Transformation, 28 pp., 2003. The findings may be found at: http://www.sap.com/solutions/srm/pdf/CCS_Xerox.pdf
[8] A.T. Kearney, Network Publishing: Creating Value Through Digital Content, A.T. Kearney White Paper, April 2001, 32 pp. See http://www.adobe.com/aboutadobe/pressroom/pressmaterials/networkpublishing/pdfs/netpubwh.pdf.
[9] S.A. Mohrman and D.L. Finegold, Strategies for the Knowledge Economy: From Rhetoric to Reality, 2000,http://www.marshall.usc.edu/ceo/Books/pdf/knowledge_economy.pdf. University of Southern California study as supported by Korn/Ferry International, January 2000, 43 pp. See
[10] C. Moore, TheContent Integration Imperative, Forrester Research Trends Report, March 26, 2004, 14 pp.
[11] D. Vesset, Worldwide Business Intelligence Forecast and Anal ysis, 2003-2007, International Data Corporation, June 2003, 18 pp. See http://www.dwway.com/file/20030708085453_IDC_WW-BIFORECASTANDANALYSIS2003-07_JUN03.pdf.
[12] M. Stonebraker and J. Hellerstein, “Content Integration for E-Business,” in ACM SIGMOD Proceedings, Santa Barbara, CA, pp. 552-560, May 2001.
[13] P. Lyman and H. Varian, “How Much Information, 2003,” retrieved from http://www.sims.berkeley.edu/how-much-info-2003 on December 1, 2003.
[14] U.S. Department of Commerce, Digital Economy 2003, Economic Statistics Administration, U.S. Dept. of Commerce, Washington, D.C., April 2004, 155 pp. See http://www.esa.doc.gov/DigitalEconomy2003.cfm.
[15] U.S. Department of Labor, “Occupation Employment and Wages, 2002,” Bureau of Labor Statistics. See http://www.bls.gov/news.release/archives/ocwage_11192003.pdf.
[16] U.S. Census Bureau, “Statistics of U.S. Businesses 2001.” See http://www.census.gov/epcd/susb/2001/us/US–.htm.
[17] Total office documents counts were obtained on a page basis from reference 13, which used a value of 2% for what documents deserve to be archived. This formed the ‘lo’ case, with the high case using a 5% estimate (lower still than the ENST 10% estimated cited in reference 13). Total pages were converted to numbers of documents on an average 8 pp per document basis; see Technical Endnotes for further discussion.
[18] See Technical Endnotes for the derivation of knowledge worker estimates.
[19] See Technical Endnotes for the derivation of content worker estimates.
[20] Citation sources and assumptions for this analysis are presented in the BrightPlanet white paper, “A Cure to IT Indigestion: Deep Content Federation,” BrightPlanet Corporation White Paper, June 2004, 31 pp.
[21] The “bottom up” cases are built from the number of assumed knowledge workers in Table 3. The “low” and “high” variants are based on a 5% archival value or 350 annual documents created per worker, respectively, applied to worker staff costs associated with document creation. The “Coopers & Lybrand” case is a strict updating of that study to 2002. The other two “C&L” cases use the updated per document costs from the C&L study; the first variant uses the annual documents created from the UC Berkeley study without archiving; the second variant uses the average of the “low” and “high” document numbers. See further Technical Endnotes for other key assumptions.
[22] The individual values in Table 5 range from about $140 to $740 per document, with the update of the Coopers & Lybrand study being about $270. Separate Delphi analysis by BrightPlanet has shown median values of about $550 per document.
[23] See http:// www.eds.com/services_offerings/ibill_openbill_b2b.shtml
[24] See http://www.hsh.com/cfee-sample.html.
[25] See http://www.atp.nist.gov/eao/applicants/section9.htm.
[26] As initially published in Inc Magazine in 1993. Reference to this document may be found at: http://www.contingencyplanning.com/PastIssues/marapr2001/6.asp
[27] Xerox Global Services, Documents – An Opportunity for Cost Control and Business Transformation, 28 pp., 2003. The findings may be found at: http://www.sap.com/solutions/srm/pdf/CCS_Xerox.pdf and J. Snowdon, Documents – The Lifeblood of Your Business?, October 2003, 12 pp. The white paper may be found at: http://www.mdy.com/News&Events/Newsletter/IDCDocMgmt.pdf
[28] Optika Corporation. See http://www.optika.com/ROI/calculator/ROI_roiresults.cfm.
[29] Cap Ventures information, as cited in ZyLAB Technologies B.V., “Know the Cost of Filing Your Paper Documents,” Zylab White Paper, 2001. See http://www.zylab.com/downloads/whitepapers/PDF/21%20-%20Know%20the%20cost%20of%20filing%20your%20paper%20documents.pdf.
[30] ALL Associates Group, Inc., EDAM Sector Summary, April 2003, 2 pp.
[31] ALL Associates Group, 2002 EDAM Metrics for Major U.S. Companies.
[32] By the second Q 2004, this amount was $11.6 trillion. U.S. Federal Reserve Board, Flow of Funds Accounts for the United States, Sept. 16, 2004. See http://www.federalreserve.gov/releases/Z1/current/accessible/f6.htm.
[33] The bases for this table have the following assumptions: 1) the three cases for document handling are based on 5%, 10% and 15% of total enterprise revenues, per the earlier section; 2) the three cases for document creation are based on the ‘C&L Bottom-Up’, ‘Bottom-up – High,’ and ‘Coopers & Lybrand’ items for the Low, Medium, and High columns, respectively, in Table 5; and 3) the document misfiling case draws on the same basis but using the total document estimates and misfiled percentages of 5%, 7.5% and 9% consistent with the previous discussion section. See further the Technical Endnotes.
[34] P. Lyman and H. Varian, “How Much Information, 2003,” retrieved from http://www.sims.berkeley.edu/how-much-info-2003 on December 1, 2003.
[35] Cap Ventures information, as cited in ZyLAB Technologies B.V., “Know the Cost of Filing Your Paper Documents,” Zylab White Paper, 2001. See http://www.zylab.com/downloads/whitepapers/PDF/21%20-%20Know%20the%20cost%20of%20filing%20your%20paper%20documents.pdf.
[36] As reported in http://www.hoovers.com/company/archive/detail/0,2049,7_2322,00.html.
[37] See http://www.veronissuhler.com/businfo/segment.html, August 2, 2000.
[38] See http://www.outsellinc.com/docs/pr_release/pr20000602_01.htm, June 2, 2000.
[39] See http://www.outsellinc.com/docs/pr_release/pr20000629_01.htm.
[40] M.K. Bergman, “The Deep Web: Surfacing Hidden Value,” BrightPlanet Corporation White Paper, June 2000. The most recent version of the study was published by the University of Michigan’s Journal of Electronic Publishing in July 2001. See http://www.press.umich.edu/jep/07-01/bergman.html.
[41] This analysis assumes there were 1 million documents on the Web as of mid-1994.
[42] See, for example, C. Sherman and G. Price, The Invisible Web, Information Today, Inc., Medford, NJ, 2001, 439 pp., and P. Pedley, The Invisible Web: Searching the Hidden Parts of the Internet, Aslib-IMI, London, 2001, 138pp.
[43] iProspect Corporation, iProspect Search Engine User Attitudes, April/May 2004, 28 pp. See http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf.
[44] As reported at http://www.nua.ie/surveys/index.cgi?f=VS&art_id=905358569&rel=true.
[45] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See http://delphigroup.com.
[46] C. Sherman and S. Feldman, “The High Cost of Not Finding Information,” International Data Corporation Report #29127, 11 pp., April 2003.
[47] M.E.D. Koenig, “Time Saved – a Misleading Justification for KM,” KMWorld Magazine, Vol 11, Issue 5, May 2002. See http://www.kmworld.com/publications/magazine/index.cfm.
[48] G. Xu, A. Cockburn and B. McKenzie, Lost on the Web: An Introduction to Web Navigation Research, http://www.cosc.canterbury.ac.nzq/ACMchapterq/NZCSPGq/papers.
[49] A. Cockburn and B. McKenzie, What Do Web Users Do? An Empirical Analysis of Web Use, 2000. See http://citeseer.ist.psu.edu/cockburn00what.html.
[50] Tenth edition of GVU’s (graphics, visualization and usability} WWW User Survey, May 14, 1999. See http://www.gvu.gatech.edu/user_surveys/survey-1998-10/tenthreport.html.
[51] C. Alvarado, J. Teevan, M. S. Ackerman and D.Karger, “Surviving the Information Explosion: How People Find Their Electronic Information,” AI Memo 2003-06, April 2003, 11 pp.., Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory. See ftp://publications.ai.mit.edu/ai-publications/2003/AIM-2003-006.pdf.
[52] W. Jones, H. Bruce and S. Dumais, “Keeping Found Things Found on the Web,” See http://washington.edu/KFTF_Web.pdf.
[53] J. Teevan, “How People Re-find Information When the Web Changes,” AI Memo 2004-014, June 2004, 10 pp., Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory. See ftp://publications.ai.mit.edu/ai-publications/2004/AIM-2004-012.pdf.
[54] Library of Congress, “Preserving Our Digital Heritage: Plan for the National Digital Information Infrastructure and Preservation Program”, a Report to Congress by the U.S. Library of Congress, 2002, 66 pp. See http://www.digitalpreservation.gov/ndiipp/.
[55] Consistent with Table 8; this analysis also assumes the 25% search time commitment by employee and previous values from earlier tables.
[56] All subsequent references to ‘Large’ firms is based on the last column in Table 2, namely the 930 U.S. firms with more than 10,000 employees.
[57] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See http://delphigroup.com.
[58] S. Stearns, “Realize the Value Locked in Your Content Silos Without Breaking the Bank: Automated Classification Tools to Improve Information Discovery,” Inmagic White Paper, version 1.0, 2004. 10 pp. See http://www.inmagic.com.
[59] P. Sonderegger, “Weave Search into the Browsing Experience,” ForresterQuick Take, Forrester Research, Inc., Feb. 18, 2004. 2 pp.
[60] P. Russom, “An Eye for the Needle,” Intelligent Enterprise, January 14, 2002. See http://www.iemagazine.com/020114/502feat2_1.
[61] This average was estimated by interpolating figures shown on Figure 8 in reference 68.
[62] This average was estimated by interpolating figures shown on the p.14 figure in Plumtree Corporation, “The Corporate Portal Market in 2002,” Plumtree Corp. White Paper, 27 pp. See http://www.plumtree.com/pdf/Corporate_Portal_Survey_White_Paper_February2002.pdf.
[63] The ‘low’ case represents the archival value in the middle bars with the addition that 30% of internal documents generated in the current year have a value to be shared for one year; the ‘high’ case represents the related archival value in the middle bars but with 40% of documents generated in that year having a value to be shared for one year.
[64] Analysis based on reference 68, with interpolations from Figure 16.
[65] M. Corcoran, “When Worlds Collide: Who Really Owns the Content,” AIIM Conference, New York, NY, March 10, 2004. See http://show.aiimexpo.com/convdata/aiim2003/brochures/64CorcoranMary.pdf.
[66] C. Phillips, “Stemming the Software Spending Spree,” Optimize Magazine, April 2002, Issue 6. See http://www.optimizemag.com/article/showArticle.jhtml?articleId=17700698&pgno=1.
[67] C. Moore, “The Content Integration Imperative,” Forrester Research, Inc., March 26, 2004, 14 pp.
[68] Plumtree Corporation, “The Corporate Portal Market in 2003,” Plumtree Corp. White Paper, 30 pp. See http://www.plumtree.com/portalmarket2003/default.asp.
[69] BEA Corporation, “Enterprise Portal Rationalization,” BEA Technical White Paper, 23 pp., 2004. See http://www.bea.com/content/news_events/white_papers/BEA_epr_wp.pdf.
[70] A. Aneja, C.Rowan and B. Brooksby, “Corporate Portal Framework for Transforming Content Chaos on Intranets,” Intel Technology Journal Q1, 2000. See http://developer.intel.com/technology/itj/q12000/pdf/portal.pdf.
[71] J. Smeaton, “IBM’s Own Intranet: Saving Big Blue Millions,” Intranet Journal, Sept. 25, 2002. See http://www.intranetjournal.com/articles/200209/ij_09_25_02a.html.
[72] See http://www.wookieweb.com/Intranet/.
[73] D. Voth, “Why Enterprise Portals are the Next Big Thing,” LTI Magazine, October 1, 2002. See http://www.ltimagazine.com/ltimagazine/article/articleDetail.jsp?id=36877.
[74] A. Nyberg, “Is Everybody Happy?” CFO Magazine, November 01, 2002. See http://www.cfo.com/article/1%2C5309%2C8062%2C00.html.
[75] See http://www.proudfoot-plc.com/pdf_20004-USPR1002Avayaweb.asp.
[76] Wall Street Journal, May 4, 2004, p. B1.
[77] pers. comm.., Jonathon Houk, Director of DHS IIAP Program, November 2003.
[78] These figures are based on Table 12 and the GDP figures from reference 32. Note, the analysis in this section also ignores business-to-business opportunities, which are also likely significant.
[79] Total grant and procurement amounts are derived from the U.S. Census Bureau, Consolidated Federal Funds Report (CFFR). See http://harvester.census.gov/cffr/asp/Reports.asp.
[80] The number of awards and an analysis of which line items are competitively awarded was derived from the U.S. Census Bureau, Federal Assistance Award Data System (FAADS). See http://www.census.gov/govs/faads/021sumus.htm.
[81] Specific categories of grants were analyzed based on the U.S. General Services Administration’s Catalog of Federal Domestic Assistance (CFDA) definitions to determine degree of competitiveness; see http://12.46.245.173/cfda/cfda.html. Figures from the U.S. Department of Health and Human Services, Grant.gov Clearinghouse (see http://www.grants.gov/) suggest that $350 billion in federal grants is available, but many of the specific grant opportunities are geared to state governments or individuals. That is why the figures shown indicate only $100 billion in competitive opportunities available directly to enterprises.
[82] U.S. General Services Administration, Federal Procurement Data System – NG (FY 2003 data); see http://www.fpdc.gov/fpdc/FPR2003a.pdf and http://www.fpdc.gov/fpdc/FPR2003c.pdf. These sources are also the reference for the number of actions or successful awards. Due to discrepancies, these amounts were adjusted to conform with the totals in reference 79.
[83] Average competitive opportunities are derived by dividing the total award amount by category by the number of awards for that category.
[84] See http://www.gcswin.com/opportunities/opp2.htm. This is the only summary reference for state and local information found. Splits between grants and contract procurements were adjusted based on the assumption that contract amounts differed at the non-federal level. Thus, while the split for grant-contract procurements in the federal sector is about 58%-42% in the federal sector, it is assumed to be 38%-62% at the state and local level.
[85] There may also be some double counting of state amounts due to transfers from the federal government. For example, in 2002, $360,534 million in direct transfers was made to states and localities from the federal government. U.S. Census Bureau, State and Local Government Finances by Level of Government and by State: 2001 – 02. See http://www.census.gov/govs/estimate/0200ussl_1.html.
[86] This analysis assumes that individual grant and contract awards are 80% of the amount shown at the federal level.
[87] To be listed requires a minimum of $10,000 in federal contracts; see http://clinton2.nara.gov/WH/EOP/OP/html/aa/aa06.html.
[88] See http://www.govexec.com/features/0804-15/0804-15s1s1.htm.
[89] This header information is drawn from Table 12.
[90] Number of competing firms is increased from the federal contractor baseline by a factor of 1.30 to account for new state and local government contractors.
[91] Winning and losing proposal preparation costs are based on the empirical percentages from NIST (see reference 93), namely 0.85% and 0.59%, respectively, as a percent of total award amounts.
[92] The ‘Low’ basis for improvements is based on the finding of missing information discussed in a previous section; the ‘High” basis reflects the difference between lowest quartile and highest quartile efforts spent on successful proposal preparation (see reference 93). The ‘Med’ basis is an intermediate value between these two.
[93] The increase in winning submissions is calculated based on numbers of winning proposals times the RFP improvement factor. In fact, because all things being equal the pool of contract dollars does not change, this amount merely represents a shift of winning awards from existing winners to new winners. In other words, total contracts amounts are a zero-sum game with proposal improvements by previous losers taken from the pool of previous winners.
[94] The analysis in Figure 2 indicates there is a power curve distribution of awards. The number of new winning proposals was applied to this curve to estimate the actual number of new firms winning awards; see Figure 2 for the power-curve fitting equation.
[95] Of course, better probabilities of winning competitive solicitations are a zero-sum game. New winners displace old winners. The real advantage in this arena is to individual firms that better succeed at securing the existing pool of competitive funds. The benefits to individual companies can be the difference between profitability, indeed survival.
[96] NFIB, Coping with Regulation, NFIB National Small Business Poll, Vol. 1, Issue 5. See http://www.nfib.com/object/3105105.html.
[97] NFIB, Paperwork and Record-keeping, NFIB National Small Business Poll, Vol. 3, Issue 5. See http://www.nfib.com/object/4131277.html.
[98] W. M. Crain & T. D. Hopkins, “The Impact of Regulatory Costs on Small Firms”, Report to the Small Business Administration, RFP No. SBAHQ-00-R-0027 (2001). The report’s 2000 year basis was updated to 2002 based on a 4% annual inflation factor.
[99] U.S. General Accounting Office, Paperwork Reduction Act: Record Increase in Agencies’ Burden Estimates, testimony of V. S. Rezendes, before the Subcommittee on Energy, Policy, Natural Resources and Regulatory Affairs, Committee on Government Reform, House of Representatives, April 11, 2003. See http://www.reform.house.gov/UploadedFiles/Testimony_GAO_Revised.pdf.
[100] Office of Management and Budget, Managing Information Collection and Dissemination, Fiscal Year 2003, 198 pp. (Table A1). See http://www.whitehouse.gov/omb/inforeg/2003_info_coll_dism.pdf.
[101] NFIB, Paperwork and Record-keeping, NFIB National Small Business Poll, Vol. 3, Issue 5. See http://www.nfib.com/object/4131277.html.
[102]U.S. Small Business Administration, Final Report of the Small Business Paperwork Relief Task Force, June 27, 2003, 64 pp. See http://www.sbaonline.sba.gov/advo/laws/final_paperwork03.pdf.
[103] IRS, Civil Penalties Assessed and Abated, by Type of Penalty and Type of Tax (Table 26), September 20, 2002. See http://www.irs.gov/pub/irs-soi/02db26cp.xls.
[104] Except as footnoted, the figures below are drawn from the OMB Public Budget Tables. Civil penalties for crime victims have been excluded from these figures. See http://www.whitehouse.gov/omb/budget/fy2005/db.html.
[105] Obtained orders in SEC judicial and administrative proceedings requiring securities law violators to disgorge illegal profits of approximately $1.293 billion. Civil penalties ordered in SEC proceedings totaled approximately $101 million. See SEC http://www.sec.gov/pdf/annrep02/ar02enforce.pdf.
[106] T. L. Sansonetti, U.S. Department of Justice, testimony before the House Committee on the Judiciary, Subcommittee on Commercial and Administrative Law, March 9, 2004. See http://www.house.gov/judiciary/sansonetti030904.htm.
[107]Argy, Wiltse & Robinson, Business Insights, Summer 2003, 4 pp. See http://www.awr.com/news_let/Argy%20Summer%202003.pdf
[108] Project on Government Oversight, Federal Contractor Misconduct: Failures of the Suspension and Debarment System, revised May 10, 2002. See http://www.pogo.org/p/contracts/co-020505-contractors.html.
[109]Corporate Crime Reporter, Top 100 False Claims Act Settlements, December 30, 2003, 64 pp. See http://www.corporatecrimereporter.com/fraudrep.pdf.
[110] According to Alchemia Corporation testimony citing a Price Waterhouse Coopers study, FDA Hearing, Jan. 17, 2002. See http://www.fda.gov/ohrms/dockets/dockets/ 00d1538/00d-1538_mm00023_01_vol7.doc.
[111] For example, see http://www.medschool.ucsf.edu/curriculum/clinical/guide/section2/confidentiality.asp.
[113] From Table 16 after adjusting by total number of employees for all firms as shown on Table 2, and removal of total burdens as shown in Table 17.
[115] All ‘State and Local’ items are based on the ratio of state and local budgets in relation to the federal budget, excluding direct federal transfers, and applied to those factors for the federal sector. This ratio is 0.563. See http://www.gpoaccess.gov/usbudget/fy01/guide01.html.
[116] All ‘Large Firm’ estimates are based on the ratio of large firm documents to total firm documents; see Table 2.
[117] For example, see http://www.nfr.com/why/mandates.php#gramm
Much has been happening on the Structured Dynamics front of late. Besides welcoming Steve Ardire as a senior advisor to the company, we also have been issuing a steady stream of new products from our semantic Web pipeline.
This new slide show attempts to capture these products and relate them to the various layers in Structured Dynamics’ enterprise product stack:
The show indicates the role of scones, irON, structWSF, UMBEL, conStruct and others and how they leverage existing information assets to enable the semantic enterprise. And, oh, by the way, all of this is done via Web-accessible linked data and our practical technologies.
Enjoy!
OK, well, I just finished moving and upgrading some dozen Web sites and wikis, including this one — my main blog — over the weekend, from fixed stuff to the “clouds“. Believe you me, there were some pretty massive changes required.
For someone like me who is relatively clueless about such things, the process has been interesting (to say the least).
It seems like our modern era either involves moving digital things or converting digital things. As for moving, we all experience that laptop or hard drive dying, and then the move. (The Death of a Laptop actually happened to my wife this past week.) But it also is changing providers and venues — what caused me to move all of these Web sites.
So, the mainstream digital age has existed for what, now, some 40 years? How many data formats have we transitioned (ASCII, EBCDIC, UTF-8, an immense number)? And, how many systems and environments have we transitioned?
At the risk of dating myself, when I was in college we still used slide rules; truly the end of an era. Just a year or two later everyone transitioned to having TI or HP calculators, some they wore on their hips like some PDAs and cell phones today.
I won’t bore everyone with my own transition from my first computer (an HP 9100 with 4K RAM and program listings on cash register tapes) through many others including a DEC Rainbow PC with CP/M (a beauty!). For many years, as we moved into the PC era and IBM legitimized the shift, every computer I bought seemed to cost about $3000. Each one was more capable, etc., but they all cost the same.
And, then, about the late 1990s, that changed. In fact, my last capable desktop machine cost way south of $1000.
But, I digress.
What has been the real constant across these decades has been system and data migration. Granted, many of the docs and many of the systems in my own experience from 30 yrs ago have no relevance today (god, do I miss WordPerfect with its embedded, editable codes!), but actually an important minor portion do.
For these, I need to move both apps and data (with readable formats) for each generational transition.
I know that organizations, like the Library of Congress in its NDIIPP program, need to worry about digital preservation, potentially for millenia. These are worthwhile concerns.
But, from my own more prosaic standpoint, I see this issue with my own lens and own bas relief. I am constantly moving apps and data, each transition much like a snake shedding its skin.
It makes one wonder about the effort and process by which the entire meaningful cultural history of our species continues to adapt and transition forward.
Hmmm. All of us have seen these transitions and the loss of productivity they bring in that shift. (Some might argue that the lack of productivity gains from computers until this decade was due to such transitions, which at least now with the Web we see a more common migration framework.)
I think we have no choice but to transition to the next latest and greatest as it emerges. Automated means at acceptable cost for doing such transitions will also be attractive.
But the real point, I think, is that such transitions are inevitable. Faster apps: Check! Better apps: Check! Easier data exchange: Check!!
Living with transition thus becomes a clear constant for all us as we move forward. And, part of that is accepting downtime to screw around moving the keepable old to the potentially useful new.
After this weekend, I’m now ready for a couple of days off before the real work week begins (yeah, right, keep dreaming).

I recently wrote about WOA (Web-oriented architecture), a term coined by Nick Gall, and how it represented a natural marriage between RESTful Web services and RESTful linked data. There was, of course, a method behind that posting to foreshadow some pending announcements from UMBEL and Zitgist.
Well, those announcements are now at hand, and it is time to disclose some of the method behind our madness.
As Fred Giasson notes in his announcement posting, UMBEL has just released some new Web services with fully RESTful endpoints. We have been working on the design and architecture behind this for some time and, all I can say is, it’s UMBELievable!
As Fred notes, there is further background information on the UMBEL project — which is a lightweight reference structure based on about 20,000 subject concepts and their relationships for placing Web content and data in context with other data — and the API philosophy underlying these new Web services. For that background, please check out those references; that is not my main point here.
We discussed much in coming up with the new design for these UMBEL Web services. Most prominent was taking seriously a RESTful design and grounding all of our decisions in the HTTP 1.1 protocol. Given the shared approaches between RESTful services and linked data, this correspondence felt natural.
What was perhaps most surprising, though, was how complete and well suited HTTP was as a design and architectural basis for these services. Sure, we understood the distinctions of GET and POST and persistent URIs and the need to maintain stateless sessions with idempotent design, but what we did not fully appreciate was how content and serialization negotiation and error and status messages also were natural results of paying close attention to HTTP. For example, here is what the UMBEL Web services design now embraces:
There are likely other services out there that embrace this full extent of RESTful design (though we are not aware of them). What we are finding most exciting, though, is the ease with which we can extend our design into new services and to mesh up data with other existing ones. This idea of scalability and distributed interoperability is truly, truly powerful.
It is almost like, sure, we knew the words and the principles behind REST and a Web-oriented architecture, but had really not fully taken them to heart. As our mindset now embraces these ideas, we feel like we have now looked clearly into the crystal ball of data and applications. We very much like what we see. WOA is most cool.
For lack of a better phrase, Zitgist has a component internal plan that it calls its ‘Grand Vision’ for moving forward. Though something of a living document, this reference describes how Zitgist is going about its business and development. It does not describe our markets or products (of course, other internal documents do that), but our internal development approaches and architectural principles.
Just as we have seen a natural marriage between RESTful Web services and RESTful linked data, there are other natural fits and synergies. Some involve component design and architecting for pipeline models. Some involve the natural fit of domain-specific languages (DSLs) to common terminology and design, too. Still others involve use of such constructs in both GUIs and command-line interfaces (CLIs), again all built from common language and terminology that non-programmers and subject matter experts alike can readily embrace. Finally, some is a preference for Python to wrap legacy apps and to provide a productive scripting environment for DSLs.
If one can step back a bit and realize there are some common threads to the principles behind RESTful Web services and linked data, that very same mindset can be applied to many other architectural and design issues. For us, at Zitgist, these realizations have been like turning on a very bright light. We can see clearly now, and it is pretty UMBELievable. These are indeed exciting times.
BTW, I would like to thank Eric Hoffer for the very clever play on words with the UMBELievable tag line. Thanks, Eric, you rock!
Zotero has long been one of my favorite Firefox plug-ins, being a productive and trusted sidekick for collecting and reporting my voluminous citation and bibliographic data. I think perhaps my review of Zotero from January 2007 was one of my most glowing write-ups.
If you go to the Zotero home page, you will see at the lower left the steady increase of functionality that has come out in this free and open source tool. For example, Zotero now supports more than 1100 bibliographic sources, can capture Web pages and many standard Web sources, and has MS Office and WordPress support. Zotero has been developed and is distributed by the Center for History and New Media at George Mason University.
According to the Courthouse News Service with a copy of this complaint filed September 5, Thomson Reuters is suing George Mason University and, as a state institution, the Commonwealth of Virginia, for $10 million in damages and an injunction on further distribution of a beta version of Zotero. Thomson is seeking a jury trial.
Thomson claims that a July 8 beta release of Zotero (version 1.5) included a new feature to read and convert Thomson’s 3,500 plus proprietary .ens style files within the EndNote software into free, open source Zotero .csl files. Thomson claims this is in direct violation with GMU’s current license for EndNote. The Zotero beta release introduces a server-side synchronization function; the standard Zotero release without this feature and the EndNote support is version 1.07.
EndNote is a proprietary and popular citation software used by many academics and researchers. EndNote has very similar functionality to Zotero. It allows users to search online bibliographic databases, organize them, and store and re-format citations in various publication styles. Single user licenses are $250 with volume and academic discounts available. Thomson claims “millions” of ultimate users.
Thomson Reuters is also the firm behind the Open Calais named entity extraction service noted much in the semantic Web community (and which this week announced a commercial version).
File format ingest and conversions have long been a mainstay of interoperable software systems. This lawsuit will bear close monitoring.
Hat tip to Rafael Sidi for this link.
I’m pleased to present a timeline of 100 or so of the most significant events and developments in the innovation and management of information and documents from cave paintings ( ca 30,000 BC) to the present. Click on the link to the left or on the screen capture below to go to the actual interactive timeline.
This timeline has fast and slow scroll bands — including bubble popups with more information and pictures for each of the entries offered. (See the bottom of this posting for other usage tips.)
Note the timeline only presents non-electronic innovations and developments from alphabets to writing to printing and information organization and conventions. Because there are so many innovations and they are concentrated in the last 100 years or fewer, digital and electronic communications are somewhat arbitrarily excluded from the listing.
I present below some brief comments on why I created this timeline, some caveats about its contents, and some basic use tips. I conclude with thanks to the kind contributors.
Readers of this AI3 blog or my detailed bio know that information — biological embodied in genes, or cultural embodied in human artefacts — has been my lifelong passion. I enjoy making connections between the biological and cultural with respect to human adaptivity and future prospects and I like to dabble on occasion as an amateur economic or information science historian.
About 18 months ago I came across David Huynh’s nifty Exhibit lightweight data display widget, gave it a glowing review, and then proceeded to convert my growing Sweet Tools listing of semantic Web and related tools to that format. Exhibit still powers the listing (which I just updated yesterday for the twelfth time or so).
At the time of first rolling out Exhibit I also noted that David had earlier created another lightweight timeline display widget that looked similarly cool (and which was also the first API for rendering interactive timelines in Web pages). (In fact, Exhibit and Timeline are but two of the growing roster of excellent lightweight tools from David.) Once I completed adopting Exhibit, I decided to find an appropriate set of chronological or time-series data to play next with Timeline.
I had earlier been ruminating on one of the great intellectual mysteries of human development: Why, roughly beginning in 1820 to 1850 or so, did the historical economic growth patterns of all prior history suddenly take off? I first wrote on this about two years ago in The Biggest Disruption in History: Massively Accelerated Growth Since the Industrial Revolution, with a couple of follow-ups and expansions since then.
I realized that in developing my thesis that wood pulp paper and mechanized printing were the key drivers for this major inflection change in growth (as they effected literacy and the broadscale access to written information) I already had the beginnings of a listing of various information innovations throughout history. So, a bit more than a year ago, I began adding to that list in terms of how humans learned to write, print, share, organize, collate, reproduce and distribute information and when those innovations occurred.
There are now about 100 items in this listing (I’m still looking for and researching others; please send suggestions at any time.
). Here are some of the current items in chronological order from upper left to lower right:
| cave paintings | codex | footnotes | microforms |
| ideographs | woodblock printing | copyrights | thesaurus |
| calendars | tree diagram | encyclopedia | pencil (mass produced) |
| cuneiform | quill pen | capitalization | rotary perfection press |
| papyrus (paper) | library catalog | magazines | catalogues |
| hieroglyphs | movable type | taxonomy (binomial classification) | typewriter |
| ink | almanacs | statistics | periodic table |
| alphabet | paper (rag) | timeline | chemical pulp (sulfite) |
| Phaistos Disc | word spaces | data graphs | classification (Dewey) |
| logographs | registers | card catalogs | linotype |
| maps | intaglio | lithography | mimeograph machine |
| scrolls | printing press | punch cards | kraft process (pulp) |
| manuscripts | advertising (poster) | steam-powered (mechanized) papermaking | flexography |
| glossaries | bookbinding | book (machine-paper) | classification (LoC) |
| dictionaries | pagination | chemcial symbols | classification (UDC) |
| parchment (paper) | punctuation | mechanical pencil | offset press |
| bibliographies | library catalog (printed) | chromolithography | screenprinting |
| concept of categories | public lending library | paper (wood pulp) | ballpoint pen |
| library | dictionaries (alphabetic) | rotary press | xerographic copier |
| classification system (library) | newspapers | mail-order catalog | hyperlink |
| zero | Information graphics | fountain pen | metadata (MARC) |
| paper | scientific journal |
So, off and on, I have been working with and updating the data and display of this timeline in draft. (I may someday also post my notes about how to effectively work with the Timeline widget.)
With the listing above, completion was sufficient to finally post this version. One of the neat things with Timeline is the ability to drive the display from a simple XML listing. I will update the timeline when I next have an opportunity to fill in some of the missing items still remaining on my innovations list such as alphabeticization, citations, and table of contents, among many others.
Of course, rarely can an innovation be traced to a single individual or a single moment in time. Historians are increasingly documenting the cultural milieu and multiple individuals that affect innovation.
In these regards, then, a timeline such as this one is simplistic and prone to much error and uncertainty. We have no real knowledge, for examples, for the precise time certain historical innovations occurred, and others (the ballpoint pen being one case in point) are a matter of interpretation as to what and when constituted the first expression. For instances where the record indicated multiple dates, I chose to use the date when released to the publlic.
Nonetheless, given the time scales here of more than 30,000 years, I do think broad trends and rough time frames can be discerned. As long as one interprets this timeline as indicative and not meant as definitive in any scholary sense, I believe this timeline can inform and provide some insight and guidance for how information has evolved over human history.
The operation of Timeline is pretty straightforward and intuitive. Here are a couple of tips to get a bit more out of playing with it:
For the sake of consistency, nearly all entries and pictures on the timeline are drawn from the respective entries within Wikipedia. Subsequent updates may add to this listing by reference to original sources, at which time all sources will be documented.
The timeline icons are from David Vignoni’s Nuvola set, available under the LGPL license. Thanks David!
The fantastic Timeline was developed by David Huynh while he was a graduate student at MIT. Timeline and its sibling widgets were developed under funding from MIT’s Simile program. Thanks to all in the program and best wishes for continued funding and innovation.
Finally, my sincere thanks go to Professor Michael Buckland of the School of Information at the University of California, Berkeley, for his kind suggestions, input and provision of additonal references and sources. Of course, any errors or omissions are mine alone. I also thank Professor Buckland for his admonitions about use and interpretation of the timeline dates.
The recent LinkedData Planet conference in NYC marked, I think, a real transition point. The conference signaled the beginning movement of the Linked Data approach from the research lab to the enterprise. As a result, there was something of a schizophrenic aspect at many different levels to the conference: business and research perspectives; realists and idealists; straight RDF and linked data RDF; even the discussions in the exhibit area versus some of the talks presented from the podium.
Like any new concept, my sense was a struggle around terminology and common language and the need to bridge different perspectives and world views. Like all human matters, communication and dialog were at the core of the attendees’ attempts to bridge gaps and find common ground. Based on what I saw, much great progress occurred.
The reality, of course, is that Linked Data is still very much in its infancy, and its practice within the enterprise is just beginning. Much of what was heard at the conference was theory versus practice and use cases. That should and will change rapidly.
In an attempt to help move the dialog further, I offer a definition and Zitgist’s perspective to some of the questions posed in one way or another during the conference.
Sources such as the four principles of Linked Data in Tim Berners-Lee’s Design Issues: Linked Data and the introductory statements on the Linked Data Wikipedia entry approximate — but do not completely express — an accepted or formal or “official” definition of Linked Data per se. Building from these sources and attempting to be more precise, here is the definition of Linked Data used internally by Zitgist:
All references to Linked Data below embrace this definition.
I’m sure many other questions were raised, but listed below are some of the more prominent ones I heard in the various conference Q&A sessions and hallway discussions.
Yes. Though other approaches can also model the first order predicate logic of subject-predicate-object at the core of the Resource Description Framework data model, RDF is the one based on the open standards of the W3C. RDF and FOL are powerful because of simplicity, ability to express complex schema and relationships, and suitability for modeling all extant data frameworks for unstructured, semi-structured and structured data.
No. Linked Data represents a set of techniques applied to the RDF data model that names all objects as URIs and makes them accessible via the HTTP protocol (as well as other considerations; see the definition above and further discussion below).
Some vendors and data providers claim Linked Data support, but if their data is not accessible via HTTP using URIs for data object identification, it is not Linked Data. Fortunately, it is relatively straightforward to convert non-compliant RDF to Linked Data.
There are some excellent references for how to publish Linked Data. Examples include a tutorial, How to Publish Linked Data on the Web, and a white paper, Deploying Linked Data, using the example of OpenLink’s Virtuoso software. There are also recommended approaches and ways to use URI identifiers, such as the W3C’s working draft, Cool URIs for the Semantic Web.
However, there are not yet published guidelines for also how to meet the Zitgist definition above where there is also an emphasis on class and context matching. A number of companies and consultants, including Zitgist, presently provide such assistance.
The key principles, however, are to make links aggressively between data items with appropriate semantics (properties or relations; that is, the predicate edges between the subject and object nodes of the triple) using URIs for the object identifiers, all being exposed and accessible via the HTTP Web protocol.
Absolutely not, though this is a source of some confusion at present.
The Semantic Web is probably best understood as a vision or goal where semantically rich annotation of data is used by machine agents to make connections, find information or do things automatically in the background on behalf of humans. We are on a path toward this vision or goal, but under this interpretation the Semantic Web is more of a process than a state. By understanding that the Semantic Web is a vision or goal we can see why a label such as ‘Web 3.0′ is perhaps simplistic and incomplete.
Linked Data is a set of practices somewhere in the early middle of the spectrum from the initial Web of documents to this vision of the Semantic Web. (See my earlier post at bottom for a diagram of this spectrum.)
Linked Data is here today, doable today, and pragmatic today. Meaningful semantic connections can be made and there are many other manifest benefits (see below) with Linked Data, but automatic reasoning in the background or autonomic behavior is not yet one of them.
Strictly speaking, then, Linked Data represents doable best practices today within the context both of Web access and of this yet unrealized longer-term vision of the Semantic Web.
Definitely not, though early practice has been interpreted by some as such.
One of the stimulating, but controversial, keynotes of the conference was from Dr. Anant Jhingran of IBM, who made the strong and absolutely correct observation that Linked Data requires the interplay and intersection of people, instances and schema. From his vantage, early exposed Linked Data has been dominated by instance data from sources such as Wikipedia and have lacked the schema (class) relationships that enterprises are based upon. The people aspect in terms of connections, collaboration and joint buy-in is also the means for establishing trust and authority to the data.
In Zitgist’s terminology, class-level mappings ‘explode the domain’ and produce information benefits similar to Metcalfe’s Law as a function of the degree of class linkages [1]. While this network effect is well known to the community, it has not yet been shown much in current Linked Data sets. As Anant pointed out, schemas define enterprise processes and knowledge structures. Demonstrating schema (class) relationships is the next appropriate task for the Linked Data community.
In an RDF context, “ontologies” are the vocabularies and structures that capture the schema structures noted above. Ontologies embody the class and instance definitions and the predicate (property) relations that enable legacy schemas and data to be transformed into Linked Data graphs.
Though many public RDF vocabularies and ontologies presently exist, and should be re-used where possible and where the semantics match the existing legacy information, enterprises will require specific ontologies reflective of their own data and information relationships.
Despite the newness or intimidation perhaps associated with the “ontology” term, ontologies are no more complex — indeed, are simpler and more powerful — than the standard relational schema familiar to enterprises. If you’d like, simply substitute schema for ontology and you will be saying the same thing in an RDF context.
Neither, really, though the rationale and justification for Linked Data is grounded in federating widely disparate sources of data that can also vary widely in existing formalism and structure.
Because Linked Data is a set of techniques and best practices for expressing, exposing and publishing data, it can easily be applied to either centralized or federated circumstances.
However, the real world where any and all potentially relevant data can be interconnected is by definition a varied, distributed, and therefore federated world. Because of its universal RDF data model and Web-based techniques for data expression and access, Linked Data is the perfect vehicle, finally, for data integration and interoperability without boundaries.
The simple case is where two data sources refer to the exact same entity or instance (individual) with the same identity. The standard sameAs predicate is used to assert the equivalence in such cases.
The more important case is where the data sources are about similar subjects or concepts, in which case a structure of well-defined reference classes is employed. Furthermore, if these classes can themselves be expressed in a graph structure capturing the relationships amongst the concepts, we now have some fixed points in the conceptual information space for relating and tieing together disparate data. Still further, such a conceptual structure also provides the means to relate the people, places, things, organizations, events, etc., of the individual instances of the world to one another as well.
Any reference structure that is composed of concept classes that are properly related to each other may provide this referential “glue” or “backbone”.
One such structure provided in open source by Zitgist is the 21,000 subject concept node structure of UMBEL, itself derived from the Cyc knowledge base. In any event, such broad reference structures may often be accompanied by more specific domain conceptual ontologies to provide focused domain-specific context.
No, absolutely not.
While, to date, it is the case that Linked Data has been demonstrated using public Web data and many desire to expose more through the open data movement, there is nothing preventing private, proprietary or subscription data from being Linked Data.
The Linking Open Data (LOD) group formed about 18 months ago to showcase Linked Data techniques began with open data. As a parallel concept to sever the idea that it only applies to open data, François-Paul Servant has specifically identified Linking Enterprise Data (and see also the accompanying slides).
For example, with Linked Data (and not the more restrictive LOD sense), two or more enterprises or private parties can legitimately exchange private Linked Data over a private network using HTTP. As another example, Linked Data may be exchanged on an intranet between different departments, etc.
So long as the principles of URI naming, HTTP access, and linking predicates where possible are maintained, the approach qualifies as Linked Data.
Absolutely yes, without reservation. Indeed, non-transactional legacy data perhaps should be expressed as Linked Data in order to gain its manifest benefits. See #14 below.
Of course. Since Linked Data can be applied to any data formalism, source or schema, it is perfectly suited to integrating data from inside and outside the firewall, open or private.
The basic query language for Linked Data is SPARQL (pronounced “sparkle”), which bears close resemblance to SQL only applicable to an RDF data graph. The actual datastores applied to RDF may also add a fourth aspect to the tuple for graph namespaces, which can bring access and scale efficiencies. In these cases, the system is known as a “quad store”. Additional techniques may be added to data filtering prior to the SPARQL query for further efficiencies.
Templated SPARQL queries and other techniques can lead to very efficient and rapid deployment of various Web services and reports, two techniques often applied by Zitgist and other vendors. For example, all Zitgist DataViewer views and UMBEL Web services are expressed using such SPARQL templates.
This SPARQL templating approach may also be combined with the use of templating standards such as Fresnel to bind instance data to display templates.
In Zitgist’s view, access control or security occurs at the layer of the HTTP access and protocols, and not at the Linked Data layer. Thus, the same policies and procedures that have been developed for general Web access and security are applicable to Linked Data.
However, standard data level or Web server access and security can be enhanced by the choice of the system hosting the data. Zitgist, for example, uses OpenLink’s Virtuoso universal server that has proven and robust security mechanisms. Additionally, it is possible to express security and access policies using RDF ontologies as well. These potentials are largely independent of Linked Data techniques.
The key point is that there is nothing unique or inherent to Linked Data with respect to access or control or security that is not inherent with standard Web access. If a given link points to a data object from a source that has limited or controlled access, its results will not appear in the final results graph for those users subject to access restrictions.
For more than 30 years — since the widespread adoption of electronic information systems by enterprises — the Holy Grail has been complete, integrated access to all data. With Linked Data, that promise is now at hand. Here are some of the key enterprise benefits to Linked Data, which provide the rationales for adoption:
Linked Data is well suited to traditional knowledge base or knowledge management applications. Its near-term application to transactional or material process applications is less apparent.
Of special use is the value-added from connecting existing internal and external content via the network effect from the linkages [1].
Johnnie Linked Data is starting to grow up. Our little semantic Web toddler is moving beyond ga-ga-goo-goo to saying his first real sentences. Language acquisition will come rapidly, and, like what all of us have seen with our own children, they will grow up faster than we can imagine.
There were so many at this meeting that had impact and meaning to this exciting transition point that I won’t list specific names at risk of leaving other names off. Those of you who made so many great observations or stayed up late interacting with passion know who you are. Let me simply say: Thanks!
The LinkedData Planet conference has shown, to me, that enterprises are extremely interested in what our community has developed and now proven. They are asking hard questions and will be difficult task masters, but we need to listen and respond. The attendees were a selective and high-quality group, understanding of their own needs and looking for answers. We did an OK job of providing those answers, but we can do much, much better.
I reflect on these few days now knowing something I did not truly know before: the market is here and it is real. The researchers who have brought us to this point will continue to have much to research. But, those of us desirous of providing real pragmatic value and getting paid for it, can confidently move forward knowing both the markets and the value are real. Linked Data is not magic, but when done with quality and in context, it delivers value worth paying for.
To all of the fellow speakers and exhibitors, to all of the engaged attendees, and to the Juperitermedia organizers and Bob DuCharme and Ken North as conference chairs, let me add my heartfelt thanks for a job well done.
The next LinkedData Planet conference and expo will be October 16-17, 2008, at the Santa Clara Hyatt in Santa Clara, California. The agenda has not been announced, but hopefully we will see a continuing enterprise perspective and some emerging use cases.
Zitgist as a company will continue to release and describe its enterprise products and services, and I will continue to blog on Linked Data matters of specific interest to the enterprise. Pending topics include converting legacy data to Linked Data, converting relational data and schema to Linked Data, placing context to Linked Data, and many others. We think you will like the various announcements as they arise.
Zitgist is also toying with the use of a distinctive icon
to indicate the availability of Linked Data conforming to the principles embodied in the questions above. (The color choice is an adoption of the semantic Web logo from the W3C.) The use of a distinctive icon is similar to what RSS feeds
or microformats
have done to alert users to their specific formats. Drop me a line and let us know what you think of this idea.
UMBEL is today releasing a new sandbox for its first iteration of Web services. The site is being hosted by Zitgist. All are welcomed to visit and play.
UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight reference structure for placing Web content and data in context with other data. It is comprised of about 21,000 subject concepts and their relationships — with one another and with external vocabularies and named entities.
Each UMBEL subject concept represents a defined reference point for asserting what a given chunk of content is about. These fixed hubs enable similar content to be aggregated and then placed into context with other content. These subject context hubs also provide the aggregation points for tying in their class members, the named entities which are the people, places, events, and other specific things of the world.
The backbone to UMBEL is the relationships amongst these subject concepts. It is this backbone that provides the contextual graph for inter-relating content. UMBEL’s subject concepts and their relationships are derived from the OpenCyc version of the Cyc knowledge base.
The UMBEL ontology is based on RDF and written in the RDF Schema vocabulary of SKOS (Simple Knowledge Organization System) with some OWL Full constructs to aid interoperability.
UMBEL’s backbone is also a reference structure for more specific domains or ontologies, thereby enabling further context for inter-relating additional content. Much of the sandbox shows these external relationships.
These first set of Web services provide online demo sandboxes, and descriptions of what they are about and their API documentation. The first 11 services are:
The single service that provides the best insight to what UMBEL is all about is the Subject Concept Detailed Report. (That is probably because this service is itself an amalgam of some of the others.)
Starting from a single concept amongst the 21,000, in this case ‘Mammal’, we can get descriptions or definitions (the proper basis for making semantic relationships, not the ‘Mammal’ label), aliases and semsets, equivalent classes (in OWL terms), named entities (for leaf concepts), more general or specific external classes, and domain and range relationships with other ontologies. Here is the sample report for ‘Mammal’:
The discerning eye likely observes that while there are a rich set of relationships to the internal UMBEL subject concepts, coverage is still light for external classes and named entities. This sandbox is, after all, a first release and we are early in the mapping process.
But, it should also start to become clear that the ability of this structure to map and tie in all forms of external concepts and class structures is phenomenal. Once such class relationships are mapped (to date, most other Linked Data only occurs at the instance level), all external relationships and properties can be inherited as well. And, vice versa.
So, for aficionados of the network effect, stand back! You ain’t seen nothing yet. If we have seen amazing emergent properties arising from the people and documents on the Web, with data we move to another quantum level, like moving from organisms to cells. The leverage of such concept and class structures to provide coherence to atomic data is literally primed to explode.
To put it mildly, trying to get one’s mind around the idea of 21,000 concepts and all of their relationships and all of their possible tie in points and mappings to still further ontologies and all of their interactions with named entities and all of their various levels of aggregation or abstraction and all of their possible translations into other languages or all of their contextual descriptions or all of their aliases or synonyms or all of their clusterings or all of their spatial relationships or all of the still more detailed relationships and instances in specific domains or, well, whew! You get the idea.
It is all pretty complex and hard to grasp.
One great way to wrap one’s mind around such scope is through interactive visualization. The first UMBEL service to provide this type of view is the Subject Concept Explorer, a screenshot of which is shown here:
But really, to gain the true feel, go to the service and explore for yourself. It feels like snorkeling through those schools of billions of tiny silver fish. Very cool!
These amazing visualizations are being brought to us by Moritz Stefaner, imho one of the best visualization and Flash gurus around. We will be showcasing more about Moritz’s unbelievable work in some forthcoming posts, where some even cooler goodies will be on display. His work is also on display at a couple of other sites that you can spend hours drooling over. Thanks, Moritz!
You should note that developer access to the actual endpoints and external exposure of the subject concepts as Linked Data are not yet available. The endpoints, Linked Data and further technical documentation will be forthcoming shortly.
The currently displayed services and demos provided on this UMBEL Web services site are a sandbox for where the project is going. Next releases will soon provide as open source under attribution license:
When we hit full stride, we expect to be releasing still further new Web services on a frequent basis.
BTW, for more technical details on this current release, see Fred Giasson’s accompanying post. Fred is the magician who has brought much of this forward.