Evolution
AI³
Adaptive Information
Adaptive Innovation
Adaptive Infrastructure
a·dap·tive adj. Showing or having a capacity to make fit for new or special situations; flexible; a successful adjustment.

Blogasbörd (cloud version):
Send Email   Get SIOC Profile   Get FOAF Profile   Syndicate full contents for this site using RSS 20
Main Links
Categories
Calendar
March 2010
S M T W T F S
« Feb    
 123456
78910111213
14151617181920
21222324252627
28293031  
Archives
More . . .  
Search
 
Sponsored Links
-->
Affiliations
structWSF
Credits
Blog software courtesy of WordPress Obtain Technorati profile Subscribe with Bloglines
View Mike's profile on LinkedIn
Date:   March 12, 2010

Friday Brown Bag Lunch

Today, in the advanced knowledge economy of the United States, the information contained within documents represents about a third of total gross domestic product, or an amount of about $3.3 trillion annually.

Yet our understanding of the value of documents and the means to manage them is abysmal. These failures impact enterprises of all sizes from the standpoints of revenues, profitability and reputation. Continued national productivity growth — and thus the wealth of all citizens — depends critically on understanding and managing these document values.

As this white paper describes, the lack of a compelling and demonstrable common understanding of the importance of documents is in itself a major factor limiting available productivity benefits. There is an old Chinese saying that roughly translated is “what cannot be measured, cannot be improved.” Many corporate officers may believe this to be the case for document creation and productivity, but, as this paper shows, in fact many of these document issues can be measured.

Friday Brown Bag Lunch This Friday brown bag leftover was first placed into the AI3 refrigerator on July 20, 2005. No changes have been made to the original posting.

I’d like to thank David Siegel for recently highlighting this post from 5 years ago with nice kudos on his PowerOfPull blog. That reference is what caused me to dust off the cobwebs from this older piece.

To wit, some 25% of all of the annual trillions of dollar spent on document creation costs lend themselves to actionable improvements:

U.S. FIRMS

$ Million

%
Cost to Create Documents

$3,261,091

Benefits
Benefits to Finding Missed or Overlooked Documents

$489,164

63%

Benefits to Improved Document Access

$81,360

10%

Benefits of Re-finding Web Documents

$32,967

4%

Benefits of Proposal Preparation and Wins

$6,798

1%

Benefits of Paperwork Requirements and Compliance

$119,868

15%

Benefits of Reducing Unauthorized Disclosures

$51,187

7%

Total Annual Benefits

$781,314

100%

PER LARGE FIRM

$ Million

Cost to Create Documents

$955.6

Benefits to Finding Missed or Overlooked Documents

$143.3

Benefits to Improving Document Access

$23.8

Benefits of Re-finding Web Documents

$9.7

Benefits of Proposal Preparation and Wins

$2.0

Benefits of Paperwork Requirements and Compliance

$35.1

Benefits of Reducing Unauthorized Disclosures

$15.0

Total Annual Benefits

$229.0

Table 1. Mid-range Estimates for the Annual Value of Documents, U.S. Firms, 2002[1]

The total benefit from improved document access and use to the U.S economy is on the order of $800 billion annually, or about 8% of GDP. For the 1,000 largest U.S. firms, benefits from these improvements can approach nearly $250 million annually per firm. About three-quarters of these benefits arise from not re-creating the intellectual capital already invested in prior document creation. About one-quarter of the benefits are due to reduced regulatory non-compliance or paperwork, or better competitiveness in obtaining solicited grants and contracts.

Indeed, even these figures likely severely underestimate the benefits to enterprises from an improved leverage of document assets. It has always been the case that the best and most successful companies have been able to make better advantage of their intellectual assets than their competitors. The competitiveness advantage from better document access and use alone may exceed the huge benefits in the table above.

Documents — that is, unstructured and semi-structured data — are now at the point where structured data was at 15 years ago. At that time, companies realized that consolidating information from multiple numeric databases would be a key source of competitive advantage. That realization led to the development and growth of the data warehousing or business intelligence markets, now representing about $3.9 billion in annual software sales.

Search and enterprise content management software today only represents a fraction of that amount — perhaps on the order of $500 million annually. But given that intellectual content in documents represents three to four times the amount in numeric structured data, it is clear that document software capabilities are not being well utilized, reaching only a small fraction of their market potential.

The estimates provided in this white paper are drawn from numerous sources and are extremely fragmented, perhaps even inconsistent. One hope in preparing this document was to stimulate more research attention and data gathering around the critical issues of document value to the enterprise and the economy at large.

EXECUTIVE SUMMARY

I. INTRODUCTION

Documents: The Drivers of a Knowledge Economy

Documents: The Linchpin of Corporate Intellectual Assets

Documents: Unknown Value, Huge Implications

Documents: The Next Generation of Data Warehousing?

Connecting the Dots: A Pointillistic Approach

II. INTERNAL DOCUMENTS

Number of ‘Valuable’ Documents Produced per Firm

Total Annual U.S. ‘Costs’ to Create Documents

‘Cost’ of Creating a ‘Typical’ Document

‘Cost’ of a Missed or Overlooked Document

Other Document Total ‘Cost’ Factors and Summary

Archival Lifetime of ‘Valuable’ Documents

III. WEB DOCUMENTS AND SEARCH

Estimate of Time and Effort Devoted to Document Search

Effect of Non-persistent Search Efforts

‘Cost’ of Creating and Maintaining a Document Category Portal

‘Cost’ of Inaccessible or Hidden Intranet Sites

IV. OPPORTUNITIES AND THREATS

‘Costs’ and Opportunity Costs of Winning Proposals

‘Costs’ of Regulation and Regulatory Non-compliance

‘Cost’ of an Unauthorized Posted Document

V. CONCLUSIONS

I. INTRODUCTION

How many documents does your organization create each year? What effort does this represent in terms of total staffing costs? What does it cost to create a ‘typical’ document? Of documents created, how much of the value in them is readily sharable throughout your organization? How long do you need to keep valuable documents and how can you access them? How much existing document content is re-created simply because prior work cannot be found? When prior information is missed, what do these prior investments in documents represent in terms of loss of market share, revenue or reputation? Indeed, what does the term, “document” represent in your organization’s context?

If you have difficulty answering these questions, you are not alone. Depending on the survey, from 90% to 97% of enterprises cannot answer these questions — in whole or in part. The purpose of this white paper is to provide the first comprehensive assessment ever of these document values.

Enterprises and the analyst community have historically overlooked the impact of document creation as opposed to document handling. Document creation is about 2-3 times more important — from an embedded cost standpoint — than document handling. Second, all aspects of document creation, and later access and use, assume a much greater role in the overall economics of enterprises than have been realized previously.

Documents: The Drivers of a Knowledge Economy

Put your index finger one inch from your nose. That is how close — and unfocused — document importance is to an organization. Documents are the salient reality of a knowledge economy, but like your finger, documents are often too close, ubiquitous and commonplace to appreciate.

How do your employees earn their livings? Writing proposals? Marketing or selling? Evaluating competitors or opportunities? Persuading? Analyzing? Communicating? Teaching? Of course, in some sectors, many make their living from growing things or making things. These are essential jobs — indeed, until the last few decades were the predominant drivers of economies — but are now being supplanted in advanced economies by knowledge work. Perhaps up to 35% of all company employees in the U.S. can be classified as knowledge workers.

And knowledge work means documents. The fact is that knowledge is produced and communicated through the written word. When we search, when we write, when we persuade, we may often do so verbally but make it persistent through the written word.

Documents: The Linchpin of Corporate Intellectual Assets

IBM estimates that corporate data doubles every six to eight months, 85% of which are documents.[2] At least 10% of an enterprise’s information changes on a monthly basis.[3] Year-on-year office document growth rates are on the order of 22%.[4] As later analysis indicates, there are perhaps on the order of 10 billion documents created annually in the U.S with a mid-range “asset” value of $3.3 trillion per year. Documents are a huge contributor to the United States’ gross domestic product of $10.5 trillion (2002).

  • According to a Coopers & Lybrand study in 1993:[5]
  • Ninety percent of corporate memory exists on paper
  • Ninety percent of the papers handled each day are merely shuffled
  • Professionals spend 5-15 percent of their time reading information, but up to 50 percent looking for it
  • On average, 19 copies are made of each paper document.

A Xerox Corporation study commissioned in 2003 and conducted by IDC surveyed 1000 of the largest European companies and had similar findings:[6],[7]

  • On average 45% of an executive’s time was spent dealing with documents
  • 82% believe that documents were crucial to the successful operation of their organizations
  • A further 70% claimed that poor document processes could impact the operational agility of their organizations
  • While 83%, 78% and 76% consider faxes, email and electronic files as documents, respectively, only 48% and 46% categorize web pages and multimedia content as such.

Documents: Unknown Value, Huge Implications

But, if defining what constitutes a document is hard, identifying the costs associated with all the document activities is almost impossible for many organizations. Ninety to 97 percent of the corporate respondents to the Coopers & Lybrand and Xerox studies, respectively, could not estimate how much they spent on producing documents each year. Almost three quarters of them admit that the information is unavailable or unknown to them.

An A.T. Kearney study sponsored by Adobe, EDS, Hewlett-Packard, Mayfield and Nokia, published in 2001, estimated that workforce inefficiencies related to content publishing cost organizations globally about $750 billion. The study further estimated that knowledge workers waste between 15% to 25% of their time in non-productive document activities.[8]

Enterprise document use (SPIN)

Figure 1. The Situation of Poor Enterprise Document Use Leads to Real Implications

But the situation is much broader and results in part from the inability to quantify the importance of both internal and external document assets to all aspects of the enterprise’s bottom line. For examples drawn from the main body of this white paper, early adopters of enterprise content software typically capture less than 1% of valuable internal documents available; large enterprises are witnessing the proliferation of internal and external Web sites, sometimes exceeding thousands; use of external content is presently limited to Internet search engines, producing non-persistent results and no capture of the investment in discovery or results; and “deep” content in searchable databases, which is common to large organizations and represents 90% of external Internet content, is completely untapped.

A USC study reported that typically only 32% of employees in knowledge organizations have access to good information about technical developments relevant to their work, and 79% claim they have inadequate information about what their competitors are doing.[9]

The enterprise content integration software market is fragmented and confused, with only a few established companies providing partial solutions. Content integration is still a small market with annual revenues of less than $50 million worldwide.[10] Vendor offerings fail to satisfy customer needs because of a lack of functionality and a lack of scalability to enterprise volumes. Sales in the market remain distinctly lower than those projected by industry analysts, even as the magnitude of “information overload” continues to grow at a dramatic rate.

Documents: The Next Generation of Data Warehousing?

Documents — that is, unstructured and semi-structured data — are now at the point where structured data was at 15 years ago. At that time, companies realized that consolidating information from multiple numeric databases would be a key source of competitive advantage. That realization led to the development and growth of the data warehousing or business intelligence markets, now representing about $3.9 billion in annual software sales.[11]

Certain categories of businesses have been leaders in content integration, especially those that have recently had mergers and acquisitions activity, those that need to integrate business applications with content, and those for which the reuse of marketing assets across the organization is critical.10

Stonebraker and Hellerstein have provided an insightful roadmap for how enterprise data integration or “federation” has trended over time: Data warehousing → Enterprise application integration → Enterprise content integration → Enterprise information integration.[12] There are two threads to this trend. First, there has been a growing recognition of the importance of document (unstructured) content to contribute to actionable information. Second, increasingly unified and integrated means are being applied to all data sources to allow single-access retrievals.

Connecting the Dots: A Pointillistic Approach

The state of information regarding the value and cost of documents is extremely poor. Lack of defensible and vetted estimates for this information undercuts the ability to properly estimate the intellectual assets tied up in documents or the impacts of overlooked or misused documents.

Only three large document studies — the Coopers & Lybrand, Xerox and A.T. Kearney studies noted above — have been conducted in the past ten years regarding the use and importance of documents within enterprises, and then solely from the standpoint of executive perceptions.

The quantified picture presented in this white paper regarding the costs and benefits of document creation, access and use is a paint-by-the-numbers assemblage of disparate data. The paper draws upon about 80 different data sources, many fragmented. The analysis approach by necessity has needed to conjoin assumptions and data from many diverse sources.

This approach leads to both uncertainty regarding “true” values and likely inaccuracies or mis-estimates in some areas. To make the assessment as consistent as possible, a base year of 2002 was used, the common year reference for most of the available data sources. To bracket uncertainties, most estimates are provided in low, medium and high estimates.

Thus, this study should be viewed as preliminary, but strongly indicative of the value of documents. Further research and data collection will surely refine these estimates. Clearly, though, by any measure, the value of documents to the enterprise is significant and huge, and should not continue to be overlooked.

II. INTERNAL DOCUMENTS

Though valuable content resides everywhere, the first challenge to enterprises is getting a handle on their own internal document content.

Number of ‘Valuable’ Documents Produced per Firm

A recent UC Berkeley study on “How Much Information?” estimated that more than 4 billion pages of internal office documents with archival value are generated annually in the U.S. (Note: this is not the amount created, only those documents deemed worthy of retaining for more than one year).

Firm Size (employees)

1-9

10-19

20-99

100-499

500-999

1000-2500

2500-9999

>10,000

Firms

3,716,944

616,064

518,258

85,304

8,572

5,161

2,704

930

Employees

12,328,094

8,274,541

20,370,447

16,410,367

5,906,266

7,894,226

12,519,664

31,357,579

Knowledge Workers

2,217,093

1,488,099

3,663,435

2,951,251

1,062,187

1,419,703

2,251,545

5,639,368

Number of Pages  – Low

465,842,666

312,670,737

769,739,697

620,099,840

223,180,542

298,299,744

473,081,537

1,184,911,325

Number of Pages  – High

1,164,606,665

781,676,843

1,924,349,242

1,550,249,599

557,951,355

745,749,360

1,182,703,842

2,962,278,313

Number of Docs  – Low

46,584,267

31,267,074

76,973,970

62,009,984

22,318,054

29,829,974

47,308,154

118,491,133

Number of Docs- High

116,460,666

78,167,684

192,434,924

155,024,960

55,795,135

74,574,936

118,270,384

296,227,831

Docs/Firm  – Low

13

51

149

727

2,604

5,780

17,496

127,410

Docs/Firm  – High

31

127

371

1,817

6,509

14,450

43,739

318,525

Docs/Firm – 3 yr Low

38

152

446

2,181

7,811

17,340

52,487

382,229

Docs/Firm – 5 yr High

157

634

1,857

9,087

32,545

72,249

218,695

1,592,623

Content Management Workers

105,709

70,951

174,670

140,713

50,644

67,690

107,352

268,881

CMWs/Firm

0

0

0

2

6

13

40

289

Table 2. Document Projections for U.S. Firms by Size, 2002 Basis

Sources: UC Berkeley[13], U.S. Commerce Department[14], U.S. Bureau of Labor Statistics[15], U.S. Census Bureau[16]

Table 2 and Table 3 attempt to summarize the scale of this challenge for U.S. firms (for internal enterprise documents only). (See[17] for a description of methodology regarding document scales, note[18] for estimating the numbers of enterprise knowledge workers, and note[19] for estimating content workers. A rough multiplier of 3x to 4x can be applied to extrapolate globally.[20]) Breakouts are provided by size of firm; these include estimates for the number of knowledge and content workers within U.S. firms.

Category

Value

Firms

4,953,937

Employees

127,273,960

Knowledge Workers

20,692,680

Annual Number of Docs – Low

9,291,013,320

Annual Number of Docs- High

21,739,130,435

Annual Docs/Firm – Low

1,875

Annual Docs/Firm – High

4,388

Total Docs/Firm – 3 yr Low

1,990

Total Docs/Firm – 5 yr High

5,601

Content Management Workers

986,610

CMWs/Firm

0.2

Table 3. Total Annual Document Projections for U.S. Firms, 2002 Basis

Table 4 takes this information and breaks out distribution of document production for a ‘typical’ knowledge worker according to major document types. The data from this table is based on analysis of dozens of BrightPlanet customers averaged across about 10 million documents in various repositories.

% Based On

All

Unique

MBs

KB/Page

Pg/Doc

Pages

Docs

MBs

Pages

Archival Documents (3 yrs)
DOC

281

59

20

10.5

2,938

52%

36%

50%

PDF

46

28

14

43.6

2,017

9%

17%

34%

PPT

32

26

55

14.6

474

6%

16%

8%

XLS

178

51

100

2.7

484

33%

31%

8%

Weighted

537

164

28

11.0

5,912

100%

100%

100%

Current Documents (I yr)
DOC

221

71

20

5.1

1,127

49%

35%

32%

PDF

66

36

14

24.7

1,634

15%

18%

46%

PPT

53

76

55

12.9

687

12%

38%

20%

XLS

108

17

100

0.6

70

24%

8%

2%

Weighted

449

199

57

7.8

3,517

100%

100%

100%

Total per Employee
DOC

502

129

20

8.1

4,065

51%

36%

43%

PDF

112

64

14

32.5

3,650

11%

18%

39%

PPT

86

102

55

13.5

1,161

9%

28%

12%

XLS

285

68

100

1.9

554

29%

19%

6%

Weighted

986

363

39

9.6

9,430

100%

100%

100%

Table 4. Document Production for a ‘Typical’ Knowledge Worker

Note that word processed documents account for about 50% of typical production and storage demands. However, also note that documents of the highest archival value, as converted to PDFs for sharing and deployment, also represent about a third to two-fifths of stored documents.

Total Annual U.S. ‘Costs’ to Create Documents

Based on the information from Table 2 to Table 4 above, all updated to a common year 2002 basis, we can now estimate the total annual costs in the U.S. for creating all internal enterprise documents. The analysis is based on the UC Berkeley information and the Coopers & Lybrand studies. The “bottom up” case is based on the number of annual U.S. documents estimated based on Table 2. These results are shown in the table below:

Annual U.S. Office Documents

Number (M)

$/Document

Total $ (B)

“Bottom Up” – Low

1,387

$738.58

$1,024

“Bottom Up” – High

7,242

$141.43

$1,024

Coopers & Lybrand

11,975

$272.33

$3,261

C&L – UCB

27,737

$272.33

$7,554

C&L – “Bottom Up”

4,315

$272.33

$1,175

Average

10,531

$384.11

$3,253

Table 5. Annual U.S. Office Document Cost Estimates[21]

The average numbers above represent the average of the unique values in each column. The Table 5 analysis suggests there may be on the order of 10 billion documents created annually in the U.S with a total “asset” value on the order of $3.3 trillion per year.

‘Cost’ of Creating a ‘Typical’ Document

Based on the averages in the table above, a ‘typical’ document may cost on the order of $380 each to create.[22] Of course, a “document” can vary widely in size, complexity and time to create, and therefore its individual cost and value will vary widely. An invoice generated from an automated accounting system could be a single page and produced automatically in the thousands; proposals for very large contracts can take tens of thousands to millions of dollars to create. For examples, here are some other ‘typical’ costs for a variety of documents:

Ave. Cost

‘Typical’ Document

$384.11

Invoice

$4.43

[23]
Mortgage Application

$210.00

[24]
‘Typical’ Proposal

$17,500.00

[25]

Table 6. ‘Typical’ per Document Creation Costs

Depending on document mix and activities, individual enterprises may want to vary the average document creation costs used in their cost-benefit estimates.

‘Cost’ of a Missed or Overlooked Document

The Coopers & Lybrand study suggests that 7.5 percent of all documents are lost forever, and that it costs $120 in labor ($150 updated to 2002) to find a misfiled document;[26] other studies suggest that 5% to 6% of documents are routinely misplaced or misfiled.

In fact, the extent of this problem is unknown and is affirmed by the Xerox results:[27]

  • Almost three quarters of corporate respondents admit that the information is unavailable or unknown to them
  • 95% of the companies are not able to estimate the cost of wasted or unused documents
  • On average 19% of printed documents were wasted.

Other Document Total ‘Cost’ Factors and Summary

Five independent studies suggest that, on average, organizations spend from 5% to 15% of total company revenue on handling documents.27,[28],[29],[30],[31] These seemingly innocuous percentages can translate into huge bottom-line impacts for U.S. enterprises. For example, the total GDP of the United States was on the order of $10.5 trillion at the end of 2002.[32] Translating this value into the results of Table 5 and the information in previous sections indicates the importance of document creation and handling for U.S enterprises:

Low

Medium

High

Total U.S. Gross Domestic Product ($B)

$10,487

$10,487

$10,487

Total Document Handling ($B)

$524

$1,049

$1,573

% of total GDP:

5.0%

10.0%

15.0%

Total Document Creation ($B)

$1,100

$3,261

$7,554

% of total GDP:

10.5%

31.1%

72.0%

Total Document Misfiled ($B)

$32

$81

$160

% of total GDP:

0.3%

0.8%

1.5%

ALL U.S. Document Burdens ($B)

$1,656

$4,390

$9,287

% of total GDP:

15.8%

41.9%

88.6%

Table 7. Range Estimates for Total U.S. Document Burdens in Enterprises, 2002[33]

A few observations relate to this table. First, enterprises and the analyst community have greatly overlooked the impact of document creation as opposed to document handling. Document creation is about 2-3 times more important  – from an embedded cost standpoint  – than document handling. Second, all aspects of document creation assume a much greater role in the overall economics of enterprises than has been realized previously.

The fact that documents have received so little management attention, awareness, measurement and direct attention to improve performance is shocking.

Archival Lifetime of ‘Valuable’ Documents

The ‘low’ and ‘high’ estimates for documents in Table 2 and Table 3 assume that 2% and 5%, respectively, of internal documents have archival value. Were these percentages to be higher, the volume of documents requiring integration and access would likewise increase. The 2% value is derived from the UC Berkeley study,[34] which also refers to an unpublished European study that places archival amounts at 10%. Unfortunately, there is little empirical information to support the degree to which documents deserve to be kept for archival purposes.

Assuming that documents may retain value for three to five years, the largest firms perhaps have as many as 4 million internal documents on average with enterprise-wide value. Firms with fewer employees generally have lower document counts. Archival percentages, however, are a tricky matter, since apparently 85% of all archived documents are accessed.[35]

III. WEB DOCUMENTS AND SEARCH

Various estimates by Cowles/Simba,[36] Veronis, Suhler & Associates,[37] and Outsell[38] place the current market for on line business information in the $30 billion to $140 billion range, with significant projected growth. Outsell also indicates that marketing, sales, and product development professionals rely most heavily on information from the Internet for their daily decision making, based on a comparative study of Fortune 500 business professionals’ use of the open Web and fee-based desktop information content services.[39] Clearly, relevant and targeted content, much of which resides on line, has extreme value to enterprises.

UC Berkeley estimates that about 500 petabytes of new information was published on the Web in 2002,34 based on original analysis conducted by BrightPlanet.[40] The compound growth rate in Web documents has been on the order of more than 200% annually.[41] Estimates for deep Web content range from about 6-8 times larger [42] to 500 times larger40 than standard “surface web” content. The size of Internet content is overwhelming, of highly variable quality, growing at a rapid pace, and with much of its content ephemeral.

Estimate of Time and Effort Devoted to Document Search

According to a recent study by iProspect, about 56 percent of users use search engines every day, based on a population of which more than 70 percent use the Internet more than 10 hours per week. Professionals abandon a current search 38% of the time after inspecting only one results page (the listing of document result URLs), and overall 82% of users attempt another search if relevant results are not found within the first three results pages. Just 13 percent of users said that they use different search engines for different types of searches.[43] Only 7.5 percent of Internet users said they refined their search with additional keywords in cases where they were unable to achieve satisfactory results.[44]

The average knowledge worker spends 2.3 hrs per day  – or about 25% of work time  – searching for critical job information.[45] IDC estimates that enterprises employing 1,000 knowledge workers waste well over $6 million per year each in searching for information that does not exist, failing to find information that does, or recreating information that could have been found but was not.[46] As that report stated, “It is simply impossible to create knowledge from information that cannot be found or retrieved.”

Vendors and customers often use time savings by knowledge workers as a key rationale for justifying a document or content initiative. This comes about because many studies over the years have noted that white collar employees spend a consistent 20% to 25% of their time seeking information; the premise is that more effective search will save time and drop these percentages. As a sample calculation, each 1% reduction in time devoted to search produces:

$50,000 (base salary) * 1.8 (burden rate) * 1.0% = $900/ employee

The stable percentage effort devoted to search over time suggests it is the “satisficing” allocation. (In other words, knowledge workers are willing to devote a quarter of their time to finding relevant information.) Thus, while better tools to aid better discovery may lead to finding better information and making better decisions more productively  – a far more important justification in itself  – there may not result a strict time or labor savings from more efficient search.[47]

Effect of Non-persistent Search Efforts

The percentage of Web page visits that are re-visits is estimated at between 58%[48] and 80%.[49] While many of these re-visitations occur shortly after the first visit (e.g., during the same session using the back button), a significant number occur after a considerable amount of time has elapsed. Thus, it is not surprising that a survey of problems using the Web found “Not being able to find a page I know is out there,” and “Not being able to return to a page I once visited,” accounted for 17% of the problems reported, and that the most common problem using bookmarks was, “Changed content.”[50] Depending on the content type, users use either “direct” or “indirect” approaches to re-find previously discovered information:

Direct

Indirect

Specific Information

42%

58%

General Information

58%

43%

Specific Documents

29%

71%

Web Documents

77%

23%

Emails

9%

91%

Table 8. General Approaches to Re-finding Previously Discovered Information [51]

Direct approaches require remembering or specifically noting the specific location of the information. Direct approaches include: direct entry; emailing to self; emailing to others; printing out; saving as file; pasting the URL into a document; and posting to a personal Web site.

Indirect approaches include: searching; looking through bookmarks; and recalling from a history file. All of these indirect approaches are supported by modern browsers. Note that re-finding Web pages or documents relies heavily on having a record of a previously visited URL.

As a University of Washington study supported by Microsoft discovered, all of the specific direct and indirect techniques applied to these re-discovery approaches have significant drawbacks in terms of desired functions for the recall process: [52]

Portability No of Access Points Persistence Preservation Currency Context Reminding Ease of Integration Communication Ease of Maintenance

DIRECT APPROACHES

Direct Entry

Low

High

Low

Med

High

Low

Low

?

Low

High

Email to Self

Low

High

Low

Med

High

High

High

Med

Low

Med

Email to Others

Low

High

Low

Med

High

High

Low

Low?

High

High

Print-out

High

High

High

Low

Low

Low

High

Med

High

Med

Save as File

Med?

Low?

High

High

Low

Low

Low

Med?

Low

Med

Paste URL in Doc

Low

Low?

Low

Med

High

High

High?

High?

Low

High

Personal Web Site

Low

High

Low

Med

High

High

High?

High

Med

High?

INDIRECT APPROACHES

Search

Low

High

Low

Med

High

Low

Low

?

Low

High

Bookmark

Low

Low

Low

Med

High

Low

Low

Low

Low

Low

History

Low

Low

Low

Med

High

Low

Low

Low?

Low

?

Table 9. Strengths and Weakness of Existing Techniques to Re-use Web Information

The general observation is that no present technique is able alone to keep search persistent, current or maintain context. These combined inadequacies mean that previously found information is not easily found again, or re-discovered, as the following table shows:

Percent

Information No Longer Available

37%

Re-tracing Path Fails

14%

Time Length Since Last Find

9%

Other Failure Reasons

9%

Total Information Lost

68%

Success Finding Lost Information

32%

Table 10. Success in Finding Important Earlier Found Web Information [53]

This table has a number of important observations. First, some 37% of previously found information disappears from the Web, consistent with other findings that estimate about 40% of all Web content disappears annually, some of which has historical or archival value.[54]

Second, and most importantly, nearly 70% of previously found valuable information cannot be rediscovered again. More than half of this problem is because the information is no longer available on the Web, but other reasons relate to the inadequacies of recall techniques for finding previously discovered information.

These observations can translate into some relatively huge costs on a per employee and per enterprise basis, as the table below shows:

Per Knowledge Worker

Per ‘Large’

All

Per Doc

All Docs

Enterprise ($000)

Enterprises ($M)

Re-finding Documents

$148.54

$585

$3,547

$12,103

Re-creating Documents

$384.11

$1,008

$6,114

$20,864

TOTAL

$1,593

$9,661

$32,967

Table 11. ‘Cost’ of Not Readily Re-finding Valuable Web Information

This analysis assumes that some previously found information of value is again re-found (60%), but some is also not re-found and must be re-created (40%).[55] The ‘large’ enterprise is identical to the definition in Table 2 (which is also nearly equivalent to a Fortune 1000 company).[56]

The analysis indicates that poor methods to recall previously found and valuable Web documents may cost $1,600 per knowledge worker per year. This translates into nearly a $10 million productivity loss for the largest enterprises, or nearly $33 billion across all U.S. industries.

In relation to the total document costs noted in Table 7 above, these may seem to be comparatively small numbers. However, when viewed in the context of unproductive standard Web search, they indicate important failings in the ability to recall previously found valuable results from searches and their attendant productivity losses.

‘Cost’ of Creating and Maintaining a Document Category Portal

Users, administrators and industry analysts alike recognize the importance of placing content into logical, intuitive and hierarchically organized categories. About 60% of knowledge workers note that search is a difficult process, made all the more difficult without a logical organization to content.[57] While technical distinctions exist, these logical structures organized into a hierarchical presentation are most often referred to as “taxonomies,” though other terms such as ontology, subject directory, subject tree, directory structure or classification schema may be used.

Delphi Group’s research with corporate Web sites points to the lack of organized information as the number one problem in the opinion of business professionals. More than three-quarters of the surveyed corporations indicated that a taxonomy or classification system for documents is imperative or somewhat important to their business strategy; more than one-third of firms that classify documents still use manual techniques.57 Hierarchical arrangements of categorized subjects trigger associations and relationships that are not obvious when simply searching keywords. Other advantages cited for the taxonomic presentation of documents are the greater likelihood of discovery, ease-of-use, overcoming the difficulty of formulating effective search queries, being able to search only within related documents, discovery of relationships among similar terminology and concepts, and user satisfaction.[58],[59]

From the user standpoint, knowledge workers want to impose taxonomic order on document chaos, but only if the taxonomy models their domain accurately. They also want software to assist with categorizing, as long as it respects the taxonomy they created. Finally, the results of these category placements should be presented via a portal. Thus, as the common concern across all requirements, the taxonomy takes on tremendous importance for an application’s success.[60]

Large firm documents

Figure 2. Typical Large Firm Documents, Thousands

Enterprises that have adopted directory structures for content management are not yet achieving enterprise-wide relevance, presenting on average 1% of all relevant documents in an organized portal view. These limitations appear to be driven by weaknesses in the technology and high costs associated with conventional approaches:

  • Comprehensiveness and Scale – according to a market report published by Plumtree in 2003, the average document portal contains about 37,000 documents.[61] This was an increase from a 2002 Plumtree survey that indicated average document counts of 18,000.[62] However, about 60% of respondents to a Delphi Group survey said they had more than 50,000 internal documents in their portal environment (generally the department level), 3 and as Table 2 indicates above, most of the largest firms likely have millions or more internal documents deserving of common access and archiving.
  • The left-hand bar in Figure 2 indicates current averages for documents in existing content portals. The right-hand (yellow and orange) bar indicates potential based on high and low estimates. The ‘Archive’ case (middle bar) show the same values as provided in Table 2, and represent a conservative view of “archival-likely” documents. The right bar is a more representative view of actual current internal content that enterprises may want to make available to their employees.[63] Two observations have merit: 1) under current practice, enterprises are at most making 10% of their useful documents available, and more likely slightly over 1%; 2) the documents that are being made available are solely internal, and neglect potentially important external sources that would increase document counts considerably.
  • Implementation Times – though average time to stand-up a new content installation is about 6 months, there is also a 22% risk that deployment times exceeds that and an 8% risk it takes longer than one year. Furthermore, internal staff necessary for initial stand-up average nearly 14 people (6 of whom are strictly devoted to content development), with the potential for much larger head counts[64]
  • Ongoing Maintenance and Staffing Costs – ongoing maintenance and staffing costs typically exceed the initial deployment effort. This trend is perhaps not surprising in that once a valuable content portal has been created there will be demands to expand its scope and coverage. Based on these various factors, Table 12 summarizes set-up, ongoing maintenance and key metrics for today’s conventional approaches versus what BrightPlanet can do (the BrightPlanet document count is based on a ‘typical’ installation; there are no practical scale limits)

DOCUMENT

INITIAL SET-UP

MAINTENANCE

BASIS

Staff

Mos

$/Doc

Staff

$/Doc

Current Practice

37,000

6.2

5.4

$4.861

6.4

$11.278

BrightPlanet

250,000

1.0

0.8

$0.017

0.3

$0.078

BP Advantage

6.8 x + up

6.2 x

6.7 x

280.4 x

21.4 x

144.6 x

Table 12. Staff, Time and per Document Costs for Categorized Document Portals

  • The content staff level estimates in the table are consistent with anecdotal information and with a survey of 40 installations that found there were on average 14 content development staff managing each enterprise’s content portal.[65]

Though conventional approaches to content integration seem to lead to high per document set-up and maintenance costs, these should be contrasted with standard practice that suggests it may cost on average $25 to $40 per document simply for filing.29 Indeed, labor costs can account for up to 30% of total document handling costs.28 Nonetheless, at $5 to $11 per document for content management alone, this could result in no actual cost savings if electronic access does not displace current filing practices. When multiplied across all enterprise documents, these uncertainties can translate into huge swings in costs or benefits for a content portal initiative.

  • Software License v. Full Project Costs – according to Charles Phillips of Morgan Stanley, only 30% of the money spent on major software projects goes to the actual purchase of commercially packaged software. Another third goes to internal software development by companies. The remaining 37% goes to third-party consultants.[66] In evaluating a commitment, internal staff and consulting time should be carefully scrutinized. Efficiencies in initial deployment and ongoing support are the biggest cost drivers
  • Internal PLUS External Sources – weaknesses in scalability and high implementation costs often lead to a dismissal of the importance of integrating internal plus external content. Few installations address relevant content external to the enterprise essential to achieving its missions. Granted, the increase in scales associated with external content are large, but for some businesses integration with external content may be essential.

While other vendors claim fast categorization times, what they fail to mention is the lengthy pre-processing times necessary for generating their categorization metatags. According to Forrester Research, some of these metatagging systems can only process five to 15 documents per hour![67]

‘Cost’ of Inaccessible or Hidden Intranet Sites

In 2003, the portal vendor Plumtree noticed a new trend that it called “Web sprawl,” by which it meant the costly proliferation of Web applications, intranets and extranets.[68] BEA has taken up this trend as a major thrust to its Web service offerings through an approach it calls “enterprise portal rationalization” (EPR).[69] According to BEA, its architectural offerings are meant to control the “metastasizing” of corporate Web sites.

How common and to what scale is the proliferation of enterprise Web sites? I have not been able to find any comprehensive studies on this topic, but has been able to find many anecdotal examples. The proliferation, in fact, began as soon as the Internet became popular:

  • As reported in 2000, Intel had more than 1 million URLs on its intranet with more than 100 new Web sites being introduced each month[70]
  • In 2002, IBM consolidated over 8,000 intranet sites, 680 ‘major’ sites, 11 million Web pages and 5,600 domain names into what it calls the IBM Dynamic Workplaces, or W3 to employees[71]
  • Silicon Graphics’ ‘Silicon Junction’ company-wide portal serves 7,200 employees with 144,000 Web pages consolidated from more than 800 internal Web sites[72]
  • Hewlett-Packard Co., for example, has sliced the number of internal Web sites it runs from 4,700 (1,000 for employee training, 3,000 for HR) to 2,600, and it makes them all accessible from one home, @HP [73],[74]
  • Avaya Corporation is now consolidating more than 800 internal Web sites globally[75]
  • The Wall Street Journal recently reported that AT&T has 10 information architects on staff to maintain its 3,600 intranet sets that contain 1.5 million public Web pages[76]
  • The new Department of Homeland Security is faced with the challenge of consolidating more than 3,000 databases inherited from its various constituent agencies.[77]

BrightPlanet’s customers confirm these trends, with indicators of hundreds if not thousands of internal Web sites common in the largest companies. Indeed, it is surprising how many instances there are where corporate IT does not even know the full extent of Web site proliferation. The problem is likely much greater than realized:

Low

Med

High

Number of Large Firms

930

1,500

3,000

Ave Number of Web Sites per Firm

100

500

900

Ave. Number of Documents per Web Site

100

350

1,500

Total Large Firm Web Sites

93,000

750,000

2,700,000

Percentage of Known Web Sites

85%

60%

40%

Percentage of Doc Federation for Known Sites

50%

10%

2%

Site Development & Maintenance
Development Cost per Web Site

$300

$1,701

$9,000

Annual Maintenance Cost per Site

$800

$3,947

$21,000

Total Yr 1 Cost per Site

$1,100

$5,649

$30,000

Total Yr 1 per Large Firm Costs ($000)

$110

$2,824

$27,000

Total Yr 1 Large Firm Costs ($M)

$102

$4,237

$81,000

‘Cost’ of Unfound Documents
No. of Unknown Documents per Firm

5,750

80,500

820,800

Total Number of Large Firm Unknown Docs

5,347,500

120,750,000

2,462,400,000

Total Cost per Web Site

$6,900

$23,915

$350,310

Cost of Unknown Docs per Firm ($000)

$690

$11,958

$315,279

Total Cost of Large Firm Unknown Docs ($M)

$642

$17,937

$945,837

Summary
Total Cost per Firm ($000)

$800

$14,782

$342,279

Total Cost all Large Firms ($M)

$744

$22,173

$1,026,837

Development as % of Total Costs

14%

19%

8%

Unfound Documents as % of Total Costs

86%

81%

92%

Table 13. Development and Unfound Document ‘Costs’ for Large Firms due to Web Sprawl

Table 13 consolidates previous information to estimate what the ‘costs’ of Web sprawl might be to larger firms (analogous to the Fortune 1000). The table presents Low, Medium and High estimates for number of Web sites per firm, known and unknown documents in each, and associated costs for initial site development and first-year maintenance plus the value of unfound information. The Medium category uses the average values from previous tables. The Low and High values bracket these amounts based on distribution of known values and expert judgment.

The table indicates as a mid-range estimate that an individual Web site for a large enterprise may cost about $6,000 to set-up and maintain in the first year and represents $24,000 in opportunity costs due to unknown or unfound documents. For the average large enterprise across all Web sites, these costs may be $4.2 million and $12.0 million, respectively. Across all large firms, total costs due to Web sprawl may be on the order of $22 billion.

While site development and maintenance costs are not trivial, exceeding $4 billion for all large firms (which can also be significantly reduced  – see previous section), the major cost impact comes from the inability to find or federate the information that is available. Unfound documents represent well in excess of 80% of the costs associated with Web sprawl.

The Web sprawl situation is analogous to other major technology shifts. For example, in the early 1980s, IT grappled mightily with the proliferation of personal computers. Centralized control was impossible in that circumstance because individuals and departments recognized the productivity benefits to be gained by PCs. Only when enterprise-capable vendors of networking technology, such as Novell, were able to offer integration solutions was the corporation able to control and fully exploit the PC’s technology potential.

The proliferation of internal enterprise Web sites is responding to similar drivers: innovation, customer service, or superior methods of product or solutions delivery. Ambitious mid-level managers will continue to exploit these advantages by “cowboy” additions of more corporate Web sites, and that is likely to the good for most enterprises. Gaining control and fully realizing the value of this Web site proliferation  – while not stymieing innovation  – will likely require enabling technology analogous to the networking of PCs.

IV. OPPORTUNITIES AND THREATS

The previous analysis has focused on more-or-less direct costs and drivers. These impacts are huge and deserve proper consideration. But there are other implications from the inability to access and manage relevant document information. These implications fall into the categories of lost opportunities, liabilities, or non-compliance. These implications often far outweigh the direct costs in their bottom-line impacts. This section presents only a few of these many opportunities.

‘Costs’ and Opportunity Costs of Winning Proposals

Competitive proposals are an important revenue factor to hundreds of thousands of businesses. Indeed, contracts and grants from federal, state and local governments accounted for 12.1% of GDP in 2002; the amount competitively awarded equaled about 5.6% of GDP.[78] Reducing the fully-burdened costs of producing responses to competitive procurements and improving the rate of successfully obtaining them can be a huge competitive advantage to business.

Significant proportions of commercial projects and programs are likewise awarded through competitive proposals and bids. However, literature references to these are limited, and the remainder of this section relies on federal sector statistics as a proxy for the overall category.

Though the federal government is making strides in providing central clearinghouses to opportunities  – and is also doing much in moving to uniform application standards and electronic application submissions  – these efforts are still in their nascent stages and similar efforts at the state and local level are severely lagging. As a result, the magnitude of the proposal opportunity is perhaps largely unknown to many businesses. This lack of appreciation and attention to the cost- and success-drivers behind winning proposals is a real gap in the competitiveness of many individual businesses.

Table 14 on the following page consolidates information from many government sources to quantify the magnitude of this competitively-awarded grant and contract opportunity with governments.

Number of Awards

Amount ($000)

Federal Government
Total Grants

1,335,813

$441,037,633

[79] [80]
Total Contract Procurements

1,155,096

$327,413,076

Competitively-awarded Grants

336,091

$99,234,657

[81]
Competitively-awarded Procurements

909,087

$231,878,136

[82]
Total Competitive Opportunities

1,245,179

$331,112,793

Ave Competitive Opportunity

$266

[83]
State & Local Government [84] [85]
Total Grants

757,199

$190,000,000

Total Contract Procurements

1,439,031

$310,000,000

Competitively-awarded Grants

190,512

$42,750,512

[86]
Competitively-awarded Procurements

1,132,551

$219,545,972

Total Competitive Opportunities

1,323,063

$262,296,485

Ave Competitive Opportunity

$198

Total (no B-to-B)
Competitively-awarded Grants

526,603

$141,985,169

Competitively-awarded Procurements

2,041,638

$451,424,108

Total Competitive Opportunities

2,568,241

$593,409,277

Ave Competitive Opportunity

$231

Table 14. Federal, State & Local Contract and Grant Opportunities, 2002

This analysis suggests there are nearly $600 billion available each year for competitively awarded grants and procurements from all levels of government within the U.S.; about 60% from the federal sector. The average competitive award is about $270 K for grants; about $220 K for contract procurements.

Aside from construction firms (which are excluded in this and prior analyses), there are on the order of 92,500 federal contract-seeking firms today.[87] In 2003, the top 200 federal contracting firms accounted for nearly $190 billion in contract outlays.[88] While it is unclear what proportion of these commitments were competitive (81% of total federal commitments) or based on all contract procurements (57% of total federal commitments), it is clear that more than 90,000 firms are competing via a classic power curve for a minor portion of available federal revenues. This power curve is shown in Figure 3 below for the 200 largest federal contractors, which obtain a proportionately high percentage of all contract dollars.

Power curve distribution of Fedeeral contractors

Figure 3. Power Curve Distribution of Top 200 Federal Contractors by Revenue, 2002

The combination of these factors enables an estimate of the bottom-line proposal impacts by firm. This information is shown in the table below:

Number

Amount ($000)

Total Competitive Awards
Federal

1,245,179

$331,112,793

[89]
State & Local

1,323,063

$262,296,485

Number of Competing Firms

120,250

[90]
Number of Winning Firms

90,805

Number of Winning Proposals

2,326,485

Number of Submitted Proposals

11,211,974

Direct Proposal Preparation Costs
Winning Proposal Preparation

$5,021,357

[91]
Losing Proposals Preparation

$16,939,516

TOTAL Proposal Preparation

$21,960,873

Low

Med

High

Improvement in RFP Development

7.5%

15.0%

35.0%

[92]
Proposal Preparation
Benefits – Individual Submitters ($000)

$14

$27

$64

Benefits – All Submitters ($000)

$1,647,065

$3,294,131

$7,686,305

Proposal Success Benefits
Increase in Number of Winning Submissions

6,810

13,621

31,782

[93]
Increase in Number of Winning Firms

1,406

2,812

6,562

[94]
Benefits – Individual Submitters ($000)

$1,235

$1,235

$1,235

Benefits – All Submitters ($000)

$1,737,101

$3,474,203

$8,106,473

Benefits – All Submitters/All Aspects

$3,384,167

$6,768,334

$15,792,778

Table 15. Combined Preparation Costs and Opportunity Costs for Proposals

Across all entities, the annual cost of preparing proposals to competitive solicitations from government agencies at all levels is on the order of $22 billion, $5 billion for winning firms and $17 billion for losing firms. Better access to missing information and better information  – assuming no change in the underlying ideas or proposal-writing skills  – suggests that proposal response costs could be reduced by more than $3 billion annually. Another $3 billion annually is available for better winning of competitive proposals. Individual benefits to firms that respond to competitive solicitations is on average $1.25 million per competing firm.[95]

The more significant benefit to individual firms from improved access to “missing” information and better information is increasing the likelihood of winning a competitive award. Firms that embrace these practices are estimated to obtain a $1.2 million annual benefit. Given that many firms that have previously been losing awards have relatively low annual revenues, the percent impact on the bottom line can be quite striking due to improved proposal preparation information.

‘Costs’ of Regulation and Regulatory Non-compliance

A December 2001 small business poll by the National Federation of Independent Business (NFIB) gauged the impacts of the regulatory workload on firms. When asked “is government regulation a very serious, somewhat serious, not too serious, or not at all serious problem for your business,” nearly half, or 43.6 percent, answered “very serious” or “somewhat serious.” The respondents indicated the most serious regulatory problems were at the federal level (49 %), state level (35 %) or local level (13%) of government. The biggest single regulatory problem cited was extra paperwork, followed by difficulty understanding how to comply with regulations and dollars spent doing so.[96] A later December 2003 NFIB survey indicates that the average cost per hour of complying with paperwork requirements was $48.72.[97]

Type of Regulation

All Firms

<20 Employees

20-499 Employees

500+ Employees

All Federal Regulations

$5,107

$7,544

$4,671

$4,827

Environmental

$1,312

$3,600

$1,269

$776

Economic

$2,234

$1,748

$1,782

$2,688

Workplace

$843

$897

$944

$755

Tax Compliance

$719

$1,300

$676

$608

Table 16. Per Employee Costs of Federal Regulation by Firm Size, 2002

According to a 2001 report, “The Impact of Regulatory Costs on Small Firms” by W. Mark Crain and Thomas D. Hopkins, the total costs of Federal regulations were estimated to be $843 billion in 2000, or 8 percent of the U. S. Gross Domestic Product. Of these costs, $497 billion fell on business and $346 billion fell on consumers or other governments. Here are how those impacts are estimated on a per employee basis across a range of firm sizes:[98]

As of September 30, 2002, federal agencies estimated there were about 8.2 billion “burden hours” of paperwork government-wide. Almost 95 percent of those 8.2 billion hours were being collected primarily for the purpose of regulatory compliance. [99]

Burden Hrs (million)

Labor Costs ($M)

Total Government

8,223.17

$318,237

Total Gov (excl. Treasury)

1,472.74

$56,995

Treasury

6,750.43

$261,242

Transportation

244.73

$9,471

HHS

224.83

$8,701

Labor

189.22

$7,323

EPA

140.47

$5,436

Defense

92.36

$3,574

Agriculture

88.59

$3,428

Justice

46.60

$1,803

Education

38.44

$1,488

State

29.23

$1,131

HUD

21.93

$849

Commerce

11.65

$451

Interior

7.66

$296

Energy

3.76

$146

SEC

136.58

$5,286

FTC

69.66

$2,696

FCC

26.80

$1,037

SSA

24.89

$963

FAR (contracts)

24.49

$948

FCIC

9.87

$382

NRC

8.34

$323

FEMA

7.77

$301

Veterans Administration

7.31

$283

NASA

5.95

$230

NSF

4.46

$173

FERC

4.38

$170

SBA

2.77

$107

Table 17. Federal Government Paperwork Burdens, 2002[100]

A December 2003 NFIB survey indicates that the average cost per hour of complying with paperwork requirements was $48.72.[101] If these costs are substituted, the total cost burden in the table above would be about $400 billion, $71 billion of which excludes Treasury and the IRS.

Despite legislation requiring federal paperwork reduction and embracing of e-government initiatives, paperwork burdens continue to increase. Total burden hours in 2002, for example, increased 600 million hours, or about 4 percent, from the previous year. The Code of Federal Regulations (CFR) continues to expand despite efforts to curtail further growth. The CFR grew from 71,000 pages in 1975 to 135,000 pages in 1998. Annually, there are more than 4,000 regulatory changes introduced by the federal government. The federal government now has over 8,000 separate information collection requests authorized by OMB.[102]

Federal Source

Fines ($ 000)

Internal Revenue Service

$4,119,622

[103]

Corporate Income

$1,120,531

Employment Taxes

$2,691,021

Excise Taxes

$200,585

Other Taxes

$107,486

[104]
Agriculture

$2,000

Economic Stabilization

$9,000

Labor & Immigration

$72,000

Commerce & Customs (excl SEC)

$22,000

SEC

$101,000

[105]

Narcotics & Alcohol

$2,000

Mine Safety

$18,000

Environmental Protection

$212,000

[106]

Miscellaneous

$1,000

Other

$448,000

TOTAL

$5,006,622

Table 18. Federal Fines and Penalties to Corporations, 2002

Another source of costs to enterprises are civil penalties and fines for non-compliance with existing regulations, as shown in the table above for 2002 by agency. A total of $5 billion annually is expended by U.S. businesses for civil penalties due to non-compliance with federal regulation, $1 billion of which is due to non-tax purposes.

However, these estimates may undercount actual fines and penalties levied by the federal government due to the accounting basis of the OMB source. For example, the Department of Labor (DOL) collected fines and penalties totaling $175 million from employers in fiscal year 2002 for Fair Labor Standards Act (FLSA) violations.[107] According to a 2002 report, since 1990, 43 of the government’s top contractors paid approximately $3.4 billion in fines/penalties, restitution, and settlements.[108] And, according to another report, the corporations liable to the top 100 False Claims Act paid more than $12 billion since 1986.[109] Since there is no central clearinghouse for this information, with both individual agency general counsels and the Department of Justice responsible for actual collections, the figures in Table 18 should be interpreted as estimates.

Table 19 on the next page consolidates the information in Table 16 to Table 18 to estimate the overall regulatory and paperwork burdens on U.S. businesses, plus estimates of the benefits to be gained from better document access and use.

‘Cost’ of an Unauthorized Posted Document

Unauthorized information disclosures derive mainly from within an organization. The ease of electronic record duplication and dissemination  – particularly through postings on enterprise Web sites  – increases a firm’s vulnerability to this problem. Records mutate and propagate in poorly controlled environments. On average, unauthorized disclosure of confidential information costs Fortune 1000 companies about $15 million per company per year.[110]

A few privacy laws demonstrate the potential liabilities associated with disclosure of confidential information due to inadvertent mistakes or disgruntled employees. As one example, the Health Insurance Portability and Accountability Act (HIPAA) of 1996 sets security standards protecting the confidentiality and integrity of “individually identifiable health information,” past, present or future. Failure to comply with any of the electronic data, security, or privacy standards can result in civil monetary penalties up to $25,000 per standard per year. Violation of the privacy regulations for commercial or malicious purposes can result in criminal penalties of $50,000 to $250,000 in fines and one to ten years of imprisonment.[111]

Amount ($000)

Total Federal Paperwork Burden (non-tax)

$56,995,038

[112]
Total Federal Other Regulatory Burden

$331,791,551

[113]
Total Federal Fines and Penalties

$5,006,622

[114]
Total State and Local Paperwork Burden (non-tax)

$32,059,709

[115]
Total State and Local Other Regulatory Burden

$186,632,748

Total State and Local Fines and Penalties

$2,816,225

Low

Med

High

Improvements Due to Better Information

7.5%

15.0%

35.0%

Paperwork Burdens (non-tax)
Benefits per Large Firm

$1,957

$3,915

$9,134

[116]
Benefits – All Firms

$6,679,106

$13,358,212

$31,169,161

Other Regulatory Burdens
Benefits per Large Firm

$11,394

$22,788

$53,172

Benefits – All Firms

$38,881,822

$77,763,645

$181,448,505

Reductions in Fines and Penalties
Benefits per Large Firm

$4,212

$8,424

$19,655

Benefits – All Firms

$14,372,953

$28,745,905

$67,073,779

TOTAL – All Regulatory Burdens
Benefits per Large Firm

$17,563

$35,126

$81,962

Benefits – All Firms

$59,933,881

$119,867,762

$279,691,445

Table 19. Regulatory Burden and Benefits to Firms from Improved Information

As another example, the Gramm-Leach-Bliley Act (GLBA) of 1999 mandates the financial industry to create guidelines for the safeguarding of customer information. GLBA includes severe civil and criminal penalties for non-compliance, with civil penalties up to $100,000 for each violation and key officers may be fined up to $10,000 per violation. Violation of the GLBA can also carry hefty sanctions, including termination of FDIC insurance and fines of up to $1,000,000 for an individual or one percent of the total assets of the financial institution.[117]

Other major areas of unauthorized disclosure liability occur in national security, identity theft, and commerce, tax and Social Security information. Indeed, virtually every state and federal agency related to a company’s business has policies and fines regarding unauthorized disclosures. Monitoring these requirements is thus an imperative for enterprise management to prevent exposure to fines and loss of reputation.

On a less-quantifiable basis there are also risks about the clarity of the enterprise message to customers, suppliers and partners. Unmanaged Web sprawl is a critical hole for enterprises to ensure compliance with privacy and confidentiality regulations, and to promote clarity of message and accuracy to stakeholders.

V. CONCLUSIONS

Prior to the analysis in this white paper, the state of understanding about the value of document assets had been abysmal. While still preliminary and subject to much improvement, this study has nonetheless found:

  • The value of documents  – in their creation, access and use  – can indeed be measured
  • The information contained within U.S. enterprise documents represents about a third of gross domestic product, or an amount of about $3.3 trillion annually
  • Some 25% of all of these expenditures lend themselves to actionable improvements
  • There are perhaps on the order of 10 billion documents created annually in the U.S.
  • Corporate data doubles every six to eight months; 85% of this data is contained in documents
  • Ninety to 97 percent of enterprises cannot estimate how much they spend on producing documents each year
  • Document creation is about 2-3 times more important  – from an embedded cost standpoint  – than document handling
  • It costs, on average, $350 to create a ‘typical’ document
  • The total potential benefit from practical improvements in document access and use to the U.S economy is on the order of $800 billion annually, or about 8% of GDP
  • For the 1,000 largest U.S. firms, benefits from these improvements can approach nearly $250 million annually per firm
  • About three-quarters of these benefits arise from not re-creating the intellectual capital already invested in prior document creation
  • Another 25% of the benefits are due to reduced regulatory non-compliance or paperwork, or better competitiveness in obtaining solicited contracts and grants
  • $33 billion is wasted each year in re-finding previously found Web documents
  • Paperwork and regulatory improvements due to documents can save U.S. enterprises $120 billion each year
  • Lack of document access due to Web sprawl costs U.S. enterprises $22 billion each year
  • $8 billion in annual benefits is available due to document improvements for competitive governmental grant and contract solicitations
  • These figures likely severely underestimate the benefits to enterprises from improved competitiveness, a factor not analyzed in this study
  • Documents are now at the point where structured data was at 15 years ago at the nascent emergence of the data warehousing market.

As noted throughout, there is a considerable need for additional research and data on document creation, use, costs and benefits. Additional technical endnotes are provided in the PDF version of the full paper.


[1] All sources and assumptions are fully documented in footnotes in the main body of this white paper; general assumptions used in multiple tables are provided in the Technical Endnotes.

[2] As quoted by Armando Garcia, vice president of content management at IBM; see http://www.contentworld.com/conference/conthur.html

[3] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See http://delphigroup.com.

[4] Based on the 1999 to 2001 estimate changes in reference 34, Table 2-6.

[5] As initially published in Inc Magazine in 1993. Reference to this document may be found at: http://www.contingencyplanning.com/PastIssues/marapr2001/6.asp

[6] J. Snowdon, Documents The Lifeblood of Your Business?, October 2003, 12 pp. The white paper may be found at: http://www.mdy.com/News&Events/Newsletter/IDCDocMgmt.pdf

[7] Xerox Global Services, Documents – An Opportunity for Cost Control and Business Transformation, 28 pp., 2003. The findings may be found at: http://www.sap.com/solutions/srm/pdf/CCS_Xerox.pdf

[8] A.T. Kearney, Network Publishing: Creating Value Through Digital Content, A.T. Kearney White Paper, April 2001, 32 pp. See http://www.adobe.com/aboutadobe/pressroom/pressmaterials/networkpublishing/pdfs/netpubwh.pdf.

[9] S.A. Mohrman and D.L. Finegold, Strategies for the Knowledge Economy: From Rhetoric to Reality, 2000,http://www.marshall.usc.edu/ceo/Books/pdf/knowledge_economy.pdf. University of Southern California study as supported by Korn/Ferry International, January 2000, 43 pp. See

[10] C. Moore, TheContent Integration Imperative, Forrester Research Trends Report, March 26, 2004, 14 pp.

[11] D. Vesset, Worldwide Business Intelligence Forecast and Anal ysis, 2003-2007, International Data Corporation, June 2003, 18 pp. See http://www.dwway.com/file/20030708085453_IDC_WW-BIFORECASTANDANALYSIS2003-07_JUN03.pdf.

[12] M. Stonebraker and J. Hellerstein, “Content Integration for E-Business,” in ACM SIGMOD Proceedings, Santa Barbara, CA, pp. 552-560, May 2001.

[13] P. Lyman and H. Varian, “How Much Information, 2003,” retrieved from http://www.sims.berkeley.edu/how-much-info-2003 on December 1, 2003.

[14] U.S. Department of Commerce, Digital Economy 2003, Economic Statistics Administration, U.S. Dept. of Commerce, Washington, D.C., April 2004, 155 pp. See http://www.esa.doc.gov/DigitalEconomy2003.cfm.

[15] U.S. Department of Labor, “Occupation Employment and Wages, 2002,” Bureau of Labor Statistics. See http://www.bls.gov/news.release/archives/ocwage_11192003.pdf.

[16] U.S. Census Bureau, “Statistics of U.S. Businesses 2001.” See http://www.census.gov/epcd/susb/2001/us/US–.htm.

[17] Total office documents counts were obtained on a page basis from reference 13, which used a value of 2% for what documents deserve to be archived. This formed the ‘lo’ case, with the high case using a 5% estimate (lower still than the ENST 10% estimated cited in reference 13). Total pages were converted to numbers of documents on an average 8 pp per document basis; see Technical Endnotes for further discussion.

[18] See Technical Endnotes for the derivation of knowledge worker estimates.

[19] See Technical Endnotes for the derivation of content worker estimates.

[20] Citation sources and assumptions for this analysis are presented in the BrightPlanet white paper, “A Cure to IT Indigestion: Deep Content Federation,” BrightPlanet Corporation White Paper, June 2004, 31 pp.

[21] The “bottom up” cases are built from the number of assumed knowledge workers in Table 3. The “low” and “high” variants are based on a 5% archival value or 350 annual documents created per worker, respectively, applied to worker staff costs associated with document creation. The “Coopers & Lybrand” case is a strict updating of that study to 2002. The other two “C&L” cases use the updated per document costs from the C&L study; the first variant uses the annual documents created from the UC Berkeley study without archiving; the second variant uses the average of the “low” and “high” document numbers. See further Technical Endnotes for other key assumptions.

[22] The individual values in Table 5 range from about $140 to $740 per document, with the update of the Coopers & Lybrand study being about $270. Separate Delphi analysis by BrightPlanet has shown median values of about $550 per document.

[23] See http:// www.eds.com/services_offerings/ibill_openbill_b2b.shtml

[24] See http://www.hsh.com/cfee-sample.html.

[25] See http://www.atp.nist.gov/eao/applicants/section9.htm.

[26] As initially published in Inc Magazine in 1993. Reference to this document may be found at: http://www.contingencyplanning.com/PastIssues/marapr2001/6.asp

[27] Xerox Global Services, Documents – An Opportunity for Cost Control and Business Transformation, 28 pp., 2003. The findings may be found at: http://www.sap.com/solutions/srm/pdf/CCS_Xerox.pdf and J. Snowdon, Documents  – The Lifeblood of Your Business?, October 2003, 12 pp. The white paper may be found at: http://www.mdy.com/News&Events/Newsletter/IDCDocMgmt.pdf

[28] Optika Corporation. See http://www.optika.com/ROI/calculator/ROI_roiresults.cfm.

[29] Cap Ventures information, as cited in ZyLAB Technologies B.V., “Know the Cost of Filing Your Paper Documents,” Zylab White Paper, 2001. See http://www.zylab.com/downloads/whitepapers/PDF/21%20-%20Know%20the%20cost%20of%20filing%20your%20paper%20documents.pdf.

[30] ALL Associates Group, Inc., EDAM Sector Summary, April 2003, 2 pp.

[31] ALL Associates Group, 2002 EDAM Metrics for Major U.S. Companies.

[32] By the second Q 2004, this amount was $11.6 trillion. U.S. Federal Reserve Board, Flow of Funds Accounts for the United States, Sept. 16, 2004. See http://www.federalreserve.gov/releases/Z1/current/accessible/f6.htm.

[33] The bases for this table have the following assumptions: 1) the three cases for document handling are based on 5%, 10% and 15% of total enterprise revenues, per the earlier section; 2) the three cases for document creation are based on the ‘C&L Bottom-Up’, ‘Bottom-up  – High,’ and ‘Coopers & Lybrand’ items for the Low, Medium, and High columns, respectively, in Table 5; and 3) the document misfiling case draws on the same basis but using the total document estimates and misfiled percentages of 5%, 7.5% and 9% consistent with the previous discussion section. See further the Technical Endnotes.

[34] P. Lyman and H. Varian, “How Much Information, 2003,” retrieved from http://www.sims.berkeley.edu/how-much-info-2003 on December 1, 2003.

[35] Cap Ventures information, as cited in ZyLAB Technologies B.V., “Know the Cost of Filing Your Paper Documents,” Zylab White Paper, 2001. See http://www.zylab.com/downloads/whitepapers/PDF/21%20-%20Know%20the%20cost%20of%20filing%20your%20paper%20documents.pdf.

[36] As reported in http://www.hoovers.com/company/archive/detail/0,2049,7_2322,00.html.

[37] See http://www.veronissuhler.com/businfo/segment.html, August 2, 2000.

[38] See http://www.outsellinc.com/docs/pr_release/pr20000602_01.htm, June 2, 2000.

[39] See http://www.outsellinc.com/docs/pr_release/pr20000629_01.htm.

[40] M.K. Bergman, “The Deep Web: Surfacing Hidden Value,” BrightPlanet Corporation White Paper, June 2000. The most recent version of the study was published by the University of Michigan’s Journal of Electronic Publishing in July 2001. See http://www.press.umich.edu/jep/07-01/bergman.html.

[41] This analysis assumes there were 1 million documents on the Web as of mid-1994.

[42] See, for example, C. Sherman and G. Price, The Invisible Web, Information Today, Inc., Medford, NJ, 2001, 439 pp., and P. Pedley, The Invisible Web: Searching the Hidden Parts of the Internet, Aslib-IMI, London, 2001, 138pp.

[43] iProspect Corporation, iProspect Search Engine User Attitudes, April/May 2004, 28 pp. See http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf.

[44] As reported at http://www.nua.ie/surveys/index.cgi?f=VS&art_id=905358569&rel=true.

[45] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See http://delphigroup.com.

[46] C. Sherman and S. Feldman, “The High Cost of Not Finding Information,” International Data Corporation Report #29127, 11 pp., April 2003.

[47] M.E.D. Koenig, “Time Saved  – a Misleading Justification for KM,” KMWorld Magazine, Vol 11, Issue 5, May 2002. See http://www.kmworld.com/publications/magazine/index.cfm.

[48] G. Xu, A. Cockburn and B. McKenzie, Lost on the Web: An Introduction to Web Navigation Research, http://www.cosc.canterbury.ac.nzq/ACMchapterq/NZCSPGq/papers.

[49] A. Cockburn and B. McKenzie, What Do Web Users Do? An Empirical Analysis of Web Use, 2000. See http://citeseer.ist.psu.edu/cockburn00what.html.

[50] Tenth edition of GVU’s (graphics, visualization and usability} WWW User Survey, May 14, 1999. See http://www.gvu.gatech.edu/user_surveys/survey-1998-10/tenthreport.html.

[51] C. Alvarado, J. Teevan, M. S. Ackerman and D.Karger, “Surviving the Information Explosion: How People Find Their Electronic Information,” AI Memo 2003-06, April 2003, 11 pp.., Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory. See ftp://publications.ai.mit.edu/ai-publications/2003/AIM-2003-006.pdf.

[52] W. Jones, H. Bruce and S. Dumais, “Keeping Found Things Found on the Web,” See http://washington.edu/KFTF_Web.pdf.

[53] J. Teevan, “How People Re-find Information When the Web Changes,” AI Memo 2004-014, June 2004, 10 pp., Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory. See ftp://publications.ai.mit.edu/ai-publications/2004/AIM-2004-012.pdf.

[54] Library of Congress, “Preserving Our Digital Heritage: Plan for the National Digital Information Infrastructure and Preservation Program”, a Report to Congress by the U.S. Library of Congress, 2002, 66 pp. See http://www.digitalpreservation.gov/ndiipp/.

[55] Consistent with Table 8; this analysis also assumes the 25% search time commitment by employee and previous values from earlier tables.

[56] All subsequent references to ‘Large’ firms is based on the last column in Table 2, namely the 930 U.S. firms with more than 10,000 employees.

[57] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See http://delphigroup.com.

[58] S. Stearns, “Realize the Value Locked in Your Content Silos Without Breaking the Bank: Automated Classification Tools to Improve Information Discovery,” Inmagic White Paper, version 1.0, 2004. 10 pp. See http://www.inmagic.com.

[59] P. Sonderegger, “Weave Search into the Browsing Experience,” ForresterQuick Take, Forrester Research, Inc., Feb. 18, 2004. 2 pp.

[60] P. Russom, “An Eye for the Needle,” Intelligent Enterprise, January 14, 2002. See http://www.iemagazine.com/020114/502feat2_1.

[61] This average was estimated by interpolating figures shown on Figure 8 in reference 68.

[62] This average was estimated by interpolating figures shown on the p.14 figure in Plumtree Corporation, “The Corporate Portal Market in 2002,” Plumtree Corp. White Paper, 27 pp. See http://www.plumtree.com/pdf/Corporate_Portal_Survey_White_Paper_February2002.pdf.

[63] The ‘low’ case represents the archival value in the middle bars with the addition that 30% of internal documents generated in the current year have a value to be shared for one year; the ‘high’ case represents the related archival value in the middle bars but with 40% of documents generated in that year having a value to be shared for one year.

[64] Analysis based on reference 68, with interpolations from Figure 16.

[65] M. Corcoran, “When Worlds Collide: Who Really Owns the Content,” AIIM Conference, New York, NY, March 10, 2004. See http://show.aiimexpo.com/convdata/aiim2003/brochures/64CorcoranMary.pdf.

[66] C. Phillips, “Stemming the Software Spending Spree,” Optimize Magazine, April 2002, Issue 6. See http://www.optimizemag.com/article/showArticle.jhtml?articleId=17700698&pgno=1.

[67] C. Moore, “The Content Integration Imperative,” Forrester Research, Inc., March 26, 2004, 14 pp.

[68] Plumtree Corporation, “The Corporate Portal Market in 2003,” Plumtree Corp. White Paper, 30 pp. See http://www.plumtree.com/portalmarket2003/default.asp.

[69] BEA Corporation, “Enterprise Portal Rationalization,” BEA Technical White Paper, 23 pp., 2004. See http://www.bea.com/content/news_events/white_papers/BEA_epr_wp.pdf.

[70] A. Aneja, C.Rowan and B. Brooksby, “Corporate Portal Framework for Transforming Content Chaos on Intranets,” Intel Technology Journal Q1, 2000. See http://developer.intel.com/technology/itj/q12000/pdf/portal.pdf.

[71] J. Smeaton, “IBM’s Own Intranet: Saving Big Blue Millions,” Intranet Journal, Sept. 25, 2002. See http://www.intranetjournal.com/articles/200209/ij_09_25_02a.html.

[72] See http://www.wookieweb.com/Intranet/.

[73] D. Voth, “Why Enterprise Portals are the Next Big Thing,” LTI Magazine, October 1, 2002. See http://www.ltimagazine.com/ltimagazine/article/articleDetail.jsp?id=36877.

[74] A. Nyberg, “Is Everybody Happy?” CFO Magazine, November 01, 2002. See http://www.cfo.com/article/1%2C5309%2C8062%2C00.html.

[75] See http://www.proudfoot-plc.com/pdf_20004-USPR1002Avayaweb.asp.

[76] Wall Street Journal, May 4, 2004, p. B1.

[77] pers. comm.., Jonathon Houk, Director of DHS IIAP Program, November 2003.

[78] These figures are based on Table 12 and the GDP figures from reference 32. Note, the analysis in this section also ignores business-to-business opportunities, which are also likely significant.

[79] Total grant and procurement amounts are derived from the U.S. Census Bureau, Consolidated Federal Funds Report (CFFR). See http://harvester.census.gov/cffr/asp/Reports.asp.

[80] The number of awards and an analysis of which line items are competitively awarded was derived from the U.S. Census Bureau, Federal Assistance Award Data System (FAADS). See http://www.census.gov/govs/faads/021sumus.htm.

[81] Specific categories of grants were analyzed based on the U.S. General Services Administration’s Catalog of Federal Domestic Assistance (CFDA) definitions to determine degree of competitiveness; see http://12.46.245.173/cfda/cfda.html. Figures from the U.S. Department of Health and Human Services, Grant.gov Clearinghouse (see http://www.grants.gov/) suggest that $350 billion in federal grants is available, but many of the specific grant opportunities are geared to state governments or individuals. That is why the figures shown indicate only $100 billion in competitive opportunities available directly to enterprises.

[82] U.S. General Services Administration, Federal Procurement Data System  – NG (FY 2003 data); see http://www.fpdc.gov/fpdc/FPR2003a.pdf and http://www.fpdc.gov/fpdc/FPR2003c.pdf. These sources are also the reference for the number of actions or successful awards. Due to discrepancies, these amounts were adjusted to conform with the totals in reference 79.

[83] Average competitive opportunities are derived by dividing the total award amount by category by the number of awards for that category.

[84] See http://www.gcswin.com/opportunities/opp2.htm. This is the only summary reference for state and local information found. Splits between grants and contract procurements were adjusted based on the assumption that contract amounts differed at the non-federal level. Thus, while the split for grant-contract procurements in the federal sector is about 58%-42% in the federal sector, it is assumed to be 38%-62% at the state and local level.

[85] There may also be some double counting of state amounts due to transfers from the federal government. For example, in 2002, $360,534 million in direct transfers was made to states and localities from the federal government. U.S. Census Bureau, State and Local Government Finances by Level of Government and by State: 2001  – 02. See http://www.census.gov/govs/estimate/0200ussl_1.html.

[86] This analysis assumes that individual grant and contract awards are 80% of the amount shown at the federal level.

[87] To be listed requires a minimum of $10,000 in federal contracts; see http://clinton2.nara.gov/WH/EOP/OP/html/aa/aa06.html.

[88] See http://www.govexec.com/features/0804-15/0804-15s1s1.htm.

[89] This header information is drawn from Table 12.

[90] Number of competing firms is increased from the federal contractor baseline by a factor of 1.30 to account for new state and local government contractors.

[91] Winning and losing proposal preparation costs are based on the empirical percentages from NIST (see reference 93), namely 0.85% and 0.59%, respectively, as a percent of total award amounts.

[92] The ‘Low’ basis for improvements is based on the finding of missing information discussed in a previous section; the ‘High” basis reflects the difference between lowest quartile and highest quartile efforts spent on successful proposal preparation (see reference 93). The ‘Med’ basis is an intermediate value between these two.

[93] The increase in winning submissions is calculated based on numbers of winning proposals times the RFP improvement factor. In fact, because all things being equal the pool of contract dollars does not change, this amount merely represents a shift of winning awards from existing winners to new winners. In other words, total contracts amounts are a zero-sum game with proposal improvements by previous losers taken from the pool of previous winners.

[94] The analysis in Figure 2 indicates there is a power curve distribution of awards. The number of new winning proposals was applied to this curve to estimate the actual number of new firms winning awards; see Figure 2 for the power-curve fitting equation.

[95] Of course, better probabilities of winning competitive solicitations are a zero-sum game. New winners displace old winners. The real advantage in this arena is to individual firms that better succeed at securing the existing pool of competitive funds. The benefits to individual companies can be the difference between profitability, indeed survival.

[96] NFIB, Coping with Regulation, NFIB National Small Business Poll, Vol. 1, Issue 5. See http://www.nfib.com/object/3105105.html.

[97] NFIB, Paperwork and Record-keeping, NFIB National Small Business Poll, Vol. 3, Issue 5. See http://www.nfib.com/object/4131277.html.

[98] W. M. Crain & T. D. Hopkins, “The Impact of Regulatory Costs on Small Firms”, Report to the Small Business Administration, RFP No. SBAHQ-00-R-0027 (2001). The report’s 2000 year basis was updated to 2002 based on a 4% annual inflation factor.

[99] U.S. General Accounting Office, Paperwork Reduction Act: Record Increase in Agencies’ Burden Estimates, testimony of V. S. Rezendes, before the Subcommittee on Energy, Policy, Natural Resources and Regulatory Affairs, Committee on Government Reform, House of Representatives, April 11, 2003. See http://www.reform.house.gov/UploadedFiles/Testimony_GAO_Revised.pdf.

[100] Office of Management and Budget, Managing Information Collection and Dissemination, Fiscal Year 2003, 198 pp. (Table A1). See http://www.whitehouse.gov/omb/inforeg/2003_info_coll_dism.pdf.

[101] NFIB, Paperwork and Record-keeping, NFIB National Small Business Poll, Vol. 3, Issue 5. See http://www.nfib.com/object/4131277.html.

[102]U.S. Small Business Administration, Final Report of the Small Business Paperwork Relief Task Force, June 27, 2003, 64 pp. See http://www.sbaonline.sba.gov/advo/laws/final_paperwork03.pdf.

[103] IRS, Civil Penalties Assessed and Abated, by Type of Penalty and Type of Tax (Table 26), September 20, 2002. See http://www.irs.gov/pub/irs-soi/02db26cp.xls.

[104] Except as footnoted, the figures below are drawn from the OMB Public Budget Tables. Civil penalties for crime victims have been excluded from these figures. See http://www.whitehouse.gov/omb/budget/fy2005/db.html.

[105] Obtained orders in SEC judicial and administrative proceedings requiring securities law violators to disgorge illegal profits of approximately $1.293 billion. Civil penalties ordered in SEC proceedings totaled approximately $101 million. See SEC http://www.sec.gov/pdf/annrep02/ar02enforce.pdf.

[106] T. L. Sansonetti, U.S. Department of Justice, testimony before the House Committee on the Judiciary, Subcommittee on Commercial and Administrative Law, March 9, 2004. See http://www.house.gov/judiciary/sansonetti030904.htm.

[107]Argy, Wiltse & Robinson, Business Insights, Summer 2003, 4 pp. See http://www.awr.com/news_let/Argy%20Summer%202003.pdf

[108] Project on Government Oversight, Federal Contractor Misconduct: Failures of the Suspension and Debarment System, revised May 10, 2002. See http://www.pogo.org/p/contracts/co-020505-contractors.html.

[109]Corporate Crime Reporter, Top 100 False Claims Act Settlements, December 30, 2003, 64 pp. See http://www.corporatecrimereporter.com/fraudrep.pdf.

[110] According to Alchemia Corporation testimony citing a Price Waterhouse Coopers study, FDA Hearing, Jan. 17, 2002. See http://www.fda.gov/ohrms/dockets/dockets/ 00d1538/00d-1538_mm00023_01_vol7.doc.

[111] For example, see http://www.medschool.ucsf.edu/curriculum/clinical/guide/section2/confidentiality.asp.

[112] From Table 17.

[113] From Table 16 after adjusting by total number of employees for all firms as shown on Table 2, and removal of total burdens as shown in Table 17.

[114] From Table 18.

[115] All ‘State and Local’ items are based on the ratio of state and local budgets in relation to the federal budget, excluding direct federal transfers, and applied to those factors for the federal sector. This ratio is 0.563. See http://www.gpoaccess.gov/usbudget/fy01/guide01.html.

[116] All ‘Large Firm’ estimates are based on the ratio of large firm documents to total firm documents; see Table 2.

[117] For example, see http://www.nfr.com/why/mandates.php#gramm

Posted on March 12, 2010 at 1:43 pm in Adaptive Information, Brown Bag Lunch, Document Assets, Information Automation | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/871/brown-bag-lunch-untapped-assets-the-3-trillion-value-of-us-documents/
The URI to trackback this post is: http://www.mkbergman.com/871/brown-bag-lunch-untapped-assets-the-3-trillion-value-of-us-documents/trackback/
Date:   March 9, 2010

Citizen DAN LogoHuzzah! for Local Government Open Data, Transparency, Community Indicators and Citizen Journalism

While the Knight News Challenge is still working its way through the screening details, Structured Dynamics Citizen DAN proposal remains in the hunt. Listen to this:

To date, we have been the most viewed proposal by far (2x more than the second most viewed!!! Hooray!) and are in the top five of highest rated (have also been at #1 or #2, depending. Hooray!). Thanks to all of you for your interest and support.

There is much to recommend this KNC approach, not the least of which being able to attract some 2,500 proposals seeking a piece of the 2010 $5 million potential grant awards. Our proposal extends SD’s basic structWSF and conStruct Drupal frameworks to provide a data appliance and network (DAN) to support citizen journalists with data and analysis at the local, community level.

None of our rankings, of course, guarantees anything. But, we also feel good about how the market is looking at these frameworks. We have recently been awarded some pretty exciting and related contracts. Any and all of these initiatives will continue to contribute to the open source Citizen DAN vision.

And, what might that vision be? Well, after some weeks away from it, I read again our online submission to the Knight News Challenge. I have to say: It ain’t too bad! (Plus many supporting goodies and details.)

So, I repeat in its entirety below, the KNC questions and our formal responses. This information from our original submittal is unchanged, except to add some live links where they could not be submitted as such before. (BTW, the bold headers are the KNC questions.) Eventual winners are slated to be announced around mid-June. We’re keeping our fingers crossed, but we are pursuing this initiative in any case.


Describe your project:

Citizen DAN is an open source framework to leverage relevant local data for citizen journalists. It is a:

  • Appliance for filtering and analyzing data specific to local community indicators
  • Means to visualize local data over time or by neighborhood
  • Meeting place for the public to upload and share local data and information
  • Web data portal that can be individually tailored by any local community
  • Node in a global network of communities across which to compare indicators of community well-being.

Good decisions and good journalism require good information. Starting with pre-loaded government data, Citizen DAN provides any citizen the framework to learn and compare local statistics and data with other similar communities. This helps to promote the grist for citizen journalism; it is also a vehicle for discovery and learning across the community.

Citizen DAN comes pre-packaged with all necessary deployment components and documentation, including local data from government sources. It includes facilities for direct upload of additional local data in formats from spreadsheets to standard databases. Many standard converters are included with the basic package.

Citizen DAN may be implemented by local governments or by community advocacy groups. When deployed, using its clear documentation, sponsors may choose whether or what portions of local data are exposed to the broader Citizen DAN network. Data exposed on the network is automatically available to any other network community for comparison and analysis purposes.

This data appliance and network (DAN) is multi-lingual. It will be tested in three cities in Canada and the US, showing its multi-lingual capabilities in English, Spanish and French.

How will your project improve the way news and information are delivered to geographic communities?

With Citizen DAN, anyone with Web access can now get, slice, and dice information about how their community is doing and how it compares to other communities. We have learned from Web 2.0 and user-generated content that once exposed, useful information can be taken and analyzed in valuable and unanticipated ways.

The trick is to get information that already exists. Citizen journalists of the past may not have either known:

  1. Where to find relevant information, or
  2. How to ‘slice-and-dice’ that information to extract meaningful nuggets.

By removing these hurdles, Citizen DAN improves the ways information is delivered to communities and provides the framework for sifting through it to extract meaning.

How is your idea innovative? (new or different from what already exists)

Government public data in electronic tabular form or as published listings or tables in local newspapers has been available for some time. While meeting strict ‘disclosure’ requirements, this information has neither been readily analyzable nor actionable.

The meaning of information lies in its interpretation and analysis.

Citizen DAN is innovative because it:

  1. Is a platform for accessing and exposing available community data
  2. Provides powerful Web-based tools for drilling down and mining data
  3. Changes the game via public-provided data, and
  4. Packages Citizen DAN in a Web framework that is available to any local citizen and requires no expertise other than clicking links.

What experience do you or your organization have to successfully develop this project?

Structured Dynamics has already developed and released as open-source code structWSF and conStruct , the basic foundations to this proposal. structWSF provides the network and dataset “backbone” to this proposal; conStruct provides the Drupal portal and Web site framework.

To this foundation we add proven experience and knowledge of datasets and how to access them, as well as tools and converters for how to stage them for standard public use. A key expertise of Structured Dynamics is the conversion of virtually any legacy data format into interoperable canonical forms.

These are important challenges, which require experience in the semantics of data and mapping from varied forms into useful and common frameworks. Structured Dynamics has codified its expertise in these areas into the software underlying Citizen DAN.

Structured Dynamics’ principals are also multi-lingual, with language-neutral architectures and code. The company’s principals are also some of the most prominent bloggers and writers in the semantic Web. We are acknowledged as attentive to documentation and communication.

Finally, Structured Dynamics’ principals have more than a decade of track record in successful data access and mining, and software and venture development.

To this strong basis, we have preliminary city commitments for deploying this project in the United States (English and Spanish) and Canada (French and English).

What unmet need does your proposal answer?

ThisWeKnow offers local Census data, but no community or publishing aspects. Data sharing is in DataSF and DataMine (NYC), but they lack collaboration, community networks and comparisons, or powerful data visualization or mapping.

Citizen DAN is a turnkey platform for any size community to create, publish, search, browse, slice-and-dice, visualize or compare indicators of community well-being. Its use makes the Web more locally focused. With it, researchers, watchdog groups, reporters, local officials and interested citizens can now discover hard data for ‘new news’ or fact-check mainstream media.

What tasks/benchmarks need to be accomplished to develop your project and by when will you complete them?

There are two releases with feedback. Each task summary, listing of task hours (hr) and duration in months (mo), in rough sequence order with overlaps, is:

  1. Dataset Prep/Staging: identify, load and stage baseline datasets; provide means for aggregating data at different levels; 420 hr; 2.5 mo
  2. Refine Data Input Facility: feature to upload other external data, incl direct from local sources; XML, spreadsheet, JSON forms; dataset metadata; 280 hr; 3 mo
  3. Add Data Visualization Component: Flex mapping/data visualization (charts, graphs) using any slice-and-dice; 390 hr; 3 mo
  4. Make Multi-linguality Changes: English, French, Spanish versions; 220 hr; 2 mo
  5. Refine User Interface: update existing interface in faceted browse; filter; search; record create, manage and update; imports; exports; and user access rights; 380 hr; 3 mo
  6. Standard Citizen DAN Ontologies: the coherent schema for the data; 140 hr; 3 mo
  7. Create Central Portal: distribution and promotion site for project; 120 hr; 2 mo
  8. Deploy/Test First Release: release by end of Mo 5 @ 3 test sites; 300 hr; 4 mo
  9. Revise Based on Feedback: bug fixing and 4 mo testing/feedback, then revision #2; 420 hr
  10. Package/Document: component packaging for easier installs; increased documentation; 310 hr; 2 mo
  11. Marketing/Awareness: see next question; 240 hr; 12 mo
  12. Project Management: standard PM/interact with test communities, partners; 220 hr; 12 mo.

See attached task details.

What will you have changed by the end of your project?

"Information is the currency of democracy." Thomas Jefferson (n.b.)

We intuitively understand that an informed citizenry is a healthy polity. At the global level and in 250 languages, we see how Wikipedia, matched with the Internet and inexpensive laptops, is bringing unforeseen information and enrichment to all. Across the board, we are seeing the democratization of information.

But very little of this revolution has percolated to the local level.

Only in the past decade or so have we seen free, electronic access to national Census data. We still see local data only published in print or not available at all, limiting both awareness but more importantly understanding and analysis. Data locked up in municipal computers or available but not expressed via crowdsourcing is as good as non-existent.

Though many citizens at the local level are not numeric, intuition has to tell us that the absense of empirical, local data hurts our ability to understand, reason and debate our local circumstances. Are we doing better or worse than yesterday? Than in comparison with our peers? Under what measures does this have meaning about community well being?

The purpose of the Citizen DAN project is to create an appliance — in the same sense of refrigerators keeping our food from spoiling — by which any citizen can crack open and expose relevant data at the local level. Citizen DAN is about enrichening our local information and keeping our communities healthy.

How will you measure progress and ultimately success?

We will measure the progress of the project by the number of communities and local organizations that use the Citizen DAN platform to create and publish community data. Subsidiary measures include the number of:

  • Individual users across all installations
  • Users contributing uploaded datasets
  • Contributed datasets
  • Contributed applications based on the platform
  • Interconnected sites in the network
  • Different Citizen DAN networks
  • Substantive articles and blog posts on Citizen DAN
  • Mentions of ‘Citizen DAN’ (and local naming or variants, which will be tracked) in news articles
  • Contributed blog posts on the central Citizen DAN portal
  • Software package downloads, and
  • Google citations and hits on ‘Citizen DAN’ (and prominent variants).

These measures, plus active sites with profiles of each, will be monitored and tracked on the central Citizen DAN portal.

‘Ultimate success’ is related to the general growth in transparent government at the local level. Growth in Citizen DAN-related measures on a year-over-year basis or in relation to Gov2.0 would indicate success.

Do you see any risk in the development of your project?

There is no technical risk to this proposal, but there are risks in scope, awareness and acceptance. Our system has been operational for one year for relevant use cases; all components have been integrated, debugged, and put into production.

Scope risks relate to how much data the Citizen DAN platform is loaded with, and how much functionality is included. We balance the data question by using common public datasets for baseline data, then add features for localities to “crowdsource” their own supplementary data. We balance the functionality question by limiting new development to data visualization/mapping and to upload functions (per above), and then to refine what already exists.

Awareness risks arise from a crowded attention space. We can overcome this in two ways. The first is to satisfy users at our test sites. That will result in good recommendations to help seed a snowball effect. The second way is to use social media and our existing Web outlets aggressively. We have been building awareness for our own properties in steady, inch-by-inch measures. While a notable few Web efforts may go viral, the process is not predictable. Steady, constant focus is our preferred recipe.

Acceptance risk is intimately linked with awareness and use. If we can satisfy each Citizen DAN community, then new datasets, new functionality and new awareness will naturally arise. More users and more contributions through the network effect are the best way to broad acceptance.

What is your marketing plan? How will people learn about what you are doing?

Marketing and awareness efforts will include our use of social media, dedicated Web sites, support from test communities, and outreach to relevant community Web sites.

Our own blogs are popular in the semantic Web and structured data space (~3K uniques daily); we have published two posts on Citizen DAN and will continue to do so with more frequency once the effort gets underway.

We will create a central portal (http://citizen-dan.org) based on the project software (akin to our other project sites). The model for this apps and deployments clearinghouse is CrimeReports.com. Using social aspects and crowdsourcing, the site will encourage sharing and best practices amongst the growing number of Citizen DAN communities.

We will blog and post announcements for key releases and milestones on relevant external Web sites including various Gov 2.0 sites, Community Indicators Consortium, GovLoop, Knight News Challenge, the Sunlight Foundation, and so forth. In addition, we will collate and track individual community efforts (maintained on the central Citizen DAN site) and make specific outreach to community data sites (such as DataSF or DataMine at NYC.gov). We will use Twitter (#CitizenDAN, etc) and the social networks of LinkedIn, Facebook, and Meetup to promote Citizen DAN activity.

We will interact with advocates of citizen journalism, and engage civic organizations, media, and government officials (esp in our three test communities) to refine our marketing plan.

Is this a one-time experiment or do you think it will continue after the grant?

Citizen DAN is not an experiment. It is a working framework that gives any locality and its citizenry the means to assemble, share and compare measures of its community well-being with other communities. These indicators, in turn, provide substance and grist for greater advocacy and writing and blogging (”journalism”) at the local level.

Granted, there are unknowns: How many localities will adopt the Citizen DAN appliance? How essential will its data be to local advocacy and news? How active will each Citizen DAN installation be in attracting contributions and local data?

We submit the better way to frame the question is the degree of adoption, as opposed to will it work.

Web-based changes in our society and social interaction are leading to the democratization of information, access to it, and channels for expression. Whether ultimately successful in the specific form proposed herein, Citizen DAN and its open source software and frameworks will surely be adopted in one form or another — to one degree or another — in the unassailable trend toward local government transparency and citizen involvement.

In short, Yes: We believe Citizen DAN will continue long after the grant.

If it is to be self-sustainable, what is the plan for making that happen?

Our plan begins with the nature of Citizen DAN as software and framework. Sustainability is a question of whether the appliance itself is useful, and how users choose to leverage it.

Mediawiki, the software behind Wikipedia, is an analog. Mediawiki is an enabling infrastructure. Some sites using it are not successful; others wildly so. Success has required the combination of a good appliance with topicality and good management. The same is true for Citizen DAN.

Our plan thus begins with Citizen DAN as a useful appliance, as free open source with great documentation and prominent initial use cases. Our plan continues with our commitment to the local citizen marketplace.

We are developing Citizen DAN because of current trends. We foresee many hundreds of communities adopting the system. Most will be able to do so on their own. Some others may require modifications or assistance. Our self-interest is to ensure a high level of adoption.

An era of citizen engagement is unfolding at the local level, fueled by Web technologies and growing comfort with crowdsourcing and social networks. Meanwhile, local government constraints and pressures for transparency are unleashing locked-up data. These forces will create new opportunities for data literacy by the public, that will itself bring new understanding and improvements in governance and budgeting. We plan on Citizen DAN and its offspring to be one of the catalysts for those changes.

Posted on March 9, 2010 at 11:36 pm in Adaptive Information, Open Source, Semantic Web Tools, Software Development, Structured Dynamics | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/869/citizen-dan-prise-deux/
The URI to trackback this post is: http://www.mkbergman.com/869/citizen-dan-prise-deux/trackback/
Date:   February 15, 2010

Two Faces in Circle, from http://energeticrelations.com/Our Own Approach is Adaptive and Incremental

It is gratifying to see the emergence of the term semantic enterprise, with much increased attention and commentary. But, similar to different styles and patterns in software programming, there is not a single (nor best, depending on circumstance) way to approach becoming a semantic enterprise.

In this piece I contrast two styles. The more traditional and familiar one is comprehensive, complete and “engineered” in its approach. The second, and emerging style, is more adaptive and incremental. While Structured Dynamics is a proponent and thought leader for the adaptive style, the use and applicability of either approach is really a function of objectives and circumstances. The choice of approach depends on use case, and should not be a dogmatic one.

Any time a contrast is posed, one should be on guard about setting up a rhetorical strawman. There may perhaps be a bit of this flavor in this article; if so, it is unintended. It is probably best to realize that there is a gradient — or spectrum — of possible approaches between these contrasting styles. The real message is to understand these differences such that you can comfortably place your own organization at the right points along this spectrum.

A Spectrum of Advantages and Differences

The general idea of semantics in the enterprise preceeds the use of the term, having been somewhat captured before by the ideas of enterprise application integration, enterprise information integration and other concepts even related to data federation and data warehousing stretching back to the 1980s. However, as a specific label, we can look back to the first mentions in the late 1990s and more concerted attention beginning from about 2002 or so onward [1]. As another indicator, since 2005 the Semantic Technology Conference has given specific prominence to the enterprise [2].

Throughout this period, the sense from academic papers, many vendors, and most pundits [3] has been on things like automated reasoning, machine-aided decision making, aspects of artificial intelligence, and so forth. The general tone is often framed as “revolution” or “massive changes” or something “entirely new.” If you are a consultant or software/implementation vendor — especially where VC money is backing the venture with hopes for big returns and home runs — it may make cynical sense to sell such large and costly change.

I believe there are circumstances where the Semantic Enterprise writ this large may make sense and be financially justified. But, this kind of “big change” view has also seen relatively few visible (or successful) deployments. It has colored what it means to be a semantic enterprise. And, I believe, it has weakened market credibility by perhaps overpromising and underdelivering. The conventional view of what it is be a semantic enterprise deserves to be balanced.

So, as we balance this understanding of the semantic enterprise to one that is more nuanced, we can contrast the characteristics of the two apposite styles as follows:

Characteristics of the
Comprehensive, ‘Engineered’ Style
Characteristics of the
Adaptive, Incremental Style
  • A focus on a more complete, comprehensive coverage of the semantics in the domain
  • More enterprise-wide, less partial or departmental
  • Greater emphasis on “closed world” approaches [4]; more akin to relational database architecting and schema
  • Expansion is possible, but effort may be somewhat complex
  • A general implication is to replace or supplant existing information structures with semantic ones
  • Not necessarily based on semantic Web standards and languages [5] (e.g., may include Common Logic, frame logics, etc.)
  • Richer set of predicates (relations)
  • Though a distinction is maintained between schema and instances, their separation may not be consistently (physically) enforced
  • Often more complicated inferencing and logic tests
  • More complete enumeration and characterization of items
  • Much process around semantics agreement across groups
  • Fairly well-developed implementation tools, including for ontology engineering
  • Implementation times in months to years
  • Implementation costs akin to traditional large-scale IT projects
  • An emphasis on a simpler, incremental, “learn as you go” approach
  • Start with single departments or limited vertical apps
  • Embedded in the “open world” approach [4], with incorporation of external information
  • Design and approach inherently allows incremental expansion and adaptation
  • A key premise is to build from and leverage existing information structures, vocabularies and assets
  • Fully based on semantic Web standards and languages [5], often including linked data [6]
  • Tends to start simply with hierarchical or related concepts (e.g., SKOS)
  • Conscious distinction in the structure for handling schema separate from instances [7]
  • Inferencing logic based more on concept matching, or parent-child or part-of relationships
  • Degree of item characterization based on current scope
  • Initial semantic matching can be driven from existing assets
  • Fairly well-developed implementation tools, except for how to engage publics in the development process
  • Implementation times in weeks to months
  • Implementation costs driven by available budgets (and thus scope)

Note we have labeled the conventional approach as the “comprehensive, engineering” style; its contrast, and the one we position more closely to, is the “adaptive, incremental” style.

[Others have posited contrasting styles, most often as "top down" v. "bottom up." However, in one interpretation of that distinction, "top down" means a layer on top of the existing Web [8]. On the other hand, “top down” is more often understood in the sense of a “comprehensive, engineered” view, consistent with my own understanding [9]. Yet no matter which characterization, neither captures what I feel to be the more important considerations of mindset, logic and premise.]

Though the table above contrasts many points, I think there are two main distinctions to the adaptive approach. First, it firmly embraces the open world assumption. OWA is key to an incremental, “learn as you go” deployment that is also well suited to incorporation of external information. The second main distinction is to leverage and build from existing assets.

A Spectrum of Applications

Yet as noted in the opening, which of these approaches makes better sense depends on circumstance. One aspect of circumstance is available budget and deployment times for pilots or proofs-of-concept. Another aspect, of course, is the planned use or application for the deployment.

These are by no means hard distinctions, but in general we can see these contrasting approaches applying to the following uses:

Applications and Uses for the
Comprehensive, ‘Engineered’ Style
(i.e., more CWA driven)
Applications and Uses for the
Adaptive, Incremental Style
(i.e., more OWA driven)
  • Bounded, “inward” applications (high degree of control and completeness)
  • Engineering enterprises
  • Technical domains and organizations
  • Aeronautics
  • Pharmaceuticals
  • Chemicals
  • Petroleum
  • Energy
  • A/E firms (construction)
  • External facing applications, organizations (customers, incorporation of external data)
  • Faceted Search
  • Taxonomy updates
  • Multi-domain master data management (MDM)
  • Simple (initially) inferencing
  • Consumer products
  • Finance
  • Health care
  • Knowledge enterprises

A critical distinction is the nature of the enterprise itself. “External-facing” enterprises or functions that want or need to incorporate much external information (say, marketing or competitive intelligence) are advised to look closely at the adaptive approach. Organizations that have more complete control over their circumstances should perhaps focus on the conventional approach.

Adoption Thresholds and Risks

In previous writings I have pointed to the manifest benefits that can accrue to the semantic enterprise [see, esp. 10]. But we also have witnessed nearly a decade of promotion for semantics in the enterprise, with perhaps a lack of progress in some areas or unmet promises in others. These raise questions and skepticism of the real eventual costs and benefits.

I believe some of this skepticism is inherent with anything new — the general IT fatigue from what the current “next great thing” might be. But I also believe that some of this skepticism results from an approach to semantics in the enterprise that is both lengthy to deploy and high cost.

The key advantage of the adaptive, incremental approach is that the whole IT game in the enterprise can change. An open world approach enables adoption as it proves itself and as budgets allow. Commitments made under this approach have, in essence, permanent value. Past fears and concerns about making “wrong” bets no longer apply. With learning, targets can be re-adjusted, structure re-defined and applications re-focused, all as new discoveries and broadening scope dictate.

This does not make the adaptive approach better than the conventional one. But, it does make it less risky and, well, more adaptive.


[1] For example, the earliest Google mentions on “semantic enterprise” date to about 1998 or 1999. In 2002, the University of Georgia and Amit Sheth offered the first known academic course on the Semantic Enterprise; see http://lsdis.cs.uga.edu/SemanticEnterprise/.
[2] See the conference guide for the Semantic Technology Conference 2005. The sixth one, the 2010 Semantic Technology Conference, is upcoming on June 21-25 in San Francisco.
[3] See, for example, Mitchell Ummell, ed., 2009. “The Rise of the Semantic Enterprise,” special dedicated edition of the Cutter IT Journal, Vol. 22(9), 40 pp., September 2009. See http://www.cutter.com/offers/semanticenterprise.html (after filling out contact form). Partially in response to this conventional view, I wrote [10]. In that article I offered as a working definition that “a semantic enterprise is one that adopts the languages and standards of the semantic Web . . . and applies them to the issues of information interoperability, preferably using the best practices of linked data.” That happens to be Structured Dynamics’ preferred definition, though as this posting indicates, there is a spectrum of definitions of the term.
[4] See, M.K. Bergman, 2009. “The Open World Assumption: Elephant in the Room“, AI3:::Adaptive Information blog, December 21, 2009.
[5] See for example RDF, RDFS, OWL , SKOS and SPARQL and others.
[6] Linked data is a set of best practices for publishing and deploying instance and class data using the RDF data model. Two of the best practices are to name the data objects using uniform resource identifiers (URIs), and to expose the data for access via the HTTP protocol. Both of these practices enable the Web to become a distributed database, which also means that Web architectures can also be readily employed.
[7] We use a basis in description logics for defining the roles and splits in schema and instances. As we define it:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[8] One article that got quite a bit of play a few years back was A. Iskold, 2007. “Top Down: A New Approach to the Semantic Web,” in ReadWrite Web, Sept. 20, 2007. The problem with this terminology is that it offers a completely different sense of “top down” to traditional uses. In Iskold’s argument, his “top down” is a layering on top of the existing Web.
[9] The more traditional view of “top down” with respect to the semantic Web is in relation to how the system is constructed. This is reflected well in a presentation from the NSF Workshop on DB & IS Research for Semantic Web and Enterprises, April 3, 2002, entitled “The ‘Emergent, Semantic Web: Top Down Design or Bottom Up Consensus?“. Under this view, top down is design and committee-driven; bottom up is more decentralized and based on social processes, which is more akin to Iskold’s “top down.”
[10] M.K. Bergman, 2009. “Fresh Perspectives on the Semantic Enterprise,” AI3:::Adaptive Information blog, Sept. 28, 2009.

Posted on February 15, 2010 at 10:36 am in Adaptive Information, Semantic Web, Structured Dynamics | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/866/two-contrasting-styles-for-the-semantic-enterprise/
The URI to trackback this post is: http://www.mkbergman.com/866/two-contrasting-styles-for-the-semantic-enterprise/trackback/
Date:   February 2, 2010

Inkscape Logo

The Inkscape Process Can Also Aid Image Interchanges with Powerpoint

As we see more collaboration forums emerge, one question that naturally arises is the joint authoring or editing of images. This is particularly important as “official” slide decks or presentations come to the fore.

There are perhaps many different ways to skin this cat. In this article, I describe how to do so using the free, open source SVG editing program, Inkscape.

Why Inkscape?

Like many of you, I have been creating and editing images for years. I am by no means a graphics artist, but images and diagrams have been essential for communicating my work.

Until a few years back, I was totally a bitmap man. I used Paint Shop Pro (bought by Corel in 2004 and getting long in the tooth) and did a lot of copying and pasting.

I switched to Inkscape about two years ago for the following reasons:

  • I wanted re-use of image components via re-sizing and re-coloring, etc., and vector graphics are far superior to raster images for this purpose
  • I wanted a stable, free, usable editor and Inkscape was beginning to mature nicely (the current version 0.47 is even nicer and more stable)
  • Its SVG (scalable vector graphics) format was a standard adopted by the W3C after initial development by Adobe
  • SVG is an easily read and editable XML format
  • There was a growing source of online documentation
  • There was a growing repository of SVG graphics examples, including the broadscale use within Wikipedia (a good way to find stuff from this site is with the search “keywords site:http://commons.wikimedia.org filetype:svg” on your favorite search engine, after substituting your specific keywords).

How to Collaborate with Inkscape

Once you have a working image in Inkscape, make sure all collaborators have a copy of the software. Then:

  1. Isolate the picture (sometimes there are multiple images in a single file) by deleting all extraneous image stuff in the file
  2. From the toolbar, click on the Zoom to fit drawing in window icon [Zoom to fit drawing in window]; this will resize and put your target image in the full display window
  3. Under File -> Document Properties … check Show page border and Show border shadow, then Fit page to selection. This helps size the image properly in the exported file for sharing or collaboration
  4. Save the file as an *.svg option, and name the file with a date/time stamp and author extension (useful for tracking multiple author edits over time)
  5. If in multiple author mode, make sure who has current “ownership” of the image is clear.

How to Share with Powerpoint

Of course, it is more often the case that not all collaborators may have a copy of Inkscape or that the image began in the SVG format.

The image below began as a Windows Powerpoint clip art file, which has then gone through some modifications. Note the bearded guy’s hand holding the paper is out of registry (because I screwed up in earlier editing, but I also can easily fix because it is a vector image!  ;)   ). Also note we have the border from Inkscape as suggested above.  This file, BTW, is people.png, and was created as a PNG after a screen capture from Inkscape:

PNG representation of an SVG

When beginning in Powerpoint or as clip art, files in the format of Windows metafile (*.wmf) or extended WMF (*.emf) work well. (For example, you can download and play with the native Inkscape format of people.svg, or the people.wmf or people.emf versions of the image above.) If you already have images in a Powerpoint presentation, save in one of these two formats, with (*.emf) preferred. (EMF is generally better for text.)

You can open or load these files directly into Inkscape. Generally, they will come in as a group of vectors; to edit the pieces, you should “ungroup.”

After editing per the instructions in the previous section, if you need to re-insert back into Powerpoint, please use the *.emf format (and make sure you do not save text as paths).

For example, see the following PNG graphic taken from a Inkscape file (figure_text.svg):

PNG representation of an SVG

We can save it as an EMF (figure_textpath.emf) to a Powerpoint, with the option of converting text to paths:

Text-to-path EMF

Or, we can save it as an EMF (figure_text.emf) to a Powerpoint, only this time not converting text to paths and then “ungrouping” once in Powerpoint:

EMF with no text to path

Note the latter option, text not as path, is the far superior one. However, also note that borders are added to the figures and vertical text is rotated 90o back to horizontal. Nonetheless, the figure is fully editable, including text. Also, if the original Inkscape figures are constructed with lines of the same color as fills, the border conversion also works well.

Frankly, especially with text, because there can be orientation and other changes going from Inkscape to Powerpoint, I recommend using Inkscape and its native SVG for all early modifications and to keep a canonical copy of your images. Then, prior to completion of the deck, save as EMF for import into Powerpoint and then clean up. If changes later need to be made to the graphic, I recommend doing so in Inkscape and then re-importing.

Other Alternatives

I should note there is an option, as well, in Inkscape to convert raster images to vector ones (use Path -> Trace bitmap … and invoke the multiple scans with colors). This is doable, but involves quite a bit of image copying, manipulation and color separation to achieve workable results. You may want to see further Inkscape’s documentation on tracing, or more fully this reference dealing with color.

Of course, there are likely many other ways to approach these issues of collaboration and sharing. I will leave it to others to suggest and explain those options.

Posted on February 2, 2010 at 11:26 am in Adaptive Information, Blogs and Blogging, Open Source, Site-related | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/863/collaborating-on-images/
The URI to trackback this post is: http://www.mkbergman.com/863/collaborating-on-images/trackback/
Date:   January 26, 2010

AI3's Ontologies category

140 Tools: 20 Must Haves, 70 Possible Usefuls, and 50 Has Beens and Marginals

Well, for another client and another purpose, I was goaded into screening my Sweet Tools listing of semantic Web and -related tools and to assemble stuff from every other nook and cranny I could find. The net result is this enclosed listing of some 140 or so tools — most open source — related to semantic Web ontology building in one way or another.

Ever since I wrote my Intrepid Guide to Ontologies nearly three years ago (and one of the more popular articles of this site, though it is now perhaps a bit long in the tooth), I have been intrigued with how these semantic structures are built and maintained. That interest, in no small measure, is why I continue to maintain the Sweet Tools listing.

As far as I know, the following is the largest and most comprehensive listing of ontology building tools available. I broadly interpret the classification of ‘ontology building’; I include, for example, vocabulary extraction and prompting tools, as well as ontology visualization and mapping.

There are some 140 tools, perhaps 90 or so are still in active use. (Given the scope, not every tool could be inspected in detail. Some listed as being perhaps inactive may not be so, and others not in that category perhaps should be.) Of the entire roster of tools, somewhere on the order of 12 to 20 are quite impressive and deserving of local installation, test runs, and close inspection.

There are relatively few tools useful to non-specialists (or useful to engaging knowledgeable publics in the ontology-building exercise). There appear to be key gaps in the entire workflow from domain scoping and initial ontology definition and vocabulary candidates, to longer-term maintenance and revision. For example, spreadsheets would appear to be a possible useful first step in any workflow process (which is why irON is listed), but the spreadsheet tool per se is not listed herein (nor are text editors).

I surely have missed some tools and likely improperly assigned others. Please drop me an email or comment on this post with any revisions or suggestions.

Some Worth A Closer Look

In my own view, there are some tools that definitely deserve a closer look. My favorite candidates — for very different reasons and for very different places in the workflow — are (in no particular order): Apelon DTS, irON, FlexViz, Knoodl, Protégé, diagramic.com, BooWa, COE, ontopia, Anzo, PoolParty, Vine (and voc2rdf), Erca, Graphl, and GrOWL. Each one of these links is more fully described below. Also, all tools in the Vocabulary Prompting Tools category (which also includes extraction) are worth reviewing since all or nearly all have online demos.

Other tools may also be deserving, depending on use case. Some of the more specific analysis and conversion tools, for example, are in the Miscellaneous category.

Also, some purists may quibble with why some tools are listed here (such as inclusion of some stuff related to Topic Maps). Well, my answer to that is there are no real complete solutions, and whatever we can pragmatically do today requires glueing together many disparate parts.

Comprehensive Ontology Tools

  • Altova SemanticWorks is a visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design. No open source version available
  • Amine is a rather comprehensive, open source platform for the development of intelligent and multi-agent systems written in Java. As one of its components, it has an ontology GUI with text- and tree-based editing modes, with some graph visualization
  • The Apelon DTS (Distributed Terminology System) is an integrated set of open source components that provides comprehensive terminology services in distributed application environments. DTS supports national and international data standards, which are a necessary foundation for comparable and interoperable health information, as well as local vocabularies. Typical applications for DTS include clinical data entry, administrative review, problem-list and code-set management, guideline creation, decision support and information retrieval.. Though not strictly an ontology management system, Apelon DTS has plug-ins that provide visualization of concept graphs and related functionality that make it close to a complete solution
  • DOME is a programmable XML editor which is being used in a knowledge extraction role to transform Web pages into RDF, and available as Eclipse plug-ins. DOME stands for DERI Ontology Management Environment
  • FlexViz is a Flex-based, Protégé-like client-side ontology creation, management and viewing tool; very impressive. The code is distributed from Sourceforge; there is a nice online demo available; there is a nice explanatory paper on the system, and the developer, Chris Callendar, has a useful blog with Flex development tips
  • Knoodl facilitates community-oriented development of OWL based ontologies and RDF knowledge bases. It also serves as a semantic technology platform, offering a Java service-based interface or a SPARQL-based interface so that communities can build their own semantic applications using their ontologies and knowledgebases. It is hosted in the Amazon EC2 cloud and is available for free; private versions may also be obtained. See especially the screencast for a quick introduction
  • The NeOn toolkit is a state-of-the-art, open source multi-platform ontology engineering environment, which provides comprehensive support for the ontology engineering life-cycle. The v2.3.0 toolkit is based on the Eclipse platform, a leading development environment, and provides an extensive set of plug-ins covering a variety of ontology engineering activities. You can add these plug-ins or get a current listing from the built-in updating mechanism
  • ontopia is a relative complete suite of tools for building, maintaining, and deploying Topic Maps-based applications; open source, and written in Java. Could not find online demos, but there are screenshots and there is visualization of topic relationships
  • Protégé is a free, open source visual ontology editor and knowledge-base framework. The Protégé platform supports two main ways of modeling ontologies via the Protégé-Frames and Protégé-OWL editors. Protégé ontologies can be exported into a variety of formats including RDF(S), OWL, and XML Schema. There are a large number of third-party plugins that extends the platform’s functionality
    • Protégé Plugin Library – frequently consult this page to review new additions to the Protégé editor; presently there are dozens of specific plugins, most related to the semantic Web and most open source
    • Collaborative Protégé is a plug-in extension of the existing Protégé system that supports collaborative ontology editing as well as annotation of both ontology components and ontology changes. In addition to the common ontology editing operations, it enables annotation of both ontology components and ontology changes. It supports the searching and filtering of user annotations, also known as notes, based on different criteria. There is also an online demo
  • TopBraid Composer is an enterprise-class modeling environment for developing Semantic Web ontologies and building semantic applications. Fully compliant with W3C standards, Composer offers comprehensive support for developing, managing and testing configurations of knowledge models and their instance knowledge bases. It is based on the Eclipse IDE. There is a free version (after registration) for small ontologies.

Not Apparently in Active Use

  • Adaptiva is a user-centred ontology building environment, based on using multiple strategies to construct an ontology, minimising user input by using adaptive information extraction
  • Exteca is an ontology-based technology written in Java for high-quality knowledge management and document categorisation, including entity extraction. Though code is still available, no updates have been provided since 2006. It can be used in conjunction with search engines
  • IODT is IBM’s toolkit for ontology-driven development. The toolkit includes EMF Ontolgy Definition Metamodel (EODM), EODM workbench, and an OWL Ontology Repository (named Minerva)
  • KAON is an open-source ontology management infrastructure targeted for business applications. It includes a comprehensive tool suite allowing easy ontology creation and management and provides a framework for building ontology-based applications. An important focus of KAON is scalable and efficient reasoning with ontologies
  • Ontolingua provides a distributed collaborative environment to browse, create, edit, modify, and use ontologies. The server supports over 150 active users, some of whom have provided us with descriptions of their projects. Provided as an online service; software availability not known.

Vocabulary Prompting Tools

  • AlchemyAPI from Orchestr8 provides an API based application that uses statistical and natural language processing methods. Applicable to webpages, text files and any input text in several languages
  • BooWa is a set expander for any language (formerly known as SEALS); developed by RC Wang of Carnegie Mellon
  • Google Keywords allows you to enter a few descriptive words or phrases or a site URL to generate keyword ideas
  • Google Sets for automatically creating sets of items from a few examples
  • Open Calais is free limited API web service to automatically attach semantic metadata to content, based on either entities (people, places, organizations, etc.), facts (person ‘x’ works for company ‘y’), or events (person ‘z’ was appointed chairman of company ‘y’ on date ‘x’). The metadata results are stored centrally and returned to you as industry-standard RDF constructs accompanied by a Globally Unique Identifier (GUID)
  • Query-by-document from BlogScope has a nice phrase extraction service, with a choice of ranking methods. Can also be used in a Firefox plug-in (not texted with 3.5+)
  • SemanticHacker (from Textwise) is an API that does a number of different things, including categorization, search, etc. By using ‘concept tags’, the API can be leveraged to generate metadata or tags for content
  • TagFinder is a Web service that automatically extracts tags from a piece of text. The tags are chosen based on both statistical and linguistic analysis of the original text
  • Tagthe.net has a demo and an API for automatic tagging of web documents and texts. Tags can be single words only. The tool also recognizes named entities such as people names and locations
  • TermExtractor extracts terminology consensually referred in a specific application domain. The software takes as input a corpus of domain documents, parses the documents, and extracts a list of “syntactically plausible” terms (e.g. compounds, adjective-nouns, etc.)
  • TermFinder uses Poisson statistics, the Maximum Likelihood Estimation and Inverse Document Frequency between the frequency of words in a given document and a generic corpus of 100 million words per language; available for English, French and Italian
  • TerMine is an online and batch term extractor that emphasizes part of speech (POS) and n-gram (phrase extraction). TerMine is the terminological management system with the C-Value term extraction and AcroMine acronym recognition integrated
  • Topia term extractor is a part-of-speech and frequency based term extraction tool implemented in python. Here is a term extraction demo based on this tool
  • Topicalizer is a service which automatically analyses a document specified by a URL or a plain text regarding its word, phrase and text structure. It provides a variety of useful information on a given text including the following: Word, sentence and paragraph count, collocations, syllable structure, lexical density, keywords, readability and a short abstract on what the given text is about
  • TrMExtractor does glossary extraction on pure text files for either English or Hungarian
  • Wikify! is a system to automatically “wikify” a text by adding Wikipedia-like tags throughout the document. The system extracts keywords and then disambiguates and matches them to their corresponding Wikipedia definition
  • Yahoo! Placemaker is a freely available geoparsing Web service. It helps developers make their applications location-aware by identifying places in unstructured and atomic content – feeds, web pages, news, status updates – and returning geographic metadata for geographic indexing and markup
  • Yahoo! Term Extraction Service is an API to Yahoo’s term extraction service, as well as many other APIs and services in a variety of languages and for a variety of tasks; good general resource. The service has been reported to be shut down numerous times, but apparently is kept alive due to popular demand.

Initial Ontology Development

  • COE COE (CmapTools Ontology Editor) is a specialized version of the CmapTools from IMHC. COE — and its CmapTools parent — is based on the idea of concept maps. A concept map is a graph diagram that shows the relationships among concepts. Concepts are connected with labeled arrows, with the relations manifesting in a downward-branching hierarchical structure. COE is an integrated suite of software tools for constructing, sharing and viewing OWL encoded ontologies based on these constructs
  • Conzilla2 is a second generation concept browser and knowledge management tool with many purposes. It can be used as a visual designer and manager of RDF classes and ontologies, since its native storage is in RDF. It also has an online collaboration server
  • http://diagramic.com/ has an online Flex network graph demo, which also has a neat facility for quick entry and visualization of relationships; mostly small scale; pretty cool. Does not appear to be code available anywhere
  • DogmaModeler is a free and open source, ontology modeling tool based on ORM. The philosophy of DogmaModeler is to enable non-IT experts to model ontologies with a little or no involvement of an ontology engineer; project is quite old, but the software is still available and it may provide some insight into naive ontology development
  • Erca is a framework that eases the use of Formal and Relational Concept Analysis, a neat clustering technique. Though not strictly an ontology tool, Erca could be implemented in a work flow that allows easy import of formal contexts from CSV files, then algorithms that computes the concept lattice of the formal contexts that can be exported as dot graphs (or in JPG, PNG, EPS and SVG formats). Erca is provided as an Eclipse plug-in
  • GraphMind is a mindmap editor for Drupal. It has the basic mindmap features and some Drupal specific enhancements. There is a quick screencast about how GraphMind looks like and what is does. The Flex source is also available from Github
  • GrOWL is the software framework to provide graphical, intuitive browsing and editing of knowledge maps. GrOWL is open source and is used in several projects worldwide. None of the online demos apparently work, but the screenshots look interesting and the code is still available
  • irON using spreadsheets, via its notation and specification. Spreadsheets can be used for initial authoring, esp if the irON guidelines are followed. See further this case study of Sweet Tools in a spreadsheet using irON (commON)
  • ITM T3 stands for Terminology, Thesaurus, Taxonomy, Metadata dictionary. ITM T3 includes a range of functions for managing enterprise shareable multilingual domain-specific taxonomies, thesaurus, terminologies in a unified way. It uses XML, SKOS and RDF standards. Commercial; from Mondeca
  • MindRaider is Semantic Web outliner. It aims to connect the tradition of outline editors with emerging technologies. MindRaider mission is to organize not only the content of your hard drive but also your cognitive base and social relationships in a way that enables quick navigation, concise representation and inferencing
  • Topincs is a Topic Map authoring software that allows groups to share their knowledge over the web. It makes use of a variety of modern technologies. The most important are Topic Maps, REST and Ajax. It consists of three components: the Wiki, the Editor, and the Server. The servier requires AMP; the Editor and Wiki are based on browser plug-ins.

Ontology Editing

  • First, see all of the Comprehensive Tools listing above
  • Anzo for Excel includes an (RDFS and OWL-based) ontology editor that can be used directly within Excel. In addition to that, Anzo for Excel includes the capability to automatically generate an ontology from existing spreadsheet data, which is very useful for quick bootstrapping of an ontology.
  • Hozo is an ontology visualization and development tool that brings version control constructs to group ontology development; limited to a prototype, with no online demo
  • Lexaurus Editor is for off-line creation and editing of vocabularies, taxonomies and thesauri. It supports import and export in Zthes and SKOS XML formats, and allows hierarchical / poly-hierarchical structures to be loaded for editing, or even multiple vocabularies to be loaded simultaneously, so that terms from one taxonomy can be re-used in another, using drag and drop. Not available in open source
  • Model Futures OWL Editor combines simple OWL tools, featuring UML (XMI), ErWin, thesaurus and imports. The editor is tree-based and has a “navigator” tool for traversing property and class-instance relationships. It can import XMI (the interchange format for UML) and Thesaurus Descriptor (BT-NT XML), and EXPRESS XML files. It can export to MS Word.
  • OntoTrack is a browsing and editing ontology authoring tool for OWL Lite. It combines a sophisticated graphical layout with mouse enabled editing features optimized for efficient navigation and manipulation of large ontologies
  • OWLViz is an attractive visual editor for OWL and is available as a Protégé plug-in
  • PoolParty is a triple store-based thesaurus management environment which uses SKOS and text extraction for tag recommendations. See further this manual, which describes more fully the system’s functionality. Also, there is a PoolParty Web service that enables a Zthes thesaurus in XML format to be uploaded and converted to SKOS (via skos:Concepts)
  • SKOSEd is a plugin for Protege 4 that allows you to create and edit thesauri (or similar artefacts) represented in the Simple Knowledge Organisation System (SKOS).
  • TemaTres is a Web application to manage controlled vocabularies, taxonomies and thesaurus. The vocabularies may be exported in Zthes, Skos, TopicMap, etc.
  • ThManager is a tool for creating and visualizing SKOS RDF vocabularies. ThManager facilitates the management of thesauri and other types of controlled vocabularies, such as taxonomies or classification schemes
  • Vitro is a general-purpose web-based ontology and instance editor with customizable public browsing. Vitro is a Java web application that runs in a Tomcat servlet container. With Vitro, you can: 1) create or load ontologies in OWL format; 2) edit instances and relationships; 3) build a public web site to display your data; and 4) search your data with Lucene. Still in somewhat early phases, with no online demos and with minimal interfaces.

Not Apparently in Active Use

  • Omnigator The Omnigator is a form-based manipulaton tool centered on Topic Maps, though it enables the loading and navigation of any conforming topic map in XTM, HyTM, LTM or RDF formats. There is a free evaluation version.
  • OntoGen is a semi-automatic and data-driven ontology editor focusing on editing of topic ontologies (a set of topics connected with different types of relations). The system combines text-mining techniques with an efficient user interface. It requires .Net.
  • OWL-S-editor is an editor for the development of services in OWL-S, with graphical, WSDL and import/export support
  • ReTAX+ is an aide to help a taxonomist create a consistent taxonomy and in particular provides suggestions as to where a new entity could be placed in the taxonomy whilst retaining the integrity of the revised taxonomy (c.f., problems in ontology modelling)
  • SWOOP is a lightweight ontology editor. (Swoop is no longer under active development at mindswap. Continuing development can be found on SWOOP’s Google Code homepage at http://code.google.com/p/swoop/)
  • WebOnto supports the browsing, creation and editing of ontologies through coarse grained and fine grained visualizations and direct manipulation.

Ontology Mapping

  • COMA++ is a schema and ontology matching tool with a comprehensive infrastructure. Its graphical interface supports a variety of interaction
  • ConcepTool is a system to model, analyse, verify, validate, share, combine, and reuse domain knowledge bases and ontologies, reasoning about their implication
  • MatchIT automates and facilitates schema matching and semantic mapping between different Web vocabularies. MatchIT runs as a stand-alone or plug-in Eclipse application and can be integrated with popular third party applications. MatchIT’s uses Adaptive Lexicon™ as an ontology-driven dictionary and thesaurus of English language terminology to quantify and ank the semantic similarity of concepts. It apparently is not available in open source
  • myOntology is used to produce the theoretical foundations, and deployable technology for the Wiki-based, collaborative and community-driven development and maintenance of ontologies instance data and mappings
  • OLA/OLA2 (OWL-Lite Alignment) matches ontologies written in OWL. It relies on a similarity combining all the knowledge used in entity descriptions. It also deal with one-to-many relationships and circularity in entity descriptions through a fixpoint algorithm
  • Potluck is a Web-based user interface that lets casual users—those without programming skills and data modeling expertise—mash up data themselves. Potluck is novel in its use of drag and drop for merging fields, its integration and extension of the faceted browsing paradigm for focusing on subsets of data to align, and its application of simultaneous editing for cleaning up data syntactically. Potluck also lets the user construct rich visualizations of data in-place as the user aligns and cleans up the data.
  • PRIOR+ is a generic and automatic ontology mapping tool, based on propagation theory, information retrieval technique and artificial intelligence model. The approach utilizes both linguistic and structural information of ontologies, and measures the profile similarity and structure similarity of different elements of ontologies in a vector space model (VSM).
  • Vine is a tool that allows users to perform fast mappings of terms across ontologies. It performs smart searches, can search using regular expressions, requires a minimum number of clicks to perform mappings, can be plugged into arbitrary mapping framework, is non-intrusive with mappings stored in an external file, has export to text files, and adds metadata to any mapping. See also http://sourceforge.net/projects/vine/.

Not Apparently in Active Use

  • ASMOV (Automated Semantic Mapping of Ontologies with Validation) is an automatic ontology matching tool which has been designed in order to facilitate the integration of heterogeneous systems, using their data source ontologies
  • Chimaera is a software system that supports users in creating and maintaining distributed ontologies on the web. Two major functions it supports are merging multiple ontologies together and diagnosing individual or multiple ontologies
  • CMS (CROSI Mapping System) is a structure matching system that capitalizes on the rich semantics of the OWL constructs found in source ontologies and on its modular architecture that allows the system to consult external linguistic resources
  • ConRef is a service discovery system which uses ontology mapping techniques to support different user vocabularies
  • DRAGO reasons across multiple distributed ontologies interrelated by pairwise semantic mappings, with a vision of peer-to-peer mapping of many distributed ontologies on the Web. It is implemented as an extension to an open source Pellet OWL Reasoner
  • Falcon-AO (Finding, aligning and learning ontologies) is an automatic ontology matching tool that includes the three elementary matchers of String, V-Doc and GMO. In addition, it integrates a partitioner PBM to cope with large-scale ontologies
  • FOAM is the Framework for ontology alignment and mapping. It is based on heuristics (similarity) of the individual entities (concepts, relations, and instances)
  • hMAFRA (Harmonize Mapping Framework) is a set of tools supporting semantic mapping definition and data reconciliation between ontologies. The targeted formats are XSD, RDFS and KAON
  • IF-Map is an Information Flow based ontology mapping method. It is based on the theoretical grounds of logic of distributed systems and provides an automated streamlined process for generating mappings between ontologies of the same domain
  • LILY is a system matching heterogeneous ontologies. LILY extracts a semantic subgraph for each entity, then it uses both linguistic and structural information in semantic subgraphs to generate initial alignments. The system is presently in a demo version only
  • MAFRA Toolkit – the Ontology MApping FRAmework Toolkit allows users to create semantic relations between two (source and target) ontologies, and apply such relations in translating source ontology instances into target ontology instances
  • OntoEngine is a step toward allowing agents to communicate even though they use different formal languages (i.e., different ontologies). It translates data from a “source” ontology to a “target”
  • OWLS-MX is a hybrid semantic Web service matchmaker. OWLS-MX 1.0 utilizes both description logic reasoning, and token based IR similarity measures. It applies different filters to retrieve OWL-S services that are most relevant to a given query
  • RiMOM (Risk Minimization based Ontology Mapping) integrates different alignment strategies: edit-distance based strategy, vector-similarity based strategy, path-similarity based strategy, background-knowledge based strategy, and three similarity-propagation based strategies
  • semMF is a flexible framework for calculating semantic similarity between objects that are represented as arbitrary RDF graphs. The framework allows taxonomic and non-taxonomic concept matching techniques to be applied to selected object properties
  • Snoggle is a graphical, SWRL-based ontology mapper. Snoggle attempts to solve the ontology mapping problem by providing a graphical user interface (similar to which of the Microsoft Visio) to guide the process of ontology vocabulary alignment. In Snoggle, user-defined mappings can be serialized into rules, which is expressed using SWRL
  • Terminator is a tool for creating term to ontology resource mappings (documentation in Finnish).

Ontology Visualization/Analysis

Though all are not relevant, see my post from a couple of years back on large-scale RDF graph software.

  • Social network graphing tools (many covered elsewhere)
  • Cytoscape is a bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data; I have also written specifically about Cytoscape’s use in UMBEL
    • RDFScape is a project that brings Semantic Web “features” to the popular Systems Biology software Cytoscape
    • NetworkAnalyzer performs analysis of biological networks and calculates network topology parameters including the diameter of a network, the average number of neighbors, and the number of connected pairs of nodes. It also computes the distributions of more complex network parameters such as node degrees, average clustering coefficients, topological coefficients, and shortest path lengths. It displays the results in diagrams, which can be saved as images or text files; used by SD
  • Graphl is a tool for collaborative editing and visualisation of graphs, representing relationships between resources or concepts of the real world. Graphl may be thought of as a visual wiki, a place where everybody can contribute to a shared repository of knowledge
  • igraph is a free software package for creating and manipulating undirected and directed graphs
  • Network Workbench is a very complex, comprehensive; Swiss Army Knife
  • NetworkX – Python; very clean
  • Stanford Network Analysis Package (SNAP) is a general purpose network analysis and graph mining library. It is written in C++ and easily scales to massive networks with hundreds of millions of nodes
  • Social Networks Visualizer (SocNetV) is a flexible and user-friendly tool for the analysis and visualization of Social Networks. It lets you construct networks (mathematical graphs) with a few clicks on a virtual canvas or load networks of various formats (GraphViz, GraphML, Adjacency, Pajek, UCINET, etc) and modify them to suit your needs. SocNetV also offers a built-in web crawler, allowing you to automatically create networks from all links found in a given initial URL
  • Tulip may be incredibly strong
  • Springgraph component for Flex
  • VizierFX is a Flex library for drawing network graphs. The graphs are laid out using GraphViz on the server side, then passed to VizierFX to perform the rendering. The library also provides the ability to run ActionScript code in response to events on the graph, such as mousing over a node or clicking on it.

Miscellaneous Ontology Tools

  • Apolda (Automated Processing of Ontologies with Lexical Denotations for Annotation) is a plugin (processing resource) for GATE (http://gate.ac.uk/). The Apolda processing resource (PR) annotates a document like a gazetteer, but takes the terms from an (OWL) ontology rather than from a list
  • DL-Learner is a tool for learning complex classes from examples and background knowledge. It extends Inductive Logic Programming to Description Logics and the Semantic Web. DL-Learner now has a flexible component based design, which allows to extend it easily with new learning algorithms, learning problems, reasoners, and supported background knowledge sources. A new type of supported knowledge sources are SPARQL endpoints, where DL-Learner can extract knowledge fragments, which enables learning classes even on large knowledge sources like DBpedia, and includes an OWL API reasoner interface and Web service interface.
  • LexiLink is a tool for building, curating and managing multiple lexicons and ontologies in one enterprise-wide Web-based application. The core of the technology is based on RDF and OWL
  • mopy is the Music Ontology Python library, designed to provide easy to use python bindings for ontology terms for the creation and manipulation of music ontology data. mopy can handle information from several ontologies, including the Music Ontology, full FOAF vocab, and the timeline and chord ontologies.
  • OBDA (Ontology Based Data Access) is a plugin for Protégé aimed to be a full-fledged OBDA ontology and component editor. It provides data source and mapping editors, as well as querying facilities that, in sum, allow you to design and test every aspect of an OBDA system. It supports relational data sources (RDBMS) and GLAV-like mappings. In its current beta form, it requires Protege 3.3.1, a reasoner implementing the OBDA extensions to DIG 1.1 (e.g., the DIG server for QuOnto) and Jena 2.5.5
  • OntoComP is a Protégé 4 plugin for completing OWL ontologies. It enables the user to check whether an OWL ontology contains “all relevant information” about the application domain, and extend the ontology appropriately if this is not the case
  • Ontology Browser is a browser created as part of the CO-ODE (http://www.co-ode.org/) project; rather simple interface and use
  • Ontology Metrics is a web-based tool that displays statistics about a given ontology, including the expressivity of the language it is written in
  • OntoSpec is a SWI-Prolog module, aiming at automatically generating XHTML specification from RDF-Schema or OWL ontologies
  • OWL API is a Java interface and implementation for the W3C Web Ontology Language (OWL), used to represent Semantic Web ontologies. The API is focused towards OWL Lite and OWL DL and offers an interface to inference engines and validation functionality
  • OWL Module Extractor is a Web service that extracts a module for a given set of terms from an ontology. It is based on an implementation of locality-based modules that is part of the OWL API.
  • OWL Syntax Converter is an online tool for converting ontologies between different formats, including several OWL syntaxes, RDF/XML, KRSS
  • OWL Verbalizer is an on-line tool that verbalizes OWL ontologies in (controlled) English
  • OwlSight is an OWL ontology browser that runs in any modern web browser; it’s developed with Google Web Toolkit and uses Gwt-Ext, as well as OWL-API. OwlSight is the client component and uses Pellet as its OWL reasoner
  • Pellint is an open source lint tool for Pellet which flags and (optionally) repairs modeling constructs that are known to cause performance problems. Pellint recognizes several patterns at both the axiom and ontology level.
  • PROMPT is a tab plug-in for Protégé is for managing multiple ontologies by comparing versions of the same ontology, moving frames between included and including project, merging two ontologies into one, or extracting a part of an ontology.
  • SegmentationApp is a Java application that segments a given ontology according to the approach described in “Web Ontology Segmentation: Analysis, Classification and Use” (http://www.co-ode.org/resources/papers/seidenberg-www2006.pdf)
  • SETH is a software effort to deeply integrate Python with Web Ontology Language (OWL-DL dialect). The idea is to import ontologies directly into the programming context so that its classes are usable alongside standard Python classes
  • SKOS2GenTax is an online tool that converts hierarchical classifications available in the W3C SKOS (Simple Knowledge Organization Systems) format into RDF-S or OWL ontologies
  • SpecGen (v5) is an ontology specification generator tool. It’s written in Python using Redland RDF library and licensed under the MIT license
  • Text2Onto is a framework for ontology learning from textual resources that extends and re-engineers an earlier framework developed by the same group (TextToOnto). Text2Onto offers three main features: it represents the learned knowledge at a metalevel by instantiating the modelling primitives of a Probabilistic Ontology Model (POM), thus remaining independent from a specific target language while allowing the translation of the instantiated primitives
  • Thea is a Prolog library for generating and manipulating OWL (Web Ontology Language) content. Thea OWL parser uses SWI-Prolog’s Semantic Web library for parsing RDF/XML serialisations of OWL documents into RDF triples and then it builds a representation of the OWL ontology
  • TONES Ontology Repository is primarily designed to be a central location for ontologies that might be of use to tools developers for testing purposes; it is part of the TONES project
  • Visual Ontology Manager (VOM) is a family of tools enables UML-based visual construction of component-based ontologies for use in collaborative applications and interoperability solutions.
  • Web Ontology Manager is a lightweight, Web-based tool using J2EE for managing ontologies expressed in Web Ontology Language (OWL). It enables developers to browse or search the ontologies registered with the system by class or property names. In addition, they can submit a new ontology file
  • RDF evoc (external vocabulary importer) is an RDF external vocabulary importer module (evoc) for Drupal caches any external RDF vocabulary and provides properties to be mapped to CCK fields, node title and body. This module requires the RDF and the SPARQL modules.

Not Apparently in Active Use

  • Almo is an ontology-based workflow engine in Java supporting the ARTEMIS project; part of the OntoWare initiative
  • ClassAKT is a text classification web service for classifying documents according to the ACM Computing Classification System
  • Elmo provides a simple API to access ontology oriented data inside a Sesame RDF repository. The domain model is simplified into independent concerns that are composed together for multi-dimensional, inter-operating, or integrated applications
  • ExtrAKT is a tool for extracting ontologies from Prolog knowledge bases.
  • F-Life is a tool for analysing and maintaining life-cycle patterns in ontology development.
  • Foxtrot is a recommender system which represents user profiles in ontological terms, allowing inference, bootstrapping and profile visualization.
  • HyperDAML creates an HTML representation of OWL content to enable hyperlinking to specific objects, properties, etc.
  • LinKFactory is an ontology management tool, it provides an effective and user-friendly way to create, maintain and extend extensive multilingual terminology systems and ontologies (English, Spanish, French, etc.). It is designed to build, manage and maintain large, complex, language independent ontologies.
  • LSW – the Lisp semantic Web toolkit enables OWL ontologies to be visualized. It was written by Alan Ruttenberg
  • Ontodella is a Prolog HTTP server for category projection and semantic linking
  • OntoWeaver is an ontology-based approach to Web sites, which provides high level support for web site design and development
  • OWLLib is a PHP library for accessing OWL files. OWL is w3.org standard for storing semantic information
  • pOWL is a Semantic Web development platform for ontologies in PHP. pOWL consists of a number of components, including RAP
  • ROWL is the Rule Extension of OWL; it is from the Mobile Commerce Lab in the School of Computer Science at Carnegie Mellon University
  • Semantic Net Generator is a utlity for generating Topic Maps automatically from different data sources by using rules definitions specified with Jelly XML syntax. This Java library provides Jelly tags to access and modify data sources (also RDF) to create a semantic network
  • SMORE is OWL markup for HTML pages. SMORE integrates the SWOOP ontology browser, providing a clear and consistent way to find and view Classes and Properties, complete with search functionality
  • SOBOLEO is a system for Web-based collaboration to create SKOS taxonomies and ontologies and to annotate various Web resources using them
  • SOFA is a Java API for modeling ontologies and Knowledge Bases in ontology and Semantic Web applications. It provides a simple, abstract and language neutral ontology object model, inferencing mechanism and representation of the model with OWL, DAML+OIL and RDFS languages; from java.dev
  • WebScripter is a tool that enables ordinary users to easily and quickly assemble reports extracting and fusing information from multiple, heterogeneous DAMLized Web sources.

Posted on January 26, 2010 at 9:54 am in Adaptive Information, Ontologies, Open Source, Semantic Web Tools | Comments (6)
The URI link reference to this post is: http://www.mkbergman.com/862/the-sweet-compendium-of-ontology-building-tools/
The URI to trackback this post is: http://www.mkbergman.com/862/the-sweet-compendium-of-ontology-building-tools/trackback/
Date:   November 11, 2009

irON - instance record and Object Notation

A Case Study of Turning Spreadsheets into Structured Data Powerhouses

In a former life, I had the nickname of ‘Spreadsheet King’ (perhaps among others that I did not care to hear). I had gotten the nick because of my aggressive use of spreadsheets for financial models, competitors tracking, time series analyses, and the like. However, in all honesty, I have encountered many others in my career much more knowledgeable and capable with spreadsheets than I’ll ever be. So, maybe I was really more like a minor duke or a court jester than true nobility.

Yet, pro or amateur, there are perhaps 1 billion spreadsheet users worldwide [1], making spreadsheets undoubtedly the most prevalent data authoring environment in existence. And, despite moans and wails about how spreadsheets can lead to chaos, spaghetti code, or violations of internal standards, they are here to stay.

Spreadsheets often begin as simple notetaking environments. With the addition of new findings and more analysis, some of these worksheets may evolve to become full-blown datasets. Alternatively, some spreadsheets start from Day One as intended datasets or modeling environments. Whatever the case, clearly there is much accumulated information and data value “locked up” in existing spreadsheets.

How to “unlock” this value for sharing and collaboration was a major stimulus to development of the commON serialization of irON (instance record and Object Notation) [2]. I recently published a case study [3] that describes the reasons and benefits of dataset authoring in a spreadsheet, and provides working examples and code based on Sweet Tools [4] to aid users in understanding and using the commON notation. I summarize portions of that study herein.

This is the second article of a two-part series related to the recent Sweet Tools update.

Background on Sweet Tools and irON

The dataset that is the focus of this use case, Sweet Tools, began as an informal tracking spreadsheet about four years ago. I began it as a way to learn about available tools in the semantic Web and -related spaces. I began publishing it and others found it of value so I continued to develop it.

As it grew over time, however, it gained in structure and size. Eventually, it became a reference dataset, with which many other people desired to use and interact. The current version has well over 800 tools listed, characterized by many structured data attributes such as type, programming language, description and so forth. As it has grown, a formal controlled vocabulary has also evolved to bring consistency to the characterization of many of these attributes.

It was natural for me to maintain this listing as a spreadsheet, which was also reinforced when I was one of the first to adopt an Exhibit presentation of the data based on a Google spreadsheet about three years back. Here is a partial view of this spreadsheet as I maintain it locally:

Sweet Tools Main Spreadsheet Screen
(click to expand)

When we began to develop irON in earnest as a simple (”naïve”) dataset authoring framework, it was clear that a comma-separated value, or CSV [5], option should join the other two serializations under consideration, XML and JSON. CSV, though less expressive and capable as a data format than the other serializations, still has an attribute-value pair (also known as key-value pairs and many other variants [6]) orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc.

As a dataset very familiar to us as irON’s editors, and directly relevant to the semantic Web, Sweet Tools provided a perfect prototype case study for helping to guide the development of irON, and specifically what came to be known as the commON serialization for irON. The Sweet Tools dataset is relatively large for a speciality source, has many different types and attributes, and is characterized by text, images, URLs and similar.

The premise was that if Sweet Tools could be specified and represented in commON sufficiently to be parsed and converted to interoperable RDF, then many similar instance-oriented datasets could likely be so as well. Thus, as we tried and refined notation and vocabulary, we tested applicability against the CSV representation of Sweet Tools in addition to other CSV, JSON and XML datasets.

Dataset Authoring in a Spreadsheet

A large portion of the case study describes the many advantages of authoring small datasets within spreadsheets. The useful thing about the CSV format is that these full functional capabilities of the spreadsheet are available during authoring or later updates and modifications, but, when exported, the CSV provides a relatively clean format for processing and parsing.

So, some of the reasons for small dataset authoring in a spreadsheet include:

  • Formatting and on-sheet management -  the first usefulness of a spreadsheet comes from being able to format and organize the records. Records can be given background colors to highlight distinctions (new entries, for example); live URL links can be embedded; contents can be wrapped and styled within cells; and the column and row heads can be “frozen”, useful when scrolling large workspaces
  • Named blocks and sorting – named blocks are a powerful feature of modern spreadsheets, useful for data manipulation, printing and internal referencing by formulas and the like.  Sorting with named blocks is especially important as an aid to check consistency of terminology, records completeness, duplicates checks, missing value checks, and the like. Named blocks can also be used as references in calculations. All of these features are real time savers, especially when datasets grow large and consistency of treatment and terminology is important
  • Multiple sheets and consolidated accesscommON modules can be specified on a single worksheet or multiple worksheets and saved as individual CSV files; because of its size and relative complexity, the Sweet Tools dataset is maintained on multiple sheets. Multi-worksheet environments help keep related data and notes consolidated and more easily managed on local hard drives
  • Completeness and counts - the spreadsheet counta function is useful to sum counts for cell entries by both column and row, a useful aid to indicate if an attribute or type value is missing or if a record is incomplete.  Of course, similar helps and uses can be found for many of the hundreds of embedded functions within a spreadsheet
  • Controlled vocabularies and data entry validation – quality datasets often hinge on consistency and uniform values and terminology; the data validation utilities within spreadsheets can be applied to Boolean, ranges and mins and maxes, and to controlled vocabulary lists. Here is an example for Sweet Tools, enforcing proper tool category assignments from a 50-item pick list:
Controlled Vocabularies and Data Entry Validation
  • Specialized functions and macrosall functionality of spreadsheets may be employed in the development of commON datasets. Then, once employed, only the values embedded within the sheets are then exported as CSV.

Staging Sweet Tools for commON

The next major section of the case study deals with the minor conventions that must be followed in order to stage spreadsheets for commON. Not much of the specific commON vocabulary or notation is discussed below; for details, see [7].

Because you can create multiple worksheets within a spreadsheet, it is not necessary to modifiy existing worksheets or tabs. Rather, if you are reluctant or can not change existing information, merely create parallel duplicate sheets of the source information. These duplicate sheets have as their sole purpose export to commON CSV. You can maintain your spreadsheet as is while staging for commON.

To do so, use the simple = formula to create cross-references between the existing source spreadsheet tab and the target commON CSV export tab. (You can also do this for complete, highlighted blocks from source to target sheet.) Then, by adding the few minor conventions of commON, you have now created a staged export tab without modifying your source information in the slightest.

In standard form and for Excel and Open Office, single quotes, double quotes and commas when entered into a spreadsheet cell are automatically ‘escaped‘ when issued as CSV. commON allows you to specify your own delimiter for lists (the standard is the pipe ‘|’ character) and what the parser recognizes as the ‘escape’ character (’\’ is the standard). However, you probably should not change for most conditions.

The standard commON parsers and converters are UTF-8 compatible. If your source content has unusual encodings, try to target UTF-8 as your canonical spreadsheet output.

In the irON specification there are a small number of defined modules or processing sections. In commON, these modules are denoted by the double-ampersand character sequence (’&&‘), and apply to lists of instance records (&&recordList), dataset specifications and associated metadata describing the dataset (&&dataset), and mappings of attributes and types to existing schema (&&linkage). Similarly, attributes and types are denoted by a single ampersand prefix (&attributeName).

In commON, any or all of the modules can occur within a single CSV file or in multiple files. In any case, the start of one of these processing modules is signaled by the module keyword and &&keyword convention.

The RecordList Module

The first spreadsheet figure above shows a Sweet Tools example for the &&recordList module. The module begins with that keyword, indicating one of more instance records will follow. Note that the first line after the &&recordList keyword is devoted to the listing of attributes and types for the instance records (designated by the &attributeName convention in the columns for the first row after the &&recordList keyword is encountered).

The &&recordList format can also include the stacked style (see similar Dataset example below) in addition to the single row style shown above.

At any rate, once a worksheet is ready with its instance records following the straightforward irON and commON conventions, it can then be saved as a CSV file and appropriately named. Here is an example of what this “vanilla” CSV file now looks like when shown again in a spreadsheet:

Spreadsheet View of the CSV File
(click to expand)

Alternatively, you could open this same file in a text editor. Here is how this exact same instance record view looks in an editor:

Editor View of the CSV Record File
(click to expand)

Note that the CSV format separates each column by the comma separator, with escapes shown for the &description attribute when it includes a comma-separated clause. Without word wrap, each record in this format occupies a single row (though, again, for the stacked style, multiple entries are allowed on individual rows so long as a new instance record &id is not encountered in the first column).

The Dataset Module

The &&dataset module defines the dataset parameters and provides very flexible metadata attributes to describe the dataset [8]. Note the dataset specification is exactly equivalent in form to the instance record (&&recordList) format, and also allows the single row or stacked styles (see these instance record examples), with this one being the stacked style:

The Dataset Module
(click to expand)

The Linkage Module

The &&linkage module is used to map the structure of the instance records to some structural schema, which can also include external ontologies. The module has a simple, but specific structure.

Either attributes (presented as the &attributeList) or types (presented as the &typeList) are listed sequentially by row until the listing is exhausted [8]. By convention, the second column in the listing is the targeted &mapTo value. Absent a prior &prefixList value, the &mapTo value needs to be a full URL to the corresponding attribute or type in some external schema:

The Linkage Module

Notice in the case of Sweet Tools that most values are from the actual COSMO mini-ontology underlying the listing. These need to be listed as well, since absent the specifications in commON the system has NO knowledge of linkages and mappings.

The Schema (structure) Module

In its current state of development, commON does not support a spreadsheet-based means for specifying the schema structure (lightweight ontology) governing the datasets [2]. Another irON serialization, irJSON, does. Either via this irJSON specification or via an offline ontology, a link reference is presently used by commON (and, therefore, Sweet Tools for this case study) to establish the governing structure of the input instance record datasets.

A spreadsheet-based schema structure for commON has been designed and tested in prototype form. commON should be enhanced with this capability in the near future [8].

Saving and Importing

If the modules are spread across more than one worksheet, then each worksheet must be saved as its own CSV file. In the case of Sweet Tools, as exhibited by its reference current spreadsheet, sweet_tools_20091110.xls, three individual CSV files get saved. These files can be named whatever you would like. However, it is essential that the names be remembered for later referencing.

My own naming convention is to use a format of appname_date_modulename.csv because it sorts well in a file manager accommodating multiple versions (dates) and keeps related files clustered. The appname in the case of Sweet Tools is generally swt. The modulename is generally the dataset, records, or linkage convention. I tend to use the date specification in the YYYYMMDD format. Thus, in the case of the records listings for Sweet Tools, its filename could be something like:  swt_20091110_records.csv.

Once saved, these files are now ready to be imported into a structWSF [9] instance, which is where the CSV parsing and conversion to interoperable RDF occurs [8]. In this case study, we used the Drupal-based conStruct SCS system [10]. conStruct exposes the structWSF Web services via a user interface and a user permission and access system. The actual case study write-up offers more details about the import process.

Using the Dataset

We are now ready to interact with the Sweet Tools structured dataset using conStruct (assuming you have a Drupal installation with the conStruct modules) [10].

Introduction to the App

The screen capture below shows a couple of aspects of the system:

  • First, the left hand panel (according to how this specific Drupal install was themed) shows the various tools available to conStruct.  These include (with links to their documentation) Search, Browse, View Record, Import, Export, Datasets, Create Record, Update Record, Delete Record and Settings [11];
  • The Browse tree in the main part of the screen shows the full mini-ontology that classifies Sweet Tools. Via simple inferencing, clicking on any parent link displays all children projects for that category as well (click to expand):
conStruct (Drupal) Browse Screen for Sweet Tools(click to expand)

One of the absolutely cool things about this framework is that all tools, inferencing, user interfaces and data structure are a direct result of the ontology(ies) underlying the system (plus the irON instance ontology, as well). This means that switching datasets or adding datasets causes the entire system structure to now reflect those changes — without lifting a finger!!

Some Sample Uses

Here are a few sample things you can do with these generic tools driven by the Sweet Tools dataset:

Note, if you access this conStruct instance you will do so as a demo user. Unfortunately, as such, you may not be able to see all of the write and update tools, which in this case are reserved for curators or admins. Recall that structWSF has a comprehensive user access and permissions layer.

Exporting in Alternative Formats

Of course, one of the real advantages of the irON and structWSF designs is to enable different formats to be interchanged and to interoperate. Upon submission, the commON format and its datasets can then be exported in these alternate formats and serializations [8]:

  • commON
  • irJSON
  • irXML
  • N-Triples/CSV
  • N-Triples/TSV
  • RDF+N3
  • RDF+XML

As should be obvious, one of the real benefits of the irON notation — in addition to easy dataset authoring — is the ability to more-or-less treat RDF, CSV, XML and JSON as interoperable data formats.

The Formal Case Study

The formal Sweet Tools case study based on commON, with sample download files and PDF, is available from Annex: A commON Case Study using Sweet Tools, Supplementary Documentation [3].


[1] In 2003, Microsoft estimated its worldwide users of the Excel spreadsheet, which then had about a 90% market share globally, at 400 million. Others at that time estimated unauthorized use to perhaps double that amount. There has been significant growth since then, and online spreadsheets such as Google Docs and Zoho have also grown wildly. This surely puts spreadsheet users globally into the 1 billion range.
[2] See Frédérick Giasson and Michael Bergman, eds., Instance Record and Object Notation (irON) Specification, Specification Document, version 0.82, 20 October 2009.  See http://openstructs.org/iron/iron-specification. Also see the irON Web site, Google discussion group, and code distribution site.
[3] Michael Bergman, 2009. Annex: A commON Case Study using Sweet Tools, Supplementary Documentation, prepared by Structured Dynamics LLC, November 10, 2009. See http://openstructs.org/iron/common-swt-annex. It may also be downloaded in PDF .
[4] See Michael K. Bergman’s AI3:::Adaptive Information blog, Sweet Tools (Sem Web). In addition, the commON version of Sweet Tools is available at the conStruct site.
[5] The CSV mime type is defined in Common Format and MIME Type for Comma-Separated Values (CSV) Files [RFC 4180]. A useful overview of the CSV format is provided by The Comma Separated Value (CSV) File Format. Also, see that author’s related CTX reference for a discussion of how schema and structure can be added to the basic CSV framework; see http://www.creativyst.com/Doc/Std/ctx/ctx.htm, especially the section on the comma-delimited version (http://www.creativyst.com/Doc/Std/ctx/ctx.htm#CTC).
[6] An attribute-value system is a basic knowledge representation framework comprising a table with columns designating “attributes” (also known as properties, predicates, features, parameters, dimensions, characteristics or independent variables) and rows designating “objects” (also known as entities, instances, exemplars, elements or dependent variables). Each table cell therefore designates the value (also known as state) of a particular attribute of a particular object. This is the basic table presentation of a spreadsheet or relational data table.

Attribute-values can also be presented as pairs in a form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array..

[7] See especially SUB-PART 3: commON PROFILE in, Frédérick Giasson and Michael Bergman, eds., Instance Record and Object Notation (irON) Specification, Specification Document, version 0.82, 20 October 2009.
[8] As of the date of this case study, some of the processing steps in the commON pipeline are manual. For example, the parser creates an intermediate N3 file that is actually submitted to the structWSF. Within a week or two of publication, these capabilities should be available as a direct import to a structWSF instance. However, there is one exception to this:  the specification for the schema structure. That module has been prototyped, but will not be released with the first commON upgrade. That enhancement is likely a few weeks off from the date of this posting. Please check the irON or structWSF discussion groups for announcements.
[9] structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data, with generic tools driven by underlying data structures. Its central perspective is that of the dataset. Access and user rights are granted around these datasets, making the framework enterprise-ready and designed for collaboration. Since a structWSF layer may be placed over virtually any existing datastore with Web access — including large instance record stores in existing relational databases — it is also a framework for Web-wide deployments and interoperability.
[10] conStruct SCS is a structured content system built on the Drupal content management framework. conStruct enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces. It is based on RDF and SD’s structWSF platform-independent Web services framework [6]. In addition to user access control and management and a general user interface, conStruct provides Drupal-level CRUD, data display templating, faceted browsing, full-text search, and import and export over structured data stores based on RDF.
[11] More Web services are being added to structWSF on a fairly constant basis, and the existng ones have been through a number of upgrades.

Posted on November 11, 2009 at 9:19 pm in Adaptive Information, Ontologies, Semantic Web, Semantic Web Tools, Structured Dynamics, Structured Web, irON | Comments (4)
The URI link reference to this post is: http://www.mkbergman.com/845/a-most-un-common-way-to-author-datasets/
The URI to trackback this post is: http://www.mkbergman.com/845/a-most-un-common-way-to-author-datasets/trackback/
Date:   October 18, 2009

instance record and Object Notation

New Cross-Scripting Frameworks for XML, JSON and Spreadsheets

On behalf of Structured Dynamics, I am pleased to announce our release into the open source community of irON — the instance record and Object Notation — and its family of frameworks and tools [1]. With irON, you can now author and conduct business solely in the formats and tools most familiar and comfortable to you, all the while enabling your data to interact with the semantic Web.

irON is an abstract notation and associated vocabulary for specifying RDF triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF. The notation supports writing RDF and schema in JSON (irJSON), XML (irXML) and comma-delimited (CSV) formats (commON).

The surprising thing about irON is that — by following its simple conventions and vocabulary — you will be authoring and creating interoperable RDF datasets without doing much different than your normal practice.

This first specification for the irON notation includes guidance for creating instance records (including in bulk), linkages to existing ontologies and schema, and schema definitions. In this newly published irON specificatiion, profiles and examples are also provided for each of the irXML, irJSON and commON serializations. The irON release also includes a number of parsers and converters of the specification into RDF [2]. Data ingested in the irON frameworks can also be exported as RDF and staged as linked data.

UPDATE: Fred Giasson announced on his blog today (10/20) the release of the irJSON and commON parsers.

Background and Rationale

The objective of irON is to make it easy for data owners to author, read and publish data. This means the starting format should be a human readable, easily writable means for authoring and conveying instance records (that is, instances and their attributes and assigned values) and the datasets that contain them. Among other things, this means that irON’s notation does not use RDF “triples“, but rather the native notations of the host serializations.

irON is premised on these considerations and observations:

  • RDF (Resource Description Framework) is a powerful canonical data model for data interoperability [3]
  • However, most existing data is not written in RDF and many authors and publishers prefer other formats for various reasons
  • Many formats that are easier to author and read than RDF are variants of the attribute-value pair construct [4], which can readily be expressed as RDF, and
  • A common abstract notation for converting to RDF would also enable non-RDF formats to become somewhat interchangeable, thus allowing the strengths of each to be combined.

The irON notation and vocabulary is designed to allow the conceptual structure (”schema”) of datasets to be described, to facilitate easy description of the instance records that populate those datasets, and to link different structures for different schema to one another. In these manners, more-or-less complete RDF data structures and instances can be described in alternate formats and be made interoperable. irON provides a simple and naïve information exchange notation expressive enough to describe most any data entity.

The notation also provides a framework for extending existing schema. This means that irON and its three serializations can represent many existing, common data formats and standards, while also providing a vehicle for extending them. Another intent of the specification is to be sparse in terms of requirements. For instance, this reserved vocabulary is fairly minimal and optional in most all cases. The irON specification supports skeletal submissions.

irON Concepts and Vocabulary

The aim of irON is to describe instance records. An instance record is simply a means to represent and convey the information (”attributes”) describing a given instance. An instance is the thing at hand, and need not represent an individual; it could, for example, represent the entire holdings or collection of books in a given library. Such instance records are also known as the ABox [5]. The simple design of irON is in keeping with the limited roles and work associated with this ABox role.

Attributes provide descriptive characteristics for each instance. Every attribute is matched with a value, which can range from descriptive text strings to lists or numeric values. This design is in keeping with simple attribute-value pairs where, in using the terminology of RDF triples, the subject is the instance itself, the predicate is the attribute, and the object is the value. irON has a vocabulary of about 40 reserved attribute terms, though only two are ever required, with a few others strongly recommended for interoperability and interface rendering purposes.

A dataset is an aggregation of instance records used to keep a reference between the instance records and their source (provenance). It is also the container for transmitting those records and providing any metadata descriptions desired. A dataset can be split into multiple dataset slices. Each slice is written to a file serialized in some way. Each slice of a dataset shares the same <id> of the dataset.

Instances can also be assigned to types, which provide the set or classificatory structure for how to relate certain kinds of things (instances) to other kinds of things. The organizational relationships of these types and attributes is described in a schema. irON also has conventions and notations for describing the linkage of attributes and types in a given dataset to existing schema. These linkages are often mapped to established ontologies.

Each of these irON concepts of records, attributes, types, datasets, schema and linkages share similar notations with keywords signaling to the irON parsers and converters how to interpret incoming files and data. There are also provisions for metadata, name spaces, and local and global references.

In these manners, irON and its three serializations can capture virtually the entire scope and power of RDF as a data model, but with simpler and familiar terminology and constructs expected for each serialization.

The Three Serializations

For different reasons and for different audiences, the formats of XML, JSON and CSV (spreadsheets) were chosen as the representative formats across which to formulate the abstract irON notation.

XML, or eXtensible Markup Language, has become the leading data exchange format and syntax for modern applications. It is frequently adopted by industry groups for standards and standard exchange formats. There is a rich diversity of tools that support the language, importantly including capable parsers and query languages. There is also a serialization of RDF in XML. As implemented in the irON notation, we call this serialization irXML.

JSON, the JavaScript Object Notation, has become very popular as a Web 2.0 data exchange format and is often the format of choice to drive JavaScript applications. There is a growing richness of tools that support JSON, including support from leading Web and general scripting languages such as JavaScript, Python, Perl, Ruby and PHP. JSON is relatively easy to read, and is also now growing in popularity with lightweight databases, such as CouchDB. As implemented in the irON notation, we call this serialization irJSON.

CSV, or comma-separated values, is a format that has been in existence for decades. It was made famous by Microsoft as a spreadsheet exchange format, which makes CSV very useful since spreadsheets are the most prevalent data authoring environment in existence. CSV is less expressive and capable as a data format than the other irON serializations, yet still has a attribute-value pair orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc. As implemented in the irON notation, we call this serialization commON.

The following diagram shows how these three formats relate to irON and then the canonical RDF target data model:

Data transformations path

We have used the unique differences amongst XML, JSON and CSV to guide the embracing abstract notations within irON. Note the round-tripping implications of the framework.

One exciting prospect for the design is how, merely by following the simple conventions within irON, each of these three data formats — and RDF !! — can be used more-or-less interchangeably, and can be used to extend existing schema within their domains.

Links, References and More

This first release of irON is in version 0.8. Updates and revisions are likely with use. Here are some key links for irON:

Mid-week, the parsers and converters for structWSF [6] will be released and announced on Fred Giasson’s blog.

In addition, within the next week we will be publishing a case study of converting the Sweet Tools semantic Web and -related tools dataset to commON.

The irON specification and notation by Structured Dynamics LLC is licensed under a Creative Commons Attribution-Share Alike 3.0. irON’s parsers or converters are available under the Apache License, Version 2.0.

Editors’ Notes

irON is an important piece in the semantic enterprise puzzle that we are building at Structured Dynamics. It reflects our belief that knowledge workers should be able to author and create interoperable datasets without having to learn the arcana of RDF. At the same time we also believe that RDF is the appropriate data model for interoperability. irOn is an expression of our belief that many data formats have appropriate places and uses; there is no need to insist on a single format.

We would like to thank Dr. Jim Pitman for his advocacy of the importance of human-readable and easily authored datasets and formats. Via his leadership of the Bibliographic Knowledge Network (BKN) project and our contractual relationship with it [7], we have learned much regarding the BKN’s own format, BibJSON. Experience with this format has been a catalytic influence in our own work on irON.

Mike Bergman and Fred Giasson, editors


[1] Please see here for how irON fits within Structured Dynamics’ vision and family of products.
[2] Presently parsers and converters are available for the irJSON and commON serializations, and will be released this week. We have tentatively spec’ed the irXML converter, and would welcome working with another party to finalize a converter. Absent an immediate contribution from a third party, contractual work will likely result in our completing the irXML converter within the reasonable future.
[3] A pivotal premise of irON is the desirability of using the RDF data model as the canonical basis for interoperable data. RDF provides a data model capable of representing any extant data structure and any extant data format. This flexibility makes RDF a perfect data model for federating across disparate data sources. For a detailed discussion of RDF, see Michael K. Bergman, 2009. “Advantages and Myths of RDF,” in AI3 blog, April 8, 2009. See http://www.mkbergman.com/483/advantages-and-myths-of-rdf/.
[4] An attribute-value system is a basic knowledge representation framework comprising a table with columns designating “attributes” (also known as properties, predicates, features, parameters, dimensions, characteristics or independent variables) and rows designating “objects” (also known as entities, instances, exemplars, elements or dependent variables). Each table cell therefore designates the value (also known as state) of a particular attribute of a particular object. This is the basic table presentation of a spreadsheet or relational data table.

Attribute-values can also be presented as pairs in the form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array.

[5]We use the reference to the “ABox” and “TBox” in accordance with this working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[6] structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data, with generic tools driven by underlying data structures. Its central perspective is that of the dataset. Access and user rights are granted around these datasets, making the framework enterprise-ready and designed for collaboration. Since a structWSF layer may be placed over virtually any existing datastore with Web access — including large instance record stores in existing relational databases — it is also a framework for Web-wide deployments and interoperability.
[7] BKN is a project to develop a suite of tools and services to encourage formation of virtual organizations in scientific communities of various types. BKN is a project started in September 2008 with funding by the NSF Cyber-enabled Discovery and Innovation (CDI) Program (Award # 0835851). The major participating organizations are the American Institute of Mathematics (AIM), Harvard University, Stanford University and the University of California, Berkeley.

Posted on October 18, 2009 at 7:12 pm in Adaptive Information, Bibliographic Knowledge Network, Open Source, Semantic Web Tools, Structured Dynamics, irON | Comments (3)
The URI link reference to this post is: http://www.mkbergman.com/838/iron-semantic-web-for-mere-mortals/
The URI to trackback this post is: http://www.mkbergman.com/838/iron-semantic-web-for-mere-mortals/trackback/
Date:   September 2, 2009

Segmented UMBEL (Upper Mapping and Binding Exchange Layer)

The Significant Advantages to a Logically Segmented TBox

The Message Understanding Conferences (MUC) were initiated in 1987 and financed by DARPA to encourage the development of new and better methods of information extraction (IE). It was a seminal series that resulted in basic measures of retrieval and semantic efficacy, recall (R) and precision (P) and the combined F-measure, and other core terminology and constructs used by IE today.

By the sixth version in the series (MUC-6), in 1995, the task of recognition of named entities and coreference was added. That initial slate of named entities included the basic building blocks of person (PER), location (LOC), and organization (ORG); to these were added the numeric building blocks of time, percentage or quantity. The very terminology of named entity was coined for this seminal meeting, as was the idea of inline markup [1].

What is a ‘Nameable Thing’?

The intuition surrounding “named entity” and nameable “things” was that they were discrete and disjoint. A rock is not a person and is not a chemical or an event. As initially used, all “named entities” were distinct individuals. But, there also emerged the understanding that some classes of things could also be treated as more-or-less distinct nameable “things”: beetles are not the same as frogs and are not the same as rocks. While some of these “things” might be a true individual with a discrete name, such as Kermit the Frog, or The Rock at Northwestern University, most instances of such things are unnamed.

The “nameability” (or logical categorization) of things is perhaps best kept separate from other epistemological issues of distinguishing sets, collections, or classes from individuals, members or instances.

In a closed-world system it is easier to enforce clean distinctions. The Cyc knowledge base, for example, the basis for UMBEL (Upper Mapping and Binding Exchange Layer),  makes clear the distinction between individuals and collections. In the semantic Web and RDF, this can become smeared a bit with the favored terminology shifting to instances and classes, and in pragmatic, real-world terms we (as humans) readily distinguish John Smith as distinct from Jane Doe but don’t generally (unless we’re entomologists!) make such distinctions for individual beetles, let alone entire genera or species of beetles.

Under precise conditions, these distinctions are important. The fact that Cyc, for example, is assiduous in its application of these distinctions is a major reason for the overall coherence of its knowledge base. But, for most circumstances, we think it is OK to accept a distinction between “nameable” things such as frogs and beetles, but also to accept that there may be nameable individuals at times in those groupings such as Kermit that are truly an individual in that more refined sense.

This digression sets the background for a natural progression from that first MUC-6 conference. If we could cluster persons or organizations, why not other categories of distinct and disjoint things such as frogs or beetles or rocks?

From the first six entity categories of MUC-6 we begin to see an expansion to broader coverage. Readers of this blog will recall that I have been a fan for quite some time of the expanded coverage of 64 classes of entities proposed by BBN or the 200 proposed by Sekine [2] (as discussed, for example in the April 2008 Subject Concepts and Named Entities article). Again, the intuition was that real things in the real world could be logically categorized into discrete and disjoint categories.

Thus, “named entities” inexorably moved to become a categorization system, where the degree of familiarity and distinction dictated whether it was the individual (with a unique name, such as Abraham Lincoln or Mt. Rushmore) or groupings such as animal or plant species and their common names (such as beetle or oak) that was the standard “handle” for assigning a name to the “nameable thing”.

While many can argue these individual <–> grouping distinctions and whether we are talking about true, unique, named individuals or names of convenience, I think that (at least for this blog post and discussion), that misses the real, fundamental point.

The real, fundamental point is that some “things” (whether individuals, instances or classes) are distinct from other “things”. Such disjoint distinctions are a powerful concept that should not be lost sight of by “angels dancing on the head of a pin” epistemological arguments. A frog is not a rock, despite neither are “individuals”, and how can we take advantage of that realilty?

What Works for Entities, Works for Concepts

Nearly from the outset of our work with UMBEL as a ‘TBox’ [3] — that is, as a set of 20,000 or so common “subject concepts” — the natural question was what the relation or correspondence was of these concepts to the underlying “things” (entities) that they organized. As we probed the disjoint categories within the Sekine 200 entity types, for example, we began to see significant parallels and overlap. Also gnawing at our sense of order was the rather artificial and arbitrary class of concepts in UMBEL that we termed “Abstract Concepts”.

We introduced Abstract Concepts in the first release of UMBEL. When introduced, we defined “Abstract concepts [as] representing abstract or ephemeral notions such as truth, beauty, evil or justice, or [as] thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world.” In pragmatic terms, Abstract Concepts in UMBEL were often pivotal nodes in the UMBEL subject graph necessary to maintain a high degree of concept interconnectivity.

In any world view that attempts to be more-or-less comprehensive, there is a gradation of concepts from the concrete and observable to the abstract and ephemeral. The recognition that some of these concepts may be more abstract, then, was not the issue. The issue was that there was no definable basis for segregating a concrete Subject Concept from the more Abstract Concept. Where was the bright line? What was the actionable distinction?

Off and on we have probed this question for more than a year, and have looked at what might constitute a more natural and logical ordering and segmentation within UMBEL. After many tests and detailed analysis, we are now releasing the first results of our investigations.

For, like nameable entities or things, we can see a logical segmentation of (mostly) disjoint concepts within the UMBEL TBox. Here are the summary percentages of these high-level splits:

Disjoint Concepts 90%
Attributes 1%
Classifications 9%
TOTAL 100%

(Because the analysis is still being refined, exact counts and percentages for the 20,000 concepts in UMBEL are not provided.)

Why a Logical Segmentation?

As we dove deeper into these ideas, not only could we see the basis for a logical segmentation within UMBEL’s concepts, but manifest benefits from doing so as well. Remember that UMBEL’s concept structure performs two main roles. It:  1) provides a coherent framework for relating and “mapping” other external ontologies; and 2) provides conceptual binding points for organizing entities and instances [4]. Via logical segmentation, we get benefits for both roles.

Here are some of the broad areas of benefit from a logical UMBEL segmentation that we have identified:

  • Template-driven — as we discuss elsewhere, Structured Dynamics also uses its ontologies to “drive applications” and the user interfaces (UI) that support them. By proper segmentation of UMBEL concepts, we are able to determine to what “cluster” of things (which we call either dimensions or superTypes; see below) a given thing belongs. This identification means we can also determine how best to display information about that “thing”. This determination can include either the attributes or the display templates appropriate for that thing. For example, location-based things or time-based things might invoke map or calendar or timeline type displays. Moreover, because of the logical segmentation of concepts, we can also use the power of the concept graph to infer more generic display templates when specific matches are absent
  • Computational Efficiency — as the percentages above indicate, once we identify what superType concept to which a given instance belongs, we can eliminate nearly all remaining UMBEL concepts from consideration. This logical winnowing leads to computational efficiencies at all levels in the system. The fastest computational work is not to do it, and when large chunks of data are removed from consideration, many performance advantages accrue
  • Disambiguation — via this approach we now can assess concept matches in addition to entity matches. This means we can triangulate between the two assessments to aid disambiguation. Because of these logical segmentations, we also have multiple “clusters” (that is, either the concept, type, superType or dimension) upon which to do our disambiguation evaluations, either between concepts and entities or within the various concept clusters. We can do so via either multiple semantic vectors (for statistical-based methods) or multiple features (for machine learning methods). In other words, because of logical segmentation, we have increased the informational power of our concept graph
  • Structure and Integrity Testing — the very mindset of looking for logical segmentation has led to much learning about the UMBEL structure and OpenCyc upon which it is based. In the process, missing nodes (concepts), erroneous assignments, and superfluous nodes are all being discovered. Further, many of these tests can be automated using basic logical and inference approaches. The net result is a constant improvement to the scope and completeness of the structure. Lastly, these same approaches can be applied when mapping external ontologies to UMBEL, providing similar consistency benefits.

With these benefits in mind, we have undertaken concerted analysis of UMBEL to discern what this “logical segmentation” might be. This investigation has occurred over three concentrated periods over the past year. (Intervening priorities or other work prevented concentrating solely on this task.)

We are now complete with our first full iteraton of investigation. In this post, and then the subsequent release of UMBEL version 0.80 in the coming weeks, the fruits of this effort should be evident. However, it should also be noted that we are still learning much from this new mindset and approach. UMBEL structure refinement may be likely for some time to come.

UMBEL Analysis

Most things and concepts about them are based on real, observable, physical things in the real world. Because most of these things can not occupy both the same moment in time and the same location in physical space, a useful criterion for looking at these things and concepts is disjointedness.

In a broad sense, then, we can split our concepts of the world between those ideas that are disjoint because they pertain to separable objects or ideas and those that are cross-cutting or organizational or classificatory. Attributes, such as color (pink, for example), are often cross-cutting in that they can be used to describe quite disparate things. Inherent classification schemes such as academic fields of study or library catalog systems — while useful ways to organize the world — are not themselves in-and-of the world or discrete from other ideas. Thus, classificatory or organizational concepts are inherently not disjoint.

With the criterion of disjointedness in hand, then, we began an evaluation process of the UMBEL subject concepts. We looked to organizational schema such as the entity types of Sekine or BBN for some starting guidance. We also kept in mind that we also wanted our categories to inform logical clusterings of possible data presentation, such as media types or locations or time.

For terminology, we adopted the term superType to denote the largest cluster designation upon which this disjointedness may occur. As a way to test the basic coherence of these superTypes, we also collected them into larger groups which we termed dimensions.

Our analysis process began with branch-by-branch testing of the UMBEL concept graph using automated scripts, attempting to find pivotal nodes where child instance members were disjoint from other superTypes. This we term the “top-down” method.

This automated analysis was then supplemented with a complete manual inspection of all unassigned and assigned concepts, with a “bottom up” assignment of concepts or corrections to the automated approach. This inspection then led to new insights and identification of missing concepts that needed to be added into UMBEL.

We are still converging between these two methods. Optimally, we should be able to tease out all UMBEL superTypes with a relatively few number of union, intersection, or complement set operations. In its current form, we are close, but there are still some rough spots.

Nonetheless, this analysis method has led us to identify some 33 superTypes [5], clustered into 9 dimensions. Of these, 29 superTypes and 8 dimensions are mostly disjoint. The one dimension of Classificatory includes the four cross-cutting superTypes of attributes and organizational schema that can apply to any of the 29 disjoint superTypes.

UMBEL superTypes

Here is the schema, with the descriptions of each:

Dimension superType Description/Sub-types
Natural World Natural Phenomena This superType includes natural phenomena and natural processes such as weather, weathering, erosion, fires, lightning, earthquakes, tectonics, etc. Clouds and weather processes are specifically included. Also includes climate cycles, general natural events (such as hurricanes) that are not specifically named, and biochemical processes and pathways.
Natural Substances Notable inclusions are minerals, compounds, chemicals, or physical objects that are not the outcome of purposeful human effort, but are found naturally occurring. Other natural objects (such as rock, fossil, etc.) are also found under this superType.
Earthscape The Earthscape superType consists mostly of the collection of cartographic features that occur on the surface of the Earth. Positive examples include Mountain, Ocean, and Mesa. Artificial features such as canals are excluded. Most instances of these features have a fixed location in space.

Underground and underwater are also explicitly contained.

This superType is explicitly disjoint with Extraterrestrial (see below).

Extraterrestrial This superType includes all natural things not specifically terrestrial, including celestial bodies (planets, asteroids, stars, galaxies, etc., that can be located within a sky map)
Living Things Prokaryotes The Prokaryotes include all prokaryotic organisms, including the Monera, Archaebacteria, Bacteria, and Blue-green algas. Also included in this superType are viruses and prions.
Protists or Fungus This is the remaining cluster of eukaryotic organisms, specifically including the fungus and the protista (protozoans and slime molds).
Plants This superType includes all plant types and flora, including flowering plants, algae, non-flowering plants, gymnosperms, cycads, and plant parts and body types. Note that all Plant Parts are also included.
Animals This large superType includes all animal types, including specific animal types and vertebrates, invertebrates, insects, crustaceans, fish, reptiles, amphibia, birds, mammals, and animal body parts. Animal parts are specifically included. Also, groupings of such animals are included. Humans, as an animal, are included (versus as an individual Person). Diseases are specifically excluded.
Diseases Diseases are atypical or unusual or unhealthy conditions for (mostly human) living things, generally known as conditions, disorders, infections, diseases or syndromes. Diseases only affect living things and sometimes are caused by living things. This superType also includes impairments, disease vectors, wounds and injuries, and poisoning
Person Types The appropriate superType for all named, individual human beings. This superType also includes the assignment of formal, honorific or cultural titles given to specific human individuals. It further includes names given to humans who conduct specific jobs or activities (the latter case is known as an avocation). Examples include steelworker, waitress, lawyer, plumber, artisan. Ethnic groups are specifically included.
Human Activities Organizations Organization is a broad superType and includes formal collections of humans, sometimes by legal means, charter, agreement or some mode of formal understanding. Examples include geopolitical entities such as nations, municipalities or countries; or companies, institutes, governments, universities, militaries, political parties, game groups, international organizations, trade associations, etc. All institutions, for example, are organizations.

Also included are informal collections of humans. Informal or less defined groupings of humans may result from ethnicity or tribes or nationality or from shared interests (such as social networks or mailing lists) or expertise (”communities of practice”). This dimension also includes the notion of identifiable human groups with set members at any given point in time. Examples include music groups, cast members of a play, directors on a corporate Board, TV show members, gangs, mobs, juries, generations, minorities, etc.

Finally, Organizations contain the concepts of Industries and Programs and Communities.

Finance & Economy This superType pertains to all things financial and with respect to the economy, including chartable company performance, stock index entities, money, local currencies, taxes, incomes, accounts and accounting, mortgages and property.
Culture, Issues, Beliefs This category includes concepts related to political systems, laws, rules or cultural mores governing societal or community behavior, or doctrinal, faith or religious bases or entities (such as gods, angels, totems) governing spiritual human matters. Culture, Issues, beliefs and various activisms (most -isms) are included
Activities These are ongoing activities that result (mostly) from human effort, often conducted by organizations to assist other organizations or individuals (in which case they are known as services, such as medicine, law, printing, consulting or teaching) or individual or group efforts for leisure, fun, sports, games or personal interests (activities)
Human Works Products This is the largest superType and includes any instance offered for sale or performed as a commercial service. Often physical object made by humans that is not a conceptual work or a facility, such as vehicles, cars, trains, aircraft, spaceships, ships, foods, beverages, clothes, drugs, weapons. Products also include the concept of ’state’ (e/g/., on/off)
Food or Drink This superType is any edible substance grown, made or harvested by humans. The category also specifically includes the concept of cuisines
Drugs This superType is an drug, medication or addictive substance
Facilities Facilities are physical places or buildings constructed by humans, such as schools, public institutions, markets, museums, amusement parks, worship places, stations, airports, ports, carstops, lines, railroads, roads, waterways, tunnels, bridges, parks, sport facilities, monuments. All can be geospatially located.

Facilities also include animal pens and enclosures and general human “activity” areas (golf course, archeology sites, etc.). Importantly, Facilities include infrastructure systems such as roadways and physical networks.

Facilities also include the component parts that go into making them (such as foundations, doors, windows, roofs, etc.)

Information Chemistry (n.o.c) This superType is a residual category (n.o.c., not otherwise categorized) for chemical bonds, chemical composition groupings, and the like. It is formed by what is not a natural substance or living thing (organic) substance.
Audio Info This superType is for any audio-only human work. Examples include live music performances, record albums, or radio shows or individual radio broadcasts
Visual Info This superType includes any still image or picture or streaming video human work, with or without audio. Examples include graphics, pictures, movies, TV shows, individual shows from a TV show, etc.
Written Info This superType includes any general material written by humans including books, blogs, articles, manuscripts, but any written information conveyed via text.
Structured Info This information superType is for all kinds of structured information and datasets, including computer programs, databases, files, Web pages and structured data that can be presented in tabular form
Notations & References Akin to conceptual works, these are codified means of human expression. Examples range from human languages themselves, to more domain-specific cases such as chemical symbols, genetic code (A-G-C-T), protocols, and computer languages, mathematical and set notations, etc.

Identifiers (numeric or alphanumeric identifiers for objects, often in a highly patterned way, such as phone numbers, URLs, zip and postal codes, SKUs, product codes, etc.), Units (any of the various ways in which measurement, space, volume, weight, speed, intensity, temperature, calories, siesmic intensity or other quantitative descriptions of phenomena can be made) and key reference types are also included in this superType

Numbers This unique superType is for any abstract representation of numbers and numerics
Human Places Geopolitical Named places that have some informal or formal political (authorized) component. Important subcollections include Country, IndependentCountry, State_Geopolitical, City, and Province.
Workplaces, etc. These are various workplaces and areas of human activities, ranging from single person workstations to large aggregations of people (but which are not formal political entities)
Time-related Events These are nameable occasions, games, sports events, conferences, natural phenomena, natural disasters, wars, incidents, anniversaries, holidays, or notable moments or periods in time
Time This superType is for specific time or date or period (such as eras, or days, weeks, months type intervals) references in various formats
Descriptive Attributes This general superType category is for descriptive attributes of all kinds. Think of the specific attributes in Wikipedia “infoboxes” to understand the purpose and coverage of this superType. It includes colors, shapes, sizes, or other descriptive characteristics about an object
Classificatory Abstract-level This general superType category is largely composed of former AbstractConcepts, and represent some of the more abstract upper-level nodes for connecting the UMBEL structure together. This superType also includes theories or processes or methods for humans to do stuff or any human technology
Topics/Categories This largely subject-oriented superType is a means for using controlled vocabularies and classification schemes for characterizing what content “is about”. The key constituents of this category are Types, Classifications, Concepts, Topics, and controlled vocabularies
Markets & Industries This superType is a specialized classificatory system for markets and industries. It could be combined with the superType above, but is kept separate in order to provide a separate, economy-oriented system.

These may undergo some further refinement prior to release of UMBEL v 0.80, and some of the definitions will be tightened up.

(Note: It should also be mentioned that some of these superTypes further lend themselves to further splits and analysis. The Product superType, for example, is ripe for such treatment.)

Distribution of superTypes

The following diagram shows the distribution of these 20,000 UMBEL concepts across major area. By far the largest superType is Products, even with further splits into Food and Drinks and Pharmaceuticals. The next largest categories are Person and Places and Events superTypes, with Organizations and Animals not far behind:

# of superTypes by Category

Even in its generic state, UMBEL provides a very rich vocabulary for describing things or for tying in more detailed external ontologies. There are nearly 5,000 concepts across products of all types, for example.

Possible Overlaps (non-disjoint) between superTypes

You may recall that our analysis showed 29 of the superTypes to be “mostly disjoint.”  This is because there are some concepts — say, MusicPerformingAgent — that can apply to either a person or a group (band or orchestra, for example). Thus, for this concept alone, we have a bit of overlap between the normally disjoint Person and Organization superTypes.

The following shows the resulting interaction matrix where there may be some overlap between superTypes:

Instance superTypes Overlap

This kind of interaction diagram is also useful for further analyzing the concept graph structure, as well.

Even Where Overlaps Occur, They are Minor

Of the 29 “mostly” disjoint superTypes, only a relatively few show potential interactions, and then only in minor ways. We can illustrate this (drawn to scale) for the interaction between the Product, Food & Drink and Drug (Pharmaceuticals) superTypes, with the fully disjoint Organization superType thrown in for comparison:

Example superTypes Overlap

Across all 20,000 concepts, then, fully 85% are disjoint from one another (5% is lost due to overlaps between “mostly” disjoint superTypes). This is a surprising high percentage, with even better likelihood to deliver the benefits previously noted.

Interim Conclusions and Observations

These are exciting findings that bode well for UMBEL’s ongoing role and usefulness. Also, the very detailed analysis that has led to these interim findings very much reaffirms the wisdom of basing UMBEL on Cyc.  Cyc showed itself to be admirably coherent and remarkably complete. (It also appears that the first versions of UMBEL were also extracted well in terms of good coverage.)

This approach now gives us an understandable and defensible basis for logical segementation of UMBEL. It also provides a much-desired alternative to the earlier Abstract Concepts, which will now be dropped entirely as a schema concept.

One area deserving further attention is in the Attribute superType. We are in the process, for example, of analyzing attributes across Wikipedia and need to look through a slightly different lens at this superType [6]. This area is further important in its strong interaction with the Instance Record Vocabulary that is accompanying this effort on the entity side.

Another lesson for us has been to back away from the terminology of named entity, introduced at MUC-6. The expansions of that idea into other “nameable” things has caused us to embrace the “instance” nomenclature, as evidenced by our emerging IRV.

It is rewarding to prepare this next iteration release of UMBEL with its new mindset of logical segmentation and disjointedness. But — what is also clear — there are many treasures left to mine still hidden in the inherent structure of UMBEL and its Cyc parent.


[1] The original labels were ENAMEX for entity named expression and NUMEX for numeric expression. The markup format specified was also SGML. For an interesting history of this MUC-6 watershed, see Ralph Grishman and Beth Sundheim, 1996. Message Understanding Conference – 6: A Brief History, in Proceedings of the 16th International Conference on Computational Linguistics (COLING), I, Kopenhagen, 1996, 466–471.
[2] In a named entity, the word named applies to entities that have a “rigid designators” as defined by Kripke for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind of terms like biological species and substances.

Sekine’s extended hierarchy proposed in 2002 is made up of 200 subtypes, with 32 larger clusters within that. Here is the top level of the Sekine type system:

Name-Other Title Timex Frequency
Person Unit Periodx Rank
Organization Vocation Numex-Other Age
Location Disease Money School Age
Facility God Stock Index Latitude Longitude
Product ID Number Point Measurement
Event Color Percent Countx
Natural Object Time-Other Multiplication Ordinal Number

Though developed separately and for different purposes, BBN categories also proposed in 2002 consists of 29 types and 64 subtypes. Here are the BBN types (Note: BBN claims 29 types because there are double entries or considerations for the first five entries):

Person Time Animal
NORP (adjectival GPEs) Percent Substance
Facility Money Disease
Organization Quantity Work of Art
GPE (geopolitical places) Ordinal Law
Location Cardinal Language
Product Events Contact Info
Date Plant Game

Of course, other entity extraction systems have similar clusterings and approaches. Though less formal in the sense of a hierarchy or purported complete entity coverage, here for example is the listing of entity types within Calais:

Anniversary FaxNumber NaturalFeature RadioProgram
City Holiday OperatingSystem RadioStation
Company IndustryTerm Organization Region
Continent MarketIndex Person SportsEvent
Country MedicalCondition PhoneNumber SportsGame
Currency Movie Position SportsLeague
EmailAddress MusicAlbum Product Technology
EntertainmentAwardEvent MusicGroup ProgrammingLanguage TVShow
Facility NaturalDisaster ProvinceOrState TVStation
PublishedMedium URL

See further the Wikipedia entry on named entity recognition.

[3] We use the reference to “TBox” in accordance with our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[4] UMBEL also provides a SKOS-based vocabulary extension for describing other domains and mappings between classes and instances. This purpose, however, is outside of the scope of this current article.
[5] As a reference roadmap, UMBEL was specifically designed not to include meronymous (part of) relationships (see further this reference). Thus, all “part of” type concepts were assigned to the whole superType category for which they are a part. Thus, “animal parts” are assigned to the superType Animal; “car parts” to the superType Product.
[6] For a general discussion of attributes and their relation to entities, see Satoshi Sekine, 2008. Extended Named Entity Ontology with Attribute Information, in Proceedings of the 6th edition of the Language Resources and Evaluation Conference (LREC 2008). Marrakech, Morocco. See http://www.lrec-conf.org/proceedings/lrec2008/pdf/21_paper.pdf.

Posted on September 2, 2009 at 4:23 pm in Adaptive Information, Ontologies, Semantic Web, Structured Dynamics, Structured Web, UMBEL Comments Off
The URI link reference to this post is: http://www.mkbergman.com/759/supertypes-and-logical-segmentation-of-instances/
The URI to trackback this post is: http://www.mkbergman.com/759/supertypes-and-logical-segmentation-of-instances/trackback/
Page 1 of 1812345»...Last »
Copyright © 2004–2010 Michael K. Bergman.   This work is licensed under a Creative Commons License