Posted:April 4, 2006

Full Report: Why Are $800 Billion in Document Assets Wasted Annually?

Author's Note: An earlier blog series by me has now been turned into a PDF white paper under the auspices of BrightPlanet Corp The citation for this effort is:

M.K. Bergman, "Why Are $800 Billion in Document Assets Wasted Annually?” BrightPlanet Corporation White Paper, April 2006, 27 pp.

Click here to obtain a PDF copy of this full report (27 pp, 203 KB)

It is a tragedy of no small import when $800 billion in readily available savings from creating, using and sharing documents is wasted in the United States each year. How can waste of such magnitude occur right before our noses? And how can this waste occur so silently, so insidiously, and so ubiquitously that none of us can see it?

This free white paper attempts to address these questions. This report is the result of a series of posts in response to an earlier white paper I authored under BrightPlanet sponsorship entitled, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents. [1]

This full report intetgrates information from earlier blog postings:

Public and enterprise expenditures to address the wasted document assets problem remain comparatively small, with growth in those expenditures flat in comparison to the rate of document production. This report attempts to bring attention and focus to the various ways that technology, people, and process can bring real document savings to our collective pocketbooks.

[1] Michael K. Bergman, "Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents," BrightPlanet Corporation White Paper, July 2005, 42 pp. The paper contains 80 references, 150 citations, and many data tables.

Posted:April 3, 2006

Social Bookmarking Sites Poised for Shakeouts, Maturation

One way to look at 40 sites trying to achieve Web 2.0 is that each site only contributes Web 0.05.

There’s alot of stuff going on with Web 2.0 in “social” computing, some with implications about my own primary interests in the semantic Web. Indeed, though all of us can link to Wikipedia for definitions, I doubt other than first checking out that source that most of us would agree with what Wikipedia defines as Web 2.0 . That’s OK.

Nonetheless, we can see there IS something going on in the nexus of new interoperable Web standards with collaboration and application frameworks specifically geared to shared experiences and information. I think we can all agree that Web 2.0 is meant to achieve that, and that “social bookmarking” is one of the foundational facets of the phenomenon.

Like other things that take you in tangents while pursuing research stuff over a weekend, I’m actually not sure what got me trying to track down and understand “social bookmarking.” But tracking it down I did, and this post is the result of my cruising through the byways of Web 2.0 driving a “social bookmarks” roadster.

Quick Intro to Social Bookmarks

According to Wikipedia, a “social bookmark” is a:

. . . web based services where shared lists of user-created Internet bookmards are displayed. Social bookmarking sites generally organize their content using tags and are an increasingly popular way to locate, classify, rank, and share Internet resources . . . The concepts of social bookmarking and tagging took root with the launch of a web site called del.icio.us in approximately 2003.

Often, [social bookmark] lists are publicly accessible, although some social bookmarking systems allow for privacy on a bookmark by bookmark basis. They [may] also categorize their resources by the use of informally assigned, user-defined keywords or tags from a folksonomy). Most social bookmarking services allow users to search for bookmarks which are associated with given “tags”, and rank the resources by the number of users which have bookmarked them.. . . . as people bookmark resources that they find useful, resources that are of more use are bookmarked by more users. Thus, such a system [can] “rank” a resource based on its perceived utility.

Since the classification and ranking of resources is a continuously evolving process, many social bookmarking services [may also] allow users to subscribe to syndication feeds or collections of tag terms. This allows subscribers to become aware of new resources for a given topic, as they are noted, tagged, and classified by other users. There are drawbacks to such tag-based systems as well: no standard set of keywords, no standard for the structure of such tags, mistagging, etc. . . . . The separate (but related) tagging and social bookmarking services are, however, evolving rapidly, and these shortcomings will likely either be addressed in the near future or shown not to be relevant to these services.

The idea of experts and interested individuals sharing their discoveries and passions is clearly compelling. What has been most interesting in the development of “social bookmarking” software and services on the Web has been the assumptions underlying how those obejctives can best be achieved.

Of course, the most powerful concept underlying all of this stuff has been the ideal of “community.” We now face (have the opportunity) for electronic tribes and all that means in breaking former bounds of space and circumstance. Truly, the prospect of finding effective means for the identification, assembly, consensus-building, and sharing within meaningful communities is breathtaking.

Listing of Social Bookmarking Services

To get a handle on the state of the art, I began assembling a list of social bookmark and closely related services from various sources. I’ve found about 40 of them, which may mean there are on order of 50 or so extant. The icons and links below show these 40 or so sites, with a bit of explanation on each:

— 43Things — this site is geared for individuals to share activity lists, ambitions or “thngs to do” with one another.

— Backflip — this is a bookmark recollection and personal search space and directory. It has received a top 100 site from PC Magazine.

— blinkbits — this is a social bookmarking site that has about 16,000 “blinks” or topic folders.

— BlinkList –this site also allows bookmarks to be filtered by friends and collaborators.

— Bloglines — beyond a simple social bookmark service, this site more importantly provides an RSS feeder and aggregator; owned by Ask Jeeves.

— Blogmarks — there is not much background info on this site; it is a somewhat better designed but offers typical social bookmarks services.

CiteULike — this site is geared toward academics and the sharing of paper references and links. Many references are to subscription papers. Generally, all submissions have an edited abstract and pretty accruate tags provided.

— Connotea — while the functionality of this site is fairly standard for social bookmarking and activity is lower than some other sites, Connotea has a specific emphasis on technical, research, and academic topics that may make it more attractive to that audience.

— del.icio.us — this site is the granddaddy of social bookmark services, plus tagging support, plus is the first to use a very innovative URL. Amongst all the sites herein, this one probably has the greatest activity and number of listings.

— de.lirio.us — this site is now being combined with simpy.com.

Digg — Digg — the Digg service is similar to others on this listing by providing social bookmarking, voting and popularity, and user control of listings, etc. It has received some buzz in the blog community.

Fark — Fark — while this site has aspects of social bookmarking, it is definiitely more inclined to be edgy and current.

Findory — Findory — geared toward news and blogs aggregation.

— Flickr — the largest and best known of the photo sharing and bookmarking sites; owned by Yahoo.

Furl — Furl — this site, part of LookSmart, has what you would expect from a bucks-backed site, but seems pretty vanilla with respect to social bookmarking capabilities.

Hyperlinkomatic — a beta service from the UK that has ceased accepting new users.

Jots — a small, and not notably distinguised, social bookmark site.

Kinja — Kinja — this is a blog bookmarking and aggregation service.

Linkroll — this is a relatively low-key service, modeled to a great extent on del.icio.us

Lookmarks –this is a social bookmarking service with tags, sharing, search and popular lists, with images and music/video sharing as well.

— Ma.gnolia –this service is a fairly standard social bookmarking site.

Maple –this is a fairly standard social bookmarking service, small with about 5,500 users, that uses Ruby on Rails.

— Netvouz — this service is a fairly standard social bookmarking service that also provides tags.

Oyax — this is another fairly standard online bookmarks manager.

RawSugar –this site has most of the standard social bookmarking features, but differentiates by adding various user-defined directory structures.

Reddit — Reddit — the site has recently gotten some buzz due to a voting feature that moves topic rankings up or down based on user feedback; other aspects of the site are fairly vanilla.

Rojo — this is a very broad RSS feed reader with hundreds of sources, to which you may add your own. It allows you to organize feeds by tags, share your feeds via an address book, and tracks and ranks what you view most often. This site has been getting quite a bit of buzz.

Scuttle –Scuttle — this is a fairly standard social bookmarking site with low traffic.

Shadows — Shadows — this social bookmark site is attractively designed and adds a different wrinkle by letting any given topic or document to have its own community discussion page.

Shoutwire — this site adds community feedback and collaboration to a “standard” RSS news feeder and aggregator.

— Smarking — this site is a fairly standard social bookmarking site.

Spurl — Spurl — this site is a fairly standard social bookmarking site.

Squidoo — Squidoo — this site is different from other social bookmarking services in that it lets you create a page on your topic of choice (called a lens) where you add links, text, pictures and other pieces of content. Each lens is tagged.

Start — an experimental Microsoft personalized home page service, powered by Ajax; capabilities and direction are still unclear.

— TailRank — this site allows about 50,000 blogs to be monitored in a fairly standard social bookmarking manner.

— Unalog — this is a fairly standard social bookmarking site.

Wink — this service is both a social bookmarker and a search engine to other online resources such as del.icio.us and digg.

Wists — Wists — this is a social bookmarking site geared to sharing shopping links and sites.

YahooMyWeb — Yahoo’s MyWeb — this is the personalized entry portal for Yahoo! including bookmarking and many specialty feeds and customization.

Zurpy — this social bookmark service is in pre-launch phase.

General Observations

I personally participate in a couple of these services, notably Bloglines and Rojo. Some of what I have discovered will compel me to try some others.

In testing out and assembling this list, however, I do have some general observations:

Most sites are repeats or knock-offs of the original del.icio.us. While some offer prettier presentation and images, functionality is pretty identical. These are what I refer to as the “fairly vanilla” or standard sites above
Systems that combine bookmarking with tagging and directory presentations seem most useful (at least to me) for the long haul. Also of interest are those sites that focus on narrower and more technical communities (e.g., Connotea, CiteULike).
Virtually all sites had poor search capabilities, particularly in advanced search or operator support, and were not taking full advantage of the tagging structure in their listings
Development of directory and hierarchical structures is generally poor, with little useful depth or specificity. This may improve as use grows, as it has in Wikipedia, but limits real expert use at present, and
Thus, paradoxically, while the sites and services themselves in their current implementation are very helpful for initial discovery, they are of little or no use for expert discovery or knowledge discovery.

I suspect most of these limitations will be overcome over time, and perhaps very shortly at that. Technology certainly does not appear to be the limiting factor, but rather the needs for scales of use and the network effect.

Can We Get to Web 2.0 by Adding Multiple 0.05s?

Another paradox is that while these sites help promote the concept of community, they seem to work to actually fragment communities. There’s much competition at present for many of the same people trying to do the same social things and collaboration. One way to look at 40 sites trying to achieve Web 2.0 is that each site only contributes Web 0.05.

Specific iinnovative communities on the Web such as biologists, physicists, librarians and the like will be some of the most successful for leveraging these technologies for community growth and sharing. In other communities, certainly competition will winnow out only a few survivors.

The older, centrally imposed means for communities to determine “authoritativeness” — be it peer review, library purchasing decisions, societal recognition or reputation, publisher selection decisions, citation indexes, etc. — do not easily apply to the distributed, chaotic Internet. What others in your community find of value, and thus choose to bookmark and share, is one promising mechanism to bring some semblance of authoritativeness to the medium. Of course, for this truly to work, there must be trust and respect within the communities themselves.

I think we should see within the foreseeable future a standard set of functionalities — submitting, ranking, organizing, searching, commentating, collaborating, annotating, exporting, importing, and self-policing — that will allow these community sites to become go-to “infohubs” for their users. These early social bookmarking services look to be the nucleates that will condense stronger and diverse communities of interest on the Web. Let the maturation begin.

Posted:April 1, 2006

SWISHer — A ‘Swicki’ Test Drive

I just came across a pretty neat site and service for creating vertical search engines of your choosing. Called a ‘swicki’ the service and capabilitiy is provided by Eurekster, a company founded about two years ago around the idea of personalized and social search. The ‘swicki’ implementation was first released in November 2005.

SWISHer

by Michael K Bergman from https://www.mkbergman.com

HOT SEARCHES

classification information science knowledge management
metadata ontologies OWL RDF semantic web web2.0 XML [ ?]

This is a swicki – a search engine that learns from the search behavior of your community.
Get your own swicki from Eurekster for free!

NOTE: As you conduct searches using the form above, you will be taken from my blog to http://swisher-swicki.eurekster.com. To return, simply use your browser back button.

What in Bloody Hell is a Swicki?

According to the company:

Swickis are a new kind of search engine or search results aggregator. Swickis allow you to build specific searches tailored to your interests and that of your community and get constantly updated results from your web or blog page. Swickis scan all the data indexed in Yahoo Search, plus all additional sources you specify, and present the results in a dynamically updated, easy to use format that you can publish on your site – or use at swicki.com. We also collect and organize information about all public swickis in our Directory. Whether you have built a swicki or not, you can come to the swicki directory and find swicki search engines that interest you.

Swickis are like wikis in that they are collaborative. Not only does your swicki use Eurekster technology to weight searches based on the behavior of those who come to your site, in the future, your community – if you allow them – can actively collaborate to modify and focus the results of the search engine. . . . Every click refines the swicki’s search strings, creating a responsive, dynamic result that’s both customized and highly relevant.

A 10 Minute Set-up

I first studied the set-up procedure and then gathered some information before I began my own swicki. Overall the process was pretty straightforward and took me about 10 minutes. You begin the process on the Eurekster swicki home page.

Step 1: You begin by customizing how you want the swicki to look — wide or narrow, long or short, and font sizes and a choice of about twenty background and font color combinations. I thought these customization options were generally the most useful ones and the implementation pretty slick
Step 2: You "train" your search (actually, just specify useful domains and URLs and excluded ones). Importantly, you give the site some keywords or phrases to qualify final results accepted for the site. One nice feature is to add or not blog content or the content of your existing web site
Step 3: You then provide a short description for the site and assign it to existing subject categories. Code is generated at this last step that is simple to insert into your Web site or blog, with some further explanations for different blog environments.

You are then ready to post the site and make it available to collaborative feedback and refinement. You can also choose to include ads on the site or look to other means to monetize it should it become popular.

If a public site, your swicki is then listed on the Eurekster directory; as of this posting, there were about 2,100 listed swickis (more in a next post on that).

For business or larger site complexes, there are also paid versions building from this core functionality.

SWISHER: Giving it My Own Test Drive

I have been working in the background for some time on an organized subject portal and directory for this blog called SWISHer — for Semantic Web, Interoperability, Standards and HTML. (Much more is to be provided on this project at a later time.) Since it is intended to be an expert’s repository of all relevant Web documents, the SWISHer acronym is apparent.

One of the things that you can do with the Eurekster swicki is run a direct head-to-head comparison of results with Google. That caused me to think that it would also be interesting when I release my own SWISHer site to compare it with the swicki and with Google. Thus, the subject of my test swicki was clear.

Since I know the semantic Web reference space pretty well, I chose about 75 key starting URLs to use as the starting "training" set for the swicki.

This first version of SWISHer as a swicki site, with its now-embedded generated code, is thus what appears above. In use it indicates links to about 400,000 results, though the search function is pretty weak and it is difficult to use some of my standard tricks to ascertain the actual number of documents in the available index.

To see the swicki site in action, either go to http://swisher-swicki.eurekster.com, click on the SWISHer title, or enter your search in the form above and click search.

Now installed, I’m taking these capabilities for a longer road trip. The test drive was fun; let’s see how it handles over rough terrain and covering real distances. I’ll post impressions in a day or so.

Posted:March 26, 2006

Why Are $800 Billion in Document Assets Wasted Annually? V. Summary

It is a tragedy of no small import when $800 billion in readily available savings from creating, using and sharing documents is wasted in the United States each year. How can waste of such magnitude — literally equivalent to almost 8% of gross domestic product or more than 40% of what the nation spends on health care [1] — occur right before our noses? And how can this waste occur so silently, so insidiously, and so ubiquitously that none of us can see it?

Let me repeat. The topic is $800 billion in annual waste in the U.S. alone, perhaps equivalent to as much as $3 trillion globally, that can be readily saved each year with improved document management and use. Achieving these savings does not require Herculean efforts, simply focused awareness and the application of best practices and available technology. As the T.D. Waterhouse commercial says, “You can do this.”

This entry concludes a series of posts resulting from an earlier white paper I authored under BrightPlanet sponsorship. Entitled, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents,[2] that paper documented via many references and databases the magnitude of the poor use of document assets within enterprises. The paper was perhaps the most comprehensive look to date at the huge expenditures document creation and use occupy within our modern knowledge economy, and first quantified the potential $800 billion annual savings in overcoming readily identifiable waste.

Simply documenting the magnitude of expenditures and savings was mindblowing. But what actually became more perplexing was why the scope of something so huge and so amenable to corrective action was virtually invisible to policy or business attention. The vast expenditures and potential savings surfaced in the research quite obviously begged the question: Why is no one seeing this?

I then began this series to look at why document use savings may fit other classes of “big” problems such as high blood pressure as a silent killer, global warming from odorless and colorless greenhouse gasses, or the underfunding of cost-effective water systems and sanitation by international aid agencies. There seems to be something more difficult involving ubiquitous problems with broadly shared responsibilities.

The series began in October of last year and concludes with this summary. Somehow, however, I suspect the issues touched on in this series are still poorly addressed and will remain a topic for some time to come.

The series looked at four major categories:

This summary wraps up the series.

I can truthfully conclude that I really haven’t yet fully put my finger on the compelling reason(s) as to why broad, universal problems such as document use and management remain a low priority and have virtual no visibility despite the very real savings that current techniques and process can bring. But I think some of the relevant factors are covered in these topics.

The arguments in Part I are pretty theoretical. They firstly ask if it is in the public interest to strive for improvements in “information” efficiency, some of which may be applicable to the private sector with possible differentials in gains. They secondly question the rhetoric of “information overload” that can lead to a facile resignation about whether the whole “information” problem can be meaningfully tackled. One dog that won’t hunt is the claim that computers intensify the information problem of private gain v. societal benefit because now more stuff can be processed. Such arguments are diversions that obfuscate deserved and concentrated public policy that can bring real public benefits — and soon. Why else do we not see tax and economic policies that can enrich our populace by hundreds of billions of dollars annually?

Part II argues that barriers to collaboration, many cultural but others social and technical, help to prevent a broader consensus about the importance of documents reuse (read: “information” and “knowledge”). Document reuse is likely the single largest reservoir of potential waste reductions. One real problem is the lack of top leadership within the organziation to encourage collaboration and efficiencies in document use and management through appropriate training and rewards, and commitments to install effective document infrastructures.

Part III re-visits prior failings and high costs in document or content initiatives within the enterprise. Perceptions of past difficulties color the adoption of new approaches and technologies. The lack of standards, confusing terminology, some failed projects, immaturity of the space, and the as-yet emergence of a dominant vendor have prevented more widespread adoption of what are clearly needed solutions to pressing business content needs. There are no accepted benchmarks by which to compare vendor performance and costs. Document use and management software can be considered to be at a similar point to where structured data was at 15 years ago at the nascent emergence of the data warehousing market. Growth in this software market will require substantial improvements in TCO and scalability, among a general increase in awareness of the magnitude of the problem and available means to solve it.

Part IV looks at what might be called issues of attention, perception or psychology. These factors are limiting the embrace of meaningful approaches to improve document access and use and to achieve meaningful cost savings. Document intelligence and document information automation markets still fall within the category of needing to “educate the market.” Since this category is generally dreaded by most venture capitalists (VCs), that perception is also acting to limit the financing of fresh technologies and entrepreneurialiship.

The conclusion is that public and enterprise expenditures to address the wasted document assets problem remain comparatively small, with growth in those expenditures flat in comparison to the rate of document production. Hopefully, this series — plus, also hopefully, ongoing dialog and input from the community — can continue to bring attention and focus to the various ways that technology, people, and process can bring real document savings to our collective pocketbooks.

[1] According to the U.S. Dept of Health and Human Services, the nation spent $1.9 trillion on health care in 2004; see http://www.cms.hhs.gov/NationalHealthExpendData/02_NationalHealthAccountsHistorical.asp#TopOfPage.

[2] Michael K. Bergman, “Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents,” BrightPlanet Corporation White Paper, July 2005, 42 pp. The paper contains 80 references, 150 citations, and many data tables.

NOTE: This posting concludes a series looking at why document assets are so poorly utilized within enterprises. The magnitude of this problem was first documented in a BrightPlanet white paper by the author titled, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents. An open question in that paper was why more than $800 billion per year in the U.S. alone is wasted and available for improvements, but enterprise expenditures to address this problem remain comparatively small and with flat growth in comparison to the rate of document production. This series is investigating the various technology, people, and process reasons for the lack of attention to this problem.

Posted:March 25, 2006

30 Minute Video on the Semantic Web

Henry Story, one of my favorite semantic Web bloggers and a Sun development guru, has produced a very useful video and PDF series on the semantic Web. Here is the excerpt from his site with details about where to get the 30 min presentation (62 MB for the QuickTime version, see below), highly useful to existing development staff:

. . . how could the SemWeb affect software development in an Open Source world, where there are not only many more developers, but also these are distributed around the world with no central coordinating organisation? Having presented the problem, I then introduce RDF and Ontologies, how this meshes with the Sparql query language, and then show how one could use these technologies to make distributed software development a lot more efficient.

Having given the presentation in November last year, I spent some time over Xmas putting together a video of it (in h.264 format). . . . Then last week I thought it would be fun to put it online, and so I placed it on Google video, where you can still find it. But you will notice that Google video reduces the quality quite dramatically, so that you will really need to have the pdf side by side, if you wish to follow.

Your time spent with this presentation will be time well spent. I’d certainly like to hear more about OWL, or representing and resolving semantic heterogeneities, or efficient RDF storage databases at scale, or a host of other issues of personal interest. But, hey, perhaps there are more presentations to come!

Main Links

Search

Author: Mike Bergman