Posted:October 21, 2005

The Semantic Web Demands Different Database Models

This post introduces a new category area in my blog related to what I and BrightPlanet are terming the eXtensible Semi-structured Data Model (XSDM). Topics in this category cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing and indexing of XML, RDF, OWL, or SKOS data formats.

Why this category is important is introduced by Fengdong Du in the master’s thesis, Moving from XML Documents to XML Databases, submitted to the University of British Columbia in March 2004. As succinctly stated in the introduction to that thesis:

Depending on the characteristics of XML applications, the current XML storage techniques can be classified into two major categories. Most text-centric applications (e.g., newspapers) choose an existing file system for data storage. Data is usually divided into logical units, and each logical unit is physically stored as a separate file. As an example, a newspaper application may divide the entire year's newspapers into 12 collections by months, and store each collection as a document file. This type of application usually provides a keyword-based search tool and manipulates the data in application-specific processes. While this approach simplifies the storage problem, it has some major drawbacks. First, storing XML data as plain text makes it difficult to develop a generic data manipulation interface.

Second, mapping logical units of data to individual files makes it difficult to view the data from a different perspective. For this reason, this type of application only provides services with limited functionalities and therefore restricts the usage of data.

On the other hand, in data-centric applications such as e-commerce applications, data is typically highly-structured, e.g., extracted from a relational database management system (RDBMS). XML is primarily used as a tool to publish data to the Web or deliver information in a self-descriptive way in place of the conventional relative files. This type of application relies on the RDBMS for data storage. Data received in XML format is eventually put into an RDBMS when persistence is desired. Over the years, an RDBMS has been well developed to efficiently store and retrieve well-structured data. Structured Query Language (SQL) and many useful extended RDBMS utilities (e.g., Programming Language SQL, stored procedures) act as an application-independent data manipulation interface. Applications can communicate with databases through this generic interface and, on top of it, provide services with very rich functionalities.

While storing XML data into an RDBMS can take advantage of the well-developed relational database techniques and open interfaces, this approach requires an extra schema-mapping process applied to XML data, which involves schema transformation and usually decomposition. The schemas of XML data have to be mapped to strictly-defined relational schemas before data is actually stored. This process is strongly application-dependent or domain-dependent because there must be enough information available to determine many relational database design issues such as which table in the target RDBMS is a good place to store the information delivered, what new tables need to be created, which elements/attributes should be indexed, etc. No matter how this kind of information is obtained, whether delivered with XML data as schemas and processing instructions, or the application context makes it obvious, it is hard to develop an automatic and generic schema-mapping mechanism. Instead, application-specific work needs to take care of the schema-mapping problem. This involves non-trivial work of database server-side programming and database administration.

Another drawback of storing XML data in an RDBMS is that it is hard to efficiently support many types of queries that people want to ask on XML data. In RDBMS, each table has a pre-defined primary key field, and possibly a few other indexed fields. Queries not on the key field and not on the indexed fields will result in table scans (i.e., possibly a very large number of I/O's, which can be very time consuming) such as for the following path and predicate expression:

//department[@street=”main mall”]/student[@nationality=”Chinese”]

It is very likely that "department" is not indexed on "street" and that "student" is not indexed on "nationality". Therefore, resolving this path expression will cause table scans. Moreover, storing XML data in an RDBMS often results in schema decomposition and produces many small tables. Hence, evaluating a query often needs many expensive join operations.

For unstructured or semi-structured data, an RDBMS has greater difficulty, and query performance is usually unacceptable for relatively large amount of data. For these reasons, a native database management system is expected in the XML world. Like a traditional RDBMS, native XML databases would provide a comprehensive and generic data management interface, and therefore isolate lower level details from the database applications. Unlike an RDBMS, an ideal native XML database would make no distinction between unstructured data and strictly structured data. It treats all valid XML data in the same way and manages them equally efficiently. Its performance is only affected by the type of data manipulation. In other words, an ideal XML native database is not only access transparent but also performance transparent upon the structural difference of data.

Future topics in this XSDM area will expand on these challenges and describe new standards-based solutions being developed by BrightPlanet that specifically address these challenges..

Posted:October 13, 2005

“Stealth Mode”? Grab Your Wallet

I just came across a VC blog pondering the value to a start-up of operating in "Stealth Mode" or not. I’ve amusingly come to the conclusion that all of this — particularly the "stealth" giveaway — is so much marketing hype. When a start-up claims they’re coming out of stealth mode, grab your wallet.

The most interesting and telling example I have of this is Rearden Commerce, which was announced in a breathy cover story in InfoWorld in February 2005 about the company and its founder/CEO Patrick Grady. The company has an obvious "in" with the magazine; in 2001 InfoWorld also carried a similar piece on the predecessor company to Rearden, Talaris Corporaton.

According to a recent Business Week article, Rearden Commerce and its predecessors reaching back to a earlier company called Gazoo founded in 1999 have raised $67 million in venture capital. While it is laudable the founder has reportedly put his own money into the venture, this venture through its massive funding and high-water mark of 80 employees or so hardly qualifies as "stealth."

As early as 2001 with the same technology and business model, this same firm was pushing the "stealth" moniker. According to an October 2001 press release:

"The company, under its stealth name Gazoo, was selected by Red Herring magazine as one of its ‘Ten to Watch’ in 2001." [emphasis added]

Even today though no longer the active name Talaris Corporation has close to 115,000 citations on Yahoo! Notable VCs such as Charter Ventures, Foundation Capital, JAFCo and Empire Capital have backed it through its multiple incubations.

Holmes Report a marketing company, provides some insight into how the earlier Talaris was spun in 2001:

"The goal of the Talaris launch was to gain mindshare among key business and IT trade press and position Talaris as a ‘different kind of start-up’ with a multi-tiered business model, seasoned executive team and tested product offering."

The Holmes Report documents the analyst firms and leading journals and newspapers to which it made outreach. Actually, this outreach is pretty impressive. Good companies do the same all of the time and that is to be lauded. What is to be questioned, however, is how many "stealths" a cat can have. Methinks this one is one too many.

"Stealth" thus appears to be code for an existing company of some duration that has had disappointing traction and now has new financing, a new name, new positioning, or all of the above. So, interested in a start-up that just came out of stealth mode? Let me humbly suggest standard due diligence.

Posted:October 11, 2005

Major Upgrade to Deep Query Manager Released

BrightPlanet has announced a major upgrade to its Deep Query Manager knowledge worker document platform. According to its press release, the new version achieves extreme scalability and broad internationalization and file format support, among other enhancements. The DQM has added the ability to harvest and process up to 140 different foreign languages in more than 370 file formats plus new content export and system administration features. The company also claims the new distributed architecture allows scalability into hundreds or thousands of users across multiple machines with the ability to handle incremental growth and expansions.

According to the company:

The Deep Query Manager is a content discovery, harvesting, management and analysis platform used by knowledge workers to collaborate across the enterprise. It can access any document content — inside or outside the enterprise — with strengths in deep content harvesting from more than 70,000 unique searchable databases and automated techniques for the analyst to add new ones at will. The DQM’s differencing engine supports monitoring and tracking, among the product’s other powerful project management, data mining, reporting and analysis capabilities.

SOCom Awards New OSINT Center

According to Paul de la Garza of the St. Petersburg Times, the Special Operations Command (SOCom) based out of MacDill Air Force Base in Tampa Bay will be opening a new Joint Intelligence Operations Center (JIOC) in St. Petersburg to process open source intelligence (OSINT) in support of the global war on terrorism.

The Center was announced by Rep.C.W. Bill Young, R-Indian Shores (FL) on October 7. Rep. Young said that Blackbird Technologies of Virginia was awarded the $27-million contract to operate the Center, which will contain 60 people to conduct OSINT. Young, chairman of the Defense Appropriations Subcommittee, said the center will open soon but declined to offer more details because of the classified nature of the facility.

According to de la Garza, SOCom has played a pivotal role in the war on terror since 9/11, with an increase in budget from $3.8-billion to $6.6-billion and an increase in staff from 6,000 to 51,441. In March, President Bush signed a directive that puts SOCom in charge of "synchronizing" the war on terror.

Posted:October 6, 2005

Why Are $800 Billion in Document Assets Wasted Annually? II. Barriers to Collaboration

Collaboration is important. BrightPlanet‘s earlier research paper on the waste associated with enterprise document use (or lack thereof) indicated that $690 billion a year alone could be reclaimed by U.S. enterprises from better sharing of information. That represents 88% of the total $780 billion wasted annually.

The issue of poor document use within the organization is certainly not solely a technological issue, and is likely due more to cultural and people issues, not to mention process. At BrightPlanet, we have been attempting a concerted “document as you go” commitment by our developers and support people, and have worked hard to put in place Wiki and other collaboration tools to minimize friction.

But friction remains, often stubbornly so. At heart, the waste and misuse of document assets within organizations arises from a complex set of these people, process and technology issues.

Dave Pollard, the inveterate blogger on KM and other issues, provided a listing of 16 reasons of ‘Why We Don’t Share Stuff’ on September 19.[1] That thoughtful posting received a hail storm of responses, which caused Dave to update that listing to 23 reasons on September 29 under a broader post called ‘Knowledge Sharing & Collaboration 2015’ (a later post upped that amount to 24 reasons). (BTW, my own additions below have upped this number to 40, though high listing numbers are beside the point.) This is great stuff, and nearly complete grist for laying out the reasons — some major and some minor — why collaboration is often difficult.

I have taken these reasons, plus some others I’ve added of my own or from other sources, and have attempted to cluster them into the various categories below.[2] Granted, these assignments are arbitrary, but they are also telling as the concluding sections discuss.

People, Behavior and Psychology

These are possible reasons why collaboration fails due to people, behavior or psychological reasons. They represent the majority (56%) of reasons proferred by Pollard:

People find it easier and more satisfying to reinvent the wheel than re-use other people’s ‘stuff’ (*)
People only accept and internalize information that fits with their mental models and frames (Lakoff’s rule) (*)
Some modest people underestimate the value of what they know so they don’t share (*)
We all learn differently (some by reading, some by listening, some by writing down, some by hands-on), and people won’t internalize information that isn’t in a format attuned to how they learn (one size training doesn’t fit all) (*)
People grasp graphic information more easily than text, and understand information conveyed through stories better than information presented analytically (we learn by analogy, and images and stories are better analogies to our real-life experiences than analyses are) (*)
People cannot readily differentiate useful information from useless information (* split)
Most people want friends and even strangers to succeed, and enemies to fail; this has a bearing on their information-sharing behaviour (office politics bites back) (*)
People are averse to sharing information orally, and even more averse to sharing it in written form, if they perceive any risk of it being misused or misinterpreted (the better safe than sorry principle) (*)
People don’t take care of shared information resources (Tragedy of the Commons again) (*)
People seek out like minds who entrench their own thinking (leads to groupthink) (**)
Introverts are more comfortable wasting time looking for information rather than just asking (sometimes it’s just more fun spending 5 hours on secondary research, or doing the graphics for your powerpoint deck by trial and error, than getting your assistant to do it for you in 5 minutes) (**)
People won’t (or can’t) internalize information until they need it or recognize its value (most notably, information in e-newsletters is rarely absorbed because it rarely arrives just at the moment it’s needed) (**)
People don’t know what others who they meet know, that they could benefit from knowing (a variant on the old “don’t know what we don’t know” — “we don’t know what we don’t know that they do”) (**)
If important news is withheld or sugar-coated, people will ‘fill in the blanks’ with an ‘anti-story’ worse than the truth (**)
Experts often speak in jargon or “expert speak.” They don’t know they aren’t communicating, and non-experts are afraid to ask (***).

Management and Organization

These are possible reasons why collaboration fails due to managerial or organization limits. They represent about one-fifth (20%) of the reasons proferred by Pollard:

Bad news rarely travels upwards in organizations (shoot the messenger, and if you do tell the boss bad news, better have a plan to fix it already in motion) (*)
People share information generously peer-to-peer, but begrudgingly upwards (“more paperwork for the boss”), and sparingly downwards (“need to know”) in organizational hierarchy — it’s all about trust (*)
Managers are generally reluctant to admit they don’t know, or don’t understand, something (leads to oversimplifying, and rash decision-making) (*)
Internal competition can mitigate against information sharing (if you reward individuals for outperforming peers, they won’t share what they know with peers) (*)
The people with the most valuable knowledge have the least time to share it (**)
Management does not generally appreciate its role in overcoming psychology and personal behaviors that limit collaboration (***)
Management does not appreciate the trremendous expense, revenue, profitability and competiveness implications from lack of collaboration (***)
Management does not know training, incentive, process, technology or other techniques to overcome limits to collaboration (***)
Earlier organization attempts with CIOs, CKOs, etc., have not been sustained or were the wrong model for internalizing these needs within the organization (***)
Organizational job titles still reinforce managerial v. expertise in status and reward (***)
Hiring often inadequately stresses communication and collaboration skills, and does not provide in-house training if still lacking (***).

Technology, Process and Training

These are possible reasons why collaboration fails due to technology, process or training. They represent about one-eighth (12%) of the reasons proferred by Pollard, but also realize his original premise was on human or psychological reasons, so it is not surprising this category is less represented:

People know more than they can tell (some experience you just have to show) & tell more than they can write down (composing takes a lot of time) (Snowden’s rule) (*)
People feel overwhelmed with content volume and complex tools (info overload, and poverty of imagination) (* split)
People will find ways to work around imposed tools, processes and other resources that they don’t like or want to use (and then deny it if they’re called to account for it) (**)
Employees lack the appreciation for the importance of collaboration to the success of their employer and their job (***)
Most means for “recording” the raw data and information for collaboration have too much “friction” (***)
There needs to be clear divisions between “capturing” knowledge and information and “packaging” it for internal or external consumption (***)
Single-source publication techniques suck (***)
Testing, screening, vetting and making new technology or process advantages is generally lacking (***).

Cost, Rewards and Incentives

These are possible reasons why collaboration fails due to the cost and rewards structure, again about one-eighth (12%) of the reasons proferred by Pollard. Again, realize his original premise was on human or psychological reasons, so it is not surprising this category is less represented:

The true cost of acquiring information (time wasted looking for it) and the cost of not knowing (Katrina, 9/11, Poultry Flu etc.) are both greatly underestimated in most organizations (*)
Rewards for sharing knowledge don’t work for long (*)
People value information they paid for more highly than that they get free from their own people (thus the existence of the consulting industry) (from James Governor) (**)
Find reduced cost document solutions (***)
Link performance pay to collaboration goals (***).

Insights and Quibbles

There are some 25 reasons provided by Dave and his blog respondents, actually closer to 40 when my own are added, that represent a pretty complete compendium of “why collaboration fails.” Though I can pick out individual ones of these to praise or criticize that would miss the point.

The objective is neither to collect the largest numbers of such factors or to worry terribly about how they are organized. But there are some interesting insights.

Clearly, human behavior and psychology provides the baseline for looking at these questions. Management’s role is to provide organizational structure, incentives, training, pay and recognition to reward the collaborative behavior it desires and needs. Actually, management’s challenge is even greater than that since in most cases upper level managers don’t yet have a clue as to the importance of the underlying information nor collaboration around it.

Like in years past, leadership for these questions needs to come from the top. The disappointments of the CIO and CKO positions of years past need to be looked at closely and given attention. The idea of these positions in the past was not wrong; what was wrong was the execution and leadership commitment.

Organizations of all types and natures have figured out how to train and incentivize its employees for difficult duties ranging from war to first response to discretion. Putting in place reward and training programs to encourage collaboration — despite piss poor performance today — should not be so difficult in this light.

I think Dave brings many valuable insights into such areas as people being reluctant to reinvent the wheel but liking creative design, or without some sense of ownership a collaboration repository is at risk, or people are afraid to look stupid, or some people communciate better orally v. in written form, etc. These are, in fact, truisms of human diversity and skill differences. I believe firmly if organizations want to purposefully understand these factors they can still design reward, training and recognition regimens to shape the behavior desired by that organization.

The real problem in the question of collaboration within the enterprise begins at the top. If the organization is not aware and geared to address human nature with appropriate training and rewards, it will continue to see the poor performance around collaboration that has characterized this issue for decades.

NOTE: This posting is part of a series looking at why document assets are so poorly utilized within enterprises. The magnitude of this problem was first documented in a BrightPlanet white paper by the author titled, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents. An open question in that paper was why more than $800 billion per year in the U.S. alone is wasted and available for improvements, but enterprise expenditures to address this problem remain comparatively small and with flat growth in comparison to the rate of document production. This series is investigating the various technology, people, and process reasons for the lack of attention to this problem.

[1] There have been some other interesting treatments of barriers to collaboration including that by Carol Kinsey Goman’s Five reasons people don’t tell what they know and Jack Vinson’s Barriers to knowledge sharing.

[2] Pollard’s initial 16 reasons are shown with a single symbol (*); the next 8 additions with a double symbol (**). All remaining reasons added by me have three symbols (***).

Main Links

Search

Author: Mike Bergman