Posted:November 4, 2005

For some years now the enterprise search market has been sick and in search of relief.  In frankly a shocking development, Autonomy announced today it was acquiring the veritable Verity search company for $500 million.  See this Bloomberg story ….

This acquisition amount is itself a signal of how poorly this market has been going.  Enterprise search has averaged about $500 million annually in revenues over the past few years with little or slow growth.  Autonomy has been the small cousin from England, but has now gobbled up the old dinosaur in Verity.

BTW, if someone mentions synergies or complementarity to you about this acquistion, don’t believe it.  This is totally an indication of how poor the entire enterprise search market has become.  With unstructured data representing 60-80% of all data available to an enterprise, these valuations are indicatiive of how piss-poor the enterprise document technology is as present.

Bon voyage!  I expect all of these dinosaurs to find their final resting place in the sun.  RIP

Posted by AI3's author, Mike Bergman Posted on November 4, 2005 at 6:47 pm in Searching, Software and Venture Capital | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/158/the-sick-search-market-autonomy-buys-verity/
The URI to trackback this post is: http://www.mkbergman.com/158/the-sick-search-market-autonomy-buys-verity/trackback/
Posted:November 1, 2005

The first recorded mentions of “semi-structured data” occurred in two academic papers from Quass et al.[1] and Tresch et al.[2] in 1995. However, the real popularization of the term “semi-strucutred data” occurred through the seminal 1997 papers from Abiteboul, “Querying semi-structured data,” [3] and Buneman, “Semistructured data.” [4] Of course, semi-structured data had existed well before this time, only it had not been named as such.

What is Semi-structured Data?

Peter Wood, a professor of computer science at Birkbeck College at the University of London, provides succinct definitions of the “structure” of various types of data:[5]

  • Structured Data — or classes. Entities in the same group have the same descriptions (or attributes), while descriptions for all entities in a group (or schema): a) have the same defined format; b) have a predefined length; c) are all present; and d) follow the same order. Structured data are what is normally associated with conventional databases such as relational transactional ones where information is organized into rows and columns within tables. Spreadsheets are another example. Nearly all understood database management systems (DBMS) are designed for structural data
  • Unstructured Data — in this form, data can be of any type and do not necessarily follow any format or sequence, do not follow any rules, are not predictable, and can generally be described as “free form.” Examples of unstructured data include text, images, video or sound (the latter two also known as “streaming media”). Generally, “search engines” are used for retrieval of unstructured data via querying on keywords or tokens that are indexed at time of the data ingest, and
  • Semi-structured Data — the idea of semi-structured data predates XML but not HTML (with the actual genesis better associated with SGML, see below). Semi-structured data are intermediate between the two forms above wherein “tags” or “structure” are associated or embedded within unstructured data. Semi-structured data are organized in semantic entities, similar entities are grouped together, entities in the same group may not have same attributes, the order of attributes is not necessarily important, not all attributes may be required, and the size or type of same attributes in a group may differ. To be organized and searched, semi-structured data should be provided electronically from database systems, file systems (e.g., bibliographic data, Web data) or via data exchange formats (e.g., EDI, scientific data, XML).

Unlike structured or unstructured data, there is no accepted database engine specific to semi-structured data. Some systems attempt to use relational DBMS approaches from the structured end of the spectrum; other systems attempt to add some structure to standard unstructured search engines. (This topic is discussed in a later section.)

Semi-structured data models are sometimes called “self-describing” (or schema-less). These data models are often represented as labelled graphs, or sometimes labelled trees with the data stored at the leaves. The schema information is contained in the edge labels of the graph. Semi-structured representations also lend themselves well to data exchange or the integration of heterogeneous data sources.

A nice description by David Loshin[6] on Simple Semi-structured Data notes that structured data can be easily modeled, organized, formed and formatted in ways that are easy for us to manipulate and manage. In contrast, though we are all familiar with the unstructured text in documents, such as articles, slide presentations or the message components of emails, its lack of structure prevents the advantages of structured data management. Loshin goes on to describe the intermediate nature of semi-structured data:

There [are] sets of data in which there is some implicit structure that is generally followed, but not enough of a regular structure to “qualify” for the kinds of management and automation usually applied to structured data. We are bombarded by semi-structured data on a daily basis, both in technical and non-technical environments. For example, web pages follow certain typical forms, and content embedded within HTML often have some degree of metadata within the tags. This automatically implies certain details about the data being presented. A non-technical example would be traffic signs posted along highways. While different areas use their own local protocols, you will probably figure out which exit is yours after reviewing a few highway signs.

This is what makes semi-structured data interesting–while there is no strict formatting rule, there is enough regularity that some interesting information can be extracted. Often, the interesting knowledge involves entity identification and entity relationships. For example, consider this piece of semi-structured text (adapted from a real example):

John A. Smith of Salem, MA died Friday at Deaconess Medical Center in Boston after a bout with cancer. He was 67.

Born in Revere, he was raised and educated in Salem, MA. He was a member of St. Mary’s Church in Salem, MA, and is survived by his wife, Jane N., and two children, John A., Jr., and Lily C., both of Winchester, MA.

A memorial service will be held at 10:00 AM at St. Mary’s Church in Salem.

This death notice contains a great deal of information–names of people, names of places, relationships between people, affiliations between people and places, affiliations between people and organizations and timing of events related to those people. Realize that not only is this death notice much like others from the same newspaper, but that it is reasonably similar to death notices in any newspaper in the US.

Note in Loshin’s example that the “structure” added to the unstructured text (shown in yellow; my emphasis) to make this “semi-structured” data arises from adding informational attributes that further elaborate or describe the document. These attributes can be automatically found using “entity extraction” tools or similar information extraction (IE) techniques, or manually identified. [7] These attributes can be assigned to pre-defined record types for manipulation separate from a full-text seach of the document text. Generally, when such attributes are added to the core unstructured data it is done through “metatags” that a parser can structurally recognize, such as by using the common open and close angle brackets. For example:

<author=John Smith>

In semi-structured HTML, the tags that provide the semi-structure serve a different purpose in terms of either formatting instructions to a browser or providing reference links to internal anchors or external documents or pages. Note that HTML also uses the open and close angle brackets as the convention to convey the structural information in the document.

The Birth of the Semi-structured Data Construct

One could argue that the emergence of the “semi-structured data” construct arose from the confluence of a number of factors:

  • The emergence of the Web
  • The desire for extremely flexible formats for data exchange between disparate databases (and therefore useful for data federation)
  • The usefulness of expressing structured data in a semi-structured way for the purposes of browsing
  • The growth of certain scientific databases, especially in biology (esp., ACeDB), where annotations, attribute extensibility resulting from new discoveries, or a broader mix of structural and text data was desired.[8]

These issues first arose and received serious computer science study in the late 1970s and early 1980s. In the early years of trying to find standards and conventions for representing semi-structured data (though not yet called that), the major emphasis was on data transfer protocols.

In the financial realm, one proposed standard was electronic data interchange (EDI). In science, there were literally tens of exchange forms proposed with varying degrees of acceptance, notably abstract syntax notation (ASN.1), TeX (a typesetting system created by Donald Knuth and its variants such as LaTeX), hierarchical data format (HDF), CDF (common data format), and the like, as well as commercial formats such as Postscript, PDF (portable document format), RTF (rich text format), and the like.

One of these proposed standards was the “standard generalized markup language” (SGML), first published in 1986. SGML was flexible enough to represent either formatting or data exchange. However, with its flexibility came complexity. Only when two simpler forms arose, namely HTML (HyperText Markup Language) for describing Web pages and XML (eXtensible Markup Language) for data exchange, did variants of the SGML form emerge as widely used common standards.[9]

The XML standard was first published by the W3C in February 1998, rather late in this history and after the semi-structured data term had achieved some impact.[10] Dan Suciu was the first to publish on the linkage of XML to semi-structured data in 1998,[11] a reference that remains worth reading to this day.

In addition, the OEM (Object Exchange Model) has become the de facto model for semi-structured data. OEM is a graph-based, self-describing object instance model. It was originally introduced for the Tsimmis data integration project,[12] and provides the intellectual basis for object representation in a graph structure with objects either being atomic or complex.

How the attribute “metadata” is described and associated has itself been the attention of much standards work. Truly hundreds of description standards have been proposed from specific instances in medical terminology such as MESH to law to physics to engineering and to cross-discipline proposed standards such as the Dublin core. (Google these for a myriad of references.)

Challenges in Semi-structured Data

Semi-structured data, as for all other data structures, needs to be represented, transferred, stored, manipulated or analyzed, all possibly at scale and with efficiency. It is often easy to confuse data representation from data use and manipulation. XML provides an excellent starting basis for representing semi-structured data. But XML says little or nothing about these other challenges in semi-structured data use:

  • Data heterogeneity — the subject of data heterogeneity in federated systems is extremely complex, and involves such areas as unit or semantic mismatches, grouping mismatches, non-uniform overlap of sets, etc. “Glad” may be the same as”happy” but it may also be expressed in metric v. English units. This area is complex and subject to its own topic
  • Type inference — related to the above is the data type requiring resolution, for example, numeric data being written as text
  • Query language — actually, besides transfer standards, probably more attention has been given to query languages supporting semi-structured data, such as XQuery, than other topics. Remember, however, that a query language is the outgrowth of a data storage framework, not a determinant, and this distinction seems to be frequently lost in the semi-structured literature
  • Extensibility — inherent with the link to XML is the concept of extensibility with semi-structured data. However, it is important to realize that extensibility as used to date is in reference to data representation and not data processing. Further, data processing should occur without the need for database updates. Indeed, it is these later points that provide a key rationale for BrightPlanet‘s XSDM system
  • Storage — XML and other transfer formats are universally in text or Unicode, excellent for transferability but shitty for data storage. How these representations actually get stored (and searched, see next) is fundamental to scalable systems that support these standards
  • Retrieval — many have and are proposing native XML retrieval systems, and others have attempted to clone RDBMSs or search (text) engines for these purposes. Retrieval is closely linked to query language, but, more fundamentally, also needs to be speedy and scalable. As long as semi-structured retrieval mechanisms are poor-cousin add-ons to systems optimized for either structured or unstructured data, they will be poor performers
  • Distributed evaluation (scalability) — most semi-structured or XML engines work OK at the scale of small and limited numbers of files. However, once these systems attempt to scale to an enterprise level (of perhaps tens of thousands to millions of documents) or, god forbid, Internet scales of billions of documents, they choke and die. Again, data exchange does not equal efficient data processing. The latter deserves specific attention in its own right, which has been lacking to date
  • Order — consider a semi-structured data file transferred in the standard way (which is necessary) as text. Each transferred file will contain a number of fields and specifications. What is the efficient order of processing this file? Can efficiencienes be gained through a “structural” view of its semi-structure? Realize that any transition from text to binary (necessary for engine purposes, see above), also requires “smart” transformation and load (TL) approaches. There is virtually NO discussion of this problem in the semi-strucutred data literature
  • Standards — while XML and its variants provide standard transfer protocols, the use of back-end engines for efficient semi-structured data processing also requires prescribed transfer standards in order to gain those efficiencies. Because the engines are still totally lacking, this next level of prescribed formats is lacking as well.

Generally, most academic, open source, or other attention to these problems has been at the superficial level of resolving schema or definitions or units. Totally lacking in the entire thrust for a semi-structured data paradigm has been the creation of adequate processing engines for effiicient and scalable storage and retrieval of semi-structured data. [13]

You know, it is very strange. Tremendous effort goes into data representations like XML, but when it comes to positing or designing engines for manipulating that data the approach is to clone kludgy workarounds on to existing relational DBMSs or text search engines. Neither meet the test.

Thus, as the semantic Web and its association to semi-structured data looks forward, two impediments stand like gatekeepers blocking forward progress: 1) efficient processing engines and 2) scalable systems and architectures.


[1] D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman and J. Widom, “Querying Semistructured Heterogeneous Information,” presented at Deductive and Object-Oriented Databases (DOOD ’95), LNCS, No. 1013, pp. 319-344, Springer, 1995.

[2] M. Tresch, N. Palmer, and A. Luniewski, “Type Classification of Semi-structured Data,” in Proceedings of the International Conference on Very Large Data Bases (VLDB), 1995.

[3] Serge Abiteboul, “Querying Semi-structured data,” in International Conference on Data Base Theory (ICDT), pp. 1 – 18, Delphi, Greece, 1997. See http://dbpubs.stanford.edu:8090/pub/1996-19.

[4] Peter Buneman, “Semistructured Data,” in ACM Symposium on Principles of Database Systems (PODS), pp. 117 – 121, Tucson, Arizona, May 1997. See http://db.cis.upenn.edu/DL/97/Tutorial-Peter/tutorial-semi-pods.ps.gz

[5] Peter Wood, School of Computer Science and Information Systems, Birkbeck College, the University of London. See http://www.dcs.bbk.ac.uk/~ptw/teaching/ssd/toc.html.

[6] David Loshin, “Simple Semi-structured Data,” Business Intelligence Network, October 17, 2005. See http://www.b-eye-network.com/view/1761.

[7] This example is actually quite complex and demonstrates the challenges facing “entity extraction” software. Extracted entities most often relate to the nouns or “things” within a document. Note also, for example, how many of the entities involve internal “co-referencing,” or the relation of subjects such as “he” to times such as “10 a.m” to specific dates. A good entity extraction engine helps resolve these so-called “within document co-references.”

[8] Peter Buneman, “Semistructured Data,” in ACM Symposium on Principles of Database Systems (PODS), pp. 117 – 121, Tucson, Arizona, May 1997. See http://db.cis.upenn.edu/DL/97/Tutorial-Peter/tutorial-semi-pods.ps.gz

[9] A common distinction is to call HTML “human readable” while XML is “machine readable” data.

[10] W3C, XML Development History. See http://www.w3.org/XML/hist2002.

[11] Dan Suciu, “Semistructured Data and XML,” in International Conference on Foundations of Data Organization (FODO), Kobe, Japan, November 1998. See PDF option from http://citeseer.ist.psu.edu/suciu98semistructured.html.

[12] Y. Papakonstantinou, H. Garcia-Molina and J. Widom, “Object Exchange Across Heterogeneous Information Sources,” in IEEE International Conference on Data Engineering, pp. 251-260, March 1995.

[13] Matteo Magnani and Danilo Montesi, “A Unified Approach to Structured, Semistructured and Unstructured Data,” Technical Report UBLCS-2004-9, Department of Computer Science, University of Bologna, 29 pp., May 29, 2004. See http://www.cs.unibo.it/pub/TR/UBLCS/2004/2004-09.pdf.

NOTE: This posting is part of an occasional series looking at a new category that I and BrightPlanet are terming the eXtensible Semi-structured Data Model (XSDM). Topics in this series cover all information related to extensible data models and engines applicable to documents, metadata, attributes, semi-structured data, or the processing, storing and indexing of XML, RDF, OWL, or SKOS data. A major white paper will be produced at the conclusion of the series. Stay tuned!
Posted:October 31, 2005

As an entrepreneur who has now dealt with VCs for close to ten years, one phrase repeated more times than I care to recount has been, "The time for paying tuition is over; it’s time to show revenue multiples."

The first times I heard this mantra it went without question.  I know, as does everyone involved in a start-up, that revenue is goodness and messing around ("paying tuition") is badness.  I think, in general, that shareholder and investor impatience for a quick return on capital is a proper and laudable expectation.  If you’re in the big leagues, you need to either hit, field or pitch, or better still, multiples of these.

But neither technology nor markets are predictable.  Another statement frequently heard is "if you need to educate the market, your business model is wrong."  Another is "show me the way to $20 million annual revenues within the next XX months."

I ain’t a kid anymore, and I appreciate the demands for performance and results.  Starting up a business and spending other people’s money (not to mention my own and my family’s) to achieve returns is not for the fainthearted.  Fair enough.  And understood.

But the real disconnect is how to balance multiple factors.   I think I appreciate the pressures on VCs for returns.  I also understand their win some/lose some mentality.  (Actually, what I don’t understand is the acceptance of the high percentage rates of individual investment failures; something is not systematic and wrong here; but I digress.) 

But what I truly don’t understand is the application of mantras vs. a careful balance of positive and negative factors for a venture.  Excellent and innovative technology is often in search of proper applications and markets.  Excellent and innovative technology is often not initially mature for market acceptance.  Excellent and innovative technology is often misdirected by its founders until engagement with the market and customers helps refine features and product expressions.  Excellent and innovative technology is sometimes tasty cookie dough that  needs more time in the oven.

Presumably, as has been the case for my own ventures, the basis for investment has been excellent and innovative technology.  We all know the standard recipe of market-technolgy-management that sprinkles every high-tech VC Web site.  But, of course, and honestly and realistically, not all of these factors are in play when venture financing is sought.  And, let’s face it, if they were in play, there would not be an interest by the entrepeneurs to dilute their ownership.

I suppose, then, that all players in a venture-financed start-up are subject to various forms of willful or self-deception.  Entrrepreneurs and VCs alike believe they have all the answers.  And, of course, neither do. 

What I have come to learn is that it is the market that has the answers, and sometimes that takes time to figure out.  Good diligence at the front end is warranted — after all, there needs to be the basis of some excellent foundations — as are mechanisms for "feeding out the line" of venture dollars and claw back and other egregious ways to lay off risk because bad choices are often made.  But what should not be acceptable, should not be perpetuated, is the expectation as to WHEN these returns will be achieved. 

There is simply no avoiding that new, innovative and sexy technology may not be able to be precisely timed.  Rather than railing about not paying more tuiition, every VC that has done diligence and made a venture commitment should be cheering for more learning and more refinement.  Begin with good partial foundations (be they technology-management-market) and applaud the tuition of learning and refinement.  In the end, we never graduate; we hopefully progress to life-long learning.

"Longing gazes and worn out phrases won’t get you where you want to go.  No!"   -  Mamas and Papas   

Next up:  "The Myth of Superman"  

Posted by AI3's author, Mike Bergman Posted on October 31, 2005 at 9:46 pm in Software and Venture Capital | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/155/no-more-tuition/
The URI to trackback this post is: http://www.mkbergman.com/155/no-more-tuition/trackback/

Naveen Balani has recently published a good intro primer on the Semantic Web through IBM’s Developer Works entitled, "The Future of the Web is Semantic".  Highly recommended.

Also, a recent paper on information retrieval and the semantic web was selected as the best paper in the 2005 ICSS mini-track on The Semantic Web: The Goal of Web Intelligence.  The paper, by Tim Finin, James Mayfield, Clay Fink, Anupam Joshi, and R. Scott Cost, Information Retrieval and the Semantic Web, Proceedings of the 38th International Conference on System Sciences, January 2005, is also available as a PDF download. I recommend this to anyone interested in the application of Semantic Web concepts to traditional Internet search engines.

Posted by AI3's author, Mike Bergman Posted on October 31, 2005 at 11:34 am in Adaptive Information, Searching, Semantic Web | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/152/two-recent-semantic-web-papers/
The URI to trackback this post is: http://www.mkbergman.com/152/two-recent-semantic-web-papers/trackback/
Posted:October 30, 2005

There was an interesting exchange between Martin Nisenholtz and Tim O’Reilly at a recent Union Square Session on the topic of peer production and open data architectures.  Martin was questioning how prominent “winners” like Wikipedia may prejudice our view of the likelihood of Web winners in general.  Here’s the exchange:

NISENHOLTZ:  I sort of call it the lottery syndrome.  There was a Powerball lottery yesterday.  Tons of people entered it.  We know that someone won in Oregon . . . we also know that the chances of winning  were one in 164 million . . . .I guess what I’m struggling with is how we measure the number of peer production efforts that get started versus Wikipedia, which has become the poster child, the lottery, the one in 164 million actually works.  Now it may not be one in 164 million.  It may be one in 10.  It may be one in 50, but I think that groups of people like [prominent Web thinkers] tend to create the lottery winner and hold the lottery winner up as the norm.

O’REILLY:    Look at Source Forge, there’s something like 104,000 projects on Source Forge.  You can actually do a long tail distribution and figure out how many of them — but … I would guess that one in like … 154 million are probably out of those 100,000 projects, there are probably, you know, at least 5,000 who have made significant reputation gains as a result of their work.  Maybe more. But, again, somebody should go out and measure that.

It just so happens that I had recently done that SourceForge project analysis in June, which is mostly still relevant since only a few months old.  That info is reproduced below.

Strong Growth for Open Source Projects

In open source there are some big visibility winners and lots of activity. (For an excellent overview of the leading and successful open source projects, see Uhlman.[1]) The numbers of these projects have grown rapidly, increasing by about 30% to 100,000 projects in the past year alone. However, like virtually everything else, the relative importance or use of open source projects tend to follow standard power curve distributions.

The truly influential projects only number in the hundreds, as figures from SourceForge, a clearinghouse solely devoted to open source projects, indicate. There is a high degree of fluctuation, but as of May 2005 there were on the order of perhaps 13 million total software code downloads per week from SourceForge (A). Though SourceForge statistics indicate it has some 100,000 open source projects within its database, in fact fewer than half of those have any software downloads, only 1.7% of the listed projects are deemed mature, and only about 15,000 projects are classified as production or stable.[2]

But power curve distributions indicate even a much smaller number of projects account for most activity. For example, the top 100 SourceForge projects account for 60% of total downloads, with the top two, Azureus and eMule, alone accounting for about one-quarter of all downloads. Indeed to even achieve 1000 downloads per day, a SourceForge open source project must be within the top 150 projects, or just 0.2% of those active or 0.1% of total projects listed.[3]

Similar trends are shown for cumulative downloads. Since its formation in 2000, software code downloads from SourceForge have totalled nearly one billion (actually, an estimated 892 million as of May 2005) (B, logarithmic scale). Again, however, a relatively small number of projects has dominated.

For example, 60% of all downloads throughout the history of SourceForge have occurred for the 100 most active projects. It can be reasonably defended that the number of open source projects with sufficient reach and use to warrant commercial attention probably total fewer than 1,000.

Open Source is Not the Same as Linux

Some observers, such as for example the Open reSource site[4], tends to equate open source with the Linux operating system and all aspects around it. While it is true that Linux was one of the first groundbreakers in open source and is the operating system with the largest open source market share, that is still only about one-half of all projects according to SourceForge statistics:

Windows projects have been growing in importance, along with Apple. In terms of programming languages, various flavors of C, followed by the ‘P’ languages (PHP, Python, Perl) and Java are the most popular. Note, however, that many projects combine languages, such as C for core engines and PHP for interfaces. Also note that many projects have multiple implementations, such as support for both Linux and Windows installations and perhaps PHP and Perl versions. Finally, the popularity of the Linux – Apache – MySQL and P languages have earned many open source projects the LAMP moniker. When replaced by Windows this is sometimes known as WAMP or with Java its known as LAMJ:

Because of the diversity of users, larger and more successful projects tend to have multiple versions.

Few Active Developers Support Most Projects

Despite source code being open and developers invited for participation, most mature open source projects in fact receive little actual development attention and effort from outsiders. Entities that touch and get involved in an open source project tend to form a pyramid of types. This pyramid, and the types of entities that become involved from the foundation upward, can be characterized as:

  • Users — by far the largest category, users simply want use of no cost software or some comfort the code base is available (as below)
  • Serious downloaders — there is an active class of Internet users that spend considerable time downloading application, game or other software, installing it, and then removing it and moving on. The motivations for this large software grazing class varies. Some are interested in seeing new software ideas, installation methods, user interfaces and the like; some are consultants or pundits that want to be current with new systems and trends; others simply are the Internet equivalent of serial mall shoppers. Whatever the motivation, this class of users acts to inflate download statistics, and sometimes may be key influence makers or spreaders of word-of-mouth, but are unlikely to establish a lasting relationship with a project
  • Linkers and embedders — these users are at the serious end of the actual user group and have clear ideas about needed functionality and will expend considerable effort to link or embed a promising new open source project into their current working environment. This level of engagement requires a considerable amount of effort and acts to increase the switching costs of later moving away from the project
  • Extenders — these individuals create the wrappers and other APIs for establishing interoperabiltiy and use between existing components in currently disparate environments (Apache, IIS or Tomcat; Windows, Linux; PHP,  PERL or Java, etc.) or critically bring the project to other languages, human or programmatic. They are perhaps the most attractive group of users from a project influence standpoint. This category is the major source of external innovation
  • Active developers — this is the standard assumed class of developers who actually sign-up and do major work on the initial project. But a surprising few number of developers participate in this category, and this category, like the next one, is close to non-existant for open source projects that follow the license choice model as proposed for BrightPlanet below
  • Code forkers — some mature and larger visibility open source projects (not including the license choice model) may witness a major breakaway in development. This can occur because of some differences in philosopy (some of the Linux variants), loss of interest by the original sponsor (HTMLarea WYSIWG editor, for example), or branching to different programming languages (many of the CMS variants). Code forking can be a source of innovation and use expansion, but also can serve to kill the original branch and leave existing users at a dead end.

Most effort around successful open source projects is geared to extending the environments or interoperability of those projects with others — both laudable objectives — rather than fundamental base code progression.

Mature Projects are Stable, Scalable, Reliable and Functional

David Wheeler has maintained the major summary site for open source performance statistics and studies for many years.[5] In compiling literally hundreds of independent studies, Wheeler observes that “OSS/FS [open source software/free software] . . . is often the most reliable software, and in many cases has the best performance. OSS/FS scales, both in problem size and project size. OSS/FS software often has far better security, perhaps due to the possibility of worldwide review. Total cost of ownership for OSS/FS is often far less than proprietary software, especially as the number of platforms increases.” However, while obviously an advocate, Wheeler is also careful to not claim these advantages across the board or for all open source projects.

Indeed, most of the studies cited by Wheeler obviously deal with that small subset of mature open source projects, and often surrounding Linux and not necessarily some of the new open source projects moving towards applications.

Probably the key point is that even though there may be ideological differences between advocates for or against open source, there is nothing inherent in open-source software that would make it inferior or superior to proprietary software. Like all other measures, the quality of the team behind an initiative is the driving force for quality as opposed to open or closed code.


[1] D. Uhlman, Open Source Business Applications, see http://www.socallinuxexpo.org/presentations/david_uhlman_scale3x.pdf

[2] I’d like to thank Matt Asay for pointing the way to digging into SourceForge statistics. It is further worth recommending his “Open Source and the Commodity Urge: Distruptive Models for a Distruptive Development Process,” November 8, 2004, 17 pp., which may be found at: http://www.open-bar.org/docs/matt_asay_open_source_chapter_11-2004.pdf

[3] Of course, downloads may occur at other sites than SourceForge and there are other proxies for project importance or activity, such as pageviews, the measure that SourceForge itself uses. However, as the largest compilation point on the Web for open source projects, the SourceForge data are nonethless indicative of these power curve distributions.

[4] See Open reSource http://sterneco.editme.com/

[5] D.A. Wheeler, Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers!, versuion updated May 5, 2005. See http://www.dwheeler.com/oss_fs_why.html. The paper also has useful summaries of market informatin and other open source statistics.

Posted by AI3's author, Mike Bergman Posted on October 30, 2005 at 12:49 pm in Open Source, Software and Venture Capital | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/148/the-lottery-syndrome-and-recent-open-source-statistics/
The URI to trackback this post is: http://www.mkbergman.com/148/the-lottery-syndrome-and-recent-open-source-statistics/trackback/