There was an interesting exchange between Martin Nisenholtz and Tim O’Reilly at a recent Union Square Session on the topic of peer production and open data architectures. Martin was questioning how prominent “winners” like Wikipedia may prejudice our view of the likelihood of Web winners in general. Here’s the exchange:
NISENHOLTZ: I sort of call it the lottery syndrome. There was a Powerball lottery yesterday. Tons of people entered it. We know that someone won in Oregon . . . we also know that the chances of winning were one in 164 million . . . .I guess what I’m struggling with is how we measure the number of peer production efforts that get started versus Wikipedia, which has become the poster child, the lottery, the one in 164 million actually works. Now it may not be one in 164 million. It may be one in 10. It may be one in 50, but I think that groups of people like [prominent Web thinkers] tend to create the lottery winner and hold the lottery winner up as the norm.
O’REILLY: Look at Source Forge, there’s something like 104,000 projects on Source Forge. You can actually do a long tail distribution and figure out how many of them — but … I would guess that one in like … 154 million are probably out of those 100,000 projects, there are probably, you know, at least 5,000 who have made significant reputation gains as a result of their work. Maybe more. But, again, somebody should go out and measure that.
It just so happens that I had recently done that SourceForge project analysis in June, which is mostly still relevant since only a few months old. That info is reproduced below.
Strong Growth for Open Source Projects
In open source there are some big visibility winners and lots of activity. (For an excellent overview of the leading and successful open source projects, see Uhlman.) The numbers of these projects have grown rapidly, increasing by about 30% to 100,000 projects in the past year alone. However, like virtually everything else, the relative importance or use of open source projects tend to follow standard power curve distributions.
The truly influential projects only number in the hundreds, as figures from SourceForge, a clearinghouse solely devoted to open source projects, indicate. There is a high degree of fluctuation, but as of May 2005 there were on the order of perhaps 13 million total software code downloads per week from SourceForge (A). Though SourceForge statistics indicate it has some 100,000 open source projects within its database, in fact fewer than half of those have any software downloads, only 1.7% of the listed projects are deemed mature, and only about 15,000 projects are classified as production or stable.
But power curve distributions indicate even a much smaller number of projects account for most activity. For example, the top 100 SourceForge projects account for 60% of total downloads, with the top two, Azureus and eMule, alone accounting for about one-quarter of all downloads. Indeed to even achieve 1000 downloads per day, a SourceForge open source project must be within the top 150 projects, or just 0.2% of those active or 0.1% of total projects listed.
Similar trends are shown for cumulative downloads. Since its formation in 2000, software code downloads from SourceForge have totalled nearly one billion (actually, an estimated 892 million as of May 2005) (B, logarithmic scale). Again, however, a relatively small number of projects has dominated.
For example, 60% of all downloads throughout the history of SourceForge have occurred for the 100 most active projects. It can be reasonably defended that the number of open source projects with sufficient reach and use to warrant commercial attention probably total fewer than 1,000.
Open Source is Not the Same as Linux
Some observers, such as for example the Open reSource site, tends to equate open source with the Linux operating system and all aspects around it. While it is true that Linux was one of the first groundbreakers in open source and is the operating system with the largest open source market share, that is still only about one-half of all projects according to SourceForge statistics:
Windows projects have been growing in importance, along with Apple. In terms of programming languages, various flavors of C, followed by the ‘P’ languages (PHP, Python, Perl) and Java are the most popular. Note, however, that many projects combine languages, such as C for core engines and PHP for interfaces. Also note that many projects have multiple implementations, such as support for both Linux and Windows installations and perhaps PHP and Perl versions. Finally, the popularity of the Linux – Apache – MySQL and P languages have earned many open source projects the LAMP moniker. When replaced by Windows this is sometimes known as WAMP or with Java its known as LAMJ:
Because of the diversity of users, larger and more successful projects tend to have multiple versions.
Few Active Developers Support Most Projects
Despite source code being open and developers invited for participation, most mature open source projects in fact receive little actual development attention and effort from outsiders. Entities that touch and get involved in an open source project tend to form a pyramid of types. This pyramid, and the types of entities that become involved from the foundation upward, can be characterized as:
Most effort around successful open source projects is geared to extending the environments or interoperability of those projects with others — both laudable objectives — rather than fundamental base code progression.
Mature Projects are Stable, Scalable, Reliable and Functional
David Wheeler has maintained the major summary site for open source performance statistics and studies for many years. In compiling literally hundreds of independent studies, Wheeler observes that “OSS/FS [open source software/free software] . . . is often the most reliable software, and in many cases has the best performance. OSS/FS scales, both in problem size and project size. OSS/FS software often has far better security, perhaps due to the possibility of worldwide review. Total cost of ownership for OSS/FS is often far less than proprietary software, especially as the number of platforms increases.” However, while obviously an advocate, Wheeler is also careful to not claim these advantages across the board or for all open source projects.
Indeed, most of the studies cited by Wheeler obviously deal with that small subset of mature open source projects, and often surrounding Linux and not necessarily some of the new open source projects moving towards applications.
Probably the key point is that even though there may be ideological differences between advocates for or against open source, there is nothing inherent in open-source software that would make it inferior or superior to proprietary software. Like all other measures, the quality of the team behind an initiative is the driving force for quality as opposed to open or closed code.
 I’d like to thank Matt Asay for pointing the way to digging into SourceForge statistics. It is further worth recommending his “Open Source and the Commodity Urge: Distruptive Models for a Distruptive Development Process,” November 8, 2004, 17 pp., which may be found at: http://www.open-bar.org/docs/matt_asay_open_source_chapter_11-2004.pdf
 Of course, downloads may occur at other sites than SourceForge and there are other proxies for project importance or activity, such as pageviews, the measure that SourceForge itself uses. However, as the largest compilation point on the Web for open source projects, the SourceForge data are nonethless indicative of these power curve distributions.
 D.A. Wheeler, Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers!, versuion updated May 5, 2005. See http://www.dwheeler.com/oss_fs_why.html. The paper also has useful summaries of market informatin and other open source statistics.