Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary

or, Why is It Always so Easy to Eat Our Young?

In the last couple of weeks I have observed two of the more exciting open source initiatives around — the bibliographic and research citation plug-in Zotero from George Mason University's Center for History and New Media (CHnM) and the semantic Web efforts from MIT's Simile program — juggle some really interesting problems that result from academic institutions committing to and promoting open source projects. Both cases have caused me to raise my appreciation for the efforts involved. Both cases have caused me to temper my expectations from the programs. And both cases perhaps offer some lessons and cautions for others considering entering the bazaar.

Early Growing Pains

Zotero, though still quite young as a project, is tackling a pragmatic issue of citation and research management of interest to virtually every academic and researcher. While capable commercial bibliographic and citation software has existed for many years, there remains much pent-up interest in the community regarding features, standards and ease-of-use. This common need, plus the laudable initial efforts of the project itself, have led to quick adoption and community scrutiny.

Such promise generates excitement and early evangelists. One of those evangelists, however, frustrated with the pace and degree of responsiveness from the program, went public with a call to better embrace Raymond’s “bazaar” aspects of open source (see this reference and its comments). That, in turn, with its comments, spurred another response from one of the Zotero team leaders speaking unofficially called Cathedrals and Bazaars, also in reference to Raymond’s treatise.

Now, granted, this is not my community and I am a total outsider, but there is, I feel, a dynamic here at work that I believe deserves broader attention. Let me get to that in a moment, after citing another example . . . .

An “Army of Programmers”

As any casual reader of my blog has observed, I have been touting MIT's Simile program for the past couple of weeks (they’ve actually deserved the praise for years, but I just had not been aware of them). As part of my interest and involvement, I took the unusual steps for me of joining the project’s mailing list, downloading code, writing to systems, blogging about it, participating on the mailing list, and other atypical efforts. Because of some recent innovations from the program I have discussed elsewhere (in fact more than once), there has been a real spike of interest in the program and general scrutiny and activity from many others besides me.

Like George Mason, MIT is of course an educational research institution committed to training new intellectual leaders and moving their fields forward. Both institutions are doing so with pragmatic and (at times) cutting-edge tools, all offered in the spirit of open source and community involvement. This is not unusual within the academic community, but, on the other hand, is not common, and most often is not done well with respect to real, usable software. Both of these programs do it well.

Committing to this involvement is not trivial. There are wikis providing documentation; there are forums at which ideas get spun and new users get answers; there are blogs to inform the (growing, burgeoning) community of future plans and directions, and (oh, by the way), there are real jobs like teaching, committee involvement, thesis writing, and the real aspects of being either a professor or student.

Meanwhile, if your project is successful within the broader community, those very same forums and blogs and wikis become the venue for any Internet user to request information, assistance, criticisms or new features. To that point, within the MIT program, the shorthand over the past days in dealing with requests for new features has been a call to a mythical “army of programmers.” Right on . . . . This software has, in fact, been developed largely on graduate assistant time and salaries (if such) from a very few individuals.

So, yes, with openness and a successful offering comes the old Chinese curse, and not enough time in the day.

People such as me — not a part of the community but looking globally for answers — jump onto these projects with glee and focus: We’re seeing some really cutting edge stuff being developed by the brightest in their fields! Non-participants from within the community see these efforts and may like (perhaps, even envy) what they see, and may also want to help in the spirit of their communities.

If the project is compelling, the result, of course, is attention, scrutiny and demands. Excitement and interest drives the engine. Promise is now evident, and all who have seen it feel a part of the eventual success of the dream — even though not all (especially all of those on the outside whether academics or not) are in a position to make that dream come real. And most do not truly appreciate what it has taken to get the project to its current point.

So, What’s My Take?

There is a real excitement to goosing the tiger. There is also so much effort to even get close to doing so that few (except those on the inside) can appreciate. The people on the Zotero and Simile projects (and the many like them) know the long hours; are posting answers on forums at 3 am; are patient with the newbies; are struggling with project demands that have nothing at all to do with tenure or getting their degrees; and are juggling a myriad of other demands like families and spouses and other interests.

Yet, that being said, since these fortunate few hold the batons of visible, successful projects at the top of the power curve, I actually don’t truly feel sorry for them. My real point for those of us that constitute the vast majority of outsiders is to be realistic about what we can expect from open source projects run by essentially graduate student labor or, if via paid staff, ones that are likely understaffed and underpaid (though clearly committed) and overworked.

Watching these two exemplary programs has been educational — in the true sense of the word — but also instructive in what it takes to conduct an open source and “open kimono” initiative.

What It Takes to be Successful

I have managed many large-scale, multi-year software projects, but only one in academia and none open source. Any professional software development manager would be amazed at what it now takes to be successful in this arena.

You need, first, of course, to develop the software. That has its standard challenges, but is not so easy within an academic setting since the components themselves need to be best-of-breed, open source, and cutting edge. The project itself may need to be thesis-worthy. Though not imperative, the project choice should be of broad community interest, rightfully anticipating potential job prospects for participating students. Then, the code needs to be magically produced. (I find it truly amazing how many non-coders so cavalierly look to major software projects and apparently dismiss the person-years of efforts already embedded in the code.) Finally, and most difficultly, the resulting software must be posted as open source, with all of the attendant community demands that imposes. In fact, it is this last requirement that is often the hidden, frozen banana.

Obviously, any successful project needs to address a latent need with a sufficient user base. This could be one of thousands of end users or simply dozens of prominent influencers. And, of course, much of open source is tool- or function-based, so that formulation of a successful initiative by no means requires developing an application in the conventional sense. Indeed, successful projects may span from a relatively complete “app” such as Zotero to code libraries or parsers or even standards development. Yet independent of the code base itself, there does seem to need to be some essential moving parts in place to be successful:

  • The Software — it begins here. The software itself must address a need and be written in a current language and architecture, all of which may require sophisticated familiarity with contemporary practices, languages and code bases
  • Documentation — arguably the lack of documentation is the biggest cause of failure for open source projects. While this sounds self-evident, in fact its importance is generally overlooked. The bane of a successful open source project is the demand of the community for new features and functionality, all potentially with inadequate project resources. The only way this conundrum can be broken is through active engagement of the community, which directly requires documentation in order for outsiders to understand the code and then be able to make contributions
  • Wikis and Blogs — effective means for engaging the user (and influencer) community is essential. Besides the time needed to set up the infrastructure, these media require daily care and feeding
  • Forums and Mailing Lists — same as for wikis and blogs, though additionally a nearly full-time moderator is required (could multi-task in other areas if demand is low). New users are constantly discovering the project and asking many of the same familiarization questions over and over. An active forum means that many existing participants can feed the discussion, but again, constant care and feeding of documentation and FAQs is important to reduce duplicated responses
  • Code Management and Tracking — assuming a minority of users desires and is qualified to test or modify code, the general management of this process requires policies, and authorization and version control with enterprise-grade code level management tools (such as Jira, Trac and Subversion, among many)
  • Time — of course, all of this takes time, often robbed from other priorities.

And finally, and most importantly, the project participants need:

  • Grace — it is likely not comfortable for many to open themselves up to the involvement and scrutiny of an open source initiative. Outsiders bring many passions, motivations. levels of knowledge and sophistication; and some see the promise or excitement and want to truly become part of the internal community. Juggling these disparate forces takes an awful lot of grace and patience. Anyone who tackles this job has my admiration . . . .

Why I Like These Two Projects

I am an omnivore for finding new, exciting projects. I have specific aims in mind, and they are (generally) different than what is motivating the specific project developers. Of the many hundreds of projects I have investigated, I think these two are among the best, but for different reasons and with different strengths.

The Zotero project, though early in the process, has all of the traits to be an exemplar in terms of documentation, professionalism, openness and active outreach to its communities. I take the criticisms from some in the community (motivated, I believe, by good will) to be a result of possible frustrations regarding pent-up needs and expectations, rather than the project’s poor execution or arrogance. I think posing the discussion as the dialectic of the cathedral v. the bazaar is silly and does the project and its sponsors a disservice. What looks to be going on is simply the real-world of the open-source sausage factory in the face of constraints.

As for Simile, we are seeing true innovation being conducted in real time in the open. And while some of these innovations have practical applications today, many are still cutting edge and not fully vetted. With the program’s stated aims of addressing emerging computer science challenges, this tension will remain and is healthy. Criticisms of the Simile efforts as “research programming”, I think, miss the boat. If you want to know what is going to be in your browser or influencing your Internet experience a few years hence, look to Simile (among others).

A Final Thought

Now that my spree of global searching for software ideas is somewhat winding down, I am truly taken with the sea change that open source and its spur to innovation is bringing to us. Costs are declining, barriers to entry are lowering, time to completion is shortening, ideas and innovations are blossoming, interoperability is improving, and our world is changing — and for the better. Zotero and Simile are amongst the young that deserve to be nurtured, not eaten.

Zotero is Perhaps the Most Polished Firefox Extension Yet

Zotero, first released in October, is perhaps the best Firefox extension that most users have never heard of, unless you are an academic historian or social scientist, in which case Zotero is becoming quite the rage. It is also percolating into other academic fields, including law, math and science.

Zotero is a complete research citation platform, or what its developer, George Mason University’s Center for History and New Media (CHnM), calls, “The next-generation research tool.” Zotero allows academics and researchers to extract, manage, annotate, organize, export and publish citations in a variety of formats and from a variety of sources — all within the Firefox browser, and all while obviously the user is interacting directly with the Web.

What it Is

Like all Firefox extensions, Zotero is easy to install. From the Firefox add-on site or the Zotero site itself, a single click downloads the app to your browser. Upon a prompted re-start the app is now active. (Later alerts for any version upgrades are similarly automatic — as for any Firefox extension.)

Upon installation, Zotero puts an icon in your status bar and places new options on menus. When you encounter a site that Zotero supports (currently, mostly university libraries, but also Amazon and major publication outlets as well, totaling more than 150; here is a listing of Zotero’s so-called translators), you will see a folder symbol in your address bar telling you Zotero is active. A single click downloads the citations from that site automatically to your local system.

Citations have traditionally been one of the more “semantically challenging” data sets, with variations in style, order, format, presentation, coverage and you name it rampant. The fact that Zotero supports a given source site means that it understands these nuances and is ready to store the information in a single, canonical representation. Once downloaded, this citation representation can now be easily managed and organized. More importantly, you can now export this internal, standard representation into a multitude of export formats (including, most recently, MS Word). In short, like for-fee citation software in the past, Zotero now provides a free and universal mechanism for managing this chaos.

While the address icon acts to download one or more citations (yes, they also work in groups if there are multiple listings on the page!), choosing the Zotero icon itself invokes the full Zotero as an app within the browser, as this screen shot shows:

The left panes provide organization and management and tag support; the middle pane shows the active resources; and the right pane shows the structure associated with the active citation. This is all supported with attractive icons and logical tooltips and organization.

Zotero also offers utilities for creating your own scrapers (“translators”) for new sites not yet in the standard, supported roster. This capability is itself an extension to Zotero, called Scaffold, that also points to the building block nature of the core app. (Other utilities such as Solvent from MIT or others surely to come could either enhance or replace the current Scaffold framework.)

What is Impressive

Though supposedly in “beta,” Zotero already shows a completeness, sophistication and attention to detail not evident in most Firefox extensions. Indeed, this system approaches a complete application in its scope and professionalism. The fact it can be so easily installed and embedded in the browser itself is worth noting.

Firefox extensions have continuously evolved from single-function wonders to crude apps and now, as Zotero and a handful of other extensions show, complete functional applications. And, like OSes of the past, these extensions also adhere to standards and practices that make them pretty easy to use across applications. Firefox is indeed becoming a full-fledged platform.

This system is also using the new SQLite local database function (“mozStorage”) in Firefox 2.x to manage the local data (perhaps one of the first Firefox extensions to do so). This provides a clean and small install footprint for the extension, as well as opens it up to other standard data utilities.

What it Implies

So, what Zotero is exemplifying — beyond its own considerable capabilities — are some important implications. First, full-bodied apps, building on many piece-parts, can now be expected around the Fireflox platform. (Indeed, I earlier noted the emergence of such “Web OS” prospects as Parakey, whose developers also come from earlier Firefox legacies. One of those developers, Joe Hewitt, is also the author of the impressive Firebug extension.)

Second, the openness of Firefox for web-centric computing will, as I’ve stated before, continue to put competitive pressure on Microsoft’s Internet Explorer. This is good for all users at large and will continue to spur innovation.

Third, the pending version 2.0 of Zotero is slated to have a server-side component. What we are potentially seeing, then, are local client-side instantiations in the browser that can then communicate with remote data servers. This opens up a wealth of possibilities in social networking and collaboration.

And, last, and more specific to Zotero itself (but also enabled with Firefox’s native RDF support), we are now seeing a complete app framework for dealing with structured information and tagging on the Web. While clearly Zotero has a direct audience for citation management and research, the same infrastructure and techniques used by the system could become a general semantic Web or data framework for any other structured application.

Hmmm. Now that sounds like an opportunity . . . .

Posted:January 24, 2007

Structure for the Masses


Instant Mashups between WordPress, Google Spreadsheets and Exhibit

The past couple of days has seen a flurry of activity and much excitement revolving around a new “database-free” mashup and publication system called Exhibit. Another in a string of sneaky-cool software from MIT’s Simile program (and written by David Huynh, a pragmatic semantic Web developer of the first order), Exhibit (and its sure to follow rapid innovations) will truly revolutionize Web publishing and the visualization and presentation of structured data. Exhibit is quite simply “structure for the masses.”

What is It?

With just a few simple steps, even the most novice blog author can now embed structured data — such as sortable and filtered table displays, thumbnails, maps, timelines and histograms — in her blog posts and blog pages. Using Exhibit, you can now create rich data visualizations of web pages using only HTML (and optional CSS and Javascript code).

Exhibit requires no traditional database technology, no server-side code, and no need for a web server. Here is a sampling of Exhibit‘s current capabilities:

  • No external databases or hassles
  • Data filtering and sorting
  • Simple HTML embeds and calls
  • Automatic and dynamic page layouts and results rendering
  • Completely tailorable (with CSS and Javascript)
  • Direct updates and presentations (mashups) from Google spreadsheets
  • Pre-prepared timelines, map mashups, tabular display options, and Web page formatting
  • Easily embedded in WordPress blogs (see the first tutorial here).

Exhibit is as simple as defining a spreadsheet; after that you have a complete database! And, if you want to get wild and crazy with presentation and display, then that is easy as well!

What Are Some Examples?

Though Exhibit has been released barely one month, already there are some pretty impressive examples:

What Are People Saying?

Granted, we’re only talking about the last 24 hours or so, but interesting people are noticing and commenting on this phenomenon:

  • Ajaxian“Exhibit is a new project that lets you build rich sorting and filtering data applications in a simple way.”
  • Danny Ayers“Although the person using these tools doesn’t need to think about gobbledegook like RDF, when they use the tools, they are putting first class data on the web in a Semantic Web-friendly fashion.”
  • David Huynh“Now, we’re not just talking about the usual suspects of domains like photos and bookmarks. We’re talking about any arbitrary domain of data — Yes, the real world data. The data that people care about. I am hoping that we can create tools that let the average blogger easily publish structured data without ever having to coin URIs or design or pick ontologies — But there it is: this is, in my humble opinion, a beginning of something great.”
  • Kyler “If you don’t see the value of this, you are a fool.”
  • Derek Kinsman“Exhibit is an amazing web app … I am beginning to work alongside my WordPress mates in the hopes that we can create some sort of Administration area inside WordPress that connects to the Google accounts. Right inside WP. Also, we’re attempting to create some sort of plugin or downloadable template to which Exhibit can run mostly out of the box.”

What is Coming?

Johan Sundström has created an Instant Google Spreadsheets Exhibit, which lets you turn any Google spreadsheet (with certain formatting requirements) into an “exhibit” just by pasting in its public feed URL with immediate faceted browsing; maps and timelines are forthcoming.

Well, a WordPress plug-in is in the works (to be announced, with Derek helping to take the lead on it). Though incorporation into a blog is easy, it does require the author to have system administration rights and access to the WordPress server. A plug-in could remove those hurdles and make usage still easier.

Exhibit‘s very helpful online tutorials are being expanded, particularly with more examples and more templates. For those seriously interested in the technology, definitely monitor the Simile project site.

There continues to be activity and expansion of the Babel translation formats. You can now convert BibTeX, Excel, Notation 3 (N3), RDF/XML or tab-separated values (TSV) to a choice of Exhibit JSON , N3 or RDF/XML. And, since Exhibit itself internally stores its data representation as triples, it is tantalizing to think that another Simile project, RDFizers, with its impressive storehose of RDF converters, may also be more closely tied with Babel. Is it possible that Exhibit JSON may become the lingua franca of small-scale data representation formats?

And, within the project team of Huynh and his Ph.D. thesis advisor, David Karger, there are also efforts underway to extend the syntax and functionality of Exhibit. We’ve just seen the expansion to direct Google spreadsheet support, and support for more spreadsheet functionality is desired, including possible string concatenation and numeric operations.

Exhibit itself has been designed with extensibility in mind. Its linkage to Timeline, for example, is one such example. What will be critical in the weeks and months ahead is the development of a developer and user community surrounding Exhibit. There is presently a fairly active mailing list and I’m sure the MIT folks would welcome serious contributions.

Finally, other aspects of the Simile project itself and related intiatives at MIT have direct and growing ties to Exhibit both in terms of team members and developers and in terms of philosophy. You may want to check out these additional MIT projects including Longwell, Piggy Bank, Solvent, Semantic Bank, Welkin, DSpace, Haystack, Dwell, Ajax, Sifter, Relo plugin, Re:Search, Chickenfoot, and LAPIS. This is a program on the move, to which the closest attention is warranted.

Expected Growing Pains

There are some known issues sometimes with display in Safari and Opera browsers; these are being worked on and should be resolved shortly. There are also some style issues and conflicts when embedding in blogs (easily fixed with CSS modifications). There are likely performance problems when data sets get into the hundreds or thousands, but that exceeds Exhibit‘s lightweight objectives anyway. There may be other problems that emerge as use broadens.

These issues are to be expected and should not diminish playing with the system immediately. You’ll be amazed at what you can do, and how rapidly with so little code.

It has been a fun few days. It’s exciting to be able to be a camp follower during one of those seminal moments in Web development. And, so I say to David and colleagues at MIT and the band of merry collaborators on their mailing list: Thanks! This is truly cool.

MIT’s Exhibit Continues the Simile Project’s Long String of Innovative Tools

I have just come across a new innovative Web development, and its simplicity and elegance have literally taken my breath away! Exhibit, from the Simile project at MIT and its lead author David Huynh, whose contributions include the stellar Piggy Bank (semantic Web Firefox extension), Sifter (little known, but excellent automatic Web data extractor), Babel (data format translator), Timeline (Javascript timeline creator), Ajax (toolset), Solvent (Web data extractor used by Piggy Bank) and Longwell (web-based RDF-powered faceted browser). David is the lead author on the first five tools listed. As a Ph.D. student at MIT, David is truly becoming one of the leading lights in practical semantic Web tool development. Exhibit only reinforces that reputation.

According to its Web site:

Exhibit is a lightweight structured data publishing framework that lets you create web pages with support for sorting, filtering, and rich visualizations by writing only HTML and optionally some CSS and Javascript code.

It’s like Google Maps and Timeline, but for structured data normally published through database-backed web sites. Exhibit essentially removes the need for a database or a server side web application. Its Javascript-based engine makes it easy for everyone who has a little bit of knowledge of HTML and small data sets to share them with the world and let people easily interact with them. . . .

“No Database, No Web Application” means that you can create your own exhibits using just a text editor. . . It’s quite easy to make exhibits. We even let you copy data straight out of a boring spreadsheet and convert it into an exhibit automatically. . . .

Exhibit consists of a bunch of Javascript files that you include in your web page. At load time, this Javascript code reads in one or more JSON data files that you link from within your web page and constructs a database implemented in Javascript right inside the browser of whoever visits your web page. It then dynamically re-constructs the web page as the visitor sorts and filters through the data. . . .

The advantages of Exhibit are as follows:

  • No traditional database technology involved even though Exhibit-embedding web pages appear as if they are backed by databases. So you don’t have to design any database, configure it, and maintain it. After all, if you only have a few dozens of things to publish rather than thousands, why would you spend so much effort in dealing with database technologies?
  • No server-side code required even though Exhibit-embedding web pages are heavily templated. So, there is no need to learn ASP, PHP, JSP, CGI, Velocity, etc. There is no need to worry which server-side scripting technology your hosting provider supports.
  • No need for web server if you only want to create exhibits and keep them on your own computer for your own use. They work straight from the file system.

We also provide a complementary service called Babel that lets you convert data from various sources, including tab-separated values (copied straight from spreadsheets) and Bibtex files, into formats that Exhibit understands.

The Exhibit Web site offers a growing list of helpful tutorials and some live examples of database-related “exhibits,” one of which is this U.S. Presidents’ example that shows maps, timelines, thumbnails and other nifty displays (see the actual site for the interactive displays):

You can get Exhibit today and embed it in your own Web site (more on this to come!).

To learn more about the background to this project, please see the submitted paper, Exhibit: Lightweight Structured Data Publishing, submitted to WWW, 2007, by David Huynh, Robert Miller, and David Karger.

Gentlemen, on behalf of the community, let me say, “Thanks! Most excellent work!” It’s discoveries like these that make the Internet so worthwhile.

Posted:January 9, 2007

Early Progress in the Use of Firefox as a Semantic Web Platform

This AI3 blog maintains Sweet Tools, the largest listing of about 800 semantic Web and -related tools available. Most are open source. Click here to see the current listing!

The other day I posted a general status and statistical report on the growth and implications of Firefox extensions. This post presents more than 30 of those nearly 3,000 extensions that may have usefulness in areas related to the semantic Web. I welcome any additions.

These same extensions have also been added to an update of the Sweet Tools listing, which has now grown to more than 350 tools.

Please note that because the spreadsheet is hosted by Google, you must copy the URL to your address bar rather than clicking directly (direct clicking is anticipated in future versions of the Google spreadsheet; now works):

I should mention that I have seen some commentary within the semantic Web community of the desirability of compiling “best of” or “Top X” tools listings for the semantic Web. While such lists have their place, they are no substitute for comprehensive listings. First, semantic tools are still in their infancy and it is premature to bestow “best of” in most categories. Second, many practitioners, such as me, are working to extend and improve existing tools. This requires more comprehensive listings, not narrower ones. And, last, what may ultimately contribute to semantic meaning on the Internet may well extend beyond semantic Web tools, strictly defined. An ivory tower focus on purity is not the means to encourage experimentation and innovation. Many Web 2.0 initiatives, including tagging and social collaboration, may very well point to more effective nucleation points for expanding semantic Web efforts than W3C-compliant efforts.

These are some of the reasons that I have been happy to include simple Firefox extensions or relatively narrow format converters for my listings. Who knows? You never know when and where you might find a gem! (And I’m not speaking solely of Ruby!)