AI3′s SemWeb Tools Survey is Now Largely Completed
It has taken nearly six months, but I believe my survey of existing semantic Web and related tools is now largely complete. While new tools will certainly be discovered and new ones are constantly being developed (which, I also believe, is at an accelerating pace), I think the existing backlog has largely been found. Though important gaps remain, today’s picture is one of a surprisingly robust tools environment for promoting semantic Web objectives.
Growth and Tools Characterization
My most recent update of Sweet Tools, also published today, now lists 500 semantic Web and related tools and is in its 8th version. Starting with the W3C’s listing of 70 tools, first referenced in August 2006, I have steadily found and added to the listing. Until now, the predominant source of growth in these listings has come through discovery of extant tools.
In its earliest versions, my search strategy very much focused on all topics directly related to the “semantic Web.” However, as time went on, I came to understand the importance of many ancillary tool sets to the entire semantic Web pipeline (such as language processing and information extraction) and came to find whole new categories of pragmatic tools that embodied semantic Web and data mediation processes but which did not label themselves as such. This latter category has been an especially rich vein to mine, with notable contributions from the humanities, biology and the physical sciences.
But the pace of discovery is now approaching its asymptote. Though I by no means believe I have comprehensively found all extant tools, I do believe that most new tools in future listings will come more from organic growth and new development than discovery of hidden gems. So, enjoy!
My view of what is required for the semantic Web vision to reach some degree of fruition begins with uncharacterized content, which then proceeds through a processing pipeline ultimately resulting in the storage of RDF triples that can be managed at scale. By necessity, such a soup-to-nuts vision embraces tools and requirements that, individually, might not constitute semantic technology strictly defined, but is nonetheless an integral part of the overall pipeline. By (somewhat arbitrary) category, here is the breakdown of the current listing of 500 tools:
|30||Parser or Converter|
|23||Browser (RDF, OWL or semantic)|
|22||Wiki- or blog-related|
|22||Wrapper (Web data extractor)|
|13||Query Language or Service|
|6||NOT ACTIVE (???)|
|3||Description or Formal Logics|
I find it amusing that the diversity and sources of such tool listings — importantly including what is properly in the domain or not — is itself an interesting example of the difficulties facing semantic mediation and resolution. Alas, such is the real world.
Java is the Preferred Language
There are interesting, and historical, trends as well in the use of primary development languages around these tools. Older ones rely on C or C++ or, if they are logic or inference oriented, on the expected languages of Prolog or LISP.
So, here is the listing of Sweet Tools apps by primary development language:
An Alternative Simple Listing
The last 24 hours have seen a flurry of postings on the newly released Yahoo! Pipes service, an online IDE for wiring together and managing data feeds on the Web. Tim O’Reilly has called the Pipes service “a milestone in the history of the internet.” Rather than repeat, go to Jeremy Zawodny’s posting, Yahoo! Pipes: Unlocking the Data Web, where he has assembled a pretty comprehensive listing of what others are saying about this new development:
Using the Pipes editor, you can fetch any data source via its RSS, Atom or other XML feed, extract the data you want, combine it with data from another source, apply various built-in filters (sort, unique (with the “ue” this time:-), count, truncate, union, join, as well as user-defined filters), and apply simple programming tools like for loops. In short, it’s a good start on the Unix shell for mashups. It can extract dates and locations and what it considers to be “text entities.” You can solicit user input and build URL lines to submit to sites. The drag and drop editor lets you view and construct your pipeline, inspecting the data at each step in the process. And of course, you can view and copy any existing pipes, just like you could with shell scripts and later, web pages.
An update to the Sweet Tools semantic Web and -related tools listing has been posted. Forty-two tools were added, bringing the running total to 420. As always, please provide any corrections or additions here.
Collex is the Next Example in a Line of Innovative Tools from the Humanities
I seem to be on a string of discovery of new tools from unusual sources — that is, at least, unusual for me. For some months now I have been attempting to discover the “universe” of semantic Web tools, beginning obviously with efforts that self-label in that category. (See my ongoing Sweet Tools comprehensive listing of semantic Web and related tools.) Then, it was clear that many “Web 2.0″ tools potentially contribute to this category via tagging, folksonomies, mashups and the like. I’ve also been focused on language processing tools that relate to this category in other ways (a topic for another day.) Most recently, however, I have discovered a rich vein of tools in areas that take pragmatic approaches to managing structure and metadata, but often with little mention of the semantic Web or Web 2.0. And in that vein, I continue to encounter impressive technology developed within the humanities and library science (see, for example, the recent post on Zotero).
To many of you, the contributions from these disciplines have likely been obvious for years. I admit I’m a slow learner. But I also suspect there is much that goes on in fields outside our normal ken. My own mini-epiphany is that I also need to be looking at the pragmatists within many different communities — some of whom eschew the current Sem Web and Web 2.0 hype — yet are actually doing relevant and directly transferable things within their own orbits. I have written elsewhere about the leadership of physicists and biologists in prior Internet innovations. I guess the thing that has really surprised me most recently is the emerging prominence of the humanities (I feel like the Geico caveman saying that).
Collex is the Next in A Fine Legacy
The latest discovery is Collex, a set of tools for COLLecting and EXhibiting information in the humanities. According to Bethany Nowviskie, a lecturer in media studies at the University of Virginia, and a lead designer of the effort in her introduction, COLLEX: Semantic Collections & Exhibits for the Remixable Web:
Collex is a set of tools designed to aid students and scholars working in networked archives and federated repositories of humanities materials: a sophisticated collections and exhibits mechanism for the semantic web. It allows users to collect, annotate, and tag online objects and to repurpose them in illustrated, interlinked essays or exhibits. Collex functions within any modern web browser without recourse to plugins or downloads and is fully networked as a server-side application. By saving information about user activity (the construction of annotated collections and exhibits) as “remixable” metadata, the Collex system writes current practice into the scholarly record and permits knowledge discovery based not only on the characteristics or “facets” of digital objects, but also on the contexts in which they are placed by a community of scholars. Collex builds on the same semantic web technologies that drive MIT’s SIMILE project and it brings folksonomy tagging to trusted, peer-reviewed scholarly archives. Its exhibits-builder is analogous to high-end digital curation tools currently affordable only to large institutions like the Smithsonian. Collex is free, generalizable, and open source and is presently being implemented in a large-scale pilot project under the auspices of NINES.
(BTW, NINES stands for the Networked Infrastructure for Nineteenth-century Electronic Scholarship, a trans-Atlantic federation of scholars.)
The initial efforts that became Collex were to establish frameworks and process within this community, not tools. But the group apparently recognized the importance of leverage and enablers (i.e, tools) and hired Erik Hatcher, a key contributor to the Apache open-source Lucene text-indexing engine and co-author of Lucene in Action, to spearhead development of an actually usable tool. Erik proceeded to grab best-of-breed stuff in area such as Ruby and Rails and Solr (a faceted enhancement to Lucene that has just graduated from the Apache incubator), and then to work hard on follow-on efforts such as Flare (a display framework) to create the basics of Collex. A sample screenshot of the application is shown below:
The Collex app is still flying under the radar, but it has sufficient online functionality today to support annotation, faceting, filtering, display, etc. Another interesting aspect of the NINES project (but not apparently a programmatic capability of the Collex software itself) is it only allows “authoritative” community listings, an absolute essential for scaling the semantic Web.
You can play with the impressive online demo of the Collex faceted browser at the NINES Web site today, though clearly the software is still undergoing intense development. I particularly like its clean design and clear functionality. The other aspect of this software that deserves attention is that it is a server-side option with cross-browser Ajax, without requiring any plugins. It works equally within Safari, Firefox and Windows IE. And, like the Zotero research citation tool, this basic framework could easily lend itself to managing structured information in virtually any other domain.
Collex is one of the projects of Applied Research in Patacriticism, a software development research team located at the University of Virginia and funded through an award to professor Jerome McGann from the Andrew Mellon Foundation. (“Shuuu, sheeee. Impressive. Most impressive.” — Darth Vader)
(BTW, forgive my sexist use of “guys” in this post’s title; I just couldn’t get a sex-neutral title to work as well for me!)
|An AI3 Jewels & Doubloon Winner|
Zotero, first released in October, is perhaps the best Firefox extension that most users have never heard of, unless you are an academic historian or social scientist, in which case Zotero is becoming quite the rage. It is also percolating into other academic fields, including law, math and science.
Zotero is a complete research citation platform, or what its developer, George Mason University’s Center for History and New Media (CHnM), calls, “The next-generation research tool.” Zotero allows academics and researchers to extract, manage, annotate, organize, export and publish citations in a variety of formats and from a variety of sources — all within the Firefox browser, and all while obviously the user is interacting directly with the Web.
What it Is
Like all Firefox extensions, Zotero is easy to install. From the Firefox add-on site or the Zotero site itself, a single click downloads the app to your browser. Upon a prompted re-start the app is now active. (Later alerts for any version upgrades are similarly automatic — as for any Firefox extension.)
Upon installation, Zotero puts an icon in your status bar and places new options on menus. When you encounter a site that Zotero supports (currently, mostly university libraries, but also Amazon and major publication outlets as well, totaling more than 150; here is a listing of Zotero’s so-called translators), you will see a folder symbol in your address bar telling you Zotero is active. A single click downloads the citations from that site automatically to your local system.
Citations have traditionally been one of the more “semantically challenging” data sets, with variations in style, order, format, presentation, coverage and you name it rampant. The fact that Zotero supports a given source site means that it understands these nuances and is ready to store the information in a single, canonical representation. Once downloaded, this citation representation can now be easily managed and organized. More importantly, you can now export this internal, standard representation into a multitude of export formats (including, most recently, MS Word). In short, like for-fee citation software in the past, Zotero now provides a free and universal mechanism for managing this chaos.
While the address icon acts to download one or more citations (yes, they also work in groups if there are multiple listings on the page!), choosing the Zotero icon itself invokes the full Zotero as an app within the browser, as this screen shot shows:
The left panes provide organization and management and tag support; the middle pane shows the active resources; and the right pane shows the structure associated with the active citation. This is all supported with attractive icons and logical tooltips and organization.
Zotero also offers utilities for creating your own scrapers (“translators”) for new sites not yet in the standard, supported roster. This capability is itself an extension to Zotero, called Scaffold, that also points to the building block nature of the core app. (Other utilities such as Solvent from MIT or others surely to come could either enhance or replace the current Scaffold framework.)
What is Impressive
Though supposedly in “beta,” Zotero already shows a completeness, sophistication and attention to detail not evident in most Firefox extensions. Indeed, this system approaches a complete application in its scope and professionalism. The fact it can be so easily installed and embedded in the browser itself is worth noting.
Firefox extensions have continuously evolved from single-function wonders to crude apps and now, as Zotero and a handful of other extensions show, complete functional applications. And, like OSes of the past, these extensions also adhere to standards and practices that make them pretty easy to use across applications. Firefox is indeed becoming a full-fledged platform.
This system is also using the new SQLite local database function (“mozStorage”) in Firefox 2.x to manage the local data (perhaps one of the first Firefox extensions to do so). This provides a clean and small install footprint for the extension, as well as opens it up to other standard data utilities.
What it Implies
So, what Zotero is exemplifying — beyond its own considerable capabilities — are some important implications. First, full-bodied apps, building on many piece-parts, can now be expected around the Fireflox platform. (Indeed, I earlier noted the emergence of such “Web OS” prospects as Parakey, whose developers also come from earlier Firefox legacies. One of those developers, Joe Hewitt, is also the author of the impressive Firebug extension.)
Second, the openness of Firefox for web-centric computing will, as I’ve stated before, continue to put competitive pressure on Microsoft’s Internet Explorer. This is good for all users at large and will continue to spur innovation.
Third, the pending version 2.0 of Zotero is slated to have a server-side component. What we are potentially seeing, then, are local client-side instantiations in the browser that can then communicate with remote data servers. This opens up a wealth of possibilities in social networking and collaboration.
And, last, and more specific to Zotero itself (but also enabled with Firefox’s native RDF support), we are now seeing a complete app framework for dealing with structured information and tagging on the Web. While clearly Zotero has a direct audience for citation management and research, the same infrastructure and techniques used by the system could become a general semantic Web or data framework for any other structured application.
Hmmm. Now that sounds like an opportunity . . . .
|An AI3 Jewels & Doubloon Winner|