Posted:April 2, 2007

Meat and Potatoes

DBpedia Serves Up Real Meat and Potatoes (but Bring Your Own Knife and Fork!)

NOTE: Due to demand, I am pleased to provide this PDF version of this posting (1.4 MB) for downloading or printing.

DBpedia is the first and largest source of structured data on the Internet covering topics of general knowledge. You may have not yet heard of DBpedia, but you will. Its name derives from its springboard in Wikipedia. And it is free, growing rapidly and publicly available today.

With DBpedia, you can manipulate and derive facts on more than 1.6 million “things” (people, places, objects, entities). For example, you can easily retrieve a listing of prominent scientists born in the 1870s; or, with virtually no additional effort, a further filtering to all German theoretical physicists born in 1879 who have won a Nobel prize [1]. DBpedia is the first project chosen for showcasing by the Linking Open Data community of the Semantic Web Education and Outreach (SWEO) interest group within the W3C. That community has committed to make portions of other massive data sets — such as the US Census, Geonames, MusicBrainz, WordNet, the DBLP bibliography and many others — interoperable as well.

DBpedia has been unfortunately overlooked in the buzz of the past couple weeks surrounding Freebase. Luminaries such as Esther Dyson, Tim O’Reilly and others have been effusive about the prospects of the pending Freebase offering. And, while, according to O’Reilly Freebase may be “addictive” or from Dyson it may be that “Freebase is a milestone in the journey towards representing meaning in computers,” those have been hard assertions to judge. Only a few have been invited (I’m not one) to test drive Freebase, now in alpha behind a sign-in screen, and reportedly also based heavily on Wikipedia. On the other hand, DBpedia, released in late January, is open for testing and demos and available today — to all [2].

Please don’t misunderstand. I’m not trying to pit one service against the other. Both services herald a new era in the structured Web, the next leg on the road to the semantic Web. The data from both Freebase and DBpedia are being made freely available under either Creative Commons or the GNU Free Documentation License, respectively. Free and open data access is fortunately not a zero sum game — quite the opposite. Like other truisms regarding the network effects of the Internet, the more data that can be meaningfully intermeshed, the greater the value. Let’s wish both services and all of their cousins and progeny much success!

Freebase may prove as important and revolutionary as some of these pundits predict — one never knows. Wikipedia, first released in Jan. 2001 with 270 articles and with only 10 editors, only had a mere 1000 mentions a month by July 2003. Yet today it has more than 1.7 million articles (English version) and is ranked about #10 in overall Web traffic (more here). So, while today Freebase has greater visibility, marketing savvy and buzz than DBpedia, so did virtually every other entity in Jan. 2001 compared to Wikipedia in its infancy. Early buzz is no guarantee of staying power.

What I do know is that DBpedia and the catalytic role it is playing in the open data movement is the kind of stuff from which success on the Internet naturally springs. What I also know is that in open source content a community is required to power a promise to its potential. Because of its promise, its open and collaborative approach, and the sheer quality of its information now, DBpedia deserves your and the Web’s attention and awareness. But, only time will tell whether DBpedia is able to nurture a community or not and overcome current semantic Web teething problems not of its doing.

First, Some Basics

DBpedia represents data using the Resource Description Framework (RDF) model, as is true for other data sources now available or being contemplated for the semantic Web. Any data representation that uses a “triple” of subject-predicate-object can be expressed through the W3C’s standard RDF model. In such triples, subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. (You can think of subjects and objects as nouns, predicates as verbs.) Resources are given a URI (as may also be given to predicates or objects that are not specified with a literal) so that there is a single, unique reference for each item. These lookups can themselves be an individual assertion, an entire specification (as is the case, for example, when referencing the RDF or XML standards), or a complete or partial ontology for some domain or world-view. While the RDF data is often stored and displayed using XML syntax, that is not a requirement. Other RDF forms may include N3 or Turtle syntax, and variants of RDF also exist including RDFa and eRDF (both for embedding RDF in HTML) or more structured representations such as RDF-S (for schema; also known as RDFS, RDFs, RDFSchema, etc.).

The absolutely great thing about RDF is how well it lends itself to mapping and mediating concepts from different sources into an unambiguous semantic representation (my ‘glad== your ‘happy‘ OR my ‘gladis your ‘glad‘), leading to what Kingsley Idehen calls “data meshups. Further, with additional structure (such as through RDF-S or the various dialects of OWL), drawing inferences and machine reasoning based on the data through more formal ontologies and descriptive logics is also within reach [3].

While these nuances and distinctions are important to the developers and practitioners in the field, they are of little to no interest to actual users [4]. But, fortunately for users, and behind the scenes, practitioners have dozens of converters to get data in whatever form it may exist in the wild such as XML or JSON or a myriad of other formats into RDF (or vice versa) [5]. Using such converters for structured data is becoming pretty straightforward. What is now getting more exciting are improved ways to extract structure from semi-structured data and Web pages or to use various information extraction techniques to obtain metadata or named entity structure from within unstructured data. This is what DBpedia did: it converted all of the inherent structure within Wikipedia into RDF, which then makes it manipulable similar to a conventional database.

And, like SQL for conventional databases, SPARQL is now emerging as a leading query framework for RDF-based “triplestores” (that is, the unique form of databases — most often indexed in various ways to improve performance — geared to RDF triples). Moreover, in keeping with the distributed nature of the Internet, distributed SPARQL “endpoints” are emerging, which represent specific data query points at IP nodes, the results of which can then be federated and combined. With the emerging toolset of “RDFizers“, in combination with extraction techniques, such endpoints should soon proliferate. Thus, Web-based data integration models can either follow the data federation approach or the consolidated data warehouse approach or any combination thereof.

The net effect is that the tools and standards now exist such that all data on the Internet can now be structured and combined and analyzed. This is huge. Let me repeat: this is HUGE. And all of us users will only benefit as practitioners continue their labors in the background. The era of the structured Web is now upon us.

A Short Intro to DBpedia [6]DBpedia Logo

The blossoming of Wikipedia has provided a serendipitous starting point to nucleate this emerging “Web of data“. Like Vonnegut’s ice-9, DBpedia‘s RDF representation — freely available for download and extension — offers the prospect to catalyze many semantic Web data sources and tools. (It is also worth study why Freebase, DBpedia, and a related effort called YAGO from the Max Planck Institute for Computer Science [7] all are using Wikipedia as a springboard.) The evidence that the ice crystals are forming comes from the literally hundreds of millions of new “facts” that were committed to be added as part of the open data initiative within days of DBpedia‘s release (see below).

Wikipedia articles consist mostly of free text, but also contain different types of structured information, such as infobox templates (an especially rich source of structure — see below), categorization information, images, geo-coordinates and links to external Web pages. The extraction methods are described in a paper from DBpedia developers Sören Auer and Jens Lehmann. This effort created the initial DBpedia datasets, including extractions in English, German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese and Chinese.

The current extraction set is about 1.8 GB of zipped files (about 7.5 GB unzipped), consisting of an estimated 1.6 milllion entities, 8,000 relations/properties/predicates, and 91 million “facts” (generally equated to RDF triples) over about 15 different data files. DBpedia can be downloaded as multiple datasets, with the splits and sizes in (zipped form and number of triples) shown below. Note that each download — available directly from the DBpedia site — may also include multiple language variants:

  • Articles (217 MB zipped, 5.4 M triples) — Descriptions of all 1.6 million concepts within the English version of Wikipedia including English titles and English short abstracts (max. 500 chars long), thumbnails, links to the corresponding articles in the English Wikipedia. This is the baseline DBpedia file
  • Extended Abstracts (447 MB, 1.6 M triples) — Additional, extended English abstracts (max. 3000 chars long)
  • External Links (75 MB, 2.8 M triples) — Links to external web pages about a concept
  • Article Categories (68 MB, 5.5 M triples) — Links from concepts to categories using the SKOS vocabulary
  • Additional Languages (147 MB, 4 M triples) — Additional titles, short abstracts and Wikipedia article links in the 10 languages listed above
  • Languages Extended Abstracts (342 MB, 1.3 M triples) — Extended abstracts in the 10 languages
  • Infoboxes (116 MB, 9.1 M triples) — Information that has been extracted from Wikipedia infoboxes, an example of which is shown below. There are about three-quarter million English templates with more than 8,000 properties (predicates), with music, animal and plant species, films, cities and books the most prominent types. This category is a major source of the DBpedia structure (but not total content volume):
Example Wikipedia Infobox Template
[Click on image for full-size pop-up]
  • Categories (10 MB, 900 K triples) — Information regarding which concepts are a category and how categories are related
  • Persons (4 MB, 437 K triples) — Information on about 58,880 persons (date and place of birth etc.) represented using the FOAF vocabulary
  • Links to Geonames (2 MB, 213 K triples) — Links between geographic places in DBpedia and data about them in the Geonames database
  • Location Types (500 kb, 65 K triples) — rdf:type Statements for geographic locations
  • Links to DBLP (10 kB, 200 K triples) — Links between computer scientists in DBpedia and their publications in the DBLP database
  • Links to the RDF Book Mashup (100 kB, 9 K triples) — Links between books in DBpedia and data about them provided by the RDF Book Mashup
  • Page Links (335 MB, around 60 M triples) — Dataset containing internal links between DBpedia instances. The dataset was created from the internal pagelinks between Wikipedia articles, potentially useful for structural analysis or data mining.

Besides direct download, DBpedia is also available via a public SPARQL endpoint at and via various query utilities (see next).

Querying and Interacting with DBpedia

Generic RDF (semantic Web) browsers like Disco, Tabulator or the OpenLink Data Web Browser [8] can be used to navigate, filter and query data across data sources. A couple of these browsers, plus endpoints and online query services, are used below to illustrate how you can interact with the DBpedia datasets. However, that being said, it is likely that first-time users will have difficulty using portions of this infrastructure without further guidance. While data preparation and exposure via RDF standards is progressing nicely, the user interfaces, documentation, intuitive presentation, and general knowledge for standard users to use this infrastructure are lagging. Let’s show this with some examples, progressing from what might be called practitioner tools and interfaces to those geared more to standard users [9].

The first option we might try is to query the data directly through DBpedia‘s SPARQL endpoint [10]. But, since that is generally used directly by remote agents, we will instead access the endpoint via a SPARQL viewer, as shown by this example looking for “luxury cars”:

SPARQL Explorer
[Click on image for full-size pop-up]

We observe a couple of things using this viewer. First, the query itself requires using the SPARQL syntax. And, the results are presented in a linked listing of RDF triples.

For practitioners, this is natural and straightforward. Indeed, other standard viewers, now using Disco as the example, also present results in the same tabular triple format (in this case for “Paul McCartney”):

Disco1 Image
[Click on image for full-size pop-up]

While OK for practitioners, these tools pose some challenges for standard users. First, one must know the SPARQL syntax with its SELECT statement and triples specifications. Second, where the resources exist (the URIs) and the specific names of the relations (predicates) must generally be known or looked up in a separate step. And, third, of course, the tabular triples result display is not terribly attractive and may not be sufficiently informative (since results are often the links to the resource, rather than the actual content of that resource).

These limits are well-known in the community; indeed, they are a reflection of RDF’s roots in machine-readable data. However, it is also becoming apparent that humans need to inspect and interact with this stuff as well, leading to consequent attempts to make interfaces more attractive and intuitive. One of the first to do so for DBpedia is the Universität Leipzig’s Query Wikipedia, an example of an online query service. In this first screen, we see that the user is shielded from much of the SPARQL syntax via the three-part (subject-predicate-object) query entry form:

Query Wikipedia Space Missions Query
[Click on image for full-size pop-up]

We also see that the results can be presented in a more attractive form including image thumbnails:

Query Wikipedia Space Missions Results
[Click on image for full-size pop-up]

Use of dropdown lists could help this design still further. Better tools and interface design is an active area of research and innovation (see below).

However, remember the results of such data queries are themselves machine readable, which of course means that the results can be embedded into existing pages and frameworks or combined (“meshed up“) with still other data in still other presentation and visualization frameworks [11]. For example, here is one DBpedia results set on German state capitals presented in the familiar Wikipedia (actually, MediaWiki) format:

German Capitals in Wikipedia
[Click on image for full-size pop-up]

Or, here is a representative example of what DBpedia data might look like when the addition of the planned Geonames dataset is completed:

Geonames Example
[Click on image for full-size pop-up]

Some of these examples, indeed because of the value of the large-scale RDF data that DBpedia now brings, begin to glaringly expose prior weaknesses in tools and user interfaces that were hidden when only applied to toy datasets. With the massive expansion to still additional datasets, plus the interface innovations of Freebase and Yahoo! Pipes and elsewhere, we should see rapid improvements in presentation and usability.

Extensibility and Relation to Open Data

In addition to the Wikipedia, DBLP bibliography and RDF book mashup data already in DBpedia, significant commitments to new datasets and tools have been quick to come, including the addition of full-text search [12], with many more candidate datasets also identified. A key trigger for this interest was the acceptance of the Linking Open Data project, itself an outgrowth of DBpedia proposed by Chris Bizer and Richard Cyganiak, as one of the kick-off community projects of the W3C‘s Semantic Web Education and Outreach (SWEO) interest group. Some of the new datasets presently being integrated into the system — with many notable advocates standing behind them — include:

  • Geonames data, including its ontology and 6 million geographical places and features, including implementation of RDF data services
  • 700 million triples of U.S. Census data from Josh Tauberer
  •, the RDF-based reviewing and rating site, including its links to FOAF, the Review Vocab and Richard Newman’s Tag Ontology
  • The “RDFizers” from MIT Simile Project (not to mention other tools), plus 50 million triples from the MIT Libraries catalog covering about a million books
  • GEMET, the GEneral Multilingual Environmental Thesaurus of the European Environment Agency
  • And, WordNet through the YAGO project, and its potential for an improved hierarchical ontology to the Wikipedia data.

Additional candidate datasets of interest have also been identified by the SWEO interest group and can be found on this page:

Note the richness within music and genetics, two domains that have been early (among others not yet listed) to embrace semantic Web approaches. Clearly, this listing, itself only a few weeks old, will grow rapidly. And, as has been noted, the availability of a wide variety of RDFizers and other tools should cause such open data to continue to grow explosively.

There are also tremendous possibilities for different presentation formats of the results data, as the MediaWiki and Geonames examples above showed. Presentation options include calendars, timelines, lightweight publishing systems (such as Exhibit, which already has a DBpedia example), maps, data visualizations, PDAs and mobile devices, notebooks and annotation systems.

Current Back-end Infrastructure

The online version of DBpedia and its SPARQL endpoint is managed by the open source Virtuoso middleware on a server provided by OpenLink Software. This software also hosts the various third-party browsers and query utilities noted above. These OpenLink programs themselves are some of the more remarkable tools available to the semantic Web community, are open source, and are also very much “under the radar.”

Just as Wikipedia helps provide the content grist for springboarding the structured Web, such middleware and conversion tools are another part of the new magic essential to what is now being achieved via DBpedia and Freebase. I will be discussing further the impressive OpenLink Software tools and back-end infrastructure in detail in an upcoming posting.

Structured Data: Foundation to the Road of the Semantic Web

We thus have an exciting story of a new era — the structured Web — arising from the RDF exposure of large-scale structured data and its accompanying infrastructure of converters, middleware and datastores. This structured Web provides the very foundational roadbed underlying the semantic Web.

Yet, at the same time as we begin traveling portions of this road, we can also see more clearly some of the rough patches in the tools and interfaces needed by non-practitioners. I suspect much of the excitement deriving from both Yahoo! Pipes and Freebase comes from the fact that their interfaces are “fun” and “addictive.” (Excitement also comes from allowing users with community rights to mold ontologies or to add structured content of their own.) Similar innovations will be needed to smooth over other rough spots in query services and use of the SPARQL language, as well as in lightweight forms of results and dataset presentation. And still even greater challenges in mediating semantic heterogeneities lie much further down this road, which is a topic for another day.

I suspect this new era of the structured Web also signals other transitions and changes. Practitioners and the researchers who have long labored in the laboratories of the semantic Web need to get used to business types, marketers and promoters, and (even) the general public crowding into the room and jostling the elbows. Explication, documentation and popularization will become more important; artists and designers need to join the party. While we’ve only seen it in limited measure to date, venture interest and money will also begin flooding into the room, changing the dynamics and the future in unforeseeable ways. I suspect many who have worked the hardest to bring about this exciting era may look back ruefully and wonder why they ever despaired that the broader world didn’t “get it” and why it was taking so long for the self-evident truth of the semantic Web to become real. It now is.

But, like all Thermidorean reactions, this, too, shall pass. Whether we have to call it “Web 3.0″ or “Web 4.0″ or even “Web 98.6“, or any other such silliness for a period of time in order to broaden its appeal, we should accept this as the price of progress and do so. We really have the standards and tools at hand; we now need to get out front as quickly as possible to get the RDFized data available. We are at the tipping point for accelerating the network effects deriving from structured data. With just a few more blinks of the eye, the semantic Web will have occurred before we know it.

DBpedia Is Another Deserving Winner

The DBpedia team of Chris Bizer, Richard Cyganiak and Georgi Kobilarov (Freie Universität Berlin), of Sören Auer, Jens Lehmann and Jörg Schüppel (Universität Leipzig), and of Orri Erling and Kingsley Idehen (OpenLink Software) all deserve our thanks for showing how “RDFizing” available data and making it accessible as a SPARQL endpoint can move from toy semantic Web demos to actually useful data. Chris and Richard also deserve kudos for proposing the Linking Open Data initiative with DBpedia as its starting nucleus. I thus gladly announce DBpedia as a deserving winner of the (highly coveted, to the small handful who even know of it! :) ) AI3 Jewels & Doubloons award.

Jewels & Doubloons An AI3 Jewels & Doubloons Winner

[1] BTW, the answer is Albert Einstein and Max von Laue; another half dozen German physicists received Nobels within the surrounding decade.

[2] Moreover, just for historical accuracy, DBpedia was also the first released, being announced on January 23, 2007.

[3] It is this potential ultimate vision of autonomous “intelligent” agents or bots working on our behalf getting and collating relevant information that many confusedly equate to what is meant to constitute the “Semantic Web” (both capitalized). While the vision is useful and reachable, most semantic Web researchers understand the semantic Web to be a basic infrastructure, followed by an ongoing process to publish and harness more and more data in machine-processable ways. In this sense, the “semantic Web” can be seen more as a journey than a destination.

[4] Like plumbing, the semantic Web is best hidden and only important as a means to an end. The originators of Freebase and other entities beginning to tap into this infrastructure intuitively understand this. Some of the challenges of education and outreach around the semantic Web are to emphasize benefits and delivered results rather than describing the washers, fittings and sweated joints of this plumbing.

[5] A great listing of such “RDFizers” can be found on MIT’s Simile Web site.

[6] This introduction to DBpedia borrows liberally from its own Web site; please consult it for more up-to-date, complete, and accurate details.

[7] YAGO (“yet another great ontology”) was developed by Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum at the Max-Plack-Institute for Computer Science at Saarbrücken University. A WWW 2007 paper describing the project is available, plus there is an online query point for the YAGO service and the data can be downloaded. YAGO combines Wikipedia and WordNet to derive a rich hierarchical ontology with high precision and recall. The system contains only 14 relations (predicates), but they are more specific and not directly comparable to the DBpedia properties. YAGO also uses a slightly different data model, convertible to RDF, that has some other interesting aspects.

[8] See also this AI3 blog’s Sweet Tools and sort by browser to see additional candidates.

[9] A nice variety of other query examples in relation to various tools is available from the DBpedia site.

[10] According to the Ontoworld wiki, “A SPARQL endpoint is a conformant SPARQL protocol service as defined in the SPROT specification. A SPARQL endpoint enables users (human or other) to query a knowledge base via the SPARQL language. Results are typically returned in one or more machine-processable formats. Therefore, a SPARQL endpoint is mostly conceived as a machine-friendly interface towards a knowledge base.”

[11] Please refer to the Developers Guide to Semantic Web Toolkits or Sweet Tools to find a development toolkit in your preferred programming language to process DBpedia data.

[12] Actually, the pace of this data increase is amazing. For example, in the short couple of days I was working on this write-up, I received an email from Kingsley Idehen noting that OpenLink Software had created another access portal to the data that now included DBpedia, 2005 US Census Data, all of the data crawled by PingTheSemanticWeb, and Musicbrainz (see above), plus full-text indexing of same! BTW, this broader release can be found at Whew! It’s hard to keep up.

Posted:March 26, 2007

Kevin Kelly“Technology is accelerating evolution; it’s accelerating the way in which we search for ideas” - Kevin Kelly

TED talks are one of the highest quality sources of video podcasts available on the Web; generally all are guaranteed to stimulate thought and expand your horizons. For example, my August 2006 posting on Data Visualization at Warp Speed by Hans Rosling of Gapminder, just acquired by Google, came from an earlier TED (Technology, Entertainment, Design) conference. The other thing that is cool about TED talks is that they are a perfect expression of how the Internet is democratizing the access to ideas and innovation: you now don’t have to be one of the chosen 1000 invited to TED to hear these great thoughts. Voices of one are being brought to conversations with millions.

One of the recent TED talk releases is from Wired‘s founding executive editor and technology pundit, Kevin Kelly (see especially his Technium blog). In this Feb. 2005 talk released in late 2006, Kevin argues for the notable similarities between the evolution of biology and technology, ultimately declaring technology the “7th kingdom of life.” (An addition to the standard six recognized kingdoms of life of plants, animals, etc.; see figure below.) You can see this 20-minute video yourself by clicking on Kelly’s photo above or following this link.

Longstanding readers or those who have read my Blogasbörd backgrounder in which I define the title of this blog, Adaptive Information, know of my interest in all forms of information and its role in human adaptability:

First through verbal allegory and then through written symbols and communications, humans have broken the boundaries of genetic information to confer adaptive advantage. The knowledge of how to create and sustain fire altered our days (and nights), consciousness and habitat range. Information about how to make tools enabled productivity, the creation of wealth, and fundamentally altered culture. Unlike for other organisms, the information which humans have to adapt to future conditions is not limited to the biological package and actually is increasing (and rapidly!) over time. No organism but humans has more potentially adaptive information available to future generations than what was present in the past. Passing information to the future is no longer limited to sex. Indeed, it can be argued that the fundamental driver of human economic activity is the generation, survival and sustenance of adaptive information.

Thus, information, biological or cultural, is nothing more than an awareness about circumstances in the past that might be used to succeed or adapt to the future. Adaptive information is that which best survives into the future and acts to sustain itself or overcome environmental instability.

Adaptive information, like the beneficial mutation in genetics, is that which works to sustain and perpetuate itself into the future.

Kelly takes a similar perspective, but expressed more as the artifact of human information embodied in things and processes — that is, technology. As a way of understanding the importance and role of technology, Kelly asks, “What does technology want?” By asking this question, Kelly uses the trick Richard Dawkins took in The Selfish Gene of analyzing evolutionary imperatives.

Kelly also likens technology to life. He looks at five of the defining characteristics of life — ubiquity, diversity, complexity, specialization and socialization — and finds similar parallels in technology. However, unlike life, Kelly observes that technologies don’t die, even ones that might normally be considered “obsolete.” These contrasts and themes are not dissimilar from my own perspective about the cumulative and permanent nature of human, as opposed to biological, information.

In one sense, this makes Kelly appear to be a technology optimist, or whatever the opposite of a Luddite might be called. But, I think more broadly, that Kelly’s argument is that the uniqueness of humans to overcome biological limits of information makes its expression — technology — the uniquely defining characteristic of our humanness and culture. So, rather than optimism or pessimism, Kelly’s argument is ultimately a more important one of essence.

In this regard, Kelly is able to persuasively argue that technology brings us that unique prospect of a potentially purposeful future that we capture through such words as choice, freedom, opportunity, possibility. Technology is adaptive. What makes us human is technology, and more technology perhaps even makes us more “human.” As Kelly passionately concludes:

“Somewhere today there are millions of young children being born whose technology of self-expression has not yet been invented. We have a moral obligation to invent technology so that every person on the globe has the potential to realize their true difference.”


[BTW, Today, a new 1-hr podcast interview with Kevin Kelly was released by EconTalk; it is a more current complement to the more thought-provoking video above, including relevant topics to this blog of Web 3.0 and the semantic Web. I recommend you choose the download version, rather than putting up with the spotty real-time server.]

Posted by AI3's author, Mike Bergman Posted on March 26, 2007 at 10:04 am in Adaptive Information, Semantic Web | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:March 19, 2007

Jewels & Doubloons

If Everyone Could Find These Tools, We’d All be Better Off

About a month ago I announced my Jewels & Doubloons Jewels & Doubloons awards for innovative software tools and developments, most often ones that may be somewhat obscure. In the announcement, I noted that our modern open-source software environment is:

“… literally strewn with jewels, pearls and doubloons — tremendous riches based on work that has come before — and all we have to do is take the time to look, bend over, investigate and pocket those riches.”

That entry begged the question of why this value is often overlooked in the first place. If we know it exists, why do we continue to miss it?

The answers to this simple question are surprisingly complex. The question is one I have given much thought to, since the benefits from building off of existing foundations are manifest. I think the reasons for why we so often miss these valuable, and often obscure, tools of our profession range from ones of habit and culture to weaknesses in today’s search. I will take this blog’s Sweet Tools listing of 500 semantic Web and -related tools as an illustrative endpoint of what a complete listing of such tools in a given domain might look like (not all of which are jewels, of course!), including the difficulties of finding and assembling such listings. Here are some reasons:

Search Sucks — A Clarion Call for Semantic Search

I recall late in 1998 when I abandoned my then-favorite search engine, AltaVista, for the new Google kid on the block and its powerful Page Ranking innovation. But that was tens of billions of documents ago, and I now find all the major search engines to again be suffering from poor results and information overload.

Using the context of Sweet Tools, let’s pose some keyword searches in an attempt to find one of the specific annotation tools in that listing, Annozilla. We’ll also assume we don’t know the name of the product (otherwise, why search?). We’ll also use multiple search terms, and since we know there are multiple tool types in our category, we will also search by sub-categories.

In a first attempt using annotation software mozilla, we do not find Annozilla in the first 100 results. We try adding more terms, such as annotation software mozilla “semantic web”, and again it is not in the first 100 results.

Of course, this is a common problem with keyword searches when specific terms or terms of art may not be known or when there are many variants. However, even if we happened to stumble upon one specific phrase used to describe Annozilla, “web annotation tool”, while we do now get a Google result at about position #70, it is also not for the specific home site of interest:

Google Results - 1

Now, we could have entered annozilla as our search term, assuming somehow we now knew it as a product name, which does result in getting the target home page as result #1. But, because of automatic summarization and choices made by the home site, even that description is also a bit unclear as to whether this is a tool or not:

Google Results - 2

Alternatively, had we known more, we could have searched on Annotea Mozilla and gotten pretty good results, since that is what Annozilla is, but that presumes a knowledge of the product we lack.

Standard search engines actually now work pretty well in helping to find stuff for which you already know a lot, such as the product or company name. It is when you don’t know these things that the weaknesses of conventional search are so evident.

Frankly, were our content to be specified by very minor amounts of structure (often referred to as “facets”) such as product and category, we could cut through this clutter quickly and get to the results we wanted. Better still, if we could also specify only listings added since some prior date, we could also limit our inspections to new tools since our last investigation. It is this type of structure that characterizes the lightweight Exhibit database and publication framework underlying Sweet Tools itself, as its listing for Annozilla shows:

Structured Results

The limitations of current unstructured search grow daily as Internet content volumes grows.

We Don’t Know Where to Look

The lack of semantic search also relates to the problem of not knowing where to look, and derives from the losing trade-offs of keywords v. semantics and popularity v. authoritativeness. If, for example, you look for Sweet Tools on Google using “semantic web” tools, you will find that the Sweet Tools listing only appears at position #11 with a dated listing, even though arguably it has the most authoritative listing available. This is because there are more popular sites than the AI3 site, Google tends to cluster multiple site results using the most popular — and generally, therefore, older (250 v. 500 tools in this case!) — page for that given site, and the blog title is used in preference to the posting title:

Google Sweet Tools

Semantics are another issue. It is important, in part, because you might enter the search term product or products or software or applications, rather than ‘tools‘, which is the standard description for the Sweet Tools site. The current state of keyword search is to sometimes allow plural and single variants, but not synonyms or semantic variants. The searcher must thus frame multiple queries to cover all reasonable prospects. (If this general problem is framed as one of the semantics for all possible keywords and all possible content, it appears quite large. But remember, with facets and structure it is really those dimensions that best need semantic relationships — a more tractable problem than the entire content.)

We Don’t Have Time

Faced with these standard search limits, it is easy to claim that repeated searches and the time involved are not worth the effort. And, even if somehow we could find those obscure candidate tools that may help us better do our jobs, we still need to evaluate them and modify them for our specific purposes. So, as many claim, these efforts are not worth our time. Just give me a clean piece of paper and let me design what we need from scratch. But this argument is total bullpucky.

Yes, search is not as efficient as it should be, but our jobs involve information, and finding it is one of our essential work skills. Learn how to search effectively.

The time spent in evaluating leading candidates is also time well spent. Studying code is one way to discern a programming answer. Absent such evaluation, how does one even craft a coded solution? No matter how you define it, anything but the most routine coding tasks requires study and evaluation. Why not use existing projects as the learning basis, in addition to books and Internet postings? If, in the process, an existing capability is found upon which to base needed efforts, so much the better.

The excuse of not enough time to look for alternatives is, in my experience, one of laziness and attitude, not a measured evaluation of the most effective use of time.

Concern Over the Viral Effects of Certain Open Source Licenses

Enterprises, in particular, have legitimate concerns in the potential “viral” effects of mixing certain open-source licenses such as GPL with licensed proprietary software or internally developed code. Enterprise developers have a professional responsibility to understand such issues.

That being said, my own impression is that many open-source projects understand these concerns and are moving to more enlightened mix-and-match licenses such as Eclipse, Mozilla or Apache. Also, in virtually any given application area, there is also a choice of open-source tools with a diversity of licensing terms. And, finally, even for licenses with commercial restrictions, many tools can still be valuable for internal, non-embedded applications or as sources for code inspection and learning.

Though the license issue is real when it comes to formal deployment and requires understanding of the issues, the fact that some open source projects may have some use limitations is no excuse to not become familiar with the current tools environment.

We Don’t Live in the Right Part of the World

Actually, I used to pooh-pooh the idea that one needed to be in one of the centers of software action — say, Silicon Valley, Boston, Austin, Seattle, Chicago, etc. — in order to be effective and on the cutting edge. But I have come to embrace a more nuanced take on this. There is more action and more innovation taking place in certain places on the globe. It is highly useful for developers to be a part of this culture. General exposure, at work and the watering hole, is a great way to keep abreast of trends and tools.

However, even if you do not work in one of these hotbeds, there are still means to keep current; you just have to work at it a bit harder. First, you can attend relevant meetings. If you live outside of the action, that likely means travel on occasion. Second, you should become involved in relevant open source projects or other dynamic forums. You will find that any time you need to research a new application or coding area, that the greater familiarity you have with the general domain the easier it will be for you to get current quickly.

We Have Not Been Empowered to Look

Dilbert, cubes and big bureaucracies aside, while it may be true that some supervisors are clueless and may not do anything active to support tools research, that is no excuse. Workers may wait until they are “empowered” to take initiative; professionals, in the true sense of the word, take initiative naturally.

Granted, it is easier when an employer provides the time, tools, incentives and rewards for its developers to stay current. Such enlightened management is a hallmark of adaptive and innovative organizations. And it is also the case that if your organization is not supporting research aims, it may be time to get that resume up to date and circulated.

But knowledge workers today should also recognize that responsibility for professional development and advancement rests with them. It is likely all of us will work for many employers, perhaps even ourselves, during our careers. It is really not that difficult to find occasional time in the evenings or the weekend to do research and keep current.

If It’s Important, Become an Expert

One of the attractions of software development is the constantly advancing nature of its technology, which is truer than ever today. Technology generations are measured in the range of five to ten years, meaning that throughout an expected professional lifetime of say, about 50 years, you will likely need to remake yourself many times.

The “experts” of each generation generally start from a clean slate and also re-make themselves. How do they do so and become such? Well, they embrace the concept of lifelong learning and understand that expertise is solely a function of commitment and time.

Each transition in a professional career — not to mention needs that arise in-between — requires getting familiar with the tools and techniques of the field. Even if search tools were perfect and some “expert” out there had assembled the best listing of tools available, they can all be better characterized and understood.

It’s Mindset, Melinda!

Actually, look at all of the reasons above. They all are based on the premise that we have completely within our own lights the ability and responsibility to take control of our careers.

In my professional life, which I don’t think is all that unusual, I have been involved in a wide diversity of scientific and technical fields and pursuits, most often at some point being considered an “expert” in a part of that field. The actual excitement comes from the learning and the challenges. If you are committed to what is new and exciting, there is much room for open-field running.

The real malaise to avoid in any given career is to fall into the trap of “not enough time” or “not part of my job.” The real limiter to your profession is not time, it is mindset. And, fortunately, that is totally within your control.

Gathering in the Riches

Since each new generation builds on prior ones, your time spent learning and becoming familiar with the current tools in your field will establish the basis for that next change. If more of us had this attitude, the ability for each of us to leverage whatever already exists would be greatly improved. The riches and rewards are waiting to be gathered.

Posted:February 19, 2007

Jewels & Doubloons

We’re So Focused on Plowing Ahead We Often Don’t See The Value Around Us

For some time now I have been wanting to dedicate a specific category on this blog to showcasing tools or notable developments. It is clear that tools compilations for the semantic Web — such as the comprehensive Sweet Tools listing — or new developments have become one focus of this site. But as I thought about this focus, I was not really pleased with the idea of a simple tag of “review” or “showcase” or anything of that sort. The reason such terms did not turn my crank was my own sense that the items that were (and are) capturing my interest were also items of broader value.

Please, don’t get me wrong. One (among many) observations within the past few months has been the amazing diversity, breadth and number of communities, and most importantly, the brilliance and innovation that I was seeing. My general sense in this process of discovery is that I have kind of stumbled blindly into many existing — and sometimes mature — communities that have existed for some time, but for which I was not part of nor privy to their insights and advances. These communities are seemingly endless and extend to topics such as semantic Web and its constituent components, Web 2.0, agile development, Ruby, domain-specific languages, behavior driven development, Ajax, JavaScript frameworks and toolkits, Rails, extractors/wrappers/data mining, REST, why the lucky stiff, you name it.

Announcing ‘Jewels & Doubloons’

As I have told development teams in the past, as you cross the room to your workstation each morning look down and around you. The floor is literally strewn with jewels, pearls and doubloons –tremendous riches based on work that has come before — and all we have to do is take the time to look, bend over, investigate and pocket those riches. It is that metaphor, plus in honor of Fat Tuesday tomorrow, that I name my site awards ‘Jewels & Doubloons.’

Jewels & Doubloons (or J & D for short) may get awarded to individual tools, techniques, programming frameworks, screencasts, seminal papers and even blog entries — in short, anything that deserves bending over, inspecting and taking some time with, and perhaps even adopting. In general, the items so picked will be more obscure (at least to me, though they may be very well known to their specific communities), but what I feel to be of broader cross-community interest. Selection is not based on anything formal.

Why So Many Hidden Riches?

I’ll also talk on occasion as to why these riches of such potential advantage and productivity to the craft of software development may be so poorly known or overlooked by the general community. In fact, while many can easily pick up the mantra of adhering to DRY, perhaps as great of a problem is NIH — reinventing a software wheel due to pride, ignorance, discontent, or simply the desire to create for creation’s sake. Each of these reasons can cause the lack of awareness and thus lack of use of existing high value.

There are better ways and techniques than others to find and evaluate hidden gems. One of the first things any Mardi Gras partygoer realizes is not to reach down with one’s hand to pick up the doubloons and plastic necklaces flung from the krewes’ floats. Ouch! and count the loss of fingers! Real swag aficionados at Mardi Gras learn how to air snatch and foot stomp the manna dropping from heaven. Indeed, with proper technique, one can end up with enough necklaces to look like a neon clown and enough doubloons to trade for free drinks and bar top dances. Proper technique in evaluating Jewels & Doubloons is one way to keep all ten fingers while getting rich the lazy man’s way.
Jewels & Doubloons are designated with either a medium-sized Jewels & Doubloons or small-sized (see below) icon and also tagged as such.

Past Winners

I’ve gone back over the posts on AI3 and have postdated J & D awards for these items:

Posted:February 5, 2007

Collex is the Next Example in a Line of Innovative Tools from the Humanities

I seem to be on a string of discovery of new tools from unusual sources — that is, at least, unusual for me. For some months now I have been attempting to discover the “universe” of semantic Web tools, beginning obviously with efforts that self-label in that category. (See my ongoing Sweet Tools comprehensive listing of semantic Web and related tools.) Then, it was clear that many “Web 2.0″ tools potentially contribute to this category via tagging, folksonomies, mashups and the like. I’ve also been focused on language processing tools that relate to this category in other ways (a topic for another day.) Most recently, however, I have discovered a rich vein of tools in areas that take pragmatic approaches to managing structure and metadata, but often with little mention of the semantic Web or Web 2.0. And in that vein, I continue to encounter impressive technology developed within the humanities and library science (see, for example, the recent post on Zotero).

To many of you, the contributions from these disciplines have likely been obvious for years. I admit I’m a slow learner. But I also suspect there is much that goes on in fields outside our normal ken. My own mini-epiphany is that I also need to be looking at the pragmatists within many different communities — some of whom eschew the current Sem Web and Web 2.0 hype — yet are actually doing relevant and directly transferable things within their own orbits. I have written elsewhere about the leadership of physicists and biologists in prior Internet innovations. I guess the thing that has really surprised me most recently is the emerging prominence of the humanities (I feel like the Geico caveman saying that).

Collex is the Next in A Fine Legacy

The latest discovery is Collex, a set of tools for COLLecting and EXhibiting information in the humanities. According to Bethany Nowviskie, a lecturer in media studies at the University of Virginia, and a lead designer of the effort in her introduction, COLLEX: Semantic Collections & Exhibits for the Remixable Web:

Collex is a set of tools designed to aid students and scholars working in networked archives and federated repositories of humanities materials: a sophisticated collections and exhibits mechanism for the semantic web. It allows users to collect, annotate, and tag online objects and to repurpose them in illustrated, interlinked essays or exhibits. Collex functions within any modern web browser without recourse to plugins or downloads and is fully networked as a server-side application. By saving information about user activity (the construction of annotated collections and exhibits) as “remixable” metadata, the Collex system writes current practice into the scholarly record and permits knowledge discovery based not only on the characteristics or “facets” of digital objects, but also on the contexts in which they are placed by a community of scholars. Collex builds on the same semantic web technologies that drive MIT’s SIMILE project and it brings folksonomy tagging to trusted, peer-reviewed scholarly archives. Its exhibits-builder is analogous to high-end digital curation tools currently affordable only to large institutions like the Smithsonian. Collex is free, generalizable, and open source and is presently being implemented in a large-scale pilot project under the auspices of NINES.

(BTW, NINES stands for the Networked Infrastructure for Nineteenth-century Electronic Scholarship, a trans-Atlantic federation of scholars.)

The initial efforts that became Collex were to establish frameworks and process within this community, not tools. But the group apparently recognized the importance of leverage and enablers (i.e, tools) and hired Erik Hatcher, a key contributor to the Apache open-source Lucene text-indexing engine and co-author of Lucene in Action, to spearhead development of an actually usable tool. Erik proceeded to grab best-of-breed stuff in area such as Ruby and Rails and Solr (a faceted enhancement to Lucene that has just graduated from the Apache incubator), and then to work hard on follow-on efforts such as Flare (a display framework) to create the basics of Collex. A sample screenshot of the application is shown below:

The Collex app is still flying under the radar, but it has sufficient online functionality today to support annotation, faceting, filtering, display, etc. Another interesting aspect of the NINES project (but not apparently a programmatic capability of the Collex software itself) is it only allows “authoritative” community listings, an absolute essential for scaling the semantic Web.

You can play with the impressive online demo of the Collex faceted browser at the NINES Web site today, though clearly the software is still undergoing intense development. I particularly like its clean design and clear functionality. The other aspect of this software that deserves attention is that it is a server-side option with cross-browser Ajax, without requiring any plugins. It works equally within Safari, Firefox and Windows IE. And, like the Zotero research citation tool, this basic framework could easily lend itself to managing structured information in virtually any other domain.

Collex is one of the projects of Applied Research in Patacriticism, a software development research team located at the University of Virginia and funded through an award to professor Jerome McGann from the Andrew Mellon Foundation. (“Shuuu, sheeee. Impressive. Most impressive.” — Darth Vader)

(BTW, forgive my sexist use of “guys” in this post’s title; I just couldn’t get a sex-neutral title to work as well for me!)

Jewels & Doubloons An AI3 Jewels & Doubloon Winner