Posted:March 25, 2007

When the 99th Percentile is Not Good Enough

Like millions of other Americans, I love college basketball and believe the NCAA tournament to be the best of the annual sporting events. This year has been no exception with some of the best play and closest games within my decades-long memories of the event (even though our family alma mater, Duke, lost in the first round). As of the time of this writing, two of the Final Four teams have been decided (Ohio State and UCLA) with the last two slots to be determined later today. Goooood stuff!

Also, like millions of Americans, I also enter various pools vying to pick the various round and final winners over the tournament’s six rounds of play. This collective obsession has caused many to comment on the weeks of lost productivity within the American workforce at this time of year. Bring it on!

I remember fondly some of those prior years when I’ve won some of those office pools and the accompanying few hundred bucks. But, as with so much else, the Internet has also changed how mad this March Madness has become. There are more than 2.9 million entries in ESPN’s $10,000 tournament bracket challenge, with other large competitions occurring for Facebook (~ 1.6 million), CBS Sportsline (~ 1/2 million), Sports Illustrated, and literally dozens of other large online services. (Note these are games of chance and are free to enter; they are different than wagering money bets in informal office betting pools.)

According to R.J. Bell at Pregame.com, some 30 million Americans in total are participating in pools of various kinds, wagering about $2.5 billion, both numbers I can easily believe. Your chances of picking a perfect bracket with all correct winners? How about 9,223,372,036,854,775,808 to 1. Or, as Bell puts it, “if every man, woman, and child on the planet randomly filled out 10 million brackets each, the odds would be LESS than 1% that even one would have a perfect bracket.”

Well, I’ve actually done pretty well on the ESPN pool to date this year, ranking in what at first sounds like an impressive 99.1 percentile as of this writing. But, while scoring in such high percentages on standardized tests is pretty cool, it is humbling in these tournament pools. I only have more than 25,000 entries better than mine and 0% chance of winning anything! (I also assume that my percent ranking will drop further as the final results come in. BTW, though there are tens of thousands better, my bracket is shown below.)

Of course, one shouldn’t enter these bracket games with any expectation other than to increase the enjoyment of watching the actual games and having some fun with statistics. For better odds, play your office pool. Or, better still, try starting up a company that eventually gets venture backing and becomes profitable. At least there, your chances of winning are much improved to say, 1,000 to 1.

Posted:March 19, 2007

Jewels & Doubloons II: Why Do We Miss this Stuff?

Jewels & Doubloons

If Everyone Could Find These Tools, We’d All be Better Off

About a month ago I announced my Jewels & Doubloons awards for innovative software tools and developments, most often ones that may be somewhat obscure. In the announcement, I noted that our modern open-source software environment is:

“… literally strewn with jewels, pearls and doubloons — tremendous riches based on work that has come before — and all we have to do is take the time to look, bend over, investigate and pocket those riches.”

That entry begged the question of why this value is often overlooked in the first place. If we know it exists, why do we continue to miss it?

The answers to this simple question are surprisingly complex. The question is one I have given much thought to, since the benefits from building off of existing foundations are manifest. I think the reasons for why we so often miss these valuable, and often obscure, tools of our profession range from ones of habit and culture to weaknesses in today’s search. I will take this blog’s Sweet Tools listing of 500 semantic Web and -related tools as an illustrative endpoint of what a complete listing of such tools in a given domain might look like (not all of which are jewels, of course!), including the difficulties of finding and assembling such listings. Here are some reasons:

Search Sucks — A Clarion Call for Semantic Search

I recall late in 1998 when I abandoned my then-favorite search engine, AltaVista, for the new Google kid on the block and its powerful Page Ranking innovation. But that was tens of billions of documents ago, and I now find all the major search engines to again be suffering from poor results and information overload.

Using the context of Sweet Tools, let’s pose some keyword searches in an attempt to find one of the specific annotation tools in that listing, Annozilla. We’ll also assume we don’t know the name of the product (otherwise, why search?). We’ll also use multiple search terms, and since we know there are multiple tool types in our category, we will also search by sub-categories.

In a first attempt using annotation software mozilla, we do not find Annozilla in the first 100 results. We try adding more terms, such as annotation software mozilla “semantic web”, and again it is not in the first 100 results.

Of course, this is a common problem with keyword searches when specific terms or terms of art may not be known or when there are many variants. However, even if we happened to stumble upon one specific phrase used to describe Annozilla, “web annotation tool”, while we do now get a Google result at about position #70, it is also not for the specific home site of interest:

Now, we could have entered annozilla as our search term, assuming somehow we now knew it as a product name, which does result in getting the target home page as result #1. But, because of automatic summarization and choices made by the home site, even that description is also a bit unclear as to whether this is a tool or not:

Alternatively, had we known more, we could have searched on Annotea Mozilla and gotten pretty good results, since that is what Annozilla is, but that presumes a knowledge of the product we lack.

Standard search engines actually now work pretty well in helping to find stuff for which you already know a lot, such as the product or company name. It is when you don’t know these things that the weaknesses of conventional search are so evident.

Frankly, were our content to be specified by very minor amounts of structure (often referred to as “facets”) such as product and category, we could cut through this clutter quickly and get to the results we wanted. Better still, if we could also specify only listings added since some prior date, we could also limit our inspections to new tools since our last investigation. It is this type of structure that characterizes the lightweight Exhibit database and publication framework underlying Sweet Tools itself, as its listing for Annozilla shows:

The limitations of current unstructured search grow daily as Internet content volumes grows.

We Don’t Know Where to Look

The lack of semantic search also relates to the problem of not knowing where to look, and derives from the losing trade-offs of keywords v. semantics and popularity v. authoritativeness. If, for example, you look for Sweet Tools on Google using “semantic web” tools, you will find that the Sweet Tools listing only appears at position #11 with a dated listing, even though arguably it has the most authoritative listing available. This is because there are more popular sites than the AI3 site, Google tends to cluster multiple site results using the most popular — and generally, therefore, older (250 v. 500 tools in this case!) — page for that given site, and the blog title is used in preference to the posting title:

Semantics are another issue. It is important, in part, because you might enter the search term product or products or software or applications, rather than ‘tools‘, which is the standard description for the Sweet Tools site. The current state of keyword search is to sometimes allow plural and single variants, but not synonyms or semantic variants. The searcher must thus frame multiple queries to cover all reasonable prospects. (If this general problem is framed as one of the semantics for all possible keywords and all possible content, it appears quite large. But remember, with facets and structure it is really those dimensions that best need semantic relationships — a more tractable problem than the entire content.)

We Don’t Have Time

Faced with these standard search limits, it is easy to claim that repeated searches and the time involved are not worth the effort. And, even if somehow we could find those obscure candidate tools that may help us better do our jobs, we still need to evaluate them and modify them for our specific purposes. So, as many claim, these efforts are not worth our time. Just give me a clean piece of paper and let me design what we need from scratch. But this argument is total bullpucky.

Yes, search is not as efficient as it should be, but our jobs involve information, and finding it is one of our essential work skills. Learn how to search effectively.

The time spent in evaluating leading candidates is also time well spent. Studying code is one way to discern a programming answer. Absent such evaluation, how does one even craft a coded solution? No matter how you define it, anything but the most routine coding tasks requires study and evaluation. Why not use existing projects as the learning basis, in addition to books and Internet postings? If, in the process, an existing capability is found upon which to base needed efforts, so much the better.

The excuse of not enough time to look for alternatives is, in my experience, one of laziness and attitude, not a measured evaluation of the most effective use of time.

Concern Over the Viral Effects of Certain Open Source Licenses

Enterprises, in particular, have legitimate concerns in the potential “viral” effects of mixing certain open-source licenses such as GPL with licensed proprietary software or internally developed code. Enterprise developers have a professional responsibility to understand such issues.

That being said, my own impression is that many open-source projects understand these concerns and are moving to more enlightened mix-and-match licenses such as Eclipse, Mozilla or Apache. Also, in virtually any given application area, there is also a choice of open-source tools with a diversity of licensing terms. And, finally, even for licenses with commercial restrictions, many tools can still be valuable for internal, non-embedded applications or as sources for code inspection and learning.

Though the license issue is real when it comes to formal deployment and requires understanding of the issues, the fact that some open source projects may have some use limitations is no excuse to not become familiar with the current tools environment.

We Don’t Live in the Right Part of the World

Actually, I used to pooh-pooh the idea that one needed to be in one of the centers of software action — say, Silicon Valley, Boston, Austin, Seattle, Chicago, etc. — in order to be effective and on the cutting edge. But I have come to embrace a more nuanced take on this. There is more action and more innovation taking place in certain places on the globe. It is highly useful for developers to be a part of this culture. General exposure, at work and the watering hole, is a great way to keep abreast of trends and tools.

However, even if you do not work in one of these hotbeds, there are still means to keep current; you just have to work at it a bit harder. First, you can attend relevant meetings. If you live outside of the action, that likely means travel on occasion. Second, you should become involved in relevant open source projects or other dynamic forums. You will find that any time you need to research a new application or coding area, that the greater familiarity you have with the general domain the easier it will be for you to get current quickly.

We Have Not Been Empowered to Look

Dilbert, cubes and big bureaucracies aside, while it may be true that some supervisors are clueless and may not do anything active to support tools research, that is no excuse. Workers may wait until they are “empowered” to take initiative; professionals, in the true sense of the word, take initiative naturally.

Granted, it is easier when an employer provides the time, tools, incentives and rewards for its developers to stay current. Such enlightened management is a hallmark of adaptive and innovative organizations. And it is also the case that if your organization is not supporting research aims, it may be time to get that resume up to date and circulated.

But knowledge workers today should also recognize that responsibility for professional development and advancement rests with them. It is likely all of us will work for many employers, perhaps even ourselves, during our careers. It is really not that difficult to find occasional time in the evenings or the weekend to do research and keep current.

If It’s Important, Become an Expert

One of the attractions of software development is the constantly advancing nature of its technology, which is truer than ever today. Technology generations are measured in the range of five to ten years, meaning that throughout an expected professional lifetime of say, about 50 years, you will likely need to remake yourself many times.

The “experts” of each generation generally start from a clean slate and also re-make themselves. How do they do so and become such? Well, they embrace the concept of lifelong learning and understand that expertise is solely a function of commitment and time.

Each transition in a professional career — not to mention needs that arise in-between — requires getting familiar with the tools and techniques of the field. Even if search tools were perfect and some “expert” out there had assembled the best listing of tools available, they can all be better characterized and understood.

It’s Mindset, Melinda!

Actually, look at all of the reasons above. They all are based on the premise that we have completely within our own lights the ability and responsibility to take control of our careers.

In my professional life, which I don’t think is all that unusual, I have been involved in a wide diversity of scientific and technical fields and pursuits, most often at some point being considered an “expert” in a part of that field. The actual excitement comes from the learning and the challenges. If you are committed to what is new and exciting, there is much room for open-field running.

The real malaise to avoid in any given career is to fall into the trap of “not enough time” or “not part of my job.” The real limiter to your profession is not time, it is mindset. And, fortunately, that is totally within your control.

Gathering in the Riches

Since each new generation builds on prior ones, your time spent learning and becoming familiar with the current tools in your field will establish the basis for that next change. If more of us had this attitude, the ability for each of us to leverage whatever already exists would be greatly improved. The riches and rewards are waiting to be gathered.

Posted:March 15, 2007

Lurkers (and Writers!) Unite

Bill Aron's The Scribe, from http://www.puckergallery.com OK, you lurkers. You know who you are. You hang out on open source forums, learning, gleaning, scheming . . . . You want to dive in, make contributions, but the truth is that the key developers on the project are really quite remarkable and have programming skills you can’t touch. And, so, you lurk, and maybe occasionally comment.

On the other hand, you are likely a user and implementer. And you probably share with me the observation, which I make frequently here and elsewhere, that one of the biggest weaknesses of most open source projects is their (relative) lack of documentation.

Maybe you’re also no great shakes as a programmer. But you can write! (Actually, overcoming the fear to write is a still lower threshold condition to contributing code to an open source project). I don’t know if this picture is true for you, but it is true for me.

But there’s hope and there’s a role waiting for you. If you begin contributing documentation to a project, it won’t be long before you start discovering some valuable secrets:

Secret #1. Those developers you admire so much hate to document and will thank you for it. It is actually the rare case where a great developer is also a great writer and communicator.

Secret #2. While you’re afraid of actually touching the code, getting into the app to write about how to use it or its nuances will bring you up the learning chain tremendously! Just as it is a truism that to learn a subject one needs to teach it, to learn about software code one should document it.

Secret #3. If, like me, you are not a natural programmer, then influencing those who are able to tackle some of the development issues of key concern to you is perhaps a more effective use of time than a direct frontal assault on the code itself. Writing documentation is one way to perhaps play to your greater strengths while still making a valuable contribution.

Secret #4. Committed developers behind every worthwhile open source project are typically creative and innovative. It is always rewarding to interact with smart, creative people. Starting to become a documentation cog within a broader open source wheel is personally and socially rewarding.

Finally, the total scope of documentation required by a project also includes user support and response to user forums. If you can also pick up some of the slack answering questions from newbies, you will also be doing the overall project a favor by freeing up valuable developer time.

Every open source project worth its promise has way, way too much that needs to be done and the community desires. You don’t have to be a world-class code jockey to make a meaningful contribution. So, lurkers and writers unite! Roll up your sleeves and get that quill wet.

Posted:March 11, 2007

Listing of 500 Semantic Web and Related Tools

Sweet Tools Listing AI3’s SemWeb Tools Survey is Now Largely Completed

This AI3 blog maintains Sweet Tools, the largest listing of about 800 semantic Web and -related tools available. Most are open source. Click here to see the current listing!

It has taken nearly six months, but I believe my survey of existing semantic Web and related tools is now largely complete. While new tools will certainly be discovered and new ones are constantly being developed (which, I also believe, is at an accelerating pace), I think the existing backlog has largely been found. Though important gaps remain, today’s picture is one of a surprisingly robust tools environment for promoting semantic Web objectives.

Growth and Tools Characterization

My most recent update of Sweet Tools, also published today, now lists 500 semantic Web and related tools and is in its 8th version. Starting with the W3C’s listing of 70 tools, first referenced in August 2006, I have steadily found and added to the listing. Until now, the predominant source of growth in these listings has come through discovery of extant tools.

In its earliest versions, my search strategy very much focused on all topics directly related to the “semantic Web.” However, as time went on, I came to understand the importance of many ancillary tool sets to the entire semantic Web pipeline (such as language processing and information extraction) and came to find whole new categories of pragmatic tools that embodied semantic Web and data mediation processes but which did not label themselves as such. This latter category has been an especially rich vein to mine, with notable contributions from the humanities, biology and the physical sciences.

But the pace of discovery is now approaching its asymptote. Though I by no means believe I have comprehensively found all extant tools, I do believe that most new tools in future listings will come more from organic growth and new development than discovery of hidden gems. So, enjoy!

My view of what is required for the semantic Web vision to reach some degree of fruition begins with uncharacterized content, which then proceeds through a processing pipeline ultimately resulting in the storage of RDF triples that can be managed at scale. By necessity, such a soup-to-nuts vision embraces tools and requirements that, individually, might not constitute semantic technology strictly defined, but is nonetheless an integral part of the overall pipeline. By (somewhat arbitrary) category, here is the breakdown of the current listing of 500 tools:

No. Tools	Category
43	Information Extraction
32	Ontology (general)
30	Parser or Converter
29	Composite App/Framework
29	Database/Datastore
26	Annotator
25	Programming Environment
23	Browser (RDF, OWL or semantic)
23	Language Processor
22	Reasoner/Inference Engine
22	Wiki- or blog-related
22	Wrapper (Web data extractor)
20	RDF (general)
19	Search Engine
15	Visualization
13	Query Language or Service
11	Ontology Mapper/Mediator
9	Ontology Editor
8	Data Language
8	Validator
6	NOT ACTIVE (???)
5	Semantic Desktop
4	Harvester
3	Description or Formal Logics
3	RDF Editor
2	RDF Generator
48	Miscellaneous
500

I find it amusing that the diversity and sources of such tool listings — importantly including what is properly in the domain or not — is itself an interesting example of the difficulties facing semantic mediation and resolution. Alas, such is the real world.

Java is the Preferred Language

There are interesting, and historical, trends as well in the use of primary development languages around these tools. Older ones rely on C or C++ or, if they are logic or inference oriented, on the expected languages of Prolog or LISP.

One might be tempted to say that Java is the language of the semantic Web with about 50% of all tools, especially the more developed and prominent ones that embrace a broader spectrum of needs, but I’m not so sure. I’m seeing a trend in the more recently announced tools to use JavaScript or Ruby — these deserve real attention. And while the P languages (Perl, PHP, Python) are also showing some strength, it is not clear that this is anything specific to semantic Web needs but a general reflection of standard Web trends.

So, here is the listing of Sweet Tools apps by primary development language:

About half of all apps are written in Java. The next most prevalent language is JavaScript, at 13%, which is two times the amount of the next leading choices of C/C++ or PHP, which have about 6% each. As might be expected, the “major” apps are more likely to be written in Java or the C languages; user interface emphases tend to occur in the P languages; browser extensions or add-ons are generally in JavaScript; and logic applications are often in Lisp or Prolog.

An Alternative Simple Listing

I have also created and will maintain a simple listing of Sweet Tools that lists all 500 tools on a single page with live links and the each tool’s category. This listing is being provided to provide a single access point to users and because the Exhibit presentation is based on JavaScript, which is not adequately indexed by virtually all search engines.

Posted:March 9, 2007

Firebug from the Horse’s Mouth

Firebug is hardly a secret weapon for aiding serious JavaScript development, but effectively learning what it is and how it works has been somewhat of a secret. That is, until now.

Firebug is a Firefox add-on that integrates development tools to edit, debug, and monitor CSS, HTML, and JavaScript live in any Web page. You can inspect and edit HTML, tweak CSS and styles, visualize CSS metrics with DOM highlighting, monitor page and JavaScript loading, debug and profile JavaScript, execute JavaScript on the fly and log its functions, quickly find errors, and explore the DOM. There is no more powerful tool to explain to you what is going on within your browser.

But the problem to date has been discovering and understanding many of the rich aspects of this tool. Firebug’s innovative developer, Joe Hewitt, is shown in this 48-minute screencast to Yahoo! developers about a month ago:

Source: Yahoo! Video (Flash) | download.yahoo.com (MP4; recommended)

This video provides a real power user’s tour of the recently released version 1.0 of Firebug. It is a must-see for Web developers. The download is advisable to better see Joe’s screen examples.

Joe is a Mozilla developer, who also wrote the original Mozilla DOM Inspector. Joe is a co-founder of Parakey, which I profiled in an earlier post.

Oh, by the way: Firebug is a most deserving winner of the AI3 Jewels & Doubloons award.

An AI3 Jewels & Doubloons Winner

Main Links

Search

Author: Mike Bergman