Posted:March 25, 2007

Like millions of other Americans, I love college basketball and believe the NCAA tournament to be the best of the annual sporting events. This year has been no exception with some of the best play and closest games within my decades-long memories of the event (even though our family alma mater, Duke, lost in the first round). As of the time of this writing, two of the Final Four teams have been decided (Ohio State and UCLA) with the last two slots to be determined later today. Goooood stuff!

Also, like millions of Americans, I also enter various pools vying to pick the various round and final winners over the tournament’s six rounds of play. This collective obsession has caused many to comment on the weeks of lost productivity within the American workforce at this time of year. Bring it on!

I remember fondly some of those prior years when I’ve won some of those office pools and the accompanying few hundred bucks. But, as with so much else, the Internet has also changed how mad this March Madness has become. There are more than 2.9 million entries in ESPN’s $10,000 tournament bracket challenge, with other large competitions occurring for Facebook (~ 1.6 million), CBS Sportsline (~ 1/2 million), Sports Illustrated, and literally dozens of other large online services. (Note these are games of chance and are free to enter; they are different than wagering money bets in informal office betting pools.)

According to R.J. Bell at Pregame.com, some 30 million Americans in total are participating in pools of various kinds, wagering about $2.5 billion, both numbers I can easily believe. Your chances of picking a perfect bracket with all correct winners? How about 9,223,372,036,854,775,808 to 1. Or, as Bell puts it, “if every man, woman, and child on the planet randomly filled out 10 million brackets each, the odds would be LESS than 1% that even one would have a perfect bracket.”

Well, I’ve actually done pretty well on the ESPN pool to date this year, ranking in what at first sounds like an impressive 99.1 percentile as of this writing. But, while scoring in such high percentages on standardized tests is pretty cool, it is humbling in these tournament pools. I only have more than 25,000 entries better than mine and 0% chance of winning anything! (I also assume that my percent ranking will drop further as the final results come in. BTW, though there are tens of thousands better, my bracket is shown below.)

Of course, one shouldn’t enter these bracket games with any expectation other than to increase the enjoyment of watching the actual games and having some fun with statistics. For better odds, play your office pool. Or, better still, try starting up a company that eventually gets venture backing and becomes profitable. At least there, your chances of winning are much improved to say, 1,000 to 1.

Posted by AI3's author, Mike Bergman Posted on March 25, 2007 at 12:30 pm in Site-related | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/352/when-the-99th-percentile-is-not-good-enough/
The URI to trackback this post is: http://www.mkbergman.com/352/when-the-99th-percentile-is-not-good-enough/trackback/
Posted:February 7, 2007

How to Process Your Own Large Libraries into Thumbnails

When I decided to upgrade my Sweet Tools semantic Web and -related tools listing, I wanted to add some images to make the presentation more attractive. It was also becoming the case that many metadata aggregation service providers were adopting image representations for data (see this Dlib article). Since the focus of my listing is software, I either could install all of the programs and take screenshots (not doable given the numbers involved) or adopt what many others have used as a sort of visual index for content: thumbnails, or, as specifically called when applied to Web pages, thumbshots.

Quick Review of Alternatives

Unless you get all of your Web content via feeds or have been living in a cave, you may have recently contracted a form of popup vertigo. Since its introduction just a few months back, the Snap Preview Anywhere thumbnail popup has become the eggplant that eats Chicago, with more than a half million sites now reported to be using the service. Since I don’t want this service myself for my blog (see below) and I therefore did not want to go through the effort of signing up for SPA nor restricting its use to just this posting (even though the signup appears clean and straightforward), I reproduce below what one of these Snap link-over popups looks like:

The sheer ubiquity of these popup thumbnails is creating its own backlash (check out this sample rant from UNEASYsilence and its comments) and, early promoters, such as TechCrunch, have now gone to the use of a clickable icon [ ] for a preview, rather than automatically popping up the image from the link hover.

Not only had the novelty of these popups worn off for me, but my actual desired use for Sweet Tools was to present a gallery of images for multiple results simultaneously. So, besides its other issues, the Snap service was not suitable for my purpose.

I had earlier used a Firefox add-on called BetterSearch that places thumbnails on results pages when doing searches with Google (including international versions), Amazon, MSN Search, Yahoo!, A9, Answers.com, AllTheWeb, Dogpile.com, del.icio.us and Simpy.com. But, like the Snap service, I personally found this service to be distracting. I also don’t like the fact that my use was potentially being logged and promo messages were inserted on each screen. (There is another Firefox browser extension called GooglePreview that appears less intrusive, but I have not tried it.) As it turns out, both of these services themselves piggyback on a free (for some uses) thumbnail acquisition and server service from Thumbshots.org.

Since my interest in thumbnails was limited and focused to a bounded roster of sites (not the dynamic results from a search query), I decided to cut out the middleman and try the Thumbshots.org source directly myself. However, my candidate sites are mostly obscure academic ones or semantic Web ones not generally in the top rankings, meaning that most of the Sweet Tools Web sites unfortunately had no thumbnails on Thumbshots.org.

Of course, throughout these investigations, I had always had the option of taking physical screen captures myself and converting them manually to thumbnails. This is a very straightforward process with standard graphics packages; I had done so often for other purposes using my standard Paint Shop Pro software. But with the number of the Sweet Tools growing into the hundreds, such a manual approach clearly wouldn’t scale.

Knowing there are literally hundreds of cheap or free graphics and image manipulation programs out there, I thus set out to see if I could find a utility that would provide most, if not all, of the automation required.

My Sweet Tools records don’t change frequently, so I could accept a batch mode approach. I wanted to also size the thumbnails to whatever displayed best in my Exhibit presentation. As well, if I was going to adopt a new utility, I decided I might as well seek other screen capture and display flexibilities for other purposes. I also importantly needed the individual file names created to be unique and readable (not just opaque IDs). Finally, like any tool I ultimately adopt, I wanted quality output and professional design.

Off and on I reviewed options and packages, mostly getting disgusted with the low quality of the dross that mostly exists out there, and appalled at the difficulty in using standard search services to find such candidates. (There truly is becoming whole categories of content such as products of all types, reviews, real data, market info and statistics, that are becoming nearly impossible to effectively find on the Web with current search engines; but those are topics for another day.)

Nonetheless, after much looking and trial runs of perhaps a dozen packages, I finally stumbled across a real gem, WebShot. (Reasons this product was difficult to find included its relatively recent vintage, apparent absence of any promotion, and the mismatch between the product name and Web site name.)

The WebShot Utility

WebShot is a program that allows you to take screenshots and thumbnails of web pages or whole websites. I find its GUI easy to use, but it also comes with a command line interface for advanced users or for high-volume services. WebShot can produce images in the JPG, GIF, PNG, or BMP formats. It was developed in C by Nathan Moinvaziri.

The program is free for use on Windows XP, though PayPal donations are encouraged. Nominal charges are applied to other Windows versions and use the command line. Linux is not supported and Internet Explorer must be installed.

The graphical UI on Windows XP has a standard tabbed design. Single thumbnails or ones in batch driven from a text file may be used. Output files can be flexibly sized via the above formats. The screen capture itself can be based on mandatory or max and min browser display parameters. There are a variety of file naming parameters and system settings allow WebShot to work in Web-friendly ways. Here’s an example of the Image tab for the GUI:

The command-liine version accepts about 20 different parameters.

Depending on settings, you can get a large variety of outputs. The long banner image to the left, for example, is a “complete” Web page dump of my Web site at the time of this posting, with about 8 consecutive posts shown (160 x ~2300). The system automatically stitches together the multiple long page screenshots, with the resolution in this case being set by the input width parameter of 160 pixels.

Another option is this sample “cropped” one (440 x 257) where I’m actually cutting the standard screen display to about 50% of its normal vertical (height) dimension:

And, then, the next example shows what I have chosen as my “standard” thumbnail size (160 x 120) (I added the image borders, not the program):

In batch mode, I set the destination parameter such that I got both a logical domain portion in the file name (%d) and a hashed portion (%m) since there were a few occasions of multiple, but different Web pages, from the same host domain.

As noted, download re-tries, delays and timeouts are all settable to be a good Web citizen while getting acceptable results. With more-or-less standard settings, I was able to complete the 400 thumbnail downloads (without error, I should mention) in just a few minutes for the Sweet Tools dataset.

How I Do Bulk Thumbnails for Sweet Tools

Your use will obviously vary, but I kept notes for myself so that I could easily repeat or update this batch process (in fact, I have done so already a couple of times with the incremental updates to Sweet Tools). This general work flow is:

  1. Create a text file with host Web site URLs in spreadsheet order
  2. Run WebShot with these general settings:
    • destination switches of %d%m (core domain, plus hash)
    • image at 160w x 120h (my standard; could be anything as long as proper aspect maintained)
    • use of Multiple tab, with a new destination directory for each incremental update
    • browser setting at 1024 x 768 required (most common aspect today); min of 800 x 600; highest quality image
  3. At completion, go to a command window and write-out image file names (images complete in the same order as submitted). (In Windows, this is the dir/o:d > listing.txt command.) Then, copy the file names in the resulting text file back into the spreadsheet for the record < --> image correspondence
  4. Upload to the appropriate WordPress image directory.

Some Other Tips

Like many such tools, there is insufficient documentation for the WebShot package. But, with some experimentation, it is in fact quite easy to accomplish a number of management or display options. Some of the ones I discovered are:

  • If harvesting multiple individual Web pages from the same domain, use the domain (%d) and hash (%m) options noted above
  • For complete capture of long Web pages (such as the image of my own Web site to the left), first decide on a desired resolution set via ‘width’ on the Image tab, leave height blank, and leave the browser settings open
  • For partial screen captures without distortion. set the image dimensions to the desired final size with height the desired partial percentage value, then adjust the browser dimensions to equal the image aspect.
Jewels & Doubloons An AI3 Jewels & Doubloon Winner
Posted:January 31, 2007

Adopting Standardized Identification and Metadata

Since it’s becoming all the rage, I decided to go legit and get myself (and my blog) a standard ID and enable the site for easy metadata exchange.

OpenID

OpenID is an open and free framework for users (and their blogs or Web sites) to establish a digital identity. OpenID starts with the concept that anyone can identify themselves on the Internet the same way websites do via a URI and that authentication is essential to prove ownership of the ID.

Sam Ruby provides a clear article on how to set up your own OpenID. One of the listing services is MyOpenID, from which you can get a free ID. With Sam’s instructions, you can also link a Web site or blog to this ID. When done, you can check whether your site is valid; you can do a quick check using my own ID (make sure and enter the full URI!). However, to get the valid ID message as shown below, you will need to have your standard (and secure) OpenID password handy.

unAPI

unAPI addresses a different problem — the need for a uniform, simple method for copying rich digital objects out of any Web application. unAPI is a tiny HTTP API for the few basic operations necessary to copy discrete, identified content from any kind of web application. An interesting unAPI background and rationale paper from the Ariadne online library ezine is “Introducing unAPI”.

To invoke the service, I installed the unAPI Server, a WordPress plug-in from Michael J. Giarlo. The server provides records for each WordPress post and page in the following formats: OAI-Dublin Core (oai_dc), Metadata Object Description Schema (MODS), SRW-Dublin Core (srw_dc), MARCXML, and RSS. The specification makes use of LINK tag auto-discovery and an unendorsed microformat for machine-readable metadata records.

Validating the call is provided by the unAPI organization, with the return for this blog entry shown here (which is error free and only partially reproduced):

The actual unAPI results do not appear anywhere in your page, since the entire intent is to be machine readable. However, to show what can be expected from these five formats, I reproduce what the internal metadata listings for the five different formats are in these links:

MARCXML:

http://www.mkbergman.com/wp-content/plugins/unapi/server.php?id= http://www.mkbergman.com/?p=320&format=marcxml

MODS:

http://www.mkbergman.com/wp-content/plugins/unapi/server.php?id= http://www.mkbergman.com/?p=320&format=mods

OAI-DC:

http://www.mkbergman.com/wp-content/plugins/unapi/server.php?id= http://www.mkbergman.com/?p=320&format=oai_dc

RSS:

http://www.mkbergman.com/wp-content/plugins/unapi/server.php?id= http://www.mkbergman.com/?p=320&format=rss

SRW-DC:

http://www.mkbergman.com/wp-content/plugins/unapi/server.php?id= http://www.mkbergman.com/?p=320&format=srw_dc

unAPI adoption is just now beginning, and some apps like the Firefox Dublin Core Viewer add-on do not yet recognize the formats. But, because the specification is so simple and is being pushed by the library community, I suspect adoption will become very common. I encourage you to join!

Posted:January 4, 2007

After feeling like I was drowning in too much Campbell’s alphabet vegetable soup, I decided to bite the bullet (how can one bite a bullet while drinking soup?!) and provide an acronym look-up service to this AI3 blog site. So, welcome to the new permanent link to the Acronyms & Glossary shown in my standard main links to the upper left.

This permanent (and sometimes updated page) lists about 350 acronyms related to computer science, IT, semantic Web and information retrieval and processing. Most entries point to Wikipedia, with those that do not instead referencing their definitive standards site.

I welcome any suggestions for new acronyms to be added to this master list. Yabba dabba doo . . . . (oops, I meant YAML YAML ADO).

Posted by AI3's author, Mike Bergman Posted on January 4, 2007 at 10:48 am in Site-related | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/313/new-acronym-listing-from-ado-to-yaml/
The URI to trackback this post is: http://www.mkbergman.com/313/new-acronym-listing-from-ado-to-yaml/trackback/
Posted:January 3, 2007

Google Co-op Custom Search Engines (CSEs) Moving Forward at Internet Speed

Since its release a mere two months ago in late October, Google’s custom search engine (CSE) service, built on its Co-op platform, has gone through some impressive refinements and expansions. Clearly, the development team behind this effort is dedicated and capable.

I recently announced the release of my own CSE — SweetSearch — that is a comprehensive and authoritative search engine for all topics related to the semantic Web and Web 2.0. Like Ethan Zuckerman who published his experience in creating a CSE for Ghana in late October, I too have had some issues. Ethan’s first post was entitled, “What Google Coop Search Doesn't Do Well,” posted on October 27. Yet, by November 6, the Google Co-op team had responded sufficiently that Ethan was able to post a thankful update, “Google Fixes My Custom Search Problems.” I’m hoping some of my own issues get a similarly quick response.

Fast, Impressive Progress

It is impressive to note the progress and removal of some early issues in the last two months. For example, early limits of 1,000 URLs per CSE have been upped to 5,000 URLs, with wildcard pattern matches improving this limit still further. Initial limits to two languages have now been expanded to most common left-to-right languages (Arabic and Hebrew are still excluded). Many bugs have been fixed. The CSE blog has been a welcome addition, and file upload capabilities are quite stable (though not all eventual features are yet supported). The Google Co-op team actively solicits support and improvement comments (http://www.google.com/support/coop/) and a useful blog has been posted by the development team (http://googlecustomsearch.blogspot.com/).

In just a few short weeks, at least 2,100 new CSEs have been created (found by issuing the advanced search query, ‘site:http://google.com/coop/cse?cx=‘ to Google itself, with cx representing the unique ID key for each CSE). This number is likely low since newly created or unreleased CSEs do not appear in the results. This growth clearly shows the pent up demand for vertical search engines and the desire for users to improve authoritativeness and quality. Over time, Google will certainly reap user-driven benefits from these CSEs in its own general search services.

My Pet Issues

So, in the spirit of continued improvement, I offer below my own observations and pet peeves with how the Google CSE service presently works. I know these points will not fall on deaf ears and perhaps other CSE authors may see some issues of their own importance in this listing.

  1. There is a bug in handling “dirty” URLs for results pages. Many standard CRMs or blog software, such as WordPress or Joomla!, provide options for both “pretty” URLs (SEO ones, that contain title names in the URL string, such as http://www.mydomain.com/2007/jan/most-recent-blog-post.html) v. “dirty” ones that label URLs with IDs or sequences with question marks (such as http://www.mydomain.com/?p=123). Often historical “dirty” URLs are difficult to easily convert to “pretty” ones. The Google CSE code unfortunately truncates the URL at the question mark when results are desired to be embedded in a local site using a “dirty” URL, which then causes the Javascript for results presentations to fail (see also this Joomla! link). As Ahmed, one of the Google CSE users points out, there is a relatively easy workaround for this bug, but you would pull your hair out if you did not know the trick.
  2. Results page font-size control is lacking. Though it is claimed that control is provided for this, it is apparently not possible to control results font sizes without resorting to the Google Ajax search API (see more below).
  3. There is a bug in applying filetype “refinements” to results, such as the advanced Google search operator filetype:pdf. Google Co-op staff acknowledge this as a bug and hopefully this will be corrected soon.
  4. Styling is limited to colors and borders and ad placement locations short of resorting to the Google Ajax search API, and the API itself still lacks documentation or tutorials on how to style results or interactions with the native Google CSS. Admittedly, this is likely a difficult issue for Google since too much control given to the user can undercut its own branding and image concerns. However, Google’s Terms of Service seem to be fairly comprehensive in such protections and it would be helpful to see this documentation soon. There is often reference to the Ajax search API by Google Co-op team members, but unfortunately too little useful online documentation to make this approach workable for mere mortals.
  5. It is vaguely stated that items called “attributes” can be included in CSE results and refinements (such as ‘A=Date’), but the direction is unclear and other forum comments seem to suggest this feature is not yet active. My own attempts show no issues in uploading CSE specifications that include attributes, but they are not yet retained in the actual specification currently used by Google. (Related to this topic is the fact that older forum postings may no longer be accurate as other improvements and bug fixes have been released.)
  6. Yes, there still remains a 5,000 “annotation” limit per CSE, which is the subject of complaint by some CSE authors. I personally have less concern with this limit now that the URL pattern matching has been added. Also, there is considerable confusion about what this “annotation” limit really means. In my own investigations, an “annotation” in fact is equivalent to a single harvest point URL (with or without wildcards) and up to four labels or facets (with or without weighting or comments) for each.
  7. While outside parties are attempting to provide general directory services, Google itself has a relatively poor way of announcing or listing new CSEs. The closest it comes is a posting page (http://groups-beta.google.com/group/google-co-op/web/your-custom-search) or the featured CSE engines (http://google.com/coop/cse/examples/GooglePicks), which are an impressive lot and filled with useful examples. Though there are a number of third parties trying to provide comprehensive directory listings, most have limited coverage:
  8. The best way to get a listing of current CSEs still appears to be using the Google site: query above matched with a topic description, though that approach is not browsable and does not link to CSEs hosted on external sites.

  9. I would like to see expanded support for additional input and export formats, including potentially OPML, microformats or Gdata itself. The current TSV and XML approaches are nice.

Yet, despite these quibbles, this CSE service is pointing the way to entirely new technology and business models. It is interesting that the Amazon S3 service and Yahoo!’s Developer Network are experimenting with similar Internet and Web service approaches. Let the fun begin!

Posted by AI3's author, Mike Bergman Posted on January 3, 2007 at 2:46 pm in Searching, Site-related | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/311/googles-custom-search-engine-cse-impressive-start-but-some-quibbles-remain/
The URI to trackback this post is: http://www.mkbergman.com/311/googles-custom-search-engine-cse-impressive-start-but-some-quibbles-remain/trackback/