Posted:January 3, 2007

Google Co-op Custom Search Engines (CSEs) Moving Forward at Internet Speed

Since its release a mere two months ago in late October, Google’s custom search engine (CSE) service, built on its Co-op platform, has gone through some impressive refinements and expansions. Clearly, the development team behind this effort is dedicated and capable.

I recently announced the release of my own CSE — SweetSearch — that is a comprehensive and authoritative search engine for all topics related to the semantic Web and Web 2.0. Like Ethan Zuckerman who published his experience in creating a CSE for Ghana in late October, I too have had some issues. Ethan’s first post was entitled, “What Google Coop Search Doesn't Do Well,” posted on October 27. Yet, by November 6, the Google Co-op team had responded sufficiently that Ethan was able to post a thankful update, “Google Fixes My Custom Search Problems.” I’m hoping some of my own issues get a similarly quick response.

Fast, Impressive Progress

It is impressive to note the progress and removal of some early issues in the last two months. For example, early limits of 1,000 URLs per CSE have been upped to 5,000 URLs, with wildcard pattern matches improving this limit still further. Initial limits to two languages have now been expanded to most common left-to-right languages (Arabic and Hebrew are still excluded). Many bugs have been fixed. The CSE blog has been a welcome addition, and file upload capabilities are quite stable (though not all eventual features are yet supported). The Google Co-op team actively solicits support and improvement comments (http://www.google.com/support/coop/) and a useful blog has been posted by the development team (http://googlecustomsearch.blogspot.com/).

In just a few short weeks, at least 2,100 new CSEs have been created (found by issuing the advanced search query, ‘site:http://google.com/coop/cse?cx=‘ to Google itself, with cx representing the unique ID key for each CSE). This number is likely low since newly created or unreleased CSEs do not appear in the results. This growth clearly shows the pent up demand for vertical search engines and the desire for users to improve authoritativeness and quality. Over time, Google will certainly reap user-driven benefits from these CSEs in its own general search services.

My Pet Issues

So, in the spirit of continued improvement, I offer below my own observations and pet peeves with how the Google CSE service presently works. I know these points will not fall on deaf ears and perhaps other CSE authors may see some issues of their own importance in this listing.

  1. There is a bug in handling “dirty” URLs for results pages. Many standard CRMs or blog software, such as WordPress or Joomla!, provide options for both “pretty” URLs (SEO ones, that contain title names in the URL string, such as http://www.mydomain.com/2007/jan/most-recent-blog-post.html) v. “dirty” ones that label URLs with IDs or sequences with question marks (such as http://www.mydomain.com/?p=123). Often historical “dirty” URLs are difficult to easily convert to “pretty” ones. The Google CSE code unfortunately truncates the URL at the question mark when results are desired to be embedded in a local site using a “dirty” URL, which then causes the Javascript for results presentations to fail (see also this Joomla! link). As Ahmed, one of the Google CSE users points out, there is a relatively easy workaround for this bug, but you would pull your hair out if you did not know the trick.
  2. Results page font-size control is lacking. Though it is claimed that control is provided for this, it is apparently not possible to control results font sizes without resorting to the Google Ajax search API (see more below).
  3. There is a bug in applying filetype “refinements” to results, such as the advanced Google search operator filetype:pdf. Google Co-op staff acknowledge this as a bug and hopefully this will be corrected soon.
  4. Styling is limited to colors and borders and ad placement locations short of resorting to the Google Ajax search API, and the API itself still lacks documentation or tutorials on how to style results or interactions with the native Google CSS. Admittedly, this is likely a difficult issue for Google since too much control given to the user can undercut its own branding and image concerns. However, Google’s Terms of Service seem to be fairly comprehensive in such protections and it would be helpful to see this documentation soon. There is often reference to the Ajax search API by Google Co-op team members, but unfortunately too little useful online documentation to make this approach workable for mere mortals.
  5. It is vaguely stated that items called “attributes” can be included in CSE results and refinements (such as ‘A=Date’), but the direction is unclear and other forum comments seem to suggest this feature is not yet active. My own attempts show no issues in uploading CSE specifications that include attributes, but they are not yet retained in the actual specification currently used by Google. (Related to this topic is the fact that older forum postings may no longer be accurate as other improvements and bug fixes have been released.)
  6. Yes, there still remains a 5,000 “annotation” limit per CSE, which is the subject of complaint by some CSE authors. I personally have less concern with this limit now that the URL pattern matching has been added. Also, there is considerable confusion about what this “annotation” limit really means. In my own investigations, an “annotation” in fact is equivalent to a single harvest point URL (with or without wildcards) and up to four labels or facets (with or without weighting or comments) for each.
  7. While outside parties are attempting to provide general directory services, Google itself has a relatively poor way of announcing or listing new CSEs. The closest it comes is a posting page (http://groups-beta.google.com/group/google-co-op/web/your-custom-search) or the featured CSE engines (http://google.com/coop/cse/examples/GooglePicks), which are an impressive lot and filled with useful examples. Though there are a number of third parties trying to provide comprehensive directory listings, most have limited coverage:
  8. The best way to get a listing of current CSEs still appears to be using the Google site: query above matched with a topic description, though that approach is not browsable and does not link to CSEs hosted on external sites.

  9. I would like to see expanded support for additional input and export formats, including potentially OPML, microformats or Gdata itself. The current TSV and XML approaches are nice.

Yet, despite these quibbles, this CSE service is pointing the way to entirely new technology and business models. It is interesting that the Amazon S3 service and Yahoo!’s Developer Network are experimenting with similar Internet and Web service approaches. Let the fun begin!

Posted by AI3's author, Mike Bergman Posted on January 3, 2007 at 2:46 pm in Searching, Site-related | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/311/googles-custom-search-engine-cse-impressive-start-but-some-quibbles-remain/
The URI to trackback this post is: http://www.mkbergman.com/311/googles-custom-search-engine-cse-impressive-start-but-some-quibbles-remain/trackback/