Posted:January 3, 2007

Google’s Custom Search Engine (CSE): Impressive Start, but Some Quibbles Remain

Google Co-op Custom Search Engines (CSEs) Moving Forward at Internet Speed

Since its release a mere two months ago in late October, Google’s custom search engine (CSE) service, built on its Co-op platform, has gone through some impressive refinements and expansions. Clearly, the development team behind this effort is dedicated and capable.

I recently announced the release of my own CSE — SweetSearch — that is a comprehensive and authoritative search engine for all topics related to the semantic Web and Web 2.0. Like Ethan Zuckerman who published his experience in creating a CSE for Ghana in late October, I too have had some issues. Ethan’s first post was entitled, “What Google Coop Search Doesn't Do Well,” posted on October 27. Yet, by November 6, the Google Co-op team had responded sufficiently that Ethan was able to post a thankful update, “Google Fixes My Custom Search Problems.” I’m hoping some of my own issues get a similarly quick response.

Fast, Impressive Progress

It is impressive to note the progress and removal of some early issues in the last two months. For example, early limits of 1,000 URLs per CSE have been upped to 5,000 URLs, with wildcard pattern matches improving this limit still further. Initial limits to two languages have now been expanded to most common left-to-right languages (Arabic and Hebrew are still excluded). Many bugs have been fixed. The CSE blog has been a welcome addition, and file upload capabilities are quite stable (though not all eventual features are yet supported). The Google Co-op team actively solicits support and improvement comments (http://www.google.com/support/coop/) and a useful blog has been posted by the development team (http://googlecustomsearch.blogspot.com/).

In just a few short weeks, at least 2,100 new CSEs have been created (found by issuing the advanced search query, ‘site:http://google.com/coop/cse?cx=‘ to Google itself, with cx representing the unique ID key for each CSE). This number is likely low since newly created or unreleased CSEs do not appear in the results. This growth clearly shows the pent up demand for vertical search engines and the desire for users to improve authoritativeness and quality. Over time, Google will certainly reap user-driven benefits from these CSEs in its own general search services.

My Pet Issues

So, in the spirit of continued improvement, I offer below my own observations and pet peeves with how the Google CSE service presently works. I know these points will not fall on deaf ears and perhaps other CSE authors may see some issues of their own importance in this listing.

There is a bug in handling “dirty” URLs for results pages. Many standard CRMs or blog software, such as WordPress or Joomla!, provide options for both “pretty” URLs (SEO ones, that contain title names in the URL string, such as http://www.mydomain.com/2007/jan/most-recent-blog-post.html) v. “dirty” ones that label URLs with IDs or sequences with question marks (such as http://www.mydomain.com/?p=123). Often historical “dirty” URLs are difficult to easily convert to “pretty” ones. The Google CSE code unfortunately truncates the URL at the question mark when results are desired to be embedded in a local site using a “dirty” URL, which then causes the Javascript for results presentations to fail (see also this Joomla! link). As Ahmed, one of the Google CSE users points out, there is a relatively easy workaround for this bug, but you would pull your hair out if you did not know the trick.
Results page font-size control is lacking. Though it is claimed that control is provided for this, it is apparently not possible to control results font sizes without resorting to the Google Ajax search API (see more below).
There is a bug in applying filetype “refinements” to results, such as the advanced Google search operator filetype:pdf. Google Co-op staff acknowledge this as a bug and hopefully this will be corrected soon.
Styling is limited to colors and borders and ad placement locations short of resorting to the Google Ajax search API, and the API itself still lacks documentation or tutorials on how to style results or interactions with the native Google CSS. Admittedly, this is likely a difficult issue for Google since too much control given to the user can undercut its own branding and image concerns. However, Google’s Terms of Service seem to be fairly comprehensive in such protections and it would be helpful to see this documentation soon. There is often reference to the Ajax search API by Google Co-op team members, but unfortunately too little useful online documentation to make this approach workable for mere mortals.
It is vaguely stated that items called “attributes” can be included in CSE results and refinements (such as ‘A=Date’), but the direction is unclear and other forum comments seem to suggest this feature is not yet active. My own attempts show no issues in uploading CSE specifications that include attributes, but they are not yet retained in the actual specification currently used by Google. (Related to this topic is the fact that older forum postings may no longer be accurate as other improvements and bug fixes have been released.)
Yes, there still remains a 5,000 “annotation” limit per CSE, which is the subject of complaint by some CSE authors. I personally have less concern with this limit now that the URL pattern matching has been added. Also, there is considerable confusion about what this “annotation” limit really means. In my own investigations, an “annotation” in fact is equivalent to a single harvest point URL (with or without wildcards) and up to four labels or facets (with or without weighting or comments) for each.
While outside parties are attempting to provide general directory services, Google itself has a relatively poor way of announcing or listing new CSEs. The closest it comes is a posting page (http://groups-beta.google.com/group/google-co-op/web/your-custom-search) or the featured CSE engines (http://google.com/coop/cse/examples/GooglePicks), which are an impressive lot and filled with useful examples. Though there are a number of third parties trying to provide comprehensive directory listings, most have limited coverage:

The best way to get a listing of current CSEs still appears to be using the Google site: query above matched with a topic description, though that approach is not browsable and does not link to CSEs hosted on external sites.

I would like to see expanded support for additional input and export formats, including potentially OPML, microformats or Gdata itself. The current TSV and XML approaches are nice.

Yet, despite these quibbles, this CSE service is pointing the way to entirely new technology and business models. It is interesting that the Amazon S3 service and Yahoo!’s Developer Network are experimenting with similar Internet and Web service approaches. Let the fun begin!

Posted:January 2, 2007

Authoritative SweetSearch Semantic Web and Web 2.0 Custom Search Engine

How sweet it is!

I am pleased to announce the release of the most comprehensive and authoritative search engine yet available on all topics related to the semantic Web and Web 2.0. SweetSearch, as it is named, is a custom search engine (CSE) built using Google’s CSE service. SweetSearch can be found via this permanent link on the AI3 blog site. I welcome suggested site additions or improvements to the service by commenting below.

SweetSearch Statistics

SweetSearch is comprised of 3,736 unique host sites containing 4,038 expanded search locations (some hosts have multiple searchable components). Besides the most authoritative sites available, these sites include comprehensive access to 227 companies involved in the semantic Web or Web 2.0, more than 3100 Web 2.0 sites, 53 blogs relating specifically to these topics, 101 non-profit organizations, 219 specific semantic Web and related tools, 21 wikis and other goodies. Search results are also faceted into nine different categories, including papers, references, events, organizations, companies, tools, etc.

Other Semantic Web CSE Sites

SweetSearch is by no means the first Google CSE devoted to the semantic Web and related topics — but it may be the best and largest. Other related custom search engines (with the number of URLs they search) are Web20 (757 sites), the Web 2.0 Search Co-op (310), the University of Maryland’s Baltimore Campus (UMBC) Ebiquity service (65), Elias Torres’ site (160), Andreas Blumauer’s site (20), Web 20 Ireland (67), NextGen WWW (21), and Sr-Ultimate (4), among others that will surely emerge.

General Resources

Besides the general Google CSE site, the development team’s blog and a the user forum for a group of practitioners are also good places to learn more about CSEs.

Vik Singh’s AI Research site is also very helpful in related machine learning areas, plus he has written a fantastic tutorial on how to craft a powerful technology portal using the Google CSE service.

Contributors and Comments Welcomed!

I welcome any contributors who desire to add to SweetSearch. See this Google link for general information about this site; please contact me directly at the email address in the masthead if you desire to contribute. For suggested additional sites or other comments or refinements, please comment below. I will monitor these suggestions and make improvements on a frequent basis.

Posted:December 20, 2006

It’s Moving Day!

In my earlier Pro Blogging Guide (which is beginning to get long in the tooth, though it remains very popular) I documented the process of setting up my own instance of WordPress, the very popular open source blogging software. My memories of that effort were a little painful because of the need to set up a local server and sandbox for testing that site before it went live. In fact, one whole chapter in the Guide was devoted to that topic alone.

Well, I’ve just completed another hurdle, and that is moving from a company-hosted Web site and server to one that I own and manage on my own. I’m sure I could have made this easier on myself, but, actually, I wanted to learn the ropes and become self-reliant. I’ll be posting more of the specifics of this transfer, but here are the major areas that I needed to understand and embrace:

For a multitude of reasons, I decided I wanted complete control of my environment, but at acceptable cost. That led me to decide on going with a virtual private server (VPS, also sometimes called a virtual dedicated server) wherein the user/owner has total “root” control. This was not too dissimilar from the virtual private network experience I had with my previous company. What a VPS means is that you have a footprint and total software installation and configuration control as if you owned the server, all accomplished remotely. Thus, I needed to research providers, services, responsiveness, etc. Per my SOP, I created spreadsheets and weighting matrices to help decide my choices. (A great resource for such discussion is webhostingtalk.com.) My goal was to spend no more than $30 per month; mission accomplished!
Then, I also decided I did not want to be hooked into proprietary Microsoft software. I’m looking to low-cost scalability in my new venture, and clearly no one with a clue is using MS for Internet-based ventures. Thus, I needed to decide a flavor of Linux, needed to figure out a whole new raft of software and utilities geared to remote administration and standard computer management (editors, file managers, transfer utilities, etc., etc.), and then, most importantly, I needed to start learning CLI ( command-line interfaces). In so many ways, it felt like returning to the womb. I’ve been so GUI-ized and window-fied for more than 10 years that I felt like a stroke victim learning to speak and walk again! But wow, I like this stuff and it is cool! In fact, it feels “purer” and “cleaner” (including such excesses as using emacs or vi(m) again!)
These commitments made, then choices were necessary to start making the decisions actionable
We’re now into the real nitty-gritty of open source, where LAMP comes into play. First, you need the OS (Linux, CentOS 4 in my case). Then, you need the Web server (Apache). Then, because I’m using WordPress, PHP needs to be installed. This has the side benefit of also allowing phpMyAdmin, a useful MySQL management framework. Oh, of course, that also means that a database to support all of this is needed, which again for WP is MySQL. That requires utilities for database transfer, backup and restoration. Unix provides an entirely different way to understand and manage permissions and privileges, also meaning more learning. Then, all of this infrastructure environment needs to be tested and then verified as working with a clean WP install. (One useful guide I found, based on Windows but still applicable, is Jeff Lundborg’s Apache Guide.)
However, big problem, my WP database is apparently larger than most, about 150 MB in size! (I like long posts and attachments.) The standard mechanisms for most blogs fail at these scales, including the phyMyAdmin approaches and a WP plug-in called skippy. I was going to have to deal directly with MySQL, so I began learning its CLI syntax. Again, on-line guides for such backup and restore purposes can do the trick
Then, and only then, I began migrating my blog set-up specifics (pretty nicely isolated within WP) and the database. Actually, here is where I had my first pleasant surprise: the CLI utilities for MySQL (which are really the same bash-like stuff that makes the environment so productive) work beautifully!
These efforts then needed to be updated with respect to other dependencies within my blog referring to links, or other internal but no longer applicable references, to make the entire new environment now appropriately self-contained and integral. This actually requires cruising through the entire blog site, with specific attention to pages over posts, to ascertain integrity
Now, with all of that getting choked down, I also decided to update my static pages and some of the layout and also to upgrade to WP version 2.05 (from 2.04, very straightforward), and then
Finally, the transfer of the domain and name servers to create the new hosting presence (not to mention email accounts, a challenge of a different order not further mentioned here). The domain transfer may take a few days, complicated if you also need to transfer domain registrars (something which I also needed to do).

This latter point actually is a challenge. Internal WP links from your blog require your hosting URLs to be integral. However, if you understand this, and are able to use IP addresses (216.69.xxx.xxx in my blog’s case) during development, you can actually use the delayed transfer time between registrars to your benefit as you work out details. Again, it’s a matter of perspective. A delay in registration transfer actually gives you a free sandbox for getting the bugs worked out!

So, is this stuff painful? Yes, absolutely, if this is a one-off deal. In my case, however, the real pay-off will come (is coming) from using a transfer such as this as a real-world exercise in learning and exposure: Linux, own-hosting, tools, scripting. Seen in that light, this effort has been tremendously humbling and rewarding.

And, so, you are now seeing the fruits of this transfer! I will expand on specific steps in this process in future postings.

Posted:December 17, 2006

Trying ‘Web Scientist’ on for Size

After six fantastic years with BrightPlanet, I am no longer an employee (CTO) of the company nor chairman of the Board. I felt the company should go in one direction; the Board felt otherwise . . . . Such events, while not prosaic, are also not uncommon. I wish the company all possible success. It is now time to move on. . . .

Even though comparatively small, BrightPlanet is being challenged, as are all software companies today, in managing the transition to (still another) brave new world. Think about the major generational shifts of the past 15 years: personal computers, local networking, Internet browsers and thin clients, Internet ubiquity, open source, now Web 2.0. Certainly other items could be listed in that progression, but the general point remains that the pace of software and computing technology development has been furious and relentless.

These challenges are huge, and have resulted in technology shifts literally measured in months, not years. It is not for small reason that today’s buzzwords include agile, productive and efficient. My goodness, as few as eight years ago, choosing to commit to Java for production-scale enterprise development was considered by some risky and radical; today, some may argue that Java is becoming passé and dynamic languages such as Ruby and DSLs such as Rails hold the keys to the future.

Young Turk to Old Fart

I laugh now about that (truly) instantaneous moment when one morphs from being a Young Turk to an an Old Fart. (I myself passed that breakpoint long ago.) I remember with pride having the Y.T. moniker when in my teens and twenties. We see it still today. One of the things, however, that has blown my mind in the past 5-6 years is the age of the next successful generation. Look at the ages of Brin, Page, Ross, Cannon-Brookes, Farquhar, Hansson, and many others (please forgive me if your name is not on the list), who are (or will be) hauling down some serious dough at very young ages. Now, as an older guy (‘Old Fart’), I have to ask myself whether I can play in this new game. (I guess the best that I can say in that regard is that the world is not populated entirely with my daughter’s friends, but all of us can learn from this newest generation more efficient and agile ways of doing the old tasks.)

The Horizon Ahead

The horizon ahead is one of those places where I truly think I DO have a clue. When one sees multiple major shifts of stuff over many years, it is not too difficult (though some may miss it) to see some major trends. I don’t have the time (nor inclination nor luck nor skill) to write another Peter’s In Search of Excellence, the biggest business book of all time, even assuming I could write that simply or hit the lottery. But, a close reading of trends suggests the pending convergence of open source, semantic tagging and mediation, interoperability, agile development, social collaboration, and mechanisms to assign authoritativeness to information. This convergence will be democratic with a small D, disruptive and rapid. Fasten your seat belts . . . .

Trying ‘Web Scientist’ on for Size

As for myself, I am now on my own and not running a company for the first time in 12 years. I am striking out more directly into the semantic Web — directions that have clearly been my passion on this blog over the past few months. Though I have taken up the obligatory consultant shingle (after all, we all must eat) for the time being, I have also taken on the moniker of ‘Web Scientist’ on my new email signature.

As the person who first explicated and coined the term “deep Web”, the person who wrote the Web’s most popular search tutorial in its early years, and the person who helped bring into being many of the automation techniques and bots for accessing dynamic Web content, I feel pretty comfortable with that label. I also especially like that TBL and others have put a marker out there to give the title some legitimacy. (See Creating a Science of the Web by Tim Berners-Lee, Wendy Hall, James Hendler, Nigel Shadbolt and Daniel J. Weitzner in Science 313(11), 11 August 2006.) (See also this recent NYT article.)

I’ll now see how it feels to have the Web scientist label for a while.

For those of you who have been faithful readers since I put this blog out now more than a year ago, you know that my abiding passion has been effective information use and management in relation to the Internet. I look forward to further discussions with you on these very same topics in the months ahead.

Main Links

Search

Author: Mike Bergman