Posted:January 31, 2007

Adopting Standardized Identification and Metadata

Since it’s becoming all the rage, I decided to go legit and get myself (and my blog) a standard ID and enable the site for easy metadata exchange.


OpenID is an open and free framework for users (and their blogs or Web sites) to establish a digital identity. OpenID starts with the concept that anyone can identify themselves on the Internet the same way websites do via a URI and that authentication is essential to prove ownership of the ID.

Sam Ruby provides a clear article on how to set up your own OpenID. One of the listing services is MyOpenID, from which you can get a free ID. With Sam’s instructions, you can also link a Web site or blog to this ID. When done, you can check whether your site is valid; you can do a quick check using my own ID (make sure and enter the full URI!). However, to get the valid ID message as shown below, you will need to have your standard (and secure) OpenID password handy.


unAPI addresses a different problem — the need for a uniform, simple method for copying rich digital objects out of any Web application. unAPI is a tiny HTTP API for the few basic operations necessary to copy discrete, identified content from any kind of web application. An interesting unAPI background and rationale paper from the Ariadne online library ezine is “Introducing unAPI”.

To invoke the service, I installed the unAPI Server, a WordPress plug-in from Michael J. Giarlo. The server provides records for each WordPress post and page in the following formats: OAI-Dublin Core (oai_dc), Metadata Object Description Schema (MODS), SRW-Dublin Core (srw_dc), MARCXML, and RSS. The specification makes use of LINK tag auto-discovery and an unendorsed microformat for machine-readable metadata records.

Validating the call is provided by the unAPI organization, with the return for this blog entry shown here (which is error free and only partially reproduced):

The actual unAPI results do not appear anywhere in your page, since the entire intent is to be machine readable. However, to show what can be expected from these five formats, I reproduce what the internal metadata listings for the five different formats are in these links:






unAPI adoption is just now beginning, and some apps like the Firefox Dublin Core Viewer add-on do not yet recognize the formats. But, because the specification is so simple and is being pushed by the library community, I suspect adoption will become very common. I encourage you to join!

Posted:January 4, 2007

After feeling like I was drowning in too much Campbell’s alphabet vegetable soup, I decided to bite the bullet (how can one bite a bullet while drinking soup?!) and provide an acronym look-up service to this AI3 blog site. So, welcome to the new permanent link to the Acronyms & Glossary shown in my standard main links to the upper left.

This permanent (and sometimes updated page) lists about 350 acronyms related to computer science, IT, semantic Web and information retrieval and processing. Most entries point to Wikipedia, with those that do not instead referencing their definitive standards site.

I welcome any suggestions for new acronyms to be added to this master list. Yabba dabba doo . . . . (oops, I meant YAML YAML ADO).

Posted by AI3's author, Mike Bergman Posted on January 4, 2007 at 10:48 am in Site-related | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:January 3, 2007

Google Co-op Custom Search Engines (CSEs) Moving Forward at Internet Speed

Since its release a mere two months ago in late October, Google’s custom search engine (CSE) service, built on its Co-op platform, has gone through some impressive refinements and expansions. Clearly, the development team behind this effort is dedicated and capable.

I recently announced the release of my own CSE — SweetSearch — that is a comprehensive and authoritative search engine for all topics related to the semantic Web and Web 2.0. Like Ethan Zuckerman who published his experience in creating a CSE for Ghana in late October, I too have had some issues. Ethan’s first post was entitled, “What Google Coop Search Doesn't Do Well,” posted on October 27. Yet, by November 6, the Google Co-op team had responded sufficiently that Ethan was able to post a thankful update, “Google Fixes My Custom Search Problems.” I’m hoping some of my own issues get a similarly quick response.

Fast, Impressive Progress

It is impressive to note the progress and removal of some early issues in the last two months. For example, early limits of 1,000 URLs per CSE have been upped to 5,000 URLs, with wildcard pattern matches improving this limit still further. Initial limits to two languages have now been expanded to most common left-to-right languages (Arabic and Hebrew are still excluded). Many bugs have been fixed. The CSE blog has been a welcome addition, and file upload capabilities are quite stable (though not all eventual features are yet supported). The Google Co-op team actively solicits support and improvement comments ( and a useful blog has been posted by the development team (

In just a few short weeks, at least 2,100 new CSEs have been created (found by issuing the advanced search query, ‘site:‘ to Google itself, with cx representing the unique ID key for each CSE). This number is likely low since newly created or unreleased CSEs do not appear in the results. This growth clearly shows the pent up demand for vertical search engines and the desire for users to improve authoritativeness and quality. Over time, Google will certainly reap user-driven benefits from these CSEs in its own general search services.

My Pet Issues

So, in the spirit of continued improvement, I offer below my own observations and pet peeves with how the Google CSE service presently works. I know these points will not fall on deaf ears and perhaps other CSE authors may see some issues of their own importance in this listing.

  1. There is a bug in handling “dirty” URLs for results pages. Many standard CRMs or blog software, such as WordPress or Joomla!, provide options for both “pretty” URLs (SEO ones, that contain title names in the URL string, such as v. “dirty” ones that label URLs with IDs or sequences with question marks (such as Often historical “dirty” URLs are difficult to easily convert to “pretty” ones. The Google CSE code unfortunately truncates the URL at the question mark when results are desired to be embedded in a local site using a “dirty” URL, which then causes the Javascript for results presentations to fail (see also this Joomla! link). As Ahmed, one of the Google CSE users points out, there is a relatively easy workaround for this bug, but you would pull your hair out if you did not know the trick.
  2. Results page font-size control is lacking. Though it is claimed that control is provided for this, it is apparently not possible to control results font sizes without resorting to the Google Ajax search API (see more below).
  3. There is a bug in applying filetype “refinements” to results, such as the advanced Google search operator filetype:pdf. Google Co-op staff acknowledge this as a bug and hopefully this will be corrected soon.
  4. Styling is limited to colors and borders and ad placement locations short of resorting to the Google Ajax search API, and the API itself still lacks documentation or tutorials on how to style results or interactions with the native Google CSS. Admittedly, this is likely a difficult issue for Google since too much control given to the user can undercut its own branding and image concerns. However, Google’s Terms of Service seem to be fairly comprehensive in such protections and it would be helpful to see this documentation soon. There is often reference to the Ajax search API by Google Co-op team members, but unfortunately too little useful online documentation to make this approach workable for mere mortals.
  5. It is vaguely stated that items called “attributes” can be included in CSE results and refinements (such as ‘A=Date’), but the direction is unclear and other forum comments seem to suggest this feature is not yet active. My own attempts show no issues in uploading CSE specifications that include attributes, but they are not yet retained in the actual specification currently used by Google. (Related to this topic is the fact that older forum postings may no longer be accurate as other improvements and bug fixes have been released.)
  6. Yes, there still remains a 5,000 “annotation” limit per CSE, which is the subject of complaint by some CSE authors. I personally have less concern with this limit now that the URL pattern matching has been added. Also, there is considerable confusion about what this “annotation” limit really means. In my own investigations, an “annotation” in fact is equivalent to a single harvest point URL (with or without wildcards) and up to four labels or facets (with or without weighting or comments) for each.
  7. While outside parties are attempting to provide general directory services, Google itself has a relatively poor way of announcing or listing new CSEs. The closest it comes is a posting page ( or the featured CSE engines (, which are an impressive lot and filled with useful examples. Though there are a number of third parties trying to provide comprehensive directory listings, most have limited coverage:
  8. The best way to get a listing of current CSEs still appears to be using the Google site: query above matched with a topic description, though that approach is not browsable and does not link to CSEs hosted on external sites.

  9. I would like to see expanded support for additional input and export formats, including potentially OPML, microformats or Gdata itself. The current TSV and XML approaches are nice.

Yet, despite these quibbles, this CSE service is pointing the way to entirely new technology and business models. It is interesting that the Amazon S3 service and Yahoo!’s Developer Network are experimenting with similar Internet and Web service approaches. Let the fun begin!

Posted by AI3's author, Mike Bergman Posted on January 3, 2007 at 2:46 pm in Searching, Site-related | Comments (2)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:January 2, 2007

How sweet it is!

I am pleased to announce the release of the most comprehensive and authoritative search engine yet available on all topics related to the semantic Web and Web 2.0. SweetSearch, as it is named, is a custom search engine (CSE) built using Google’s CSE service. SweetSearch can be found via this permanent link on the AI3 blog site. I welcome suggested site additions or improvements to the service by commenting below.

SweetSearch Statistics

SweetSearch is comprised of 3,736 unique host sites containing 4,038 expanded search locations (some hosts have multiple searchable components). Besides the most authoritative sites available, these sites include comprehensive access to 227 companies involved in the semantic Web or Web 2.0, more than 3100 Web 2.0 sites, 53 blogs relating specifically to these topics, 101 non-profit organizations, 219 specific semantic Web and related tools, 21 wikis and other goodies. Search results are also faceted into nine different categories, including papers, references, events, organizations, companies, tools, etc.

Other Semantic Web CSE Sites

SweetSearch is by no means the first Google CSE devoted to the semantic Web and related topics — but it may be the best and largest. Other related custom search engines (with the number of URLs they search) are Web20 (757 sites), the Web 2.0 Search Co-op (310), the University of Maryland’s Baltimore Campus (UMBC) Ebiquity service (65), Elias Torres’ site (160), Andreas Blumauer’s site (20), Web 20 Ireland (67), NextGen WWW (21), and Sr-Ultimate (4), among others that will surely emerge.

General Resources

Besides the general Google CSE site, the development team’s blog and a the user forum for a group of practitioners are also good places to learn more about CSEs.

Vik Singh’s AI Research site is also very helpful in related machine learning areas, plus he has written a fantastic tutorial on how to craft a powerful technology portal using the Google CSE service.

Contributors and Comments Welcomed!

I welcome any contributors who desire to add to SweetSearch. See this Google link for general information about this site; please contact me directly at the email address in the masthead if you desire to contribute. For suggested additional sites or other comments or refinements, please comment below. I will monitor these suggestions and make improvements on a frequent basis.

Posted by AI3's author, Mike Bergman Posted on January 2, 2007 at 8:07 pm in Searching, Semantic Web, Site-related | Comments (1)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:December 20, 2006

In my earlier Pro Blogging Guide (which is beginning to get long in the tooth, though it remains very popular) I documented the process of setting up my own instance of WordPress, the very popular open source blogging software. My memories of that effort were a little painful because of the need to set up a local server and sandbox for testing that site before it went live. In fact, one whole chapter in the Guide was devoted to that topic alone.

Well, I’ve just completed another hurdle, and that is moving from a company-hosted Web site and server to one that I own and manage on my own. I’m sure I could have made this easier on myself, but, actually, I wanted to learn the ropes and become self-reliant. I’ll be posting more of the specifics of this transfer, but here are the major areas that I needed to understand and embrace:

  • For a multitude of reasons, I decided I wanted complete control of my environment, but at acceptable cost. That led me to decide on going with a virtual private server (VPS, also sometimes called a virtual dedicated server) wherein the user/owner has total “root” control. This was not too dissimilar from the virtual private network experience I had with my previous company. What a VPS means is that you have a footprint and total software installation and configuration control as if you owned the server, all accomplished remotely. Thus, I needed to research providers, services, responsiveness, etc. Per my SOP, I created spreadsheets and weighting matrices to help decide my choices. (A great resource for such discussion is My goal was to spend no more than $30 per month; mission accomplished!
  • Then, I also decided I did not want to be hooked into proprietary Microsoft software. I’m looking to low-cost scalability in my new venture, and clearly no one with a clue is using MS for Internet-based ventures. Thus, I needed to decide a flavor of Linux, needed to figure out a whole new raft of software and utilities geared to remote administration and standard computer management (editors, file managers, transfer utilities, etc., etc.), and then, most importantly, I needed to start learning CLI ( command-line interfaces). In so many ways, it felt like returning to the womb. I’ve been so GUI-ized and window-fied for more than 10 years that I felt like a stroke victim learning to speak and walk again! But wow, I like this stuff and it is cool! In fact, it feels “purer” and “cleaner” (including such excesses as using emacs or vi(m) again!)
  • These commitments made, then choices were necessary to start making the decisions actionable
  • We’re now into the real nitty-gritty of open source, where LAMP comes into play. First, you need the OS (Linux, CentOS 4 in my case). Then, you need the Web server (Apache). Then, because I’m using WordPress, PHP needs to be installed. This has the side benefit of also allowing phpMyAdmin, a useful MySQL management framework. Oh, of course, that also means that a database to support all of this is needed, which again for WP is MySQL. That requires utilities for database transfer, backup and restoration. Unix provides an entirely different way to understand and manage permissions and privileges, also meaning more learning. Then, all of this infrastructure environment needs to be tested and then verified as working with a clean WP install. (One useful guide I found, based on Windows but still applicable, is Jeff Lundborg’s Apache Guide.)
  • However, big problem, my WP database is apparently larger than most, about 150 MB in size! (I like long posts and attachments.) The standard mechanisms for most blogs fail at these scales, including the phyMyAdmin approaches and a WP plug-in called skippy. I was going to have to deal directly with MySQL, so I began learning its CLI syntax. Again, on-line guides for such backup and restore purposes can do the trick
  • Then, and only then, I began migrating my blog set-up specifics (pretty nicely isolated within WP) and the database. Actually, here is where I had my first pleasant surprise: the CLI utilities for MySQL (which are really the same bash-like stuff that makes the environment so productive) work beautifully!
  • These efforts then needed to be updated with respect to other dependencies within my blog referring to links, or other internal but no longer applicable references, to make the entire new environment now appropriately self-contained and integral. This actually requires cruising through the entire blog site, with specific attention to pages over posts, to ascertain integrity
  • Now, with all of that getting choked down, I also decided to update my static pages and some of the layout and also to upgrade to WP version 2.05 (from 2.04, very straightforward), and then
  • Finally, the transfer of the domain and name servers to create the new hosting presence (not to mention email accounts, a challenge of a different order not further mentioned here). The domain transfer may take a few days, complicated if you also need to transfer domain registrars (something which I also needed to do).

This latter point actually is a challenge. Internal WP links from your blog require your hosting URLs to be integral. However, if you understand this, and are able to use IP addresses ( in my blog’s case) during development, you can actually use the delayed transfer time between registrars to your benefit as you work out details. Again, it’s a matter of perspective. A delay in registration transfer actually gives you a free sandbox for getting the bugs worked out!

So, is this stuff painful? Yes, absolutely, if this is a one-off deal. In my case, however, the real pay-off will come (is coming) from using a transfer such as this as a real-world exercise in learning and exposure: Linux, own-hosting, tools, scripting. Seen in that light, this effort has been tremendously humbling and rewarding.

And, so, you are now seeing the fruits of this transfer! I will expand on specific steps in this process in future postings.

Posted by AI3's author, Mike Bergman Posted on December 20, 2006 at 12:46 pm in Site-related | Comments (1)
The URI link reference to this post is:
The URI to trackback this post is: