Posted:January 15, 2007

Pandora’s Music Genome Project is Another Example of the Pragmatic Semantic Web

An initiative now gaining buzz — and one that doesn’t once mention the semantic Web — is showing how metadata characterization can bring new discovery and benefits to music listening. Pandora, which bills itself as your personal DJ, has a fantastic online music listening experience organized through a series of random shuffle-play “channels,” including those you can design and share yourself. The system presently contains more than 400,000 songs from more than 20,000 artists, most contemporary.

The Music Genome Project is the basis for the system’s recommendations. It was started by a group of musicians in 2000 with the aim of assembling the most comprehensive database of music analysis yet created. According to its Web site:

Together we set out to capture the essence of music at the most fundamental level. We ended up assembling literally hundreds of musical attributes or “genes” into a very large Music Genome. Taken together these genes capture the unique and magical musical identity of a song – everything from melody, harmony and rhythm, to instrumentation, orchestration, arrangement, lyrics, and of course the rich world of singing and vocal harmony. It’s not about what a band looks like, or what genre they supposedly belong to, or about who buys their records – it’s about what each individual song sounds like.

The basic Pandora service is shown in this screen shot:

I’ve created my own radio “channel” emphasizing Eric Clapton and specifically his blues interests with other blues artists (Clapton + Friends). The more you listen and refine, the better the shuffle play delightfully surprises. Pandora offers both free, persistent channels (with advertising) and (nominal) paid channels without advertising. (Refinement is not strictly necessary, but it overcomes the licensing limitation that prevents one from skipping too many songs in a given hour.)
Here’s the email that Pandora creates to send to friends (plus the specific reference to my Clapton channel) that you are welcomed to try out:

Hi,

I wanted to let you know about Pandora, a free internet radio site based on the Music Genome Project. It helps you discover and listen to great music. Here’s the link if you’d like to check it out:

http://www.pandora.com

Just tell Pandora the name of your favorite song or artist, and it will create a radio station that plays songs with similar musical attributes.

Here’s a link to my profile page. From this page, you can listen to my stations and check out new music I’ve found on Pandora:

http://www.pandora.com/people/mike67150

BTW, I’d like to thank a posting by Michael Daconta for turning me on to the Pandora service.

Posted:January 3, 2007

Google Co-op Custom Search Engines (CSEs) Moving Forward at Internet Speed

Since its release a mere two months ago in late October, Google’s custom search engine (CSE) service, built on its Co-op platform, has gone through some impressive refinements and expansions. Clearly, the development team behind this effort is dedicated and capable.

I recently announced the release of my own CSE — SweetSearch — that is a comprehensive and authoritative search engine for all topics related to the semantic Web and Web 2.0. Like Ethan Zuckerman who published his experience in creating a CSE for Ghana in late October, I too have had some issues. Ethan’s first post was entitled, “What Google Coop Search Doesn't Do Well,” posted on October 27. Yet, by November 6, the Google Co-op team had responded sufficiently that Ethan was able to post a thankful update, “Google Fixes My Custom Search Problems.” I’m hoping some of my own issues get a similarly quick response.

Fast, Impressive Progress

It is impressive to note the progress and removal of some early issues in the last two months. For example, early limits of 1,000 URLs per CSE have been upped to 5,000 URLs, with wildcard pattern matches improving this limit still further. Initial limits to two languages have now been expanded to most common left-to-right languages (Arabic and Hebrew are still excluded). Many bugs have been fixed. The CSE blog has been a welcome addition, and file upload capabilities are quite stable (though not all eventual features are yet supported). The Google Co-op team actively solicits support and improvement comments (http://www.google.com/support/coop/) and a useful blog has been posted by the development team (http://googlecustomsearch.blogspot.com/).

In just a few short weeks, at least 2,100 new CSEs have been created (found by issuing the advanced search query, ‘site:http://google.com/coop/cse?cx=‘ to Google itself, with cx representing the unique ID key for each CSE). This number is likely low since newly created or unreleased CSEs do not appear in the results. This growth clearly shows the pent up demand for vertical search engines and the desire for users to improve authoritativeness and quality. Over time, Google will certainly reap user-driven benefits from these CSEs in its own general search services.

My Pet Issues

So, in the spirit of continued improvement, I offer below my own observations and pet peeves with how the Google CSE service presently works. I know these points will not fall on deaf ears and perhaps other CSE authors may see some issues of their own importance in this listing.

  1. There is a bug in handling “dirty” URLs for results pages. Many standard CRMs or blog software, such as WordPress or Joomla!, provide options for both “pretty” URLs (SEO ones, that contain title names in the URL string, such as http://www.mydomain.com/2007/jan/most-recent-blog-post.html) v. “dirty” ones that label URLs with IDs or sequences with question marks (such as http://www.mydomain.com/?p=123). Often historical “dirty” URLs are difficult to easily convert to “pretty” ones. The Google CSE code unfortunately truncates the URL at the question mark when results are desired to be embedded in a local site using a “dirty” URL, which then causes the Javascript for results presentations to fail (see also this Joomla! link). As Ahmed, one of the Google CSE users points out, there is a relatively easy workaround for this bug, but you would pull your hair out if you did not know the trick.
  2. Results page font-size control is lacking. Though it is claimed that control is provided for this, it is apparently not possible to control results font sizes without resorting to the Google Ajax search API (see more below).
  3. There is a bug in applying filetype “refinements” to results, such as the advanced Google search operator filetype:pdf. Google Co-op staff acknowledge this as a bug and hopefully this will be corrected soon.
  4. Styling is limited to colors and borders and ad placement locations short of resorting to the Google Ajax search API, and the API itself still lacks documentation or tutorials on how to style results or interactions with the native Google CSS. Admittedly, this is likely a difficult issue for Google since too much control given to the user can undercut its own branding and image concerns. However, Google’s Terms of Service seem to be fairly comprehensive in such protections and it would be helpful to see this documentation soon. There is often reference to the Ajax search API by Google Co-op team members, but unfortunately too little useful online documentation to make this approach workable for mere mortals.
  5. It is vaguely stated that items called “attributes” can be included in CSE results and refinements (such as ‘A=Date’), but the direction is unclear and other forum comments seem to suggest this feature is not yet active. My own attempts show no issues in uploading CSE specifications that include attributes, but they are not yet retained in the actual specification currently used by Google. (Related to this topic is the fact that older forum postings may no longer be accurate as other improvements and bug fixes have been released.)
  6. Yes, there still remains a 5,000 “annotation” limit per CSE, which is the subject of complaint by some CSE authors. I personally have less concern with this limit now that the URL pattern matching has been added. Also, there is considerable confusion about what this “annotation” limit really means. In my own investigations, an “annotation” in fact is equivalent to a single harvest point URL (with or without wildcards) and up to four labels or facets (with or without weighting or comments) for each.
  7. While outside parties are attempting to provide general directory services, Google itself has a relatively poor way of announcing or listing new CSEs. The closest it comes is a posting page (http://groups-beta.google.com/group/google-co-op/web/your-custom-search) or the featured CSE engines (http://google.com/coop/cse/examples/GooglePicks), which are an impressive lot and filled with useful examples. Though there are a number of third parties trying to provide comprehensive directory listings, most have limited coverage:
  8. The best way to get a listing of current CSEs still appears to be using the Google site: query above matched with a topic description, though that approach is not browsable and does not link to CSEs hosted on external sites.

  9. I would like to see expanded support for additional input and export formats, including potentially OPML, microformats or Gdata itself. The current TSV and XML approaches are nice.

Yet, despite these quibbles, this CSE service is pointing the way to entirely new technology and business models. It is interesting that the Amazon S3 service and Yahoo!’s Developer Network are experimenting with similar Internet and Web service approaches. Let the fun begin!

Posted by AI3's author, Mike Bergman Posted on January 3, 2007 at 2:46 pm in Searching, Site-related | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/311/googles-custom-search-engine-cse-impressive-start-but-some-quibbles-remain/
The URI to trackback this post is: http://www.mkbergman.com/311/googles-custom-search-engine-cse-impressive-start-but-some-quibbles-remain/trackback/
Posted:January 2, 2007

How sweet it is!

I am pleased to announce the release of the most comprehensive and authoritative search engine yet available on all topics related to the semantic Web and Web 2.0. SweetSearch, as it is named, is a custom search engine (CSE) built using Google’s CSE service. SweetSearch can be found via this permanent link on the AI3 blog site. I welcome suggested site additions or improvements to the service by commenting below.

SweetSearch Statistics

SweetSearch is comprised of 3,736 unique host sites containing 4,038 expanded search locations (some hosts have multiple searchable components). Besides the most authoritative sites available, these sites include comprehensive access to 227 companies involved in the semantic Web or Web 2.0, more than 3100 Web 2.0 sites, 53 blogs relating specifically to these topics, 101 non-profit organizations, 219 specific semantic Web and related tools, 21 wikis and other goodies. Search results are also faceted into nine different categories, including papers, references, events, organizations, companies, tools, etc.

Other Semantic Web CSE Sites

SweetSearch is by no means the first Google CSE devoted to the semantic Web and related topics — but it may be the best and largest. Other related custom search engines (with the number of URLs they search) are Web20 (757 sites), the Web 2.0 Search Co-op (310), the University of Maryland’s Baltimore Campus (UMBC) Ebiquity service (65), Elias Torres’ site (160), Andreas Blumauer’s site (20), Web 20 Ireland (67), NextGen WWW (21), and Sr-Ultimate (4), among others that will surely emerge.

General Resources

Besides the general Google CSE site, the development team’s blog and a the user forum for a group of practitioners are also good places to learn more about CSEs.

Vik Singh’s AI Research site is also very helpful in related machine learning areas, plus he has written a fantastic tutorial on how to craft a powerful technology portal using the Google CSE service.

Contributors and Comments Welcomed!

I welcome any contributors who desire to add to SweetSearch. See this Google link for general information about this site; please contact me directly at the email address in the masthead if you desire to contribute. For suggested additional sites or other comments or refinements, please comment below. I will monitor these suggestions and make improvements on a frequent basis.

Posted by AI3's author, Mike Bergman Posted on January 2, 2007 at 8:07 pm in Searching, Semantic Web, Site-related | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/308/authoritative-sweetsearch-semantic-web-and-web-20-custom-search-engine/
The URI to trackback this post is: http://www.mkbergman.com/308/authoritative-sweetsearch-semantic-web-and-web-20-custom-search-engine/trackback/
Posted:September 27, 2006

Thanks to a post from NewsForge on Open source search technology goes beyond keywords, I was directed to a description of the Semantic Indexing Project at Middlebury College. Aaron Coburn, the lead developer of the project, says his team is currently documenting its open source search toolkit and finishing up a new desktop search application that should be released later this month. From the project Web site:

The National Institute for Technology in Liberal Education (NITLE) and Middlebury College have been experimenting with algorithms to help unstructured data organize itself into conceptually useful categories without human intervention. Part of our motivation is to find an alternative to spending prohibitive amounts of time and money on marking up course materials, documents, and online collections with metadata by hand. For many of the most common markup standards in use today, such as SCORM or Dublin Core, it can actually take longer to create markup than it did to create the course materials themselves.

The method being applied is a more scalable variant of latent semantic indexing that the team calls contextual network graphing. A PDF paper from the project, Semantic Search of Unstructured Data using Contextual Network Graphs by Maciej Ceglowski, Aaron Coburn and John Cuadrado explains this promising technique in greater detail and notes its debt to a 1981 Ph.D. dissertation by Scott Preece at the University of Illinois describing an almost identical technique under the name spreading activation search.

The Semantic Indexing Project is an umbrella effort over a number of subsidiary projects including a blog census, literary analysis tool, refinement of search and clustering algorithms, bioinformatics, use of ontologies, and semantic relationship visualization through a Semantic Explorer, as this example shows:

All of the source code is available for download from the project, published under the terms of the GNU General Public License. The project’s core technology is the Semantic Engine, which is distributed with its C++ code, Perl bindings, and all the necessary code for building the GUI. A new desktop application, called the the Standalone Engine, will be available later this month.

This work looks very, very promising as a step forward to bringing automation to semantic Web markup, among related advantages deriving from tagged documents.

Posted:September 9, 2006

Douglas Adams - HyperlandThe late Douglas Adams, of Doctor Who and A Hitchhiker’s Guide to the Galaxy fame, produced an absolutely fascinating, prescient and entertaining TV program 16 years ago for BBC2 presaging the Internet. Called Hyperland (see also the IMDB write up), this self-labelled ‘fantasy documentary’ 50-min video from 1990 can now be seen in its entirety from Google video. Mind you, this was well in advance of the World Wide Web (remember the source for ‘www’?) and the browser, though both that name and hypertext are liberally sprinked thrughout the show.

The presentation, written by and starring Adams as the protoganist having a fantasy dream, features Tom, the semantic simulacrum (actually, Tom Baker from Doctor Who), who is the “obsequious, and fully customizable” personal software agent who introduces, anticipates and guides Adams through what in actuality is a semantic Web of interconnected information. Laptops (actually an early Apple), pointing devices, icons and avatars sprinkle this tour de force in an uncanny glimpse into the (now) future.

Sure, some details are gotten wrong and perhaps there is a bit too much emphasis (given today’s realities) on virtual reality, but the vision presented is exactly that promised by the semantic Web and an interconnected global digital library of information and multimedia. Wow! And entertaining and fun to boot!

This is definitely Must See TV!

I’d like to thank Buzzsort for first writing about the availability of this video. Apparently fans and aficiandos have been clamoring for some time to see this show again, which has only recently been posted. Indeed, the access to an archived video such as this is a great example of Hyperland coming to reality.

Jewels & Doubloons An AI3 Jewels & Doubloon Winner