Some Quick Investigations Point to Promise, Disappointments
It has been clear for some time that Google has been assembling a war chest of entities and attributes. It first began to appear as structured results in its standard results listings, a trend I commented upon more than three years ago in Massive Muscle on the ABox at Google. Its purchase of Metaweb and its Freebase service in July 2010 only affirmed that trend.
This week, perhaps a bit rushed due to attention to the Facebook IPO, Google announced its addition of the Knowledge Graph (GKG) to its search results. It has been releasing this capability in a staged manner. Since I was fortunately one of the first to be able to see these structured results (due to luck of the draw and no special “ins”), I have spent a bit of time deconstructing what I have found.
What you get (see below) when you search on particular kinds of entities is in essence an “infobox“, similar to the same structure as what is found on Wikipedia. This infobox is a tabular presentation of key-value pairs, or attributes, for the kind of entity in the search. A ‘people’ search, for example, turns up birth and death dates and locations, some vital statistics, spouse or famous relations, pictures, and links to other relevant structured entities. The types of attributes shown vary by entity type. Here is an example for Stephen King, the writer (all links from here forward provide GKB results), which shows the infobox and its key-value pairs in the righthand column:
Reportedly these results are drawn from Freebase, Wikipedia, the CIA World Factbook and other unidentified sources. Some of the results may indeed be coming from Freebase, but I saw none as such. Most entities I found were from Wikipedia, though it is important to know that Freebase in its first major growth incarnation springboarded from a Wikipedia baseline. These early results may have been what was carried forward (since other contributed data to Freebase is known to be of highly variable quality).
The entity coverage appears to be spotty and somewhat disappointing in this first release. Entity types that I observed were in these categories:
Entity types that I expected to see, but did not find include:
This is clearly not rigorous testing, but it would appear that entity types along the lines of what is in schema.org is what should be expected over time.
I have no way to gauge the accuracy of Google’s claims that it has offered up structured data on some 500 million entities (and 3.5 billion facts). However, given the lack of coverage in key areas of Wikipedia (which itself has about 3 million entities in the English version), I suspect much of that number comes from the local businesses and restaurants and such that Google has been rapidly adding to its listings in recent years. Coverage of broadly interesting stuff still seems quite lacking.
The much-touted knowledge graph is also kind of disappointing. Related links are provided, but they are close and obvious. So, an actor will have links to films she was in, or a person may have links to famous spouses, but anything approaching a conceptual or knowledge relationship is missing. I think, though, we can see such links and types and entity expansions to steadily creep in over time. Google certainly has the data basis for making these expansions. And, constant, incremental improvement has been Google’s modus operandi.
Deconstructing the URL
For some time, and at various meetings I attend, I have always been at pains to question Google representatives whether there is some unique, internal ID for entities within its databases. Sometimes the reps I have questioned just don’t know, and sometimes they are cagey.
But, clearly, anything like the structured data that Google has been working toward has to have some internal identifier. To see if some of this might now have surfaced with the Knowledge Graph, I did a bit of poking of the URLs shown in the searches and the affiliated entities in the infoboxes. Under most standard searches, the infobox appears directly if there is one for that object. But, by inspecting cross-referenced entities from the infoboxes themselves, it is possible to discern the internal key.
The first thing one can do in such inspections is to remove that stuff that is local or cookie things related to one’s own use preferences or browser settings. Other tests can show other removals. So, using the Stephen King example above, we can eliminate these aspects of the URL:
This actually conformed to my intuition, because the ‘&stick’ aspect was a new parameter for me. (Typically, in many of these dynamic URLs, the various parameters are separated by one another by a set designator character. In the case of Google, that is the ampersand &.)
By simply doing repeated searches that result in the same entity references, I was able to confirm that the &stick parameter is what invokes the unique ID and infobox for each entity. Further, we can decompose that further, but the critical aspect seems to be what is not included within the following: &stick=H4sIAAAAAAAAAONg . . [VuLQz9U3] . . AAAA. The stuff in the brackets varies less, and I suspect might be related to the source, rather than the entity.
I started to do some investigation on types and possible sources, but ran out of time. Listed below are some &stick identifiers for various types of entities (each is a live link):
|Movie||The Green Mile||&stick=H4sIAAAAAAAAAONgVuLUz9U3MC62zC4AAGg8mEkNAAAA|
|Albums||The White Album||&stick=H4sIAAAAAAAAAONgVuLSz9U3MMxIN0nKAADnd5clDgAAAA|
You can verify that this ‘&stick‘ reference is what is pulling in the infobox by looking at this modified query that has substituted Marilyn Monroe’s &stick in the Stephen King URL string: Note the standard search results in the lefthand results panel are the same as for Stephen King, but we now have fooled the Google engine to display Marilyn Monroe’s infobox.
I’m sure over time that others will deconstruct this framework to a very precise degree. What would really be great, of course, as noted on many recent mailing lists, is for Google to expose all of this via an API. The Google listing could become the de facto source for Webby entity identifiers.
Some Concluding Thoughts
Sort of like when schema.org was first announced, there have been complaints from some in the semantic Web community that Google released this stuff without once saying the word “semantic”, that much had been ripped off from the original researchers and community without attribution, that a gorilla commercial entity like Google could only be expected to milk this stuff for profit, etc., etc.
That all sounds like sour grapes to me.
What we have here is what we are seeing across the board: the inexorable integration of semantic technology approaches into many existing products. Siri did it with voice commands; Bing, and now Google, are doing it too with search.
We should welcome these continued adoptions. The fact is, semWeb community, that what we are seeing in all of these endeavors is the right and proper role for these technologies: in the background, enhancing our search and information experience, and not something front and center or rammed down our throats. These are the natural roles of semantic technologies, and they are happening at a breakneck pace.
Welcome to the semantic technology space, Google! I look forward to learning much from you.