Posted:May 18, 2012

Deconstructing the Google Knowledge Graph

Google logoSome Quick Investigations Point to Promise, Disappointments

It has been clear for some time that Google has been assembling a war chest of entities and attributes. It first began to appear as structured results in its standard results listings, a trend I commented upon more than three years ago in Massive Muscle on the ABox at Google. Its purchase of Metaweb and its Freebase service in July 2010 only affirmed that trend.

This week, perhaps a bit rushed due to attention to the Facebook IPO, Google announced its addition of the Knowledge Graph (GKG) to its search results. It has been releasing this capability in a staged manner. Since I was fortunately one of the first to be able to see these structured results (due to luck of the draw and no special “ins”), I have spent a bit of time deconstructing what I have found.

Apparent Coverage

What you get (see below) when you search on particular kinds of entities is in essence an “infobox“, similar to the same structure as what is found on Wikipedia. This infobox is a tabular presentation of key-value pairs, or attributes, for the kind of entity in the search. A ‘people’ search, for example, turns up birth and death dates and locations, some vital statistics, spouse or famous relations, pictures, and links to other relevant structured entities. The types of attributes shown vary by entity type. Here is an example for Stephen King, the writer (all links from here forward provide GKB results), which shows the infobox and its key-value pairs in the righthand column: Google 'Stephen King' structured result

Reportedly these results are drawn from Freebase, Wikipedia, the CIA World Factbook and other unidentified sources. Some of the results may indeed be coming from Freebase, but I saw none as such. Most entities I found were from Wikipedia, though it is important to know that Freebase in its first major growth incarnation springboarded from a Wikipedia baseline. These early results may have been what was carried forward (since other contributed data to Freebase is known to be of highly variable quality).

The entity coverage appears to be spotty and somewhat disappointing in this first release. Entity types that I observed were in these categories:

  • People
  • Chemical Compounds
  • Directors
  • Some Companies
  • Some National Parks
  • Places
  • Musicians/Musical Groups
  • Actors
  • Some Government Agencies
  • Many Local Businesses
  • Animals
  • Movies
  • Albums
  • Notable Landmarks
  • Who Knows What Else

 

Entity types that I expected to see, but did not find include:

  • Products
  • Most Companies
  • Who Knows What Else
  • Songs
  • Most Government Agencies
  • Concepts
  • Non-government Organizations

This is clearly not rigorous testing, but it would appear that entity types along the lines of what is in schema.org is what should be expected over time.

I have no way to gauge the accuracy of Google’s claims that it has offered up structured data on some 500 million entities (and 3.5 billion facts). However, given the lack of coverage in key areas of Wikipedia (which itself has about 3 million entities in the English version), I suspect much of that number comes from the local businesses and restaurants and such that Google has been rapidly adding to its listings in recent years. Coverage of broadly interesting stuff still seems quite lacking.

The much-touted knowledge graph is also kind of disappointing. Related links are provided, but they are close and obvious. So, an actor will have links to films she was in, or a person may have links to famous spouses, but anything approaching a conceptual or knowledge relationship is missing. I think, though, we can see such links and types and entity expansions to steadily creep in over time. Google certainly has the data basis for making these expansions. And, constant, incremental improvement has been Google’s modus operandi.

Deconstructing the URL

For some time, and at various meetings I attend, I have always been at pains to question Google representatives whether there is some unique, internal ID for entities within its databases. Sometimes the reps I have questioned just don’t know, and sometimes they are cagey.

But, clearly, anything like the structured data that Google has been working toward has to have some internal identifier. To see if some of this might now have surfaced with the Knowledge Graph, I did a bit of poking of the URLs shown in the searches and the affiliated entities in the infoboxes. Under most standard searches, the infobox appears directly if there is one for that object. But, by inspecting cross-referenced entities from the infoboxes themselves, it is possible to discern the internal key.

The first thing one can do in such inspections is to remove that stuff that is local or cookie things related to one’s own use preferences or browser settings. Other tests can show other removals. So, using the Stephen King example above, we can eliminate these aspects of the URL:

https://www.google.com/search?num=100&hl=en&safe=off&q=stephen+king&stick=H4sIAAAAAAAAAONgVuLQz9U3MCvIKwQAcT1q1AwAAAA&sa=X&ei=Q-S2T6e1HYOw2wXR2OXOCQ&ved=0CPMGEJsTKAA&biw=1280&bih=827

This actually conformed to my intuition, because the ‘&stick’ aspect was a new parameter for me. (Typically, in many of these dynamic URLs, the various parameters are separated by one another by a set designator character. In the case of Google, that is the ampersand &.)

By simply doing repeated searches that result in the same entity references, I was able to confirm that the &stick parameter is what invokes the unique ID and infobox for each entity. Further, we can decompose that further, but the critical aspect seems to be what is not included within the following: &stick=H4sIAAAAAAAAAONg . . [VuLQz9U3] . . AAAA. The stuff in the brackets varies less, and I suspect might be related to the source, rather than the entity.

I started to do some investigation on types and possible sources, but ran out of time. Listed below are some &stick identifiers for various types of entities (each is a live link):

Type Entity Infobox Identifier
Place Los Angeles &stick=H4sIAAAAAAAAAONgVuLSz9U3MDYoTDIuAQD9ON7KDgAAAA
Place Brentwook &stick=H4sIAAAAAAAAAONgVuLUz9U3MKwsNEkDAN6nm-sNAAAA
Person Arthur Miller &stick=H4sIAAAAAAAAAONgVmLXz9U3qEwrAADRsxaaCwAAAA
Person Joe DiMaggio &stick=H4sIAAAAAAAAAONgVuLQz9U3SDFMqwAAAy8zdQwAAAA
Person Marilyn Monroe &stick=H4sIAAAAAAAAAONgVuLQz9U3MCkvLAIAW7x_LwwAAAA
Movie Shawshank Redemption &stick=H4sIAAAAAAAAAONgVuLQz9U3MM_KKwEAPYgNDQwAAAA
Movie The Green Mile &stick=H4sIAAAAAAAAAONgVuLUz9U3MC62zC4AAGg8mEkNAAAA
Movie Dr. Strangelove &stick=H4sIAAAAAAAAAONgVuLQz9U3MEopzwIAfsFCUQwAAAA
Animal Toucan &stick=H4sIAAAAAAAAAONgVuLQz9U3MM8wNAcA4g1_3AwAAAA
Animal Wolverine &stick=H4sIAAAAAAAAAONgUeLQz9U3sDAryQIAhJ3RUwwAAAA
Musicians/Groups The Beatles &stick=H4sIAAAAAAAAAONgVuLQz9U3ME82yAIAC_7r3AwAAAA
Albums The White Album &stick=H4sIAAAAAAAAAONgVuLSz9U3MMxIN0nKAADnd5clDgAAAA

 

You can verify that this ‘&stick‘ reference is what is pulling in the infobox by looking at this modified query that has substituted Marilyn Monroe’s &stick in the Stephen King URL string: Google 'Stephen King' + 'Marilyn Monroe' structured results Note the standard search results in the lefthand results panel are the same as for Stephen King, but we now have fooled the Google engine to display Marilyn Monroe’s infobox.

I’m sure over time that others will deconstruct this framework to a very precise degree. What would really be great, of course, as noted on many recent mailing lists, is for Google to expose all of this via an API. The Google listing could become the de facto source for Webby entity identifiers.

Some Concluding Thoughts

Sort of like when schema.org was first announced, there have been complaints from some in the semantic Web community that Google released this stuff without once saying the word “semantic”, that much had been ripped off from the original researchers and community without attribution, that a gorilla commercial entity like Google could only be expected to milk this stuff for profit, etc., etc.

That all sounds like sour grapes to me.

What we have here is what we are seeing across the board: the inexorable integration of semantic technology approaches into many existing products. Siri did it with voice commands; Bing, and now Google, are doing it too with search.

We should welcome these continued adoptions. The fact is, semWeb community, that what we are seeing in all of these endeavors is the right and proper role for these technologies: in the background, enhancing our search and information experience, and not something front and center or rammed down our throats. These are the natural roles of semantic technologies, and they are happening at a breakneck pace.

Welcome to the semantic technology space, Google! I look forward to learning much from you.

Schema.org Markup

headline:
Deconstructing the Google Knowledge Graph

alternativeHeadline:
Some Quick Investigations Point to Promise, Disappointments

author:

image:
http://www.mkbergman.com/wp-content/themes/ai3v2/images/2009Posts/090327_google_logo200.png

description:
This week Google announced the addition of the Knowledge Graph (GKG) to its search results. What you get when you search on particular kinds of entities is in essence an "infobox", similar to the same structure as what is found on Wikipedia.

articleBody:
see above

datePublished:

3 thoughts on “Deconstructing the Google Knowledge Graph

  1. Great first thoughts and observations, Mike!

    Indeed the information taken from Freebase is very partial. An easy example is that there is an infobox about the mainter Modigliani, including links to its paintings (under Artwork.) When you follow the one for “reclining nude” there is no infobox. And it is definitely the case the case that information about this painting is available in FreeBase, as we know from the so-called “Modigliani test” and our answer to it. One can play with Freebase data and such queries at factforge.net.

  2. Hi Naso,

    Thanks for confirming the Freebase observations. I should have thought to direct people to FactForge. ;)

    Hi Seth,

    Right, as I noted in the story (but perhaps not well ;) ), you don’t get the &stick parameter upon first search. You see it from cross-references in the infoboxes.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>