Posted:March 27, 2009

Google logo

The Recent ‘The Unreasonable Effectiveness of Data‘ Provides Important Hints

To even the most casual Web searcher, it must now be evident that Google is constantly introducing new structure into its search results. This past week three world-class computer scientists, all now research directors or scientists at Google, Alon Halevy, Peter Norvig and Fernando Pereira, published an opinion piece in the March/April 2009 issue of IEEE Intelligent Systems titled, ‘The Unreasonable Effectiveness of Data.’ It provides important framing and hints for what next may emerge in semantics from the Google search engine.

I had earlier covered Halevy and Google’s work on the deep Web. In this new piece, the authors describe the use of simple models working on very large amounts of data as means to trump fancier and more complicated algorithms.

“Unfortunately, the fact that the word ‘semantic’ appears in both ‘Semantic Web’ and ‘semantic interpretation’ means that the two problems have often been conflated, causing needless and endless consternation and confusion. The ‘semantics’ in Semantic Web services is embodied in the code that implements those services in accordance with the specifications expressed by the relevant ontologies and attached informal documentation.”

Some of the research they cite is related to WebTables [1] and similar efforts to extract structure from Web-scale data. The authors describe the use of such systems to create ‘schemata’ of attributes related to various types of instance records — in essence, figuring out the structure of ABoxes [2], for leading instance types such as companies or automobiles [3].

These observations, which they call the semantic interpretation problem and contrast with the Semantic Web, they generalize as being amenable to a kind of simple, brute-force, Web-scale analysis: “Relying on overt statistics of words and word co-occurrences has the further advantage that we can estimate models in an amount of time proportional to available data and can often parallelize them easily. So, learning from the Web becomes naturally scalable.”

Google had earlier posted their 1 terabyte database of n-grams, and I tend to agree that such large-scale incidence mining can lead to tremendous insights and advantages. The authors also helpfully point out that certain scale thresholds occur for doing such analysis, such that researchers need not have access to indexes the scale of Google to do meaningful work or to make meaningful advances. (Good news for the rest of us!)

As the authors challenge:

  1. Choose a representation that can use unsupervised learning on unlabeled data
  2. Represent the data with a non-parametric model, and
  3. Trust the important concepts will emerge from this analysis because human language has already evolved words for it.

My very strong suspicion is that we will see — and quickly — much more structured data for instance types (the ‘ABox’) rapidly emerge from Google in the coming weeks. They have the insights and approaches down, and clearly they have the data to drive the analysis! I also suspect many of these structured additions will just simply show up on the results listings to little fanfare.

The structured Web is growing all around us like stalagmites in a cave!


[1] Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu and Yang Zhang, 2008. “WebTables: Exploring the Power of Tables on the Web,” in the 34th International Conference on Very Large Databases (VLDB), Auckland, New Zealand, 2008. See http://web.mit.edu/y_z/www/papers/webtables-vldb08.pdf.
[2] As per our standard use:

"Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts."
[3] I very much like the authors’ use of ‘schemata’ as the way to describe the attribute structure of various instance record types for the ABox, in contrast to the more appropriate ‘ontology’ applied to the TBox.