The Recent ‘The Unreasonable Effectiveness of Data‘ Provides Important Hints
To even the most casual Web searcher, it must now be evident that Google is constantly introducing new structure into its search results. This past week three world-class computer scientists, all now research directors or scientists at Google, Alon Halevy, Peter Norvig and Fernando Pereira, published an opinion piece in the March/April 2009 issue of IEEE Intelligent Systems titled, ‘The Unreasonable Effectiveness of Data.’ It provides important framing and hints for what next may emerge in semantics from the Google search engine.
I had earlier covered Halevy and Google’s work on the deep Web. In this new piece, the authors describe the use of simple models working on very large amounts of data as means to trump fancier and more complicated algorithms.
Some of the research they cite is related to WebTables  and similar efforts to extract structure from Web-scale data. The authors describe the use of such systems to create ‘schemata’ of attributes related to various types of instance records — in essence, figuring out the structure of ABoxes , for leading instance types such as companies or automobiles .
These observations, which they call the semantic interpretation problem and contrast with the Semantic Web, they generalize as being amenable to a kind of simple, brute-force, Web-scale analysis: “Relying on overt statistics of words and word co-occurrences has the further advantage that we can estimate models in an amount of time proportional to available data and can often parallelize them easily. So, learning from the Web becomes naturally scalable.”
Google had earlier posted their 1 terabyte database of n-grams, and I tend to agree that such large-scale incidence mining can lead to tremendous insights and advantages. The authors also helpfully point out that certain scale thresholds occur for doing such analysis, such that researchers need not have access to indexes the scale of Google to do meaningful work or to make meaningful advances. (Good news for the rest of us!)
As the authors challenge:
- Choose a representation that can use unsupervised learning on unlabeled data
- Represent the data with a non-parametric model, and
- Trust the important concepts will emerge from this analysis because human language has already evolved words for it.
My very strong suspicion is that we will see — and quickly — much more structured data for instance types (the ‘ABox’) rapidly emerge from Google in the coming weeks. They have the insights and approaches down, and clearly they have the data to drive the analysis! I also suspect many of these structured additions will just simply show up on the results listings to little fanfare.
The structured Web is growing all around us like stalagmites in a cave!