The Recent ‘The Unreasonable Effectiveness of Data‘ Provides Important Hints
To even the most casual Web searcher, it must now be evident that Google is constantly introducing new structure into its search results. This past week three world-class computer scientists, all now research directors or scientists at Google, Alon Halevy, Peter Norvig and Fernando Pereira, published an opinion piece in the March/April 2009 issue of IEEE Intelligent Systems titled, ‘The Unreasonable Effectiveness of Data.’ It provides important framing and hints for what next may emerge in semantics from the Google search engine.
I had earlier covered Halevy and Google’s work on the deep Web. In this new piece, the authors describe the use of simple models working on very large amounts of data as means to trump fancier and more complicated algorithms.
Some of the research they cite is related to WebTables  and similar efforts to extract structure from Web-scale data. The authors describe the use of such systems to create ‘schemata’ of attributes related to various types of instance records — in essence, figuring out the structure of ABoxes , for leading instance types such as companies or automobiles .
These observations, which they call the semantic interpretation problem and contrast with the Semantic Web, they generalize as being amenable to a kind of simple, brute-force, Web-scale analysis: “Relying on overt statistics of words and word co-occurrences has the further advantage that we can estimate models in an amount of time proportional to available data and can often parallelize them easily. So, learning from the Web becomes naturally scalable.”
Google had earlier posted their 1 terabyte database of n-grams, and I tend to agree that such large-scale incidence mining can lead to tremendous insights and advantages. The authors also helpfully point out that certain scale thresholds occur for doing such analysis, such that researchers need not have access to indexes the scale of Google to do meaningful work or to make meaningful advances. (Good news for the rest of us!)
As the authors challenge:
- Choose a representation that can use unsupervised learning on unlabeled data
- Represent the data with a non-parametric model, and
- Trust the important concepts will emerge from this analysis because human language has already evolved words for it.
My very strong suspicion is that we will see — and quickly — much more structured data for instance types (the ‘ABox’) rapidly emerge from Google in the coming weeks. They have the insights and approaches down, and clearly they have the data to drive the analysis! I also suspect many of these structured additions will just simply show up on the results listings to little fanfare.
The structured Web is growing all around us like stalagmites in a cave!
One thought on “Massive Muscle on the ABox at Google”
Completelly agree with the article, the eminent jeff posted in an article from a couple of years ago the following (I changed the worlrds a little):
When Does Ontological Classification Work Well?
Domain to be Organized
* Small corpus, Formal categories, Stable entities, Restricted entities, Clear edges
* Expert catalogers, Authoritative source of judgment, Coordinated users, Expert users
Where it doesn’t?
* Large corpus, No formal categories, Unstable entities, Unrestricted entities, No clear edges
* Uncoordinated users, Amateur users, Naive catalogers, No Authority
The list of factors making ontology a bad fit is, also, an almost perfect description of the Web — largest corpus, most naive users, no global authority, and so on. The more you push in the direction of scale, spread, fluidity, flexibility, the harder it becomes to handle the expense of starting a cataloguing system and the hassle of maintaining it.
I think in many cases (like when trying to build a model of the web and it’s content’s) ontologies are best not pre-defined, more ideally the structures and hierarchies should emerge based on actual use/context. They are not static ÃƒÆ’Ã†’Ãƒâ€ Ã¢â‚¬â„¢ÃƒÆ’Ã¢â‚¬Å¡Ãƒâ€šÃ‚Â¢ÃƒÆ’Ã†’Ãƒâ€šÃ‚Â¢ÃƒÆ’Ã‚Â¢ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬Ãƒ…Ã‚Â¡ÃƒÆ’Ã¢â‚¬Å¡Ãƒâ€šÃ‚Â¬ÃƒÆ’Ã†’Ãƒâ€šÃ‚Â¢ÃƒÆ’Ã‚Â¢ÃƒÂ¢Ã¢â€šÂ¬Ã…Â¡Ãƒâ€šÃ‚Â¬ÃƒÆ’Ã¢â‚¬Â¦ÃƒÂ¢Ã¢â€šÂ¬Ã…“ they evolve and accumulate over time.
Since I began doing research in NLP, i started with non parametric methods, and also non static structures to modelize data. Indeed I did a toolkit mainly based in exploratory methods, I included all kinds of clustering algorithms, but specially focused on the fuzzy ones.