To even the most casual Web searcher, it must now be evident that Google is constantly introducing new structure into its search results. This past week three world-class computer scientists, all now research directors or scientists at Google, Alon Halevy, Peter Norvig and Fernando Pereira, published an opinion piece in the March/April 2009 issue of IEEE Intelligent Systems titled, ‘The Unreasonable Effectiveness of Data.’ It provides important framing and hints for what next may emerge in semantics from the Google search engine.
I had earlier covered Halevy and Google’s work on the deep Web. In this new piece, the authors describe the use of simple models working on very large amounts of data as means to trump fancier and more complicated algorithms.
Some of the research they cite is related to WebTables  and similar efforts to extract structure from Web-scale data. The authors describe the use of such systems to create ‘schemata’ of attributes related to various types of instance records — in essence, figuring out the structure of ABoxes , for leading instance types such as companies or automobiles .
These observations, which they call the semantic interpretation problem and contrast with the Semantic Web, they generalize as being amenable to a kind of simple, brute-force, Web-scale analysis: “Relying on overt statistics of words and word co-occurrences has the further advantage that we can estimate models in an amount of time proportional to available data and can often parallelize them easily. So, learning from the Web becomes naturally scalable.”
Google had earlier posted their 1 terabyte database of n-grams, and I tend to agree that such large-scale incidence mining can lead to tremendous insights and advantages. The authors also helpfully point out that certain scale thresholds occur for doing such analysis, such that researchers need not have access to indexes the scale of Google to do meaningful work or to make meaningful advances. (Good news for the rest of us!)
As the authors challenge:
My very strong suspicion is that we will see — and quickly — much more structured data for instance types (the ‘ABox’) rapidly emerge from Google in the coming weeks. They have the insights and approaches down, and clearly they have the data to drive the analysis! I also suspect many of these structured additions will just simply show up on the results listings to little fanfare.
The structured Web is growing all around us like stalagmites in a cave!
I’m pleased to wrap up a multi-part interview with the Federated Search Blog as part of their ongoing ‘Search Luminaries’ series. Sol Lederman, editor of the blog, does a thorough and comprehensive job! Over the past month on every Friday, I have answered some 25 or so of his detailed questions.
Federated Search Blog was particularly interested in the deep Web, its discovery and size. Many of the early questions deal with those themes. However, by Part 4 things get a bit more current, with the topics shifting to the semantic Web, linked data and Zitgist.
Here are the links to the series:
To give you a flavor of the interview, here is an example of one of the questions (and probably my favorite):
20. Tim Berners-Lee, credited with inventing the World Wide Web, has been talking about the importance and value of the Semantic Web for years yet common folks don't see much evidence of the Semantic Web gaining traction. Is there substance to the Semantic Web? What's happening with it now and what does its future look like?
No, actually, this is a very good question. As things go, I am a relative newbie to the semantic Web, only having studied and followed it closely since about 2005. I'm sure my perspective in coming later to the party may not be shared by those at the beginning, which dates to the mid-1990s as Berners-Lee's vision naturally progressed from a Web of documents, as most of us currently know the Web, to a Web of data.
I think there is indeed incredibly important substance to the semantic Web. But, as I have written elsewhere, the semantic Web is more of a vision than a discernable point in time or a milestone.
The basic idea of the semantic Web is to shift the focus from documents to data. Give data a unique Web address. Characterize that data with rich metadata. Describe how things are related to one another so that relationships and connections can be traced. Provide defined structures for what these things and relationships "mean"; this is what provides the semantics, with the structures and their defined vocabularies known as "ontologies" (which in one analog can be seen as akin to a relational database schema).
As these structures and definitions get put in place, the Web itself then becomes the infrastructure for relating information from everywhere and anywhere on any given topic or subject. While this vision may sound grandiose, just think back to what the Web itself has done for us and documents over the past decade or so. This same architecture and infrastructure can and should be extended to the actual information in those documents, the data. And, oh, by the way, conventional databases can now join this party as well. The vision is very powerful and very cool.
Progress has indeed been slow. Many advocates fairly point to how long it takes to get standards in place and for a while people spoke of the "chicken-and-egg" problem of getting over the threshold of having enough structured data to consume to make it worthwhile to create the tools and applications and showcases that consume that data.
From my perspective, the early visions of the semantic Web were too abstract, a bit off perhaps. First, there was the whole idea of artificial intelligence and machines using the data as opposed to better ways for humans to draw use from the data at hand. The fundamental and exciting engine underneath the semantic Web — the RDF (Resource Description Framework) data model — was not initially treated on its own. It got admixed with XML that made understanding difficult and distinctions vague. There is and remains too much academia and not enough pragmatics driving the bus.
But that is changing and fast.
There is now an immediate and practical "flavor" of the semantic Web called linked data. It has three simple bases:
(1) RDF as the simple but adaptable data model that can represent any information — structured or unstructured — as the basic "triple" statement of subject-predicate-object. That sounds fancy, but just substitute verb for predicate and noun for subject and object. In other words: Dick sees Jane; or the ball is round. It sounds like a kindergartner reader, but that is how data can be easily represented and built up into more complex structures and stories
(2) Give all objects a unique Web identifier. Unique identifiers are common to any database; in linked data, we just make sure those identifiers conform to the same URIs we see constantly in the address bar of our Web browsers, and:
(3) Post and expose this stuff as accessible on the Web (namely, HTTP).
My company adds some essential "spice" to these flavors with respect to reference structures and concepts to give the information context, but these simple bases remain the foundation.
These are really not complex steps. They are really no different than the early phases of posting documents on the Web. Only now, we are exposing data.
More importantly, we can forget the chicken-and-egg problem. Each new data link we make brings value, in the similar way that adding a node to a network brings value according to Metcalfe's Law. Only with linked data, we already have the nodes — the data — we are just establishing the link connections (the verbs, predicates or relations) to flesh out the network graph. Same principle, only our focus is now to connect what is there rather than to add more nodes. (Of course, adding more linked nodes helps as well!)
The absolutely amazing thing about our current circumstance as Web users is that we truly now have simple and readily deployable mechanisms available to finally overcome the decades of enterprise stovepipes. The whole answer is so simple it can be mistaken as snake oil when first presented and not inspected a bit.
As an industry accustomed to hype and cynical about so much of this, I only ask that your readers check out these assertions for themselves and suspend their normal and expected disbelief. For me, in a career of more than 30 years focusing on information and access, I feel like we finally now have the tools, data model and architecture at hand to actually achieve data interoperability.
Thanks again to Sol and Federated Search Blog for this opportunity.
The past two weeks have seen an interesting emergence of new perspectives on the ‘deep Web‘. The deep Web, a term Thane Paulsen and I coined for my oft-quoted study from 2000, The Deep Web: Surfacing Hidden Value , is the phenomenon of database-backed content served from interactive Web search forms.
Because deep Web content is dynamic and produced only on request, it has been difficult for traditional search engines to index. It is also huge and of high quality (though likely not the 100x to 500x figure larger than the standard ‘surface’ Web that I used in that first study.)
This is the most recent of the three notable events over the past two weeks, and came out on Tuesday. Maureen Flynn-Burhoe of the oceanflynn @ Digg blog has produced a very informative and comprehensive timeline of deep Web and related developments from 1980 to the present (database-backed content and early Web precursors, of course, precede the Web itself and the term ‘deep Web’).
I have been directly involved in this field since 1994 and have not yet seen such a comprehensive treatment. She cites studies noting “hundreds of thousands” of deep Web sites and the faster growth of dynamic (database-served) as opposed to static (‘surface’) content on the Web.
As someone directly involved in estimating the size of the deep Web, I appreciate the analytic difficulties and take all of the estimates (my own older ones included!) with a grain of salt. Nonetheless, the deep Web is important, its content is huge, often of unique and high quality, and it deserves serious attention by Web scientists.
Great job, Maureen! I always appreciate thorough researchers. (BTW, I suspect you might also like the Timeline of Information History.)
The next notable event was the publishing of Searching the Deep Web by Alex Wright in the Communications of the ACM (October 2008) . Alex had first written about the deep Web for Salon magazine in 2004 and had given nice attention to my company at that time, BrightPlanet .
In this current update, Alex does an excellent job of characterizing current status and research in search techniques for the deep Web. I also liked the fact he used our fishing analogy of trawling for standard search crawlers versus direct angling in the deep Web (see our earlier figure at upper left).
As some may recall, Google has stepped up its activities in this area, an event I reported on a few months back. Those perspectives, and others from some other notable figures, are included in Alex’s piece as well.
My own contribution to the piece was to suggest that RDF and semantic Web approaches offered the next evolutionary stage in deep Web searching. Alex was able to take that theme and get some great perspectives on it. I also appreciate the accuracy of my quotes, which gives me confidence in the quality for the rest of the story.
Without a doubt there is high quality in the deep Web and bringing structure and semantic characterization to it through metadata is a task of some consequence.
For myself, I chose to move beyond the deep Web when its focus seemed stuck in a document-level perspective to retrieval and analysis. However, there is much to be learned from the techniques used to select and access deep Web content, which could be readily transferable to linked data.
Thanks, Alex, for making these prospects clearer! Maybe it is time to dust off some of my old stuff!
This emerging joining of deep Web and semantics is actually taking place through the efforts of a number of academic researchers. Recently and prominently has been James Geller from the New Jersey Institute of Technology and his colleagues Soon Ae Chun and Yoo Jung . Their recently published paper, Toward the Semantic Deep Web, shows how ontologies and semantic Web constructs can be combined to more effectively extract information from the deep Web. They call this combination the ‘semantic deep Web.’
The authors posit that the structured roots of deep Web content lend themselves to better ontology learning from the Web. They also point to the usefulness of deep Web structure to annotations.
That such confluences are occurring between the semantic and deep “Webs” is a function of focused academic attention and the growing maturity of both perspectives. This year, for example, saw the inauguration of the first Workshop on Advances in Accessing Deep Web (ADW 2008). As part of the International Conference on Business Information Systems (BIS 2008), this meeting saw a lot of elbow rubbing with semantic Web and enterprise topics.
It might seem strange (indeed, sometimes it does to me ) to envision structured database content being served through a Web form and then converted via ontologies and other means to semantic Web formats. After all, why not go direct to the data?
And, of course, direct conversion is less lossy and more efficient.
But, one interesting point is that semantic Web techniques are increasingly working as a structure-extraction layer wrapping the standard Web. In that regard, starting with inherently structured source data — that is, the deep Web — can lead to higher quality inputs across the distributed, heterogeneous content of the Web.
Given the impossibility of everyone starting with the same premises and speaking the same languages and concepts, semantic Web mediation methods offer a way to overcome the Tower of Babel. And, when the starting content itself is inherently structured and (generally) of higher quality — that is, the deep Web — the logic of the combination becomes more obvious.
Interested in learning more about the deep Web? I firstly recommend the resources posted at the bottom of Flynn-Burhoe’s timeline. And, for a very thorough treatment, I also recommend Denis Shestakov’s Ph.D. thesis from earlier this year . It has a bibliography of some 115 references.
As late as 2002, no single search engine indexed the entire surface Web. There is much that has been written about that time, but emergence of Google (indeed others, it was a key battle at the time), worked to extend full search coverage to the Web, ending the need for so-called desktop metasearchers, then the only option for getting full Web search coverage.
Strangely, though full coverage of document indexing had been conquered for the Web, dynamic Web sites and database-backed sites fronted by search forms were also emerging. Estimates as of about 2001, made by myself and others, suggested such ‘deep Web‘ content was many, many times larger than the indexable document Web and was found in literally hundreds of thousands of sites.
Standard Web crawling is a different technique and technology than “probing” the contents of searchable databases, which require a query to be issued to a site’s search form. A company I founded, BrightPlanet, but many others such as Copernic or Intelliseek and others, many of which no longer exist, were formed with the specific aim to probe these thousands of valuable content sites.
From those company’s standpoints, mine at that time as well, there was always the threat that the major search engines would draw a bead on deep Web content and use their resources and clout to appropriate this market. Yahoo, for example, struck arrangements with some publishers of deep content to index their content directly, but that still fell short of the different technology that deep Web retrieval requires.
It was always a bit surprising that this rich storehouse of deep Web content was being neglected. In retrospect, perhaps it was understandable: there was still the standard Web document content to index and conquer.
Today, however, Google posted on one of its developer blog sites, Crawling through HTML forms, written by Jayant Madhavan and Alon Halevy, noted search and semantic Web researcher, announcing its new deep Web search:
In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.
To be sure, there are differences and nuances to retrieval from the deep Web. What is described here is not truly directed nor comprehensive. But, the barrier has fallen. With time, and enough servers, the more inaccessible aspects of the deep Web will fall to the services of major engines such as Google.
And, this is a good thing for all consumers desiring full access to the Web of documents.
So, an era is coming to a close. And this, too, is appropriate. For we are also now transitioning into the complementary era of the Web of data.
Over the past few months I have increasingly been writing about and referring to the structured Web. I have done so purposefully, but, so far, with little background or explication. With the inauguration of this occasional series, I hope to bring more color and depth to this topic .
Literally, over the past year, I have been learning and documenting on AI3 my attempts to understand the basis, concepts and tools of the emerging semantic Web. In that process, I have come to define my own outlines of the Web past, present and future. Within this world view, I see the structured Web as today’s current imperative and reality.
Some Web pundits have embraced a versioning terminology of Web 2.0 and Web 3.0 to describe one such world view. I don’t personally agree with this silly versioning — indeed I poked fun in a tongue-in-cheek posting about Web 98.6 more than a year ago — but such terminology has gotten some traction and serves a purpose. I actually give my own definitions for such “versions” below if for no other reason than to close the gap with alternative world views.
We need not go back to the alternative early protocols of Usenet (and news groups), Gopher and FTP and their search engines of Veronica, WAIS, Jughead or Archie in 1991  when Tim Berners-Lee first publicly announced the World Wide Web and its combination of hypertext with the Internet. More likely, the release of the Mosaic browser and CERN‘s decision to make access to the Web free in 1993 marked the true take-off point for the Web and the continued demise of the competing protocols.
Images and links in Web pages (“documents”) plus the HTML mark-up language to enable the styling and graphical design of those pages were very much in keeping with general trends, paralleling the earlier transition of personal computers to graphical interfaces and away from terminals. Mosaic became the foundation for the Netscape browser, best links compilations became a big hit through sites like Yahoo!, and the Lycos search engine, one of the first profitable Web ventures, indexed a mere 54,000 pages when it was publicly released in 1994 .
This initial start to the Web — today now referred to by some as ‘Web 1.0′ — can be squarely timed to 1993-1994. By 1995, the Web was appearing on the covers of major news magazines and by 1996 the phenomenon was at full throttle. But, since these early beginnings, the Web has gone through many different “versions” and transitions, most not fitting with version numbers, as some of these examples show:
Despite these differences in viewpoint, language does matter. Though some may view language as a contest in “branding,” which can legitimately apply in other venues, I think the issue here goes well beyond “branding.” Language is also necessary to aid communication.
As I explain below and elaborate upon more fully throughout this series, I believe one of the correct terms for the current evolutionary state of the Web is the ‘structured Web.’
As noted, portions of these trends and changes are more broadly combining to represent another transitional change in the Web from one solely focused on documents to one that is more object- or data-centric. Evidence of this trend includes such factors as:
One of the most popular series of presentations at this year’s WWW2007 conference in Banff was from the Linked Open Data project of the SWEO interest group. The members of this LOD project — comprised of accomplished advocates, developers and theorists — are providing the awareness, tools and example data that are showing how this emerging version may look. In fact, the group has just announced crossing the threshold of 1 billion ‘triples’ with 180,000 interlinks within its online DBpedia service, via these sources:
The LOD’s term for this effort is ‘linked data‘, and a Web site has been established to promote it. Others, harking back to Tim Berners-Lee’s original definition, refer to current efforts as a ‘Web of data’ or the ‘Semantic Data Web.’ Kingsley Idehen has been promoting the idea of ‘data spaces‘ — personal and collective — that is also a powerful metaphor.
Frankly, I think all of these terms are correct and useful. Yet I prefer the term structured Web because it is both more and less than some of these other terms.
The structured Web is more in that it pertains to any data formalism in use on the Web and includes the notion of extracting structure from uncharacterized content, by far the largest repository of potentially useful information on the Web. Yet the structured Web is also less because its ambition is solely to get that data into an interoperable framework and to forgo the full objectives of the ‘Semantic Web.’ In that regard, my concept of the structured Web is perhaps closest to the idea of linked data, though with less insistence on “correct” RDF and with specific attention to structure extraction from uncharacterized content.
One of today’s realities is that we have accomplished much but still have a long way to go to achieve the grand vision of the ‘Semantic Web’ (capitalized).
More than a year ago I wrote a piece on “Climbing the Data Federation Pyramid” that noted the tremendous progress that has been made in the last twenty years in overcoming many seemingly intractable issues in data interoperability, initially of a physical and hardware nature. The Internet and Web standards have made enormous contributions to that progress.
The diagram I used in that piece is shown below . Reaching the pyramid’s pinnacle could be argued as having achieved the grand vision of the Semantic Web. With the adoption of the Internet and Web protocols, all layers up through data representation have largely been solved. Data representation, data models, schema for different world views, and means for reconciling and mediating those different world views are much of the focus of today’s conceptual challenges.
Note, as we discuss the structured Web that we are largely focusing on the layer dealing with data representation, with some minor portions (principally in disambiguation) dealing with semantics. Getting data into a canonical data representation or model still leaves very crucial challenges in what does the data mean (its semantics), reasoning over that data (inference and pragmatics), and whether the data is authoritative or can be trusted. These are the daunting — and largely remaining challenges — of the Semantic Web.
For example, let’s look solely at the layer of semantics, the immediate challenge after data representation. By semantics, we are referring to whether different statements from different sources indeed refer or not to the same entity or concept; in other words, have the same meaning. Such a determination is pivotal if we are to combine data from multiple sources.
The use of RDF, accurate name spaces and syntactically correct URIs aid this resolution, but do not completely solve it. Ultimately, semantic mediation (such as my “glad” is equivalent to your “happy”) means resolving or mediating potential heterogeneities from on the order of 40 discrete categories of potential mismatches from units of measure, terminology, language, and many others. These sources may derive from structure, domain, data or language, as shown in this table :
|Generalization / Specialization|
|Internal Path Discrepancy|
|Missing Item||Content Discrepancy|
|Attribute List Discrepancy|
|DOMAIN||Schematic Discrepancy||Element-value to Element-label Mapping|
|Attribute-value to Element-label Mapping|
|Element-value to Attribute-label Mapping|
|Attribute-value to Attribute-label Mapping|
|Scale or Units|
|Data Representation||Primitive Data Type|
|ID Mismatch or Missing ID|
|LANGUAGE||Encoding||Ingest Encoding Mismatch|
|Ingest Encoding Lacking|
|Query Encoding Mismatch|
|Query Encoding Lacking|
|Parsing / Morphological Analysis Errors (many)|
|Syntactical Errors (many)|
|Semantic Errors (many)|
Using the same data model (say, RDF) or the same name spaces (say, Dublin Core or FOAF) helps somewhat to remove some of these sources of heterogeneity, but not all. Undoubtedly, longer term, resolving these heterogeneities will prove tractable. But they are not easily so today.
This observation does not undercut the Semantic Web vision nor negate the massive labors in support of that vision taken to date. But, hopefully, this observation may bring some perspective to the task ahead to obtain that vision.
If nothing else, the reality of the past 15 years shows us that the Web is a “dirty,” chaotic place. If HTML coding can be screwed up, it will. If loopholes in standards and protocols exist, they will be exploited. If there is ambiguity, all interpretations become possible, with many passionately held. Innovation and unintended uses occur everywhere.
This should not be surprising, and experienced Web designers, scientists and technologists should all know this by now. There can be no disconnect between workable standards and approaches and actual use in the “wild.” Nuanced arguments over the subtleties of standards and approaches are bound to fail. Robustness, simplicity and forgiveness must take precedence over elegance and theoretical completeness.
While there has been obvious growth in the sophistication of Web sites and the underlying technologies that support them, we see continued use of obsolete approaches that clearly should have been abandoned long ago (such as Web-safe colors, small displays, older browser versions, Web pages parked on some servers that have not been modified or looked at by their original authors in a decade, etc.). We also see slow uptake for clearly “better” new approaches. And we also sometimes see explosive uptake of approaches and ideas that seemingly come out of nowhere.
We also see that those approaches that enjoy the greatest success — blogging, tagging, microformats, RSS, widgets, for example, come most recently to mind — are those that the “citizen” user can easily and readily embrace. HTML was pretty foreign at first, but now millions of users modify their own code. Millions of users of various CMS systems and Firefox have learned how to install plug-ins and add-ins and modify CSS themes and use administration consoles.
So, my observation and argument is not that we must always choose what is mindless and unchallenging. But my argument is that we must accept real-world diversity and seek simplicity, robustness and clarity for what is new.
After nearly a decade of standards work, the basis for beginning the transition to the semantic Web is in place. But that vision itself sometimes appears too demanding, too intimidating. The vision at times appears all too unreachable.
Of course, this perception is wrong. Measured over many years, perhaps some decades, the vision of the semantic Web is reachable. Much remains to be worked on and understood regarding this vision in terms of mediating and resolving semantic heterogeneities, capturing and expressing world views through formal ontologies, making inferences between these views, and establishing trust and authoritativeness. And those challenges do not yet address the even more-exciting prospects of intelligent and autonomous agents.
Rather, the rationale for the structured Web is to tone down the vision, stay with the here and now, focus on what is achievable today. And what is achievable today is very great.
Though version numbers for the Web are silly, with ‘Web 3.0′ for the semantic Web possibly being the silliest of all, such attempts do speak to the need for providing handles and language for capturing the dynamic change, diversity and complexity of the Web.
Today, right now, and all around us, a fundamental transition is taking place in the Web from a document-centric to a data-centric environment. A confluence of standards, advocacies, and previous trends are fueling this transition. Since the practical building blocks already exist, we will see this structured Web unfold before us at amazing speed.
The concept of the structured Web is thus narrower and less ambitious in scope than the ‘Semantic Web.’ The structured Web is merely a transitional step on the journey to the vision of the semantic Web, albeit one that can be fully realized today with current technologies and current understandings.
The purpose of this new series is thus to give prominence to this transition and to highlight the pragmatic, practical building blocks available to contribute to this transition. By somewhat shifting boundary definitions, the idea of the structured Web also aims to give more prominence to the importance of usability and structure extraction from semi-structured and unstructured content. These, too, are exciting areas with much potential.
So, as a way to provide a touchstone for continued discussion on this matter, here is one working definition of the structured Web:
Some of the tentative topics that I plan to address in this series include discussion of what constitutes ‘structure’ in content, why structure is important, the various existing forms of structure, human v. machine bases for viewing and interpreting structure, the importance of finding “canonical” representation forms while also appreciating real-world diversity, the means to convert data forms and serializations, the means to extract structure from all types of content, transitioning to semantic understandings, and likely others.
Others may be added to this series over time and will be categorized under ‘Structured Web‘ on the AI3 blog.
 News groups really did not have a good search engine until the launch of Deja News in 1995.
 Chris Sherman, "Happy Birthday, Lycos!," Search Engine Watch, August 14, 2002. See http://searchenginewatch.com/showPage.html?page=2160551.
 A fairly good summary of the History of the Web can be found on Wikipedia.
 Michael K. Bergman (Aug 2001). “The Deep Web: Surfacing Hidden Value“. The Journal of Electronic Publishing 7 (1). An earlier version of this paper was published by BrightPlanet Corp. in July 2000.
 While there are variations, Linux, Apache, MySQL and the scripting languages of either Python, PHP, or Perl are often referred to as ‘LAMP‘, one central basis for much open source software and, more broadly, interoperable open-source application packages.
 This table builds on Pluempitiwiriyawej and Hammer's schema by adding the fourth major category of language. See Charnyote Pluempitiwiriyawej and Joachim Hammer, "A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources," Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See ftp.dbcenter.cise.ufl.edu/Pub/publications/tr00-004.pdf.