<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>AI3:::Adaptive Information &#187; Deep Web</title>
	<atom:link href="http://www.mkbergman.com/category/deep-web/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mkbergman.com</link>
	<description>Mike Bergman on the semantic Web and structured Web</description>
	<lastBuildDate>Tue, 24 Jan 2012 15:52:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Massive Muscle on the ABox at Google</title>
		<link>http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/</link>
		<comments>http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/#comments</comments>
		<pubDate>Fri, 27 Mar 2009 16:08:08 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Adaptive Information]]></category>
		<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Description Logics]]></category>
		<category><![CDATA[Searching]]></category>
		<category><![CDATA[Structured Web]]></category>
		<category><![CDATA[ABox]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[information extraction]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=481</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Massive Muscle on the ABox at Google&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Description Logics&amp;rft.subject=Searching&amp;rft.subject=Structured Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2009-03-27&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/&amp;rft.language=English"></span>
The Recent &#8216;The Unreasonable Effectiveness of Data&#8216; Provides Important Hints To even the most casual Web searcher, it must now be evident that Google is constantly introducing new structure into its search results. This past week three world-class computer scientists, all now research directors or scientists at Google, Alon Halevy, Peter Norvig and Fernando Pereira, [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Massive Muscle on the ABox at Google&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Description Logics&amp;rft.subject=Searching&amp;rft.subject=Structured Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2009-03-27&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/&amp;rft.language=English"></span>
<p><img style="width: 200px; height: 70px; float: left;" title="Google logo" src="../wp-content/themes/ai3/images/2009Posts/090327_google_logo200.png" alt="Google logo" /></p>
<h2>The Recent &#8216;<span style="font-style: italic">The Unreasonable Effectiveness of Data</span>&#8216; Provides Important Hints</h2>
<p>To even the most casual Web searcher, it must now be evident that <a href="http://www.google.com/">Google</a> is constantly introducing new structure into its search results.  This past week three world-class computer scientists, all now research directors or scientists at Google, <a href="http://www.cs.washington.edu/homes/alon/">Alon Halevy</a>, <a href="http://norvig.com/">Peter Norvig</a> and <a href="http://www.cis.upenn.edu/~pereira/">Fernando Pereira</a>, published an opinion piece in the March/April 2009 issue of <a style="font-style: italic" href="http://www2.computer.org/portal/web/csdl/abs/mags/ex/2009/02/mex200902toc.htm">IEEE Intelligent Systems</a> titled, <a href="http://www.computer.org/portal/cms_docs_intelligent/intelligent/homepage/2009/x2exp.pdf">&#8216;The Unreasonable Effectiveness of Data.&#8217;</a> It provides important framing and hints for what next may emerge in semantics from the Google search engine.</p>
<p>I had earlier covered <a href="../?p=436">Halevy and Google&#8217;s work</a> on the <a href="http://en.wikipedia.org/wiki/Deep_web">deep Web</a>.  In this new piece, the authors describe the use of simple models working on very large amounts of data as means to trump fancier and more complicated algorithms.</p>
<div>
<div class="boxGreenDotted" style="padding: 10px; width: 520px; text-align: center;"><big><span style="font-style: italic">&#8220;Unfortunately, the fact that the word &#8216;semantic&#8217; appears in both &#8216;Semantic Web&#8217; and &#8216;semantic interpretation&#8217; means that the two problems have often been conflated, causing needless and endless consternation and confusion. The &#8216;semantics&#8217; in Semantic Web services is embodied in the code that implements those services in accordance with the specifications expressed by the relevant ontologies and attached informal documentation.&#8221;<span class="double_u"> </span></span></big></div>
</div>
<p>Some of the research they cite is related to WebTables<a href="#goog1"> [1]</a> and similar efforts to extract structure from Web-scale data.  The authors describe the use of such systems to create &#8216;schemata&#8217; of attributes related to various types of instance records &#8212; in essence, figuring out the structure of ABoxes <a href="#goog2">[2]</a>, for leading instance types such as companies or automobiles <a href="#goog3">[3]</a>.</p>
<p>These observations, which they call the <em>semantic interpretation problem</em> and contrast with the <em>Semantic Web</em>, they generalize as being amenable to a kind of simple, brute-force, Web-scale analysis:  &#8220;Relying on overt statistics of words and word co-occurrences has the further advantage that we can estimate models in an amount of time proportional to available data and can often parallelize them easily. So, learning from the Web becomes naturally scalable.&#8221;</p>
<p>Google had earlier posted their <a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13">1 terabyte database of n-grams</a>, and I tend to agree that such large-scale incidence mining can lead to tremendous insights and advantages.  The authors also helpfully point out that certain scale thresholds occur for doing such analysis, such that researchers need not have access to indexes the scale of Google to do meaningful work or to make meaningful advances.  (Good news for the rest of us!)</p>
<p>As the authors challenge:</p>
<ol>
<li>Choose a representation that can use unsupervised learning on unlabeled data</li>
<li>Represent the data with a non-parametric model, and</li>
<li>Trust the important concepts will emerge from this analysis because human language has already evolved words for it.</li>
</ol>
<p>My very strong suspicion is that we will see &#8212; and quickly &#8212; much more structured data for instance types (the &#8216;ABox&#8217;) rapidly emerge from Google in the coming weeks.  They have the insights and approaches down, and clearly they have the data to drive the analysis! I also suspect many of these structured additions will just simply show up on the results listings to little fanfare.</p>
<p>The structured Web is growing all around us like stalagmites in a cave!</p>
<hr style="margin: 15px 0px" size="1" />
<div style="margin: 10px 0pt; font-size: 90%"><a title="goog1" name="goog1"></a>[1] Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu and Yang Zhang, 2008. &#8220;WebTables: Exploring the Power of Tables on the Web,&#8221; in the <span style="font-style: italic">34th International Conference on Very Large Databases (VLDB)</span>, Auckland, New Zealand, 2008.  See <a href="http://web.mit.edu/y_z/www/papers/webtables-vldb08.pdf">http://web.mit.edu/y_z/www/papers/webtables-vldb08.pdf</a>.</div>
<div style="margin: 10px 0pt; font-size: 90%"><a title="goog2" name="goog2"></a>[2] As per our standard use:</p>
<div class="boxGrayDotted">&quot;<a href="http://en.wikipedia.org/wiki/Description_logics">Description logics</a> and their semantics traditionally split <span style="font-style: italic">concepts</span> and their relationships from the different treatment of <span style="font-style: italic">instances</span> and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for <em>terminological</em> knowledge, the basis for <span style="font-style: italic">T</span> in <span style="font-style: italic">TBox</span>) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for <span style="font-style: italic">assertions</span>, the basis for <span style="font-style: italic">A</span> in <span style="font-style: italic">ABox</span>) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.&quot;</div>
</div>
<div style="margin: 10px 0pt; font-size: 90%"><a title="goog3" name="goog3"></a>[3] I very much like the authors&#8217; use of &#8216;schemata&#8217; as the way to describe the attribute structure of various instance record types for the ABox, in contrast to the more appropriate &#8216;ontology&#8217; applied to the TBox.</div>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Multi-part Federated Search Interview</title>
		<link>http://www.mkbergman.com/467/multi-part-federated-search-interview/</link>
		<comments>http://www.mkbergman.com/467/multi-part-federated-search-interview/#comments</comments>
		<pubDate>Fri, 14 Nov 2008 22:32:37 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Adaptive Information]]></category>
		<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Linked Data]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Structured Web]]></category>
		<category><![CDATA[UMBEL]]></category>
		<category><![CDATA[BrightPlanet]]></category>
		<category><![CDATA[federated search]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[zitgist]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=467</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Multi-part <em>Federated Search</em> Interview&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Linked Data&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Web&amp;rft.subject=UMBEL&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-11-14&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/467/multi-part-federated-search-interview/&amp;rft.language=English"></span>
Topics Range from the Deep Web to Semantic Web in this Search Luminaries Series I&#8217;m pleased to wrap up a multi-part interview with the Federated Search Blog as part of their ongoing &#8216;Search Luminaries&#8217; series. Sol Lederman, editor of the blog, does a thorough and comprehensive job! Over the past month on every Friday, I [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Multi-part <em>Federated Search</em> Interview&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Adaptive Information&amp;rft.subject=Deep Web&amp;rft.subject=Linked Data&amp;rft.subject=Semantic Web&amp;rft.subject=Structured Web&amp;rft.subject=UMBEL&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-11-14&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/467/multi-part-federated-search-interview/&amp;rft.language=English"></span>
<h2>Topics Range from the Deep Web to Semantic Web in this <span style="font-style: italic">Search Luminaries</span> Series</h2>
<p>I&#8217;m pleased to wrap up a multi-part interview with the <a href="http://federatedsearchblog.com/" style="font-style: italic">Federated Search Blog</a> as part of their ongoing &#8216;Search Luminaries&#8217; series. <a href="http://federatedsearchblog.com/about/">Sol Lederman</a>, editor of the blog, does a thorough and comprehensive job!  Over the past month on every Friday, I have answered some 25 or so of his detailed questions.</p>
<p>Federated Search Blog was particularly interested in the <a href="http://en.wikipedia.org/wiki/Deep_web">deep Web</a>, its discovery and size.  Many of the early questions deal with those themes.  However, by <a href="http://federatedsearchblog.com/2008/11/14/michael-bergman-federated-search-luminary-part-iv/">Part 4</a> things get a bit more current, with the topics shifting to the <a href="http://en.wikipedia.org/wiki/Semantic_web">semantic Web</a>, <a href="http://en.wikipedia.org/wiki/Linked_Data">linked data</a> and <a href="http://www.zitgist.com/">Zitgist</a>.</p>
<p>Here are the links to the series:</p>
<ul>
<li><a href="http://federatedsearchblog.com/2008/10/17/luminary-interview-with-michael-bergman-a-preview/">Preview</a> (Oct. 17, 2008)</li>
<li><a href="http://federatedsearchblog.com/2008/10/24/michael-bergman-federated-search-luminary-part-i/">Part I</a> (Oct. 24)</li>
<li><a href="http://federatedsearchblog.com/2008/10/31/michael-bergman-federated-search-luminary-part-ii/">Part 2</a> (Oct. 31)</li>
<li><a href="http://federatedsearchblog.com/2008/11/07/michael-bergman-federated-search-luminary-part-iii/">Part 3</a> (Nov. 7)</li>
<li><a href="http://federatedsearchblog.com/2008/11/14/michael-bergman-federated-search-luminary-part-iv/">Part 4</a> (Nov. 14).</li>
</ul>
<p>To give you a flavor of the interview, here is an example of one of the questions (and probably my favorite):</p>
<p style="font-style: italic"><span style="font-weight: bold">20.</span> Tim Berners-Lee, credited with inventing the World Wide Web, has been talking about the importance and value of the Semantic Web for years yet common folks don&apos;t see much evidence of the Semantic Web gaining traction. Is there substance to the Semantic Web? What&apos;s happening with it now and what does its future look like?</p>
<div class="boxGrayDotted">Wow, in 10,000 words or less?</p>
<p>No, actually, this is a very good question. As things go, I am a relative newbie to the semantic Web, only having studied and followed it closely since about 2005. I&apos;m sure my perspective in coming later to the party may not be shared by those at the beginning, which dates to the mid-1990s as Berners-Lee&apos;s vision naturally progressed from a Web of documents, as most of us currently know the Web, to a Web of data.</p>
<p>I think there is indeed incredibly important substance to the semantic Web. But, as I have written elsewhere, the semantic Web is more of a vision than a discernable point in time or a milestone.</p>
<p>The basic idea of the semantic Web is to shift the focus from documents to data. Give data a unique Web address. Characterize that data with rich metadata. Describe how things are related to one another so that relationships and connections can be traced. Provide defined structures for what these things and relationships &quot;mean&quot;; this is what provides the semantics, with the structures and their defined vocabularies known as &quot;ontologies&quot; (which in one analog can be seen as akin to a relational database schema).</p>
<p>As these structures and definitions get put in place, the Web itself then becomes the infrastructure for relating information from everywhere and anywhere on any given topic or subject. While this vision may sound grandiose, just think back to what the Web itself has done for us and documents over the past decade or so. This same architecture and infrastructure can and should be extended to the actual information in those documents, the data. And, oh, by the way, conventional databases can now join this party as well. The vision is very powerful and very cool.</p>
<p>Progress has indeed been slow. Many advocates fairly point to how long it takes to get standards in place and for a while people spoke of the &quot;chicken-and-egg&quot; problem of getting over the threshold of having enough structured data to consume to make it worthwhile to create the tools and applications and showcases that consume that data.</p>
<p>From my perspective, the early visions of the semantic Web were too abstract, a bit off perhaps. First, there was the whole idea of artificial intelligence and machines using the data as opposed to better ways for humans to draw use from the data at hand. The fundamental and exciting engine underneath the semantic Web &#8212; the RDF (Resource Description Framework) data model &#8212; was not initially treated on its own. It got admixed with XML that made understanding difficult and distinctions vague. There is and remains too much academia and not enough pragmatics driving the bus.</p>
<p>But that is changing and fast.</p>
<p>There is now an immediate and practical &quot;flavor&quot; of the semantic Web called linked data. It has three simple bases:</p>
<p>(1) RDF as the simple but adaptable data model that can represent any information &#8212; structured or unstructured &#8212; as the basic &quot;triple&quot; statement of subject-predicate-object. That sounds fancy, but just substitute verb for predicate and noun for subject and object. In other words: Dick sees Jane; or the ball is round. It sounds like a kindergartner reader, but that is how data can be easily represented and built up into more complex structures and stories</p>
<p>(2) Give all objects a unique Web identifier. Unique identifiers are common to any database; in linked data, we just make sure those identifiers conform to the same URIs we see constantly in the address bar of our Web browsers, and:</p>
<p>(3) Post and expose this stuff as accessible on the Web (namely, HTTP).</p>
<p>My company adds some essential &quot;spice&quot; to these flavors with respect to reference structures and concepts to give the information context, but these simple bases remain the foundation.</p>
<p>These are really not complex steps. They are really no different than the early phases of posting documents on the Web. Only now, we are exposing data.</p>
<p>More importantly, we can forget the chicken-and-egg problem. Each new data link we make brings value, in the similar way that adding a node to a network brings value according to Metcalfe&apos;s Law. Only with linked data, we already have the nodes &#8212; the data &#8212; we are just establishing the link connections (the verbs, predicates or relations) to flesh out the network graph. Same principle, only our focus is now to connect what is there rather than to add more nodes. (Of course, adding more linked nodes helps as well!)</p>
<p>The absolutely amazing thing about our current circumstance as Web users is that we truly now have simple and readily deployable mechanisms available to finally overcome the decades of enterprise stovepipes. The whole answer is so simple it can be mistaken as snake oil when first presented and not inspected a bit.</p>
<p>As an industry accustomed to hype and cynical about so much of this, I only ask that your readers check out these assertions for themselves and suspend their normal and expected disbelief. For me, in a career of more than 30 years focusing on information and access, I feel like we finally now have the tools, data model and architecture at hand to actually achieve data interoperability.</p></div>
<p>Thanks again to Sol and <span style="font-style: italic">Federated Search Blog</span> for this opportunity.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/467/multi-part-federated-search-interview/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New Currents in the &#8216;Deep Web&#8217;</title>
		<link>http://www.mkbergman.com/458/new-currents-in-the-deep-web/</link>
		<comments>http://www.mkbergman.com/458/new-currents-in-the-deep-web/#comments</comments>
		<pubDate>Fri, 10 Oct 2008 00:47:38 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[Structured Web]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=458</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=New Currents in the &#8216;Deep Web&#8217;&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.subject=Semantic Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-10-09&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/458/new-currents-in-the-deep-web/&amp;rft.language=English"></span>
Timelines, Semantics and Ontologies are Coming to the Fore The past two weeks have seen an interesting emergence of new perspectives on the &#8216;deep Web&#8216;. The deep Web, a term Thane Paulsen and I coined for my oft-quoted study from 2000, The Deep Web: Surfacing Hidden Value [1], is the phenomenon of database-backed content served [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=New Currents in the &#8216;Deep Web&#8217;&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.subject=Semantic Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-10-09&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/458/new-currents-in-the-deep-web/&amp;rft.language=English"></span>
<p><a href="../wp-content/themes/ai3/images/2008Posts/081009_DeepWebNew.jpg"><img src="../wp-content/themes/ai3/images/2008Posts/081009_DeepWebNew.jpg" style="border: 0px solid ; margin-right: 10px; width: 280px; height: 220px; float: left" alt="Trawling the Deep Web" vspace="5" hspace="5" /></a></p>
<h2>Timelines, Semantics and Ontologies are Coming to the Fore</h2>
<p>The past two weeks have seen an interesting emergence of new perspectives on the &#8216;<a href="http://en.wikipedia.org/wiki/Deep_web">deep Web</a>&#8216;. The deep Web, a term <a href="http://www.paulsenagribranding.com/main.cfm">Thane Paulsen</a> and I coined for my oft-quoted study from 2000,  <a href="http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104" style="font-style: italic">The Deep Web:  Surfacing Hidden Value</a> <a href="#deep1">[1]</a>, is the phenomenon of database-backed content served from interactive Web search  forms.</p>
<p>Because deep Web content is dynamic and produced only on request, it has been difficult for traditional search  engines to index. It is also huge and of high quality (though likely not the 100x to 500x figure larger than the  standard &#8216;surface&#8217; Web that I used in that first study.)</p>
<h3>Deep Web Timeline</h3>
<p>This is the most recent of the three notable events over the past two weeks, and came out on Tuesday. Maureen  Flynn-Burhoe of the <a href="http://papergirls.wordpress.com/">oceanflynn @ Digg</a> blog has produced a very  informative and comprehensive <a href="http://papergirls.wordpress.com/2008/10/07/timeline-deep-web/">timeline of  deep Web</a> and related developments from 1980 to the present (database-backed content and early Web precursors,  of course, precede the Web itself and the term &#8216;deep Web&#8217;).</p>
<p>I have been directly involved in this field since 1994 and have not yet seen such a comprehensive treatment. She  cites studies noting &#8220;hundreds of thousands&#8221; of deep Web sites and the faster growth of dynamic (database-served)  as opposed to static (&#8216;surface&#8217;) content on the Web.</p>
<p>As someone directly involved in estimating the size of the deep Web, I appreciate the analytic difficulties and  take all of the estimates (my <em>own</em> older ones included!) with a grain of salt. Nonetheless, the deep Web is important,  its content is huge, often of unique and high quality, and it deserves serious attention by Web scientists.</p>
<p>Great job, Maureen! I always appreciate thorough researchers. (BTW, I suspect you might also like the <a href="../?page_id=327">Timeline of Information History</a>.)</p>
<h3>Trends and Role in the Semantic Web</h3>
<p><a href="http://mags.acm.org/communications/200810/?CFID=5461527&amp;CFTOKEN=11076271"><img src="../wp-content/themes/ai3/images/2008Posts/081009_CommunicationsACM.jpg" style="border: 0px solid ; margin-left: 10px; float: right" alt="Communications of the ACM" vspace="5" width="218" align="right" border="0" height="284" hspace="5" /></a>The next notable event was the publishing of <em><a href="http://mags.acm.org/communications/200810/?CFID=5461527&amp;CFTOKEN=11076271">Searching the Deep Web</a></em> by <a href="http://www.alexwright.org/">Alex Wright</a> in the <a href="http://www.acm.org/">Communications  of the ACM</a> (October 2008) <a href="#deep2">[2]</a>. Alex had first written about the deep Web for <a href="http://archive.salon.com/tech/feature/2004/03/09/deep_web/index_np.html">Salon magazine in 2004</a> and had given  nice attention to my company at that time, <a href="http://www.brightplanet.com/">BrightPlanet</a> <a href="#deep3">[3]</a>.</p>
<p>In this  current update, Alex does an excellent job of characterizing current status and research in search techniques for  the deep Web. I also liked the fact he used our fishing analogy of trawling for standard search crawlers versus  direct angling in the deep Web (see our earlier figure at upper left).</p>
<p>As some may recall, Google has stepped up its activities in this area, an event I <a href="../?p=436">reported on</a> a few months back. Those perspectives, and others from some other  notable figures, are included in Alex&#8217;s piece as well.</p>
<p>My own contribution to the piece was to suggest that RDF and semantic Web approaches offered the next  evolutionary stage in deep Web searching. Alex was able to take that theme and get some great perspectives on it. I  also appreciate the accuracy of my quotes, which gives me confidence in the quality for the rest of the story.</p>
<p>Without a doubt there is high quality in the deep Web and bringing structure and semantic characterization to it  through metadata is a task of some consequence.</p>
<p>For myself, I chose to move beyond the deep Web when its focus seemed stuck in a document-level perspective to  retrieval and analysis. However, there is much to be learned from the techniques used to select and access deep Web  content, which could be readily transferable to linked data.</p>
<p>Thanks, Alex, for making these prospects clearer! Maybe it is time to dust off some of my old stuff!</p>
<h3>Getting Deeper into the Semantics</h3>
<h2><a href="http://www.computer.org/portal/cms_docs_computer/computer/homepage/Sep08/r9itsys.pdf"><img src="../wp-content/themes/ai3/images/2008Posts/081009_Computer.jpg" style="border: 0px solid ; margin-right: 10px; float: left" alt="Trawling the Deep Web" vspace="5" width="218" align="left" border="0" height="282" hspace="5" /></a></h2>
<p>This emerging joining of deep Web and semantics is actually taking place through the efforts of a number of  academic researchers. Recently and prominently has been James Geller from the New Jersey Institute of Technology  and his colleagues Soon Ae Chun and Yoo Jung <a href="#deep4">[4]</a>. Their recently published paper, <a href="http://www.computer.org/portal/cms_docs_computer/computer/homepage/Sep08/r9itsys.pdf" style="font-style: italic">Toward the Semantic  Deep Web</a><a href="http://www.computer.org/portal/cms_docs_computer/computer/homepage/Sep08/r9itsys.pdf" style="font-style: italic">,</a> shows how ontologies  and semantic Web constructs can be combined to more effectively extract information from the deep Web. They call  this combination the &#8216;semantic deep Web.&#8217;</p>
<p>The authors posit that the structured roots of deep Web content lend themselves to better ontology learning  from the Web. They also point to the usefulness of deep Web structure to annotations.</p>
<p>That such confluences are occurring between the semantic and deep &#8220;Webs&#8221; is a function of focused academic  attention and the growing maturity of both perspectives. This year, for example, saw the inauguration of the first  Workshop on <a href="http://bis.kie.ae.poznan.pl/11th_bis/wscfp.php?ws=adw2008">Advances in Accessing Deep Web</a>  (ADW 2008). As part of the International Conference on Business Information Systems (<a href="http://bis.kie.ae.poznan.pl/11th_bis/">BIS 2008</a>), this meeting saw a lot of elbow rubbing with semantic Web  and enterprise topics.</p>
<p>It might seem strange (indeed, sometimes it does to me <img src='http://www.mkbergman.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  ) to envision structured database content being served  through a Web form and then converted via ontologies and other means to semantic Web formats. After all, why not go  direct to the data?</p>
<p>And, of course, direct conversion is less lossy and more efficient.</p>
<p>But, one interesting point is that semantic Web techniques are increasingly working as a structure-extraction  layer wrapping the standard Web. In that regard, starting with inherently structured source data &#8212; that is, the  deep Web &#8212; can lead to higher quality inputs across the distributed, heterogeneous content of the Web.</p>
<p>Given the impossibility of everyone starting with the same premises and speaking the same languages and  concepts, semantic Web mediation methods offer a way to overcome the Tower of Babel. And, when the starting content  itself is inherently structured and (generally) of higher quality &#8212; that is, the deep Web &#8212; the logic of the  combination becomes more obvious.</p>
<h3>For More Information</h3>
<p>Interested in learning more about the deep Web? I firstly recommend the resources posted at the bottom of  Flynn-Burhoe&#8217;s <a href="http://papergirls.wordpress.com/2008/10/07/timeline-deep-web/">timeline</a>. And, for a  very thorough treatment, I also recommend Denis Shestakov&#8217;s Ph.D. thesis from earlier this year <a href="#deep5">[5]</a>. It has a  bibliography of some 115 references. <a href="http://fgiasson.com/blog/index.php/2008/04/20/exploding-the-domain-umbel-web-services-by-zitgist/" style="font-style: italic"><br />
</a></p>
<hr style="margin: 15px 0px" size="1" width="33%" align="left" />
<div style="margin: 10px 0pt; font-size: 90%">  <a title="deep1" name="deep1" id="deep1"></a> [1] Michael K. Bergman, 2001. <a href="http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104">The Deep Web:  Surfacing Hidden Value</a>, <em>Journal of Electronic Publishing</em>. 7:1. Note, this publication was an update of an internal BrightPlanet study first published on July 26, 2000.</div>
<div style="margin: 10px 0pt; font-size: 90%"> <a title="deep2" name="deep2" id="deep2"></a> [2] Alex Wright, 2008. &#8220;Searching the Deep Web,&#8221; in <span style="font-style: italic">Communications of the ACM</span>, pp. 14-15, October 2008. See <a href="http://mags.acm.org/communications/200810/?CFID=5461527&amp;CFTOKEN=11076271">http://mags.acm.org/communications/200810/?CFID=5461527&amp;CFTOKEN=11076271</a>.</div>
<div style="margin: 10px 0pt; font-size: 90%"><a title="deep3" name="deep3" id="deep3"></a> [3] Alex is also the author of <a href="http://www.alexwright.org/book.html" style="font-style: italic">GLUT: Mastering Information Through the Ages</a> (<span class="date">Joseph  Henry Press, 296 pp., July 2007;</span> <span class="date">ISBN 0309102383</span><span class="date">). BTW, I had <a href="../?p=408">earlier reviewed</a> this book with some criticisms, which should go a long way to prove Alex&#8217;s fairness and chops as a journalist.</span></div>
<div style="margin: 10px 0pt; font-size: 90%"><a title="deep4" name="deep4" id="deep4"></a> [4] James Geller, Soon Ae Chun and Yoo Jung, 2008. <a href="http://www2.computer.org/portal/web/csdl/doi/10.1109/MC.2008.402">&#8220;Toward the Semantic Deep Web,&#8221;</a> in <span style="font-style: italic">Computer</span>, vol. 41, no. 9, pp. 95-97, Sept., 2008. See</div>
<div style="margin: 10px 0pt; font-size: 90%"><a title="deep5" name="deep5" id="deep5"></a> [5] Dennis Shestakov, 2008. <a href="https://oa.doria.fi/bitstream/handle/10024/38506/diss2008shestakov.pdf?sequence=3" style="font-style: italic" target="_blank">Search  Interfaces on the Web: Querying and Characterizing</a>, PhD. dissertation from the University of Turku Centre for  Computer Science, Finland, 153 pp., May 2008.</div>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/458/new-currents-in-the-deep-web/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Another Deep Web Barrier Falls</title>
		<link>http://www.mkbergman.com/436/another-deep-web-barrier-falls/</link>
		<comments>http://www.mkbergman.com/436/another-deep-web-barrier-falls/#comments</comments>
		<pubDate>Sat, 12 Apr 2008 04:51:52 +0000</pubDate>
		<dc:creator>Mike Bergman</dc:creator>
				<category><![CDATA[Deep Web]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[search engines]]></category>

		<guid isPermaLink="false">http://www.mkbergman.com/?p=436</guid>
		<description><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Another Deep Web Barrier Falls&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-04-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/436/another-deep-web-barrier-falls/&amp;rft.language=English"></span>
As late as 2002, no single search engine indexed the entire surface Web. There is much that has been written about that time, but emergence of Google (indeed others, it was a key battle at the time), worked to extend full search coverage to the Web, ending the need for so-called desktop metasearchers, then the [...]]]></description>
			<content:encoded><![CDATA[	
	<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&amp;rfr_id=info%3Asid%2Focoins.info%3Agenerator&amp;rft.title=Another Deep Web Barrier Falls&amp;rft.aulast=Bergman&amp;rft.aufirst=Mike&amp;rft.subject=Deep Web&amp;rft.source=AI3:::Adaptive Information&amp;rft.date=2008-04-11&amp;rft.type=blogPost&amp;rft.format=text&amp;rft.identifier=http://www.mkbergman.com/436/another-deep-web-barrier-falls/&amp;rft.language=English"></span>
<p>As late as 2002, no single search engine indexed the entire <a href="http://en.wikipedia.org/wiki/Surface_web">surface Web</a>.  There is much that has been written about that time, but emergence of Google (indeed others, it was a key battle at the time), worked to extend full search coverage to the Web, ending the need for so-called desktop <a href="http://en.wikipedia.org/wiki/Metasearch">metasearchers</a>, then the only option for getting full Web search coverage.</p>
<p>Strangely, though full coverage of document indexing had been conquered for the Web, dynamic Web sites and database-backed sites fronted by search forms were also emerging.  Estimates as of about 2001, made by <a href="http://www.press.umich.edu/jep/07-01/bergman.html">myself</a> and others, suggested such &#8216;<a href="http://en.wikipedia.org/wiki/Deep_web">deep Web</a>&#8216; content was many, many times larger than the indexable document Web and was found in literally hundreds of thousands of sites.</p>
<p>Standard Web crawling is a different technique and technology than &#8220;probing&#8221; the contents of searchable databases, which require a query to be issued to a site&#8217;s search form.  A company I founded, <a href="http://www.brightplanet.com/">BrightPlanet</a>, but many others such as <a href="http://www.copernic.com/">Copernic</a> or <a href="http://web.archive.org/web/20060105033921/http://www.intelliseek.com/">Intelliseek</a> and others, many of which no longer exist, were formed with the specific aim to probe these thousands of valuable content sites.</p>
<p>From those company&#8217;s standpoints, mine at that time as well, there was always the threat that the major search engines would draw a bead on deep Web content and use their resources and clout to appropriate this market.  Yahoo, for example, struck arrangements with some publishers of deep content to index their content directly, but that still fell short of the different technology that deep Web retrieval requires.</p>
<p>It was always a bit surprising that this rich storehouse of deep Web content was being neglected.  In retrospect, perhaps it was understandable:  there was still the standard Web document content to index and conquer.</p>
<p>Today, however, Google posted on one of its developer blog sites, <a href="http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html">Crawling through HTML forms</a>, written by Jayant Madhavan and Alon Halevy, noted search and semantic Web researcher, announcing its new deep Web search:</p>
<p class="boxGreenDotted">In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn&#8217;t find and index for users who search on Google. Specifically, when we encounter a &lt;FORM&gt; element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.</p>
<p>To be sure, there are differences and nuances to retrieval from the deep Web.  What is described here is not truly directed nor comprehensive.  But, the barrier has fallen.  With time, and enough servers, the more inaccessible aspects of the deep Web will fall to the services of major engines such as Google.</p>
<p>And, this is a good thing for all consumers desiring full access to the Web of documents.</p>
<p>So, an era is coming to a close.  And this, too, is appropriate.  For we are also now transitioning into the complementary era of the Web of data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.mkbergman.com/436/another-deep-web-barrier-falls/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

