<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: The Murky Depths of the &#8216;Deep Web&#8217;</title>
	<atom:link href="http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/</link>
	<description>Mike Bergman on the semantic Web and structured Web</description>
	<lastBuildDate>Tue, 02 Mar 2010 05:09:22 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Bill Breitmayer</title>
		<link>http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/comment-page-1/#comment-44983</link>
		<dc:creator>Bill Breitmayer</dc:creator>
		<pubDate>Wed, 09 May 2007 18:52:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.mkbergman.com/?p=343#comment-44983</guid>
		<description>A heroic effort to quantify a very complex phenomenon.

A separate but related issue is the rankings, how a reference to content on a particular site gets to the top of the list.  Even if something interesting appears toward toward the top, how likely is it that someone has the patience to get through the 900 or so references listed?

Which raises the issue of Google faking us out about what we are seeing in the results ... I&#039;ve never been able to get past reference 900, nothing seems to exist beyond the first 900 references despite the 9,373,622 hits claimed by the engine.

For example, using the very general search term &quot;financial investments&quot; the Google engine reported 59,600,000 results.  In this case, dividing the 900 visible results by the 59,600,000 total results, what I&#039;m seeing on the Google results page is only about .0015% of everything purported to be in the index.

One should acknowledge the implicit assumption that each result is unique, rarely true of course.  If .15% of the unique results are dispaly by Google, one is actually seeing is still only about .0015% of everything out there.  By what ever method one computes it, the real ratio of references in surface web to the deep web is something like one in many thousands ...  

So, search engines have become a serious &quot;knowledge bottleneck&quot;, to use an old fashioned AI term.  In fact, Google seems to be very aware it and is eager to be part of whatever solution emerges. One presumes this would be a large-scale collaborative effort of some sort, akin to a Wiki.

Actually, your list of &quot;500 Semantic Web Tools&quot; is probably as close as anything I could name to what the Semantic Web would do to break up the search engine logjam. In one version of a solution, many individual lists like your &quot;500 SemWeb Tools&quot; would be linkable by pre-defined semantic classifiers and tags. The function of the search engine would be to integrate across the many more or less static references compiled by many people on a given subject.  Something like FOAF sharing of resources.  Maybe ...  

Thanks for another interesting article.

- Bill</description>
		<content:encoded><![CDATA[<p>A heroic effort to quantify a very complex phenomenon.</p>
<p>A separate but related issue is the rankings, how a reference to content on a particular site gets to the top of the list.  Even if something interesting appears toward toward the top, how likely is it that someone has the patience to get through the 900 or so references listed?</p>
<p>Which raises the issue of Google faking us out about what we are seeing in the results &#8230; I&#8217;ve never been able to get past reference 900, nothing seems to exist beyond the first 900 references despite the 9,373,622 hits claimed by the engine.</p>
<p>For example, using the very general search term &#8220;financial investments&#8221; the Google engine reported 59,600,000 results.  In this case, dividing the 900 visible results by the 59,600,000 total results, what I&#8217;m seeing on the Google results page is only about .0015% of everything purported to be in the index.</p>
<p>One should acknowledge the implicit assumption that each result is unique, rarely true of course.  If .15% of the unique results are dispaly by Google, one is actually seeing is still only about .0015% of everything out there.  By what ever method one computes it, the real ratio of references in surface web to the deep web is something like one in many thousands &#8230;  </p>
<p>So, search engines have become a serious &#8220;knowledge bottleneck&#8221;, to use an old fashioned AI term.  In fact, Google seems to be very aware it and is eager to be part of whatever solution emerges. One presumes this would be a large-scale collaborative effort of some sort, akin to a Wiki.</p>
<p>Actually, your list of &#8220;500 Semantic Web Tools&#8221; is probably as close as anything I could name to what the Semantic Web would do to break up the search engine logjam. In one version of a solution, many individual lists like your &#8220;500 SemWeb Tools&#8221; would be linkable by pre-defined semantic classifiers and tags. The function of the search engine would be to integrate across the many more or less static references compiled by many people on a given subject.  Something like FOAF sharing of resources.  Maybe &#8230;  </p>
<p>Thanks for another interesting article.</p>
<p>- Bill</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: IB Weblog &#187; Blog Archive &#187; Replik auf Library Hi Tech paper</title>
		<link>http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/comment-page-1/#comment-40135</link>
		<dc:creator>IB Weblog &#187; Blog Archive &#187; Replik auf Library Hi Tech paper</dc:creator>
		<pubDate>Mon, 26 Feb 2007 13:12:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.mkbergman.com/?p=343#comment-40135</guid>
		<description>[...] Michael K. Bergmans Replik &#8220;The murky depths of the &#8216;deep web&#8217;&#8221; auf unser Invisible Web paper. As noted, I generally agree with these criticisms. For example, since the time of original publication, we have seen the power distribution nature of most things on the Internet, including popularity and traffic. Exponential distributions will always result in overestimates using calculations based on means rather than medians. I also think that meaningful content types were both overused (more database-like records) and underused (PDF content that is now routinely indexed) in my original analysis. [...]</description>
		<content:encoded><![CDATA[<p>[...] Michael K. Bergmans Replik &#8220;The murky depths of the &#8216;deep web&#8217;&#8221; auf unser Invisible Web paper. As noted, I generally agree with these criticisms. For example, since the time of original publication, we have seen the power distribution nature of most things on the Internet, including popularity and traffic. Exponential distributions will always result in overestimates using calculations based on means rather than medians. I also think that meaningful content types were both overused (more database-like records) and underused (PDF content that is now routinely indexed) in my original analysis. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mike</title>
		<link>http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/comment-page-1/#comment-39156</link>
		<dc:creator>Mike</dc:creator>
		<pubDate>Thu, 22 Feb 2007 15:54:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.mkbergman.com/?p=343#comment-39156</guid>
		<description>Hi,

The Gale Directory of Databases provides detailed information on publicly available databases and database products accessible either online or through various media such as DVD, CD-ROM, magnetic tape, etc.  Access to the directory requires a subscription; many universities and corporations have them, but to my knowledge there is no free, online access for general users or the non-subscribing public.

Thus, if you&#039;re an academic, you can do the sorts of analysis that Lewandowski and Mayr did, but individuals without subscription access can not independently.

Hope this is helpful!  If you&#039;re real interested, see if a large local library can give you access.

Mike</description>
		<content:encoded><![CDATA[<p>Hi,</p>
<p>The Gale Directory of Databases provides detailed information on publicly available databases and database products accessible either online or through various media such as DVD, CD-ROM, magnetic tape, etc.  Access to the directory requires a subscription; many universities and corporations have them, but to my knowledge there is no free, online access for general users or the non-subscribing public.</p>
<p>Thus, if you&#8217;re an academic, you can do the sorts of analysis that Lewandowski and Mayr did, but individuals without subscription access can not independently.</p>
<p>Hope this is helpful!  If you&#8217;re real interested, see if a large local library can give you access.</p>
<p>Mike</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Koudesnik</title>
		<link>http://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/comment-page-1/#comment-39136</link>
		<dc:creator>Koudesnik</dc:creator>
		<pubDate>Thu, 22 Feb 2007 12:45:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.mkbergman.com/?p=343#comment-39136</guid>
		<description>Nice post!

You mentioned that Dirk Lewandowski and Phillip Mayr in their work use &quot;Gale Directory of Databases&quot;. I had a look at the description of directory (at http://library.dialog.com/bluesheets/html/bl0230.html) and, shame on me, was not able to realize what it is and how one can access to it. Can you search somehow for something in this &quot;Gale Directory of Databases&quot;? Or perhaps there is maybe a DVD containing the &quot;Gale Directory of Databases&quot;? I mean, according to description it looks like they (Thompson) put all the directory&#039;s content on the paper - so, in fact, it is just two paper books (two volumes) but in this case it is kind of useless ...</description>
		<content:encoded><![CDATA[<p>Nice post!</p>
<p>You mentioned that Dirk Lewandowski and Phillip Mayr in their work use &#8220;Gale Directory of Databases&#8221;. I had a look at the description of directory (at <a href="http://library.dialog.com/bluesheets/html/bl0230.html)" rel="nofollow">http://library.dialog.com/bluesheets/html/bl0230.html)</a> and, shame on me, was not able to realize what it is and how one can access to it. Can you search somehow for something in this &#8220;Gale Directory of Databases&#8221;? Or perhaps there is maybe a DVD containing the &#8220;Gale Directory of Databases&#8221;? I mean, according to description it looks like they (Thompson) put all the directory&#8217;s content on the paper &#8211; so, in fact, it is just two paper books (two volumes) but in this case it is kind of useless &#8230;</p>
]]></content:encoded>
	</item>
</channel>
</rss>
