Posted:February 21, 2007

Deep WebIt’s Taken Too Many Years to Re-visit the ‘Deep Web’ Analysis

It’s been seven years since Thane Paulsen and I first coined the term ‘deep Web‘, perhaps representing a couple of full generational cycles for the Internet. What we knew then and what “Web surfers” did then has changed markedly. And, of course, our coining of the term and BrightPlanet’s publishing of the first quantitative study on the deep Web did nothing to create the phenomenon of dynamic content itself — we merely gave it a name and helped promote a bit of understanding within the general public of some powerful subterranean forces driving the nature and tectonics of the emerging Web.

The first public release of The Deep Web: Surfacing Hidden Value (courtesy of the Internet Archive’s Wayback Machine), in July 2000, opened with a bold claim:

BrightPlanet has uncovered the “deep” Web — a vast reservoir of Internet content that is 500 times larger than the known “surface” World Wide Web. What makes the discovery of the deep Web so significant is the quality of content found within. There are literally hundreds of billions of highly valuable documents hidden in searchable databases that cannot be retrieved by conventional search engines.

The day the study was released we needed to increase our servers nine-fold to meet news demand after CNN and then 300 major news outlets eventually picked up the story. By 2001 when the University of Michigan’s Journal of Electronic Publishing and its wonderful editor, Judith A. Turner, decided to give the topic renewed thrust, we were able to clean up the presentation and language quite a bit, but did little to actually update many of the statistics. (That version, in fact, is the one mostly cited today.)

Over the years there have been some books published and other estimates put forward, more often citing lower amounts in the deep Web than my original estimates, but, with one exception (see below), none of these were backed by new analysis. I was asked numerous times to update the study, and indeed had even begun collating new analysis at a couple of points, but the effort to complete the work was substantial and the effort always took a back seat to other duties and so was never completed.

Recent Updates and Criticisms

It was thus with some surprise and pleasure that I first found reference yesterday to Dirk Lewandowski’s and Phillip Mayr’s 2006 paper, “Exploring the Academic Invisible Web” [Library Hi Tech 24(4), 529-539], that takes direct aim at the analysis in my original paper. (Actually, they worked from the 2001 JEP version, but, as noted, the analysis is virtually identical to the original 2000 version.) The authors pretty soundly criticize some of the methodology in my original paper and, for the most part, I agree with them.

My original analysis combined a manual evaluation of the “top 60” then-extant Web databases with an estimate of the total number of searchable databases (estimated at about 200,000, which they incorrectly cite as 100,000) and assessments of the mean size of each database based on a random sampling of those databases. Lewandowski and Mayr note conceptual flaws in the analysis at these levels:

  • First, by use of mean database size rather than median size, the size is overestimated,
  • Second, databases of questionable content to their interests in academic content (such as weather records from NOAA or Earth survey data by satellite) skewed my estimates upward, and
  • Third, my estimates were based on database size estimates (in GBs) and not internal record counts.

On the other hand, the authors also criticized that my definition of deep content was too narrow, and overlooked certain content types such as PDFs now routinely indexed and retrieved on the surface Web. We also have had uncertain, but tangible growth in standard search engine content — with the last cited amounts about 20 billion documents since Google and Yahoo! ceased their war of index numbers.

Though not really offering an alternative, full-blown analysis, the authors use the Gale Directory of Databases to derive an alternative estimate of perhaps 20 billion to 100 billion documents on the deep Web of interest for academic purposes, which they later seem to imply also needs to be discounted by further percentages to get at “word-oriented” and “full-text or bibliographic” records that they deem appropriate.

My Assessment of the Criticisms

As noted, I generally agree with these criticisms. For example, since the time of original publication, we have seen the power distribution nature of most things on the Internet, including popularity and traffic. Exponential distributions will always result in overestimates using calculations based on means rather than medians. I also think that meaningful content types were both overused (more database-like records) and underused (PDF content that is now routinely indexed) in my original analysis.

However, the authors’ third criticism is patently wrong, since three different methods were used to estimate internal database record counts and the average sizes of each record they contained. I would also have preferred a more careful reading by the authors of my actual paper, since there are numerous other citations in error and mis-characterizations.

On an epistemological level, I disagree with the authors’ use of the term “invisible Web”, a label that we tried hard in the paper to overturn and that is fading as a current term of art. Internet Tutorials (initially, SUNY at Albany Library) addresses this topic head-on, preferring “deep Web” on a number of compelling grounds, including that “there is no such thing as recorded information that is invisible. Some information may be more of a challenge to find than others, but this is not the same as invisibility.”

Finally, I am not compelled by the author’s simplistic, alternate partial estimate based solely on the Gale database, but they readily acknowledge to not doing a full-blown analysis and to having different objectives in mind. I agree with the authors in calling for a full, alternative analysis. I think we all agree that is a non-trivial undertaking and could itself be subject to newer methodological pitfalls.

So, What is the Quantitative Update?

Within a couple of years after the initial publication of my paper, I suspected the “500 times” claim for the greater size of the deep Web in comparison to what is discoverable by search engines may have been too high. Indeed, in later corporate literature and Powerpoint presentations, I backed off the initial 2000-2001 claims and began speaking in ranges from a “few times” to as high as “100 times” greater for the size of the deep Web.

In the last seven years, the only other quantitative study of its kind of which I am aware is documented in the paper, “Structured Databases on the Web: Observations and Implications,” conducted by Chang et al. in April 2004 and published in the ACM SIGMOD, that estimated 330,000 deep Web sources with over 1.2 million query forms, reflecting a fast 3-7 times increase in 4 years from the date of my original paper. Unlike the Lewandowski and Mayr partial analysis, this effort and others by that group suggests an even larger deep Web than my initial estimates!

The truth is, we didn’t know then — and we don’t know now — what the actual size of the dynamic Web truly is. (And, aside from a sound bite, does it really matter? It is huge by any measure.) Heroic efforts such as these quantitative analyses or the still-more ambitious efforts of UC Berkeley’s SIM School on How Much Information? still have a role in helping to bound our understanding of information overload. As long as such studies gain news traction, they will be pursued. So, what might today’s story look like?

First, the methodological problems in my original analysis remain and (I believe today) resulted in overestimates. Another factor today leading to a potential overestimate of the deep Web v. the surface Web would be the fact that much “deep” content is being more exposed to standard search engines, be it through Google’s Scholar, Yahoo!’s library relationships, individual site indexing and sharing such as through search appliances, and other “gray” factors we noted in our 2000-2001 studies. These factors, and certainly more, act to narrow the difference between exposed search engine content (“surface Web”) and what we have termed the “deep Web.”

However, countering these facts are two newer trends. First, foreign language content is growing at much higher rates and is often under-sampled. Second, blogs and other democratized sources of content are exploding. What these trends may be doing to content balances is, frankly, anyone’s guess.

So, while awareness of the qualitative nature of Web content has grown tremendously in the past near-decade, our quantitative understanding remains weak. Improvements in technology and harvesting can now overcome earlier limits.

Perhaps there is another Ph.D. candidate or three out there that may want to tackle this question in a better (and more definitive) way. According to Chang and Cho in their paper, “Accessing the Web: From Search to Integration,” presented at the 2006 ACM SIGMOD International Conference on Management of Data in Chicago:

On the other hand, for the deep Web, while the proliferation of structured sources has promised unlimited possibilities for more precise and aggregated access, it has also presented new challenges for realizing large scale and dynamic information integration. These issues are in essence related to data management, in a large scale, and thus present novel problems and interesting opportunities for our research community.

Who knows? For the right researcher with the right methodology, there may be a Science or Nature paper in prospect!

Posted by AI3's author, Mike Bergman Posted on February 21, 2007 at 1:22 pm in Deep Web, Document Assets | Comments (4)
The URI link reference to this post is: https://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/
The URI to trackback this post is: https://www.mkbergman.com/343/the-murky-depths-of-the-deep-web/trackback/
Posted:February 20, 2007

Jewels & DoubloonsDouglas Crockford, Yahoo!’s resident JavaScript guru and the developer of JSON, has provided a much-needed service in a three-part lecture series on the language. JavaScript is reportedly the second-fastest growing language to Ruby and has enjoyed a renaissance because of Ajax and rich Internet applications. JavaScript has always suffered from its unfortunate (and inaccurate) name, but it also suffers because of poor and outdated documentation and (general) lack of program comments or obfuscation to “minifying” to keep transferred script sizes small.

The first part in the lecture series (111 minutes) covers the basics of the language, its history and quirks:

Douglas Crockford provides a comprehensive introduction to the JavaScript Programming Language.

The second part places JavaScript and its development in relation to the DOM and browser evolution (77 min):

Then, the third part covers more advanced language topics such as debugging, patterns and the interesting alternative to traditional object classing using prototypal inheritance (67 min):

Crockford is also the developer of the JSLint code-checking utility and has a Web site chocked full of other tips and aids, including a very good set of JS coding standards. I highly recommend you find the three or four hours necessary to give these tutorials your undivided attention.

JavaScript: The Definitive Guide Crockford is generally disparaging about the state of most JavaScript books, though he does recommend the fifth edition of Dave Flanagan’s well-known JavaScript: The Definitive Guide. I know Doug has a real job, but all of us can hope he personally takes up the challenge of finally writing the definitive JavaScript guide. Crockford is definitely the guy to do it.

Because of the excellence of this series, it gets a J & D award.

Jewels & Doubloons An AI3 Jewels & Doubloon Winner

Posted by AI3's author, Mike Bergman Posted on February 20, 2007 at 1:08 pm in Software Development | Comments (1)
The URI link reference to this post is: https://www.mkbergman.com/339/doug-crockfords-javascript-lectures/
The URI to trackback this post is: https://www.mkbergman.com/339/doug-crockfords-javascript-lectures/trackback/
Posted:February 19, 2007

Jewels & Doubloons

We’re So Focused on Plowing Ahead We Often Don’t See The Value Around Us

For some time now I have been wanting to dedicate a specific category on this blog to showcasing tools or notable developments. It is clear that tools compilations for the semantic Web — such as the comprehensive Sweet Tools listing — or new developments have become one focus of this site. But as I thought about this focus, I was not really pleased with the idea of a simple tag of “review” or “showcase” or anything of that sort. The reason such terms did not turn my crank was my own sense that the items that were (and are) capturing my interest were also items of broader value.

Please, don’t get me wrong. One (among many) observations within the past few months has been the amazing diversity, breadth and number of communities, and most importantly, the brilliance and innovation that I was seeing. My general sense in this process of discovery is that I have kind of stumbled blindly into many existing — and sometimes mature — communities that have existed for some time, but for which I was not part of nor privy to their insights and advances. These communities are seemingly endless and extend to topics such as semantic Web and its constituent components, Web 2.0, agile development, Ruby, domain-specific languages, behavior driven development, Ajax, JavaScript frameworks and toolkits, Rails, extractors/wrappers/data mining, REST, why the lucky stiff, you name it.

Announcing ‘Jewels & Doubloons’

As I have told development teams in the past, as you cross the room to your workstation each morning look down and around you. The floor is literally strewn with jewels, pearls and doubloons –tremendous riches based on work that has come before — and all we have to do is take the time to look, bend over, investigate and pocket those riches. It is that metaphor, plus in honor of Fat Tuesday tomorrow, that I name my site awards ‘Jewels & Doubloons.’

Jewels & Doubloons (or J & D for short) may get awarded to individual tools, techniques, programming frameworks, screencasts, seminal papers and even blog entries — in short, anything that deserves bending over, inspecting and taking some time with, and perhaps even adopting. In general, the items so picked will be more obscure (at least to me, though they may be very well known to their specific communities), but what I feel to be of broader cross-community interest. Selection is not based on anything formal.

Why So Many Hidden Riches?

I’ll also talk on occasion as to why these riches of such potential advantage and productivity to the craft of software development may be so poorly known or overlooked by the general community. In fact, while many can easily pick up the mantra of adhering to DRY, perhaps as great of a problem is NIH — reinventing a software wheel due to pride, ignorance, discontent, or simply the desire to create for creation’s sake. Each of these reasons can cause the lack of awareness and thus lack of use of existing high value.

There are better ways and techniques than others to find and evaluate hidden gems. One of the first things any Mardi Gras partygoer realizes is not to reach down with one’s hand to pick up the doubloons and plastic necklaces flung from the krewes’ floats. Ouch! and count the loss of fingers! Real swag aficionados at Mardi Gras learn how to air snatch and foot stomp the manna dropping from heaven. Indeed, with proper technique, one can end up with enough necklaces to look like a neon clown and enough doubloons to trade for free drinks and bar top dances. Proper technique in evaluating Jewels & Doubloons is one way to keep all ten fingers while getting rich the lazy man’s way.
Jewels & Doubloons are designated with either a medium-sized Jewels & Doubloons or small-sized (see below) icon and also tagged as such.

Past Winners

I’ve gone back over the posts on AI3 and have postdated J & D awards for these items:

Posted:February 8, 2007

The last 24 hours have seen a flurry of postings on the newly released Yahoo! Pipes service, an online IDE for wiring together and managing data feeds on the Web. Tim O’Reilly has called the Pipes service “a milestone in the history of the internet.” Rather than repeat, go to Jeremy Zawodny’s posting, Yahoo! Pipes: Unlocking the Data Web, where he has assembled a pretty comprehensive listing of what others are saying about this new development:

Using the Pipes editor, you can fetch any data source via its RSS, Atom or other XML feed, extract the data you want, combine it with data from another source, apply various built-in filters (sort, unique (with the “ue” this time:-), count, truncate, union, join, as well as user-defined filters), and apply simple programming tools like for loops. In short, it’s a good start on the Unix shell for mashups. It can extract dates and locations and what it considers to be “text entities.” You can solicit user input and build URL lines to submit to sites. The drag and drop editor lets you view and construct your pipeline, inspecting the data at each step in the process. And of course, you can view and copy any existing pipes, just like you could with shell scripts and later, web pages.

Posted by AI3's author, Mike Bergman Posted on February 8, 2007 at 10:39 am in Adaptive Innovation, Semantic Web, Semantic Web Tools | Comments (1)
The URI link reference to this post is: https://www.mkbergman.com/336/yahoo-pipes-adding-a-voice-to-the-chorus/
The URI to trackback this post is: https://www.mkbergman.com/336/yahoo-pipes-adding-a-voice-to-the-chorus/trackback/
Posted:February 7, 2007

How to Process Your Own Large Libraries into Thumbnails

When I decided to upgrade my Sweet Tools semantic Web and -related tools listing, I wanted to add some images to make the presentation more attractive. It was also becoming the case that many metadata aggregation service providers were adopting image representations for data (see this Dlib article). Since the focus of my listing is software, I either could install all of the programs and take screenshots (not doable given the numbers involved) or adopt what many others have used as a sort of visual index for content: thumbnails, or, as specifically called when applied to Web pages, thumbshots.

Quick Review of Alternatives

Unless you get all of your Web content via feeds or have been living in a cave, you may have recently contracted a form of popup vertigo. Since its introduction just a few months back, the Snap Preview Anywhere thumbnail popup has become the eggplant that eats Chicago, with more than a half million sites now reported to be using the service. Since I don’t want this service myself for my blog (see below) and I therefore did not want to go through the effort of signing up for SPA nor restricting its use to just this posting (even though the signup appears clean and straightforward), I reproduce below what one of these Snap link-over popups looks like:

The sheer ubiquity of these popup thumbnails is creating its own backlash (check out this sample rant from UNEASYsilence and its comments) and, early promoters, such as TechCrunch, have now gone to the use of a clickable icon [ ] for a preview, rather than automatically popping up the image from the link hover.

Not only had the novelty of these popups worn off for me, but my actual desired use for Sweet Tools was to present a gallery of images for multiple results simultaneously. So, besides its other issues, the Snap service was not suitable for my purpose.

I had earlier used a Firefox add-on called BetterSearch that places thumbnails on results pages when doing searches with Google (including international versions), Amazon, MSN Search, Yahoo!, A9, Answers.com, AllTheWeb, Dogpile.com, del.icio.us and Simpy.com. But, like the Snap service, I personally found this service to be distracting. I also don’t like the fact that my use was potentially being logged and promo messages were inserted on each screen. (There is another Firefox browser extension called GooglePreview that appears less intrusive, but I have not tried it.) As it turns out, both of these services themselves piggyback on a free (for some uses) thumbnail acquisition and server service from Thumbshots.org.

Since my interest in thumbnails was limited and focused to a bounded roster of sites (not the dynamic results from a search query), I decided to cut out the middleman and try the Thumbshots.org source directly myself. However, my candidate sites are mostly obscure academic ones or semantic Web ones not generally in the top rankings, meaning that most of the Sweet Tools Web sites unfortunately had no thumbnails on Thumbshots.org.

Of course, throughout these investigations, I had always had the option of taking physical screen captures myself and converting them manually to thumbnails. This is a very straightforward process with standard graphics packages; I had done so often for other purposes using my standard Paint Shop Pro software. But with the number of the Sweet Tools growing into the hundreds, such a manual approach clearly wouldn’t scale.

Knowing there are literally hundreds of cheap or free graphics and image manipulation programs out there, I thus set out to see if I could find a utility that would provide most, if not all, of the automation required.

My Sweet Tools records don’t change frequently, so I could accept a batch mode approach. I wanted to also size the thumbnails to whatever displayed best in my Exhibit presentation. As well, if I was going to adopt a new utility, I decided I might as well seek other screen capture and display flexibilities for other purposes. I also importantly needed the individual file names created to be unique and readable (not just opaque IDs). Finally, like any tool I ultimately adopt, I wanted quality output and professional design.

Off and on I reviewed options and packages, mostly getting disgusted with the low quality of the dross that mostly exists out there, and appalled at the difficulty in using standard search services to find such candidates. (There truly is becoming whole categories of content such as products of all types, reviews, real data, market info and statistics, that are becoming nearly impossible to effectively find on the Web with current search engines; but those are topics for another day.)

Nonetheless, after much looking and trial runs of perhaps a dozen packages, I finally stumbled across a real gem, WebShot. (Reasons this product was difficult to find included its relatively recent vintage, apparent absence of any promotion, and the mismatch between the product name and Web site name.)

The WebShot Utility

WebShot is a program that allows you to take screenshots and thumbnails of web pages or whole websites. I find its GUI easy to use, but it also comes with a command line interface for advanced users or for high-volume services. WebShot can produce images in the JPG, GIF, PNG, or BMP formats. It was developed in C by Nathan Moinvaziri.

The program is free for use on Windows XP, though PayPal donations are encouraged. Nominal charges are applied to other Windows versions and use the command line. Linux is not supported and Internet Explorer must be installed.

The graphical UI on Windows XP has a standard tabbed design. Single thumbnails or ones in batch driven from a text file may be used. Output files can be flexibly sized via the above formats. The screen capture itself can be based on mandatory or max and min browser display parameters. There are a variety of file naming parameters and system settings allow WebShot to work in Web-friendly ways. Here’s an example of the Image tab for the GUI:

The command-liine version accepts about 20 different parameters.

Depending on settings, you can get a large variety of outputs. The long banner image to the left, for example, is a “complete” Web page dump of my Web site at the time of this posting, with about 8 consecutive posts shown (160 x ~2300). The system automatically stitches together the multiple long page screenshots, with the resolution in this case being set by the input width parameter of 160 pixels.

Another option is this sample “cropped” one (440 x 257) where I’m actually cutting the standard screen display to about 50% of its normal vertical (height) dimension:

And, then, the next example shows what I have chosen as my “standard” thumbnail size (160 x 120) (I added the image borders, not the program):

In batch mode, I set the destination parameter such that I got both a logical domain portion in the file name (%d) and a hashed portion (%m) since there were a few occasions of multiple, but different Web pages, from the same host domain.

As noted, download re-tries, delays and timeouts are all settable to be a good Web citizen while getting acceptable results. With more-or-less standard settings, I was able to complete the 400 thumbnail downloads (without error, I should mention) in just a few minutes for the Sweet Tools dataset.

How I Do Bulk Thumbnails for Sweet Tools

Your use will obviously vary, but I kept notes for myself so that I could easily repeat or update this batch process (in fact, I have done so already a couple of times with the incremental updates to Sweet Tools). This general work flow is:

  1. Create a text file with host Web site URLs in spreadsheet order
  2. Run WebShot with these general settings:
    • destination switches of %d%m (core domain, plus hash)
    • image at 160w x 120h (my standard; could be anything as long as proper aspect maintained)
    • use of Multiple tab, with a new destination directory for each incremental update
    • browser setting at 1024 x 768 required (most common aspect today); min of 800 x 600; highest quality image
  3. At completion, go to a command window and write-out image file names (images complete in the same order as submitted). (In Windows, this is the dir/o:d > listing.txt command.) Then, copy the file names in the resulting text file back into the spreadsheet for the record < --> image correspondence
  4. Upload to the appropriate WordPress image directory.

Some Other Tips

Like many such tools, there is insufficient documentation for the WebShot package. But, with some experimentation, it is in fact quite easy to accomplish a number of management or display options. Some of the ones I discovered are:

  • If harvesting multiple individual Web pages from the same domain, use the domain (%d) and hash (%m) options noted above
  • For complete capture of long Web pages (such as the image of my own Web site to the left), first decide on a desired resolution set via ‘width’ on the Image tab, leave height blank, and leave the browser settings open
  • For partial screen captures without distortion. set the image dimensions to the desired final size with height the desired partial percentage value, then adjust the browser dimensions to equal the image aspect.
Jewels & Doubloons An AI3 Jewels & Doubloon Winner

Posted by AI3's author, Mike Bergman Posted on February 7, 2007 at 8:43 pm in Information Automation, Open Source, Site-related | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/334/down-to-me-webshot-has-come-its-under-my-thumb/
The URI to trackback this post is: https://www.mkbergman.com/334/down-to-me-webshot-has-come-its-under-my-thumb/trackback/