Posted:February 26, 2007

Image courtesy of www.webdesign.orgTo Reach Its Full Potential, JavaScript Collaboration Needs New Approaches and New Tools

This posting begins from the perspective: I’ve found a nifty JavaScript code base and would like to help extend it. How do I understand the code and where do I begin? It then takes that perspective and offers some guidance to open source projects and developers of some tools and approaches that might be undertaken to facilitate code collaboration.

Rich Internet applications, browser extensions and add-ons, and the emergence of sophisticated JavaScript frameworks and libraries such as Dojo, YUI and Prototype (among dozens of others, see here for a listing) are belying the language’s past reputation as a “toy.” These very same developments are increasingly leading to open source initiatives around these various sophisticated frameworks and applications. But there can be major barriers to open source collaboration with JavaScript.

JavaScript works great for single developers or single-embedded functions. It also is easily malleable with the use of obfuscation, minifying (including white-space reductions), and short variable names to keep scripts communicated by the server small and opaque. Yet as more JavaScript initiatives become ever complex and seek the involvement of outside parties, these factors and the nature of the language itself work against easy code understanding.

Open source projects are often notorious for poor, even non-existent, documentation. (I very much like Mike Pope’s statement, “If it isn’t documented, it doesn’t exist.”) But the documentation curse is magnified for JavaScript . As Doug Crockford notes, the general state of standard language documentaton and guide books is poor. Developers also often learn from others’ code examples, and there is much in the way of JavaScript snippets, especially historical, that is atrocious. Even widely adopted frameworks, such as Prototype, have for years been nearly undocumented, a shortcoming now being addressed. Rarely (in fact I know of none for major open source packages) does there exist UML class or package diagrams for larger JS apps or frameworks. And, because much JavaScript is communicated to the browser as server-side scripts, embedded code comments are of generally poor quality or totally non-existent. All of these factors translate into the general absence of APIs for larger apps (though most of the leading frameworks and libraries such as Prototype, YUI and Dojo are now beginning to provide professional documentation).

Some of the JavaScript language trade-offs and decisions hark back to early standards tensions between Microsoft and Netscape (and Sun). Some of JavaScript’s challenges arise from the very design of the language. The prototypal nature of inheritance in the language makes it difficult to trace class hierarchies, or even to discern class structure from standard functions via machine translation or interpreter inspection. Again, for embedded JS or small functions, this is not a general limitation. But as apps grow complex, indeed as a result of the inherent extensibility of the language itself, an undocumented prototypal structure can become a limitation to structural understanding. Also, as a loosely-typed language, it often can be difficult to discern intended datatypes. These language characteristics make it difficult to reverse engineer the code (see below) and therefore to tease out structure or round-trip the code with advanced IDEs.

Larger apps may often mix JavaScript with other languages, such as C++ or Java, in an attempt to segregate responsibilities in the code base in relation to perceived responsibilities and language strengths. Indeed, with the exception of some more recent complete frameworks such as the Tibco General Interface, most larger apps have mixed code bases. In these designs, open source contributors need to understand multiple languages and the language interfaces can cause their own challenges and bottlenecks. Alternatively, more structured languages such as Java can be used for initial development, with the code then being translated to JS, as is the case in Rhino or the Google Web Toolkit (GWT). Again, though, external contributors may need to have multiple language skills to contribute productively.

Thus, some of JavaScript’s challenges to open source collaboration are inherent to the language or structural, others are due to inadequate best practices. I discuss below possible tools in three areas — code documentation, code profiling or reverse engineering, and code diagramming — that could aid a bit in overcoming some of these obstacles.

Understand the Code’s Structure and Model Architecture

In any approach to a new code base, especially one that is complex with many thousands of lines of code and multiple files, it is often useful to start with a visual overview of the architectural structure. Preferably, this understanding includes both the static structure (via class, component or package UML diagrams or equivalent) and data flows through the app. Component views of the file structure are one useful starting point for understanding the physical relationships between the JS, and HTML, XHTML or XUL, and possible data files. Then, a next desirable view would be to run the code base through a code visualizer such as MIT’s Relo, which has the usefulness of enabling interactive and gradual discovery and presentation of the class structure of a code base.

“Reverse engineering,” or alternatively round-tripping of code with CASE tools or IDEs, has a fairly rich history, with much of the academic work in computer science occurring in Europe. For structured languages such as Java or C#, there are many reverse engineering tools that will enable rapid construction of UML diagrams. There are some experimental reverse engineering approaches for Web sites and Web pages, including ReversiXML and WebUml, but these are limited to HTML with no JavaScript support and are not production quality. In fact, my investigations have discovered no tools capable for reverse engineering JavaScript code in any manner, let alone to UML.

This perhaps should not be all that surprising given the loose structure of the language. While Ruby is also a loosely typed language and there is a code visualizer for Rails, its operation apparently very much relies on the structure within the Rails scaffolding. And, while one might think that techniques used elsewhere for information extraction or data mining applied to unstructured data could also be adapted to infer the much greater structure in JavaScript, any such inferences still involve a considerable degree of guesswork according to Vineet Sinha, Relo’s developer.

So, what other automated options short of reverse engineering to UML are there for gaining this structural understanding of the code base? Frankly, not many.

One commercial product is JavaScript Reporter. It performs syntax checking, variable type and flow analysis on individual JavaScript files, and adds type analysis across files for a group of related JavaScript files. It does not produce program flowcharts or UML. While one can process JavaScript with the commercial Code Visual 2 Flowchart, the code analysis is limited to a single function at a time (not a complete file let alone multiple files) and the output is limited to a flowchart (no UML) in either a BMP or PNG bitmap format.

The Dojo project is moving fairly aggressively with its JS Linker project, with the purpose of processing HTML/JavaScript code bases to prepare code for deployment by reducing file size, create source code documentation, obfuscate source code to protect intellectual property, and help gather source code metrics for source code analysis and improvements. This effort may evolve to create the basis for UML generation from code, but is still in the pre-release phases and its potential applicability to non-Dojo libraries is not known.

According to one review, “Reverse engineering is successful if the cost of extracting information about legacy systems plus effort to arrive specifications is less than the cost of starting from scratch.” If code generation or reverse engineering is essential, one choice could be to work either with Rhino or GWT directly with the Java source, prior to JavaScript generation. Besides the Java reverse engineering tools above, GWT Designer is a commercial visual designer and bi-directional code generator that is tailored specifically to GWT and Ajax applications. Of course, such an approach would require a commitment to Java development at time of project initiation.

When All Else Fails, Go Manual

Thus, apparently no automatic UML diagramming tools exist that can be driven from existing JavaScript. So, what is an open source project to do that wants to use UML or architectural diagrams to promote external contributions?

The first thing a large, open source JavaScript project should do is document and annotate its file structure. It is surprising, but in my investigations of relevant JavaScript applications, only Zotero has provided such file documentation (though finding the page itself is somewhat hard). (There are likely others that provide file documentation of which I am not aware, but the practice is certainly not common.)

The next thing a project should do is provide an architectural schematic. Since no automated tools exist for JavaScript, the only option is to construct these diagrams manually. Re-constructing a project’s UML structure, though time-consuming and a high threshold to initial participation, is an excellent way to become familiar with the code base. Once created, such diagrams should be prominently posted for subsequent contributors. (New projects, of course, would be advised to include such diagramming and documentation from the outset.)

Manual diagramming, possibly with UML compliance, next begs the question of which drawing program to use. Like program editors, UML and drawing tools are highly varied and a personal preference. I mostly hate selecting new tools since I only want to learn a few, but learn them well. That means tools represent real investments for me, which also means I need to spend substantial time identifying and then testing out the candidates.

Dia is a GTK+ based diagram creation program for Linux, Unix and Windows released under the GPL license. Dia is roughly inspired by the commercial Windows program Visio. It is more geared towards informal diagrams for casual use, but it does have UML drawing templates. Graphviz is a general drawing and modeling program used by some other UML open source programs. I liked this program for its general usefulness and kept it on my system after all evaluations were done, but is really not automated enough alone for UML use.

In the UML realm, one end of the spectrum are “light”, simpler tools that have lower learning thresholds and fewer options. Some candidates here are Violet (which is quite clean and can reside in Eclipse, but has very limited formatting options) and UMLet (which does not snap to grid and also has limited color options). Getting a bit more complex I liked UMLGraph, which allows the declarative specification and drawing of UML class and sequence diagrams. Among its related projects, LightUML integrates UMLGraph into Eclipse and XUG extends UMLGraph into a reverse engineering tool.

BOUML is a another free tool, recently released in France. Program start-up is a little strange, requiring a unique ID to be set prior to beginning a project. Besides the relative developmental status of this offering, English documentation is pretty awkward and hard to understand in some critical areas. StarUML has some interesting features, but is not Eclipse-ready. Its last major upgrade was more than one year ago. It has reverse engineering and code generation for the usual suspects, and has many diagramming options. Program settings are highly parameterized and there is a unique (within open source efforts) template to support Word integration. One quirk is that new class diagrams get provided with the (?) names of the Korean development team behind the project. Like all other of the UML tools investigated, StarUML does not have any support for JavaScript.

Since I use Eclipse as my general IDE, my actual preference is a UML tool integrated into that environment. There are about 40 or UML options that support that platform, about a dozen of which are either free (generally “community versions” of commercial products) or open source. I reviewed all of them and installed and tested most.

LightUML is a little tricky to install, since it has external dependencies on Graphviz and the separate UMLGraph, though those are relatively straightforward to install. The biggest problem with LightUML for my purposes was its requirement to work off of Java packages or classes. I next checked out green. With green, UML diagrams are educational and easy to create and it provides a round-tripping editor. I found it interesting as an academic project but somewhat lacking for my desired outputs. Omondo’s free version of EclipseUML looks quite good, but unfortunately will only work with bona fide Java projects, so it was also eliminated from contention. I found eUML2 to be slow and it seemed to take over the Eclipse resources. I found a similar thing with Blueprint Software Modeler, 85 MB without the Eclipse SDK, which seems rather silly for the degree of functionality I was seeking.

Papyrus is a free, very new, Eclipse plug-in. It has only 5 standard UML diagrams and sparse documentation, but its implementation so far looks clean and its support for the DI2 (diagram interchange) exchange format is a plus (I’ve kept it on my system and will check in on new releases later). Visual Paradigm for UML has a free version usable in Eclipse, which is professionally packaged as is common for commercial vendors. However, this edition limits the number of diagram types per project and the download requires all options and a 117 MB install before the actual free version can be selected, and even then implications of the various choices are obscure. (These not uncommon tactics are not really bait-and-switch, but often times feel so.) The limit of one diagram type per project eliminate this option from consideration, plus it is incredibly complicated and a real resource hog. The community edition of MagicDraw UML operates similarly and with similar limitations.

After numerous Eclipse installs and deletes, I finally began to settle on ArgoEclipse. It had a very familiar interface and installed as a standard Eclipse plug-in, was generally positively reviewed by others, and appeared to have the basic features I initially thought necessary. However, after concerted efforts with my chosen JavaScript project, I soon found many limitations. First, it was limited to UML 1.4 and lacked a component UML view that I wanted. I found the interface to be very hard to work with for extended periods. I wanted to easily add new UML profiles (an extension mechanism for UML models) specific to my project and JavaScript, but could find no easy way to do so. Lastly, and what caused me to abandon the framework, I was unable to split large diagrams into multiple tiled pieces for coherent printing or distribution.

None of these tools support JS reverse engineering or round-tripping. However, keeping everything in the Eclipse IDE would aid later steps when adding in-line code documentation (see below) to the code base. On the other hand, because of the lack of integration, and because UML diagramming needed to occur before code commenting, I decided I could forego Eclipse integration for a standalone tool so long as it supported the standard XMI (XML metadata interchange) format.

This decision led me back to again try StarUML, which is the tool I ended up using to complete my UML diagramming. I remain concerned about StarUML’s lack of recent development and reliance on Delphi and some proprietary components that limit it as a full open-source tool. On the other hand, it is extensible via user-added profiles and tags using XML, it is very intuitive to use, and it has very flexible export and diagramming options. Because of strong XMI support, should better options later emerge, there is no risk of lock-in.

Looking to the future, because of my Eclipse environment and the growing availability of JS editors in Eclipse (Aptana, which I use, but also Spket and JSEclipse), one project worth monitoring is UML2, an EMF- (and GMF-) based implementation of the UML 2.x metamodel for the Eclipse platform due out in June. This effort is part of the broader Model Development Tools (MDT) Eclipse initiative. UML2 builds are presently available for testing with other bleeding-edge components, but I have not tested these.

Throughout this process of UML tools investigations, I came to discover a number of preferences and criteria for selecting a toolset that include: 1) a free, open-source option; 2) easy to use and modify, with flexible user preferences; 3) support for UML 2.x and most or all UML diagrams; 4) support for XMI import and export; 5) extensible profiles and frameworks via an XML syntax, preferably with some profile building utilities; 6) operation within the Eclipse environment, measured by standard plug-in and update installs; 7) clean functionality and user interface; 8 ) the ability to handle large diagrams, particularly in tiling; 9) a variety of output formats; and 10) strength in the class, component and package diagrams of most use to me. Of course, these factors may differ substantially for you.

Now, Document the Code as You Dive Deeper

Since JavaScript apps often have little or no embedded code documentation, one way to contribute is to document it as you trace though it. Like creating UML diagrams, a requirement to provide in-line documentation is a high threshold to initial participation. However, for motivated contributors, documentation can be an excellent way to learn a code base. But whomever does the initial documentation, it should be done in a maintainable manner with the ability to generate API documentation and be posted prominently for other potential contributors.

There are a few emerging JavaScript code documentation systems, which are tools that parse inline documentation in JavaScript source files to produce (preferably) well formatted HTML with links. The “leading” ones are shown in the table below:

  • Best known and most common
  • Simple parser recognition
  • Operates similarly to Javadoc (but is a subset)
  • Requires separate Perl install
  • Doesn’t support multiple declarations for a single function
  • Tags perhaps too constrained, inadequate (for example, lacks the @alias tag)
  • Little formatting flexibility
  • Not workable with Prototype-style programming
  • Found (generally) to be inadequate for modern JavaScript (according to project leads)
  • Written in JavaScript
  • Simple parser recognition
  • Same original developer (Mike Mathews) as JSDoc
  • Backwards compatible with JSDoc (at least for now)
  • More open input process to initial design
  • Reports easily configurable
  • No code interpretation
  • Just released; only in beta
  • No existing report templates (as of yet)
  • Supported by Aptana IDE
  • Separate sdoc file can keep documentation separate from source files
  • Has really not gotten traction in nearly 9 months
  • Appears to close to single sponsor (Aptana)
  • Separate sdoc needs to be synchronized with source files
  • No known presentation templates or examples
  • Schizophrenic between Aptana site and ScriptDoc site
  • Targeting any language; 4 now have full support, more than 15 (including JS) have basic support
  • Nice layouts
  • Flexibility for how inline documentation is written (e.g., any keyword: name)
  • Lack of rigid parsing rules can lead to ambiguous outputs
  • Little traction so far
  • Appears to be a single-person project
  • Perhaps too flexible

The most widely used approach is JSDoc, which is a Perl-based analog to Javadoc (click here for an example report). JSDoc is the most established of the systems and has an interface familiar to most Java developers. The first version is being phased out, with version 2 release apparently pending. There is a Google Group on this option. JSDocumenteris a graphical user interface built for Javadoc-like documenting using JSDoc. Within this family, the JSDoc-2 option appears to be the choice, since the initial developers themselves have moved in that direction and recognize earlier problems with JSDoc they are planning to overcome.

ScriptDoc is an attempt to standardize documentation formats, and is being pushed and backed by the Aptana open source JS IDE. The major differentiator for the ScriptDoc approach is the ability to link IDs in the code base to external .sdoc files that could perhaps have more extensive commenting than what might be desirable in-line. (The other side of that coin is the synchronization and maintenance of parallel files.) To date, ScriptDoc is also tightly linked with Aptana, with no known other platform support. Also, there is a disappointing level of activity on the ScriptDoc standards site.

NaturalDocs is an open-source, extensible, multi-language documentation generator. You document your code in a natural syntax that reads like plain English. Natural Docs then scans your code and builds high-quality HTML documentation from it. Multiple languages are presently supported in either full or basic mode, including JavaScript. The flexible format for this system is both a strength and weakness.

There are a few other JS documentation systems. Qooxdoo, an open-source, multi-purpose Ajax system, provides a Python script utility that parses JavaScript and also understands Javadoc- and doxygen-like meta information to generate a set of HTML files and a class tree; see further this explanation. JSLinker for Dojo, mentioned above, also includes documentation aspects. For other information, Wikipedia has a fairly useful general comparison of documentation generators, most of which are not related to JavaScript.

The idea of keeping documentation text separate from the code has initial appeal for JavaScript, since it is often necessary to transfer the scripts in minimal form. But this very separation creates synchronization issues and many modern code editors can also support programmatic exclusion of the comments for transferred forms.

On this and other bases, I think JSDoc-2 is the best choice of the options above. The general Javadocs form and expansion of necessary tags appears to have overcome earlier JSDoc limits, and the easy parser recognition for the comments appears sufficiently flexible. Plus, output formats can be tailored to be as fancy as desired.

Of course, once a documentation generator is selected it is then necessary to establish in-line comment standards. These should be a standard product documentation requirement, best a part of standard coding standards. Each project should therefore also publish its own coding and documentation standards, which is especially important when outside collaborators are involved. It is beyond the scope of this survey to suggest standards, but one place to start is with the JavaScript coding standards from Doug Crockford (but which also includes only minimal inline documentation recommendations).

Code Validation and Debugging

Though this survey regards efforts prior to coding collaboration, once coding begins there are some other essential tools any JavaScript developer should have. These include:

  • JSLint — a JavaScript syntax checker and validator
  • Firebug — a great inspection and debugging service for use within the Firefox browser, and
  • Venkman — the code name for another Firefox-based debugger, this one from Mozilla.

Since these tools are beyond the scope of this present review, more detailed discussion of them will be left to another day.

Posted by AI3's author, Mike Bergman Posted on February 26, 2007 at 10:17 pm in Open Source, Software Development | Comments (1)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:February 21, 2007

Deep WebIt’s Taken Too Many Years to Re-visit the ‘Deep Web’ Analysis

It’s been seven years since Thane Paulsen and I first coined the term ‘deep Web‘, perhaps representing a couple of full generational cycles for the Internet. What we knew then and what “Web surfers” did then has changed markedly. And, of course, our coining of the term and BrightPlanet’s publishing of the first quantitative study on the deep Web did nothing to create the phenomenon of dynamic content itself — we merely gave it a name and helped promote a bit of understanding within the general public of some powerful subterranean forces driving the nature and tectonics of the emerging Web.

The first public release of The Deep Web: Surfacing Hidden Value (courtesy of the Internet Archive’s Wayback Machine), in July 2000, opened with a bold claim:

BrightPlanet has uncovered the "deep" Web — a vast reservoir of Internet content that is 500 times larger than the known "surface" World Wide Web. What makes the discovery of the deep Web so significant is the quality of content found within. There are literally hundreds of billions of highly valuable documents hidden in searchable databases that cannot be retrieved by conventional search engines.

The day the study was released we needed to increase our servers nine-fold to meet news demand after CNN and then 300 major news outlets eventually picked up the story. By 2001 when the University of Michigan’s Journal of Electronic Publishing and its wonderful editor, Judith A. Turner, decided to give the topic renewed thrust, we were able to clean up the presentation and language quite a bit, but did little to actually update many of the statistics. (That version, in fact, is the one mostly cited today.)

Over the years there have been some books published and other estimates put forward, more often citing lower amounts in the deep Web than my original estimates, but, with one exception (see below), none of these were backed by new analysis. I was asked numerous times to update the study, and indeed had even begun collating new analysis at a couple of points, but the effort to complete the work was substantial and the effort always took a back seat to other duties and so was never completed.

Recent Updates and Criticisms

It was thus with some surprise and pleasure that I first found reference yesterday to Dirk Lewandowski’s and Phillip Mayr’s 2006 paper, “Exploring the Academic Invisible Web” [Library Hi Tech 24(4), 529-539], that takes direct aim at the analysis in my original paper. (Actually, they worked from the 2001 JEP version, but, as noted, the analysis is virtually identical to the original 2000 version.) The authors pretty soundly criticize some of the methodology in my original paper and, for the most part, I agree with them.

My original analysis combined a manual evaluation of the “top 60″ then-extant Web databases with an estimate of the total number of searchable databases (estimated at about 200,000, which they incorrectly cite as 100,000) and assessments of the mean size of each database based on a random sampling of those databases. Lewandowski and Mayr note conceptual flaws in the analysis at these levels:

  • First, by use of mean database size rather than median size, the size is overestimated,
  • Second, databases of questionable content to their interests in academic content (such as weather records from NOAA or Earth survey data by satellite) skewed my estimates upward, and
  • Third, my estimates were based on database size estimates (in GBs) and not internal record counts.

On the other hand, the authors also criticized that my definition of deep content was too narrow, and overlooked certain content types such as PDFs now routinely indexed and retrieved on the surface Web. We also have had uncertain, but tangible growth in standard search engine content — with the last cited amounts about 20 billion documents since Google and Yahoo! ceased their war of index numbers.

Though not really offering an alternative, full-blown analysis, the authors use the Gale Directory of Databases to derive an alternative estimate of perhaps 20 billion to 100 billion documents on the deep Web of interest for academic purposes, which they later seem to imply also needs to be discounted by further percentages to get at “word-oriented” and “full-text or bibliographic” records that they deem appropriate.

My Assessment of the Criticisms

As noted, I generally agree with these criticisms. For example, since the time of original publication, we have seen the power distribution nature of most things on the Internet, including popularity and traffic. Exponential distributions will always result in overestimates using calculations based on means rather than medians. I also think that meaningful content types were both overused (more database-like records) and underused (PDF content that is now routinely indexed) in my original analysis.

However, the authors’ third criticism is patently wrong, since three different methods were used to estimate internal database record counts and the average sizes of each record they contained. I would also have preferred a more careful reading by the authors of my actual paper, since there are numerous other citations in error and mis-characterizations.

On an epistemological level, I disagree with the authors’ use of the term “invisible Web”, a label that we tried hard in the paper to overturn and that is fading as a current term of art. Internet Tutorials (initially, SUNY at Albany Library) addresses this topic head-on, preferring “deep Web” on a number of compelling grounds, including that “there is no such thing as recorded information that is invisible. Some information may be more of a challenge to find than others, but this is not the same as invisibility.”

Finally, I am not compelled by the author’s simplistic, alternate partial estimate based solely on the Gale database, but they readily acknowledge to not doing a full-blown analysis and to having different objectives in mind. I agree with the authors in calling for a full, alternative analysis. I think we all agree that is a non-trivial undertaking and could itself be subject to newer methodological pitfalls.

So, What is the Quantitative Update?

Within a couple of years after the initial publication of my paper, I suspected the “500 times” claim for the greater size of the deep Web in comparison to what is discoverable by search engines may have been too high. Indeed, in later corporate literature and Powerpoint presentations, I backed off the initial 2000-2001 claims and began speaking in ranges from a “few times” to as high as “100 times” greater for the size of the deep Web.

In the last seven years, the only other quantitative study of its kind of which I am aware is documented in the paper, “Structured Databases on the Web: Observations and Implications,” conducted by Chang et al. in April 2004 and published in the ACM SIGMOD, that estimated 330,000 deep Web sources with over 1.2 million query forms, reflecting a fast 3-7 times increase in 4 years from the date of my original paper. Unlike the Lewandowski and Mayr partial analysis, this effort and others by that group suggests an even larger deep Web than my initial estimates!

The truth is, we didn’t know then — and we don’t know now — what the actual size of the dynamic Web truly is. (And, aside from a sound bite, does it really matter? It is huge by any measure.) Heroic efforts such as these quantitative analyses or the still-more ambitious efforts of UC Berkeley’s SIM School on How Much Information? still have a role in helping to bound our understanding of information overload. As long as such studies gain news traction, they will be pursued. So, what might today’s story look like?

First, the methodological problems in my original analysis remain and (I believe today) resulted in overestimates. Another factor today leading to a potential overestimate of the deep Web v. the surface Web would be the fact that much “deep” content is being more exposed to standard search engines, be it through Google’s Scholar, Yahoo!’s library relationships, individual site indexing and sharing such as through search appliances, and other “gray” factors we noted in our 2000-2001 studies. These factors, and certainly more, act to narrow the difference between exposed search engine content (“surface Web”) and what we have termed the “deep Web.”

However, countering these facts are two newer trends. First, foreign language content is growing at much higher rates and is often under-sampled. Second, blogs and other democratized sources of content are exploding. What these trends may be doing to content balances is, frankly, anyone’s guess.

So, while awareness of the qualitative nature of Web content has grown tremendously in the past near-decade, our quantitative understanding remains weak. Improvements in technology and harvesting can now overcome earlier limits.

Perhaps there is another Ph.D. candidate or three out there that may want to tackle this question in a better (and more definitive) way. According to Chang and Cho in their paper, “Accessing the Web: From Search to Integration,” presented at the 2006 ACM SIGMOD International Conference on Management of Data in Chicago:

On the other hand, for the deep Web, while the proliferation of structured sources has promised unlimited possibilities for more precise and aggregated access, it has also presented new challenges for realizing large scale and dynamic information integration. These issues are in essence related to data management, in a large scale, and thus present novel problems and interesting opportunities for our research community.

Who knows? For the right researcher with the right methodology, there may be a Science or Nature paper in prospect!

Posted by AI3's author, Mike Bergman Posted on February 21, 2007 at 1:22 pm in Deep Web, Document Assets | Comments (7)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:February 20, 2007

Jewels & DoubloonsDouglas Crockford, Yahoo!’s resident JavaScript guru and the developer of JSON, has provided a much-needed service in a three-part lecture series on the language. JavaScript is reportedly the second-fastest growing language to Ruby and has enjoyed a renaissance because of Ajax and rich Internet applications. JavaScript has always suffered from its unfortunate (and inaccurate) name, but it also suffers because of poor and outdated documentation and (general) lack of program comments or obfuscation to “minifying” to keep transferred script sizes small.

The first part in the lecture series (111 minutes) covers the basics of the language, its history and quirks:

Douglas Crockford provides a comprehensive introduction to the JavaScript Programming Language.

The second part places JavaScript and its development in relation to the DOM and browser evolution (77 min):

Then, the third part covers more advanced language topics such as debugging, patterns and the interesting alternative to traditional object classing using prototypal inheritance (67 min):

Crockford is also the developer of the JSLint code-checking utility and has a Web site chocked full of other tips and aids, including a very good set of JS coding standards. I highly recommend you find the three or four hours necessary to give these tutorials your undivided attention.

JavaScript: The Definitive Guide Crockford is generally disparaging about the state of most JavaScript books, though he does recommend the fifth edition of Dave Flanagan’s well-known JavaScript: The Definitive Guide. I know Doug has a real job, but all of us can hope he personally takes up the challenge of finally writing the definitive JavaScript guide. Crockford is definitely the guy to do it.

Because of the excellence of this series, it gets a J & D award.

Jewels & Doubloons An AI3 Jewels & Doubloon Winner

Posted by AI3's author, Mike Bergman Posted on February 20, 2007 at 1:08 pm in Jewels & Doubloons, Software Development | Comments (1)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:February 19, 2007

Jewels & Doubloons

We’re So Focused on Plowing Ahead We Often Don’t See The Value Around Us

For some time now I have been wanting to dedicate a specific category on this blog to showcasing tools or notable developments. It is clear that tools compilations for the semantic Web — such as the comprehensive Sweet Tools listing — or new developments have become one focus of this site. But as I thought about this focus, I was not really pleased with the idea of a simple tag of “review” or “showcase” or anything of that sort. The reason such terms did not turn my crank was my own sense that the items that were (and are) capturing my interest were also items of broader value.

Please, don’t get me wrong. One (among many) observations within the past few months has been the amazing diversity, breadth and number of communities, and most importantly, the brilliance and innovation that I was seeing. My general sense in this process of discovery is that I have kind of stumbled blindly into many existing — and sometimes mature — communities that have existed for some time, but for which I was not part of nor privy to their insights and advances. These communities are seemingly endless and extend to topics such as semantic Web and its constituent components, Web 2.0, agile development, Ruby, domain-specific languages, behavior driven development, Ajax, JavaScript frameworks and toolkits, Rails, extractors/wrappers/data mining, REST, why the lucky stiff, you name it.

Announcing ‘Jewels & Doubloons’

As I have told development teams in the past, as you cross the room to your workstation each morning look down and around you. The floor is literally strewn with jewels, pearls and doubloons –tremendous riches based on work that has come before — and all we have to do is take the time to look, bend over, investigate and pocket those riches. It is that metaphor, plus in honor of Fat Tuesday tomorrow, that I name my site awards ‘Jewels & Doubloons.’

Jewels & Doubloons (or J & D for short) may get awarded to individual tools, techniques, programming frameworks, screencasts, seminal papers and even blog entries — in short, anything that deserves bending over, inspecting and taking some time with, and perhaps even adopting. In general, the items so picked will be more obscure (at least to me, though they may be very well known to their specific communities), but what I feel to be of broader cross-community interest. Selection is not based on anything formal.

Why So Many Hidden Riches?

I’ll also talk on occasion as to why these riches of such potential advantage and productivity to the craft of software development may be so poorly known or overlooked by the general community. In fact, while many can easily pick up the mantra of adhering to DRY, perhaps as great of a problem is NIH — reinventing a software wheel due to pride, ignorance, discontent, or simply the desire to create for creation’s sake. Each of these reasons can cause the lack of awareness and thus lack of use of existing high value.

There are better ways and techniques than others to find and evaluate hidden gems. One of the first things any Mardi Gras partygoer realizes is not to reach down with one’s hand to pick up the doubloons and plastic necklaces flung from the krewes’ floats. Ouch! and count the loss of fingers! Real swag aficionados at Mardi Gras learn how to air snatch and foot stomp the manna dropping from heaven. Indeed, with proper technique, one can end up with enough necklaces to look like a neon clown and enough doubloons to trade for free drinks and bar top dances. Proper technique in evaluating Jewels & Doubloons is one way to keep all ten fingers while getting rich the lazy man’s way.
Jewels & Doubloons are designated with either a medium-sized Jewels & Doubloons or small-sized (see below) icon and also tagged as such.

Past Winners

I’ve gone back over the posts on AI3 and have postdated J & D awards for these items:

Posted:February 8, 2007

The last 24 hours have seen a flurry of postings on the newly released Yahoo! Pipes service, an online IDE for wiring together and managing data feeds on the Web. Tim O’Reilly has called the Pipes service “a milestone in the history of the internet.” Rather than repeat, go to Jeremy Zawodny’s posting, Yahoo! Pipes: Unlocking the Data Web, where he has assembled a pretty comprehensive listing of what others are saying about this new development:

Using the Pipes editor, you can fetch any data source via its RSS, Atom or other XML feed, extract the data you want, combine it with data from another source, apply various built-in filters (sort, unique (with the “ue” this time:-), count, truncate, union, join, as well as user-defined filters), and apply simple programming tools like for loops. In short, it’s a good start on the Unix shell for mashups. It can extract dates and locations and what it considers to be “text entities.” You can solicit user input and build URL lines to submit to sites. The drag and drop editor lets you view and construct your pipeline, inspecting the data at each step in the process. And of course, you can view and copy any existing pipes, just like you could with shell scripts and later, web pages.