Understand the Code’s Structure and Model Architecture
In any approach to a new code base, especially one that is complex with many thousands of lines of code and multiple files, it is often useful to start with a visual overview of the architectural structure. Preferably, this understanding includes both the static structure (via class, component or package UML diagrams or equivalent) and data flows through the app. Component views of the file structure are one useful starting point for understanding the physical relationships between the JS, and HTML, XHTML or XUL, and possible data files. Then, a next desirable view would be to run the code base through a code visualizer such as MIT’s Relo, which has the usefulness of enabling interactive and gradual discovery and presentation of the class structure of a code base.
So, what other automated options short of reverse engineering to UML are there for gaining this structural understanding of the code base? Frankly, not many.
When All Else Fails, Go Manual
Manual diagramming, possibly with UML compliance, next begs the question of which drawing program to use. Like program editors, UML and drawing tools are highly varied and a personal preference. I mostly hate selecting new tools since I only want to learn a few, but learn them well. That means tools represent real investments for me, which also means I need to spend substantial time identifying and then testing out the candidates.
Dia is a GTK+ based diagram creation program for Linux, Unix and Windows released under the GPL license. Dia is roughly inspired by the commercial Windows program Visio. It is more geared towards informal diagrams for casual use, but it does have UML drawing templates. Graphviz is a general drawing and modeling program used by some other UML open source programs. I liked this program for its general usefulness and kept it on my system after all evaluations were done, but is really not automated enough alone for UML use.
In the UML realm, one end of the spectrum are “light”, simpler tools that have lower learning thresholds and fewer options. Some candidates here are Violet (which is quite clean and can reside in Eclipse, but has very limited formatting options) and UMLet (which does not snap to grid and also has limited color options). Getting a bit more complex I liked UMLGraph, which allows the declarative specification and drawing of UML class and sequence diagrams. Among its related projects, LightUML integrates UMLGraph into Eclipse and XUG extends UMLGraph into a reverse engineering tool.
Since I use Eclipse as my general IDE, my actual preference is a UML tool integrated into that environment. There are about 40 or UML options that support that platform, about a dozen of which are either free (generally “community versions” of commercial products) or open source. I reviewed all of them and installed and tested most.
LightUML is a little tricky to install, since it has external dependencies on Graphviz and the separate UMLGraph, though those are relatively straightforward to install. The biggest problem with LightUML for my purposes was its requirement to work off of Java packages or classes. I next checked out green. With green, UML diagrams are educational and easy to create and it provides a round-tripping editor. I found it interesting as an academic project but somewhat lacking for my desired outputs. Omondo’s free version of EclipseUML looks quite good, but unfortunately will only work with bona fide Java projects, so it was also eliminated from contention. I found eUML2 to be slow and it seemed to take over the Eclipse resources. I found a similar thing with Blueprint Software Modeler, 85 MB without the Eclipse SDK, which seems rather silly for the degree of functionality I was seeking.
Papyrus is a free, very new, Eclipse plug-in. It has only 5 standard UML diagrams and sparse documentation, but its implementation so far looks clean and its support for the DI2 (diagram interchange) exchange format is a plus (I’ve kept it on my system and will check in on new releases later). Visual Paradigm for UML has a free version usable in Eclipse, which is professionally packaged as is common for commercial vendors. However, this edition limits the number of diagram types per project and the download requires all options and a 117 MB install before the actual free version can be selected, and even then implications of the various choices are obscure. (These not uncommon tactics are not really bait-and-switch, but often times feel so.) The limit of one diagram type per project eliminate this option from consideration, plus it is incredibly complicated and a real resource hog. The community edition of MagicDraw UML operates similarly and with similar limitations.
None of these tools support JS reverse engineering or round-tripping. However, keeping everything in the Eclipse IDE would aid later steps when adding in-line code documentation (see below) to the code base. On the other hand, because of the lack of integration, and because UML diagramming needed to occur before code commenting, I decided I could forego Eclipse integration for a standalone tool so long as it supported the standard XMI (XML metadata interchange) format.
This decision led me back to again try StarUML, which is the tool I ended up using to complete my UML diagramming. I remain concerned about StarUML’s lack of recent development and reliance on Delphi and some proprietary components that limit it as a full open-source tool. On the other hand, it is extensible via user-added profiles and tags using XML, it is very intuitive to use, and it has very flexible export and diagramming options. Because of strong XMI support, should better options later emerge, there is no risk of lock-in.
Looking to the future, because of my Eclipse environment and the growing availability of JS editors in Eclipse (Aptana, which I use, but also Spket and JSEclipse), one project worth monitoring is UML2, an EMF- (and GMF-) based implementation of the UML 2.x metamodel for the Eclipse platform due out in June. This effort is part of the broader Model Development Tools (MDT) Eclipse initiative. UML2 builds are presently available for testing with other bleeding-edge components, but I have not tested these.
Throughout this process of UML tools investigations, I came to discover a number of preferences and criteria for selecting a toolset that include: 1) a free, open-source option; 2) easy to use and modify, with flexible user preferences; 3) support for UML 2.x and most or all UML diagrams; 4) support for XMI import and export; 5) extensible profiles and frameworks via an XML syntax, preferably with some profile building utilities; 6) operation within the Eclipse environment, measured by standard plug-in and update installs; 7) clean functionality and user interface; 8 ) the ability to handle large diagrams, particularly in tiling; 9) a variety of output formats; and 10) strength in the class, component and package diagrams of most use to me. Of course, these factors may differ substantially for you.
Now, Document the Code as You Dive Deeper
The most widely used approach is JSDoc, which is a Perl-based analog to Javadoc (click here for an example report). JSDoc is the most established of the systems and has an interface familiar to most Java developers. The first version is being phased out, with version 2 release apparently pending. There is a Google Group on this option. JSDocumenteris a graphical user interface built for Javadoc-like documenting using JSDoc. Within this family, the JSDoc-2 option appears to be the choice, since the initial developers themselves have moved in that direction and recognize earlier problems with JSDoc they are planning to overcome.
ScriptDoc is an attempt to standardize documentation formats, and is being pushed and backed by the Aptana open source JS IDE. The major differentiator for the ScriptDoc approach is the ability to link IDs in the code base to external .sdoc files that could perhaps have more extensive commenting than what might be desirable in-line. (The other side of that coin is the synchronization and maintenance of parallel files.) To date, ScriptDoc is also tightly linked with Aptana, with no known other platform support. Also, there is a disappointing level of activity on the ScriptDoc standards site.
On this and other bases, I think JSDoc-2 is the best choice of the options above. The general Javadocs form and expansion of necessary tags appears to have overcome earlier JSDoc limits, and the easy parser recognition for the comments appears sufficiently flexible. Plus, output formats can be tailored to be as fancy as desired.
Code Validation and Debugging
Since these tools are beyond the scope of this present review, more detailed discussion of them will be left to another day.
It’s Taken Too Many Years to Re-visit the ‘Deep Web’ Analysis
It’s been seven years since Thane Paulsen and I first coined the term ‘deep Web‘, perhaps representing a couple of full generational cycles for the Internet. What we knew then and what “Web surfers” did then has changed markedly. And, of course, our coining of the term and BrightPlanet’s publishing of the first quantitative study on the deep Web did nothing to create the phenomenon of dynamic content itself — we merely gave it a name and helped promote a bit of understanding within the general public of some powerful subterranean forces driving the nature and tectonics of the emerging Web.
BrightPlanet has uncovered the "deep" Web — a vast reservoir of Internet content that is 500 times larger than the known "surface" World Wide Web. What makes the discovery of the deep Web so significant is the quality of content found within. There are literally hundreds of billions of highly valuable documents hidden in searchable databases that cannot be retrieved by conventional search engines.
The day the study was released we needed to increase our servers nine-fold to meet news demand after CNN and then 300 major news outlets eventually picked up the story. By 2001 when the University of Michigan’s Journal of Electronic Publishing and its wonderful editor, Judith A. Turner, decided to give the topic renewed thrust, we were able to clean up the presentation and language quite a bit, but did little to actually update many of the statistics. (That version, in fact, is the one mostly cited today.)
Over the years there have been some books published and other estimates put forward, more often citing lower amounts in the deep Web than my original estimates, but, with one exception (see below), none of these were backed by new analysis. I was asked numerous times to update the study, and indeed had even begun collating new analysis at a couple of points, but the effort to complete the work was substantial and the effort always took a back seat to other duties and so was never completed.
Recent Updates and Criticisms
It was thus with some surprise and pleasure that I first found reference yesterday to Dirk Lewandowski’s and Phillip Mayr’s 2006 paper, “Exploring the Academic Invisible Web” [Library Hi Tech 24(4), 529-539], that takes direct aim at the analysis in my original paper. (Actually, they worked from the 2001 JEP version, but, as noted, the analysis is virtually identical to the original 2000 version.) The authors pretty soundly criticize some of the methodology in my original paper and, for the most part, I agree with them.
My original analysis combined a manual evaluation of the “top 60″ then-extant Web databases with an estimate of the total number of searchable databases (estimated at about 200,000, which they incorrectly cite as 100,000) and assessments of the mean size of each database based on a random sampling of those databases. Lewandowski and Mayr note conceptual flaws in the analysis at these levels:
On the other hand, the authors also criticized that my definition of deep content was too narrow, and overlooked certain content types such as PDFs now routinely indexed and retrieved on the surface Web. We also have had uncertain, but tangible growth in standard search engine content — with the last cited amounts about 20 billion documents since Google and Yahoo! ceased their war of index numbers.
Though not really offering an alternative, full-blown analysis, the authors use the Gale Directory of Databases to derive an alternative estimate of perhaps 20 billion to 100 billion documents on the deep Web of interest for academic purposes, which they later seem to imply also needs to be discounted by further percentages to get at “word-oriented” and “full-text or bibliographic” records that they deem appropriate.
My Assessment of the Criticisms
As noted, I generally agree with these criticisms. For example, since the time of original publication, we have seen the power distribution nature of most things on the Internet, including popularity and traffic. Exponential distributions will always result in overestimates using calculations based on means rather than medians. I also think that meaningful content types were both overused (more database-like records) and underused (PDF content that is now routinely indexed) in my original analysis.
However, the authors’ third criticism is patently wrong, since three different methods were used to estimate internal database record counts and the average sizes of each record they contained. I would also have preferred a more careful reading by the authors of my actual paper, since there are numerous other citations in error and mis-characterizations.
On an epistemological level, I disagree with the authors’ use of the term “invisible Web”, a label that we tried hard in the paper to overturn and that is fading as a current term of art. Internet Tutorials (initially, SUNY at Albany Library) addresses this topic head-on, preferring “deep Web” on a number of compelling grounds, including that “there is no such thing as recorded information that is invisible. Some information may be more of a challenge to find than others, but this is not the same as invisibility.”
Finally, I am not compelled by the author’s simplistic, alternate partial estimate based solely on the Gale database, but they readily acknowledge to not doing a full-blown analysis and to having different objectives in mind. I agree with the authors in calling for a full, alternative analysis. I think we all agree that is a non-trivial undertaking and could itself be subject to newer methodological pitfalls.
So, What is the Quantitative Update?
Within a couple of years after the initial publication of my paper, I suspected the “500 times” claim for the greater size of the deep Web in comparison to what is discoverable by search engines may have been too high. Indeed, in later corporate literature and Powerpoint presentations, I backed off the initial 2000-2001 claims and began speaking in ranges from a “few times” to as high as “100 times” greater for the size of the deep Web.
In the last seven years, the only other quantitative study of its kind of which I am aware is documented in the paper, “Structured Databases on the Web: Observations and Implications,” conducted by Chang et al. in April 2004 and published in the ACM SIGMOD, that estimated 330,000 deep Web sources with over 1.2 million query forms, reflecting a fast 3-7 times increase in 4 years from the date of my original paper. Unlike the Lewandowski and Mayr partial analysis, this effort and others by that group suggests an even larger deep Web than my initial estimates!
The truth is, we didn’t know then — and we don’t know now — what the actual size of the dynamic Web truly is. (And, aside from a sound bite, does it really matter? It is huge by any measure.) Heroic efforts such as these quantitative analyses or the still-more ambitious efforts of UC Berkeley’s SIM School on How Much Information? still have a role in helping to bound our understanding of information overload. As long as such studies gain news traction, they will be pursued. So, what might today’s story look like?
First, the methodological problems in my original analysis remain and (I believe today) resulted in overestimates. Another factor today leading to a potential overestimate of the deep Web v. the surface Web would be the fact that much “deep” content is being more exposed to standard search engines, be it through Google’s Scholar, Yahoo!’s library relationships, individual site indexing and sharing such as through search appliances, and other “gray” factors we noted in our 2000-2001 studies. These factors, and certainly more, act to narrow the difference between exposed search engine content (“surface Web”) and what we have termed the “deep Web.”
However, countering these facts are two newer trends. First, foreign language content is growing at much higher rates and is often under-sampled. Second, blogs and other democratized sources of content are exploding. What these trends may be doing to content balances is, frankly, anyone’s guess.
So, while awareness of the qualitative nature of Web content has grown tremendously in the past near-decade, our quantitative understanding remains weak. Improvements in technology and harvesting can now overcome earlier limits.
Perhaps there is another Ph.D. candidate or three out there that may want to tackle this question in a better (and more definitive) way. According to Chang and Cho in their paper, “Accessing the Web: From Search to Integration,” presented at the 2006 ACM SIGMOD International Conference on Management of Data in Chicago:
On the other hand, for the deep Web, while the proliferation of structured sources has promised unlimited possibilities for more precise and aggregated access, it has also presented new challenges for realizing large scale and dynamic information integration. These issues are in essence related to data management, in a large scale, and thus present novel problems and interesting opportunities for our research community.
The first part in the lecture series (111 minutes) covers the basics of the language, its history and quirks:
Then, the third part covers more advanced language topics such as debugging, patterns and the interesting alternative to traditional object classing using prototypal inheritance (67 min):
Crockford is also the developer of the JSLint code-checking utility and has a Web site chocked full of other tips and aids, including a very good set of JS coding standards. I highly recommend you find the three or four hours necessary to give these tutorials your undivided attention.
Because of the excellence of this series, it gets a J & D award.
|An AI3 Jewels & Doubloon Winner|
We’re So Focused on Plowing Ahead We Often Don’t See The Value Around Us
For some time now I have been wanting to dedicate a specific category on this blog to showcasing tools or notable developments. It is clear that tools compilations for the semantic Web — such as the comprehensive Sweet Tools listing — or new developments have become one focus of this site. But as I thought about this focus, I was not really pleased with the idea of a simple tag of “review” or “showcase” or anything of that sort. The reason such terms did not turn my crank was my own sense that the items that were (and are) capturing my interest were also items of broader value.
Announcing ‘Jewels & Doubloons’
As I have told development teams in the past, as you cross the room to your workstation each morning look down and around you. The floor is literally strewn with jewels, pearls and doubloons –tremendous riches based on work that has come before — and all we have to do is take the time to look, bend over, investigate and pocket those riches. It is that metaphor, plus in honor of Fat Tuesday tomorrow, that I name my site awards ‘Jewels & Doubloons.’
Jewels & Doubloons (or J & D for short) may get awarded to individual tools, techniques, programming frameworks, screencasts, seminal papers and even blog entries — in short, anything that deserves bending over, inspecting and taking some time with, and perhaps even adopting. In general, the items so picked will be more obscure (at least to me, though they may be very well known to their specific communities), but what I feel to be of broader cross-community interest. Selection is not based on anything formal.
Why So Many Hidden Riches?
I’ll also talk on occasion as to why these riches of such potential advantage and productivity to the craft of software development may be so poorly known or overlooked by the general community. In fact, while many can easily pick up the mantra of adhering to DRY, perhaps as great of a problem is NIH — reinventing a software wheel due to pride, ignorance, discontent, or simply the desire to create for creation’s sake. Each of these reasons can cause the lack of awareness and thus lack of use of existing high value.
There are better ways and techniques than others to find and evaluate hidden gems. One of the first things any Mardi Gras partygoer realizes is not to reach down with one’s hand to pick up the doubloons and plastic necklaces flung from the krewes’ floats. Ouch! and count the loss of fingers! Real swag aficionados at Mardi Gras learn how to air snatch and foot stomp the manna dropping from heaven. Indeed, with proper technique, one can end up with enough necklaces to look like a neon clown and enough doubloons to trade for free drinks and bar top dances. Proper technique in evaluating Jewels & Doubloons is one way to keep all ten fingers while getting rich the lazy man’s way.
Jewels & Doubloons are designated with either a medium-sized or small-sized (see below) icon and also tagged as such.
I’ve gone back over the posts on AI3 and have postdated J & D awards for these items:
The last 24 hours have seen a flurry of postings on the newly released Yahoo! Pipes service, an online IDE for wiring together and managing data feeds on the Web. Tim O’Reilly has called the Pipes service “a milestone in the history of the internet.” Rather than repeat, go to Jeremy Zawodny’s posting, Yahoo! Pipes: Unlocking the Data Web, where he has assembled a pretty comprehensive listing of what others are saying about this new development:
Using the Pipes editor, you can fetch any data source via its RSS, Atom or other XML feed, extract the data you want, combine it with data from another source, apply various built-in filters (sort, unique (with the “ue” this time:-), count, truncate, union, join, as well as user-defined filters), and apply simple programming tools like for loops. In short, it’s a good start on the Unix shell for mashups. It can extract dates and locations and what it considers to be “text entities.” You can solicit user input and build URL lines to submit to sites. The drag and drop editor lets you view and construct your pipeline, inspecting the data at each step in the process. And of course, you can view and copy any existing pipes, just like you could with shell scripts and later, web pages.