Posted:June 19, 2007

Sweet Tools Listing

AI3′s Sweet Tools Listing Updated to Version 9

This AI3 blog maintains Sweet Tools, the largest listing of about 800 semantic Web and -related tools available. Most are open source. Click here to see the current listing!

AI3‘s listing of semantic Web and -related tools has just been updated to version 9. This version adds 42 new tools since the last update on March 11, bringing the total to 542 tools. As noted in the last update, the compilation is now largely complete. These new listings are mostly newly announced tools in the last three months, though some are stragglers missed from previous listings.

As before, the updated Sweet Tools listing is provided both as a filterable and sortable Exhibit display (thanks again, David Huynh and MIT’s Simile program) and as a simple table for quick download and copying. (Hint: sort on ‘Posted’ for 6/19/07 to see the 42 latest additions.)

Background on prior listings and earlier statistics may be found on these previous posts:

With interim updates periodically over that period.

Posted:April 26, 2007

Image from http://newsimg.bbc.co.uk/media/images/41466000/jpg/_41466772_sotherton300.jpg

The Structured Web is But an Early Hurdle to the Semantic Web

About a year ago (May 30, 2006), Dr. Douglas Lenat, the president and CEO of Cycorp, Inc., gave a great talk at Google called Computers Versus Common Sense. Doug, a former professor at Stanford and Carnegie Mellon, has been working in artificial intelligence and machine learning his entire career. Since 1984 he has been working on the project Cyc, which subsequently formed the basis for starting Cycorp in 1994. Cycorp, as may become apparent below, does much work for the defense and intelligence communities.

Google Research selected this 70-min video as one of its best of 2006, and I have to heartily concur. Doug is very informative and is also an an entertaining and humorous speaker.

But that is not the main reason for my recommendation. For what Doug presents in this video are some of the real common sense challenges of semantic matching and reasoning by computer. These are the threshold hurdles of intelligent agents and real-time answering and (perhaps) forecasting, the ultimate objectives that some equate to the “Semantic Web” (title case):

Because of the reasoning objectives Cycorp and its clients have set for the system, threshold conditions require not only more direct deductive reasoning (logical inference from known facts or premises), but also inductive (inferred based on the likelihood or belief of multiple assertions) and even abductive reasoning (likely or most probabilistic explanation or hypothesis given available facts or assertions). Doug makes clear the devilishly difficult challenges of determining semantic relevance when complete machine-based reasoning is an objective.

As I listened to the video, I interpreted the attempt to reach these objectives as bringing at least four major, and linked, design implications.

The first design implication is that the reasoning basis requires many facts and assertions about the world, the basis of the “common sense” in the knowledge base. An early lesson that some AI practitioners in the “common sense” camp came to hold was that learning systems that did not know much, could not learn much. When Cyc was started more than 20 years ago, Lenat and Marvin Minsky estimated on the back of an envelope that it would take on the order of 1000 person-years to create a knowledge base with sufficient world knowledge to enable meaningful reasoning thereafter. This is what Lenat has called “priming the pump” of the knowledge base with common sense to resolve so many classes of semantic and contextual ambiguities.

However, this large number of assertions has a second design implication, if I understood Lenat correctly, for the need for higher-order predicate calculus. These higher orders for quantification over subsets and relations or even relations over relations are designed to reduce the number of potential “facts” that need to be queried for certain questions. This makes the knowledge base (KB) as a whole more computationally tractable, and able to provide second or sub-second response times.

A third implication is that, again to maintain computational tractability, reasoning should be local (with local ontologies) and with specialized reasoning modules. Today, Cyc has more than 1000 such modules and reasoning in some local areas may not actually infer correctly across the global KB. Lenat likens this to the observation that local geography appears flat even though we know the entire Earth is a globe. This enables local simplifications to make the inferences and reasoning tractable.

Finally, the fourth implication is a very much larger number of predicates than in other knowledge bases or ontologies, actually more than 16,000 in the current Cyc KB. This large number comes about because of:

  1. simplifying some of the more complex expressions that frequently repeat themselves in some of the higher-order patterns noted above as new predicates, again to speed processing time, and
  2. providing more precise meanings to certain language verbs that humans can disambiguate because of context, but which pose problems to computer processing. For example, Cyc contains 23 different predicates relating to the word “in”.

Growing the Knowledge Base

Roughly about the year 2000, sufficient “pump priming” had taken place such that Cyc could itself be used to extend its knowledge base through machine learning. A couple of critical enablers for this process were the querying of Web search engines and the engagement of volunteers and others to test the reasonableness of new assertions for addition to the KB. The basic learning and expansion process is now generally:

formal predicate calculus language -> natural language queries -> issue to Web search engine -> translate results back to predicate language -> present inferences for human review -> accept / reject result (50%) ->’ add to KB (knowledge base)

The engagement of volunteers is also coming about through the use of online “games”. Open source (see below) is another recent tool. (BTW, the 50% acceptance rate is not that there are so many wrong “facts” on the Web, but that context, humor, sarcasm or effect can be used in human language that does not actually lead to a “correct” common-sense assertion. Such observations do give pause to unbounded information extraction.)

Thus, today, Cyc has grown to have a very significant scope, as some of these metrics indicate:

  • 16,000 predicates
  • 1,000 reasoning modules
  • 300,000 concepts
  • 4,000 physical devices
  • 400 event-participant relationships
  • 11,000 event types
  • 171,000 “names” (chemicals, persons, places, etc.)
  • 1,100 geospatial classes, 500 goespatial predicates
  • 3.2 million assertions

As might be expected, the overall KB continues to grow, and on an accelerating rate.

Threshold Conditions and the Structured Web

This all appears daunting. And, when viewed through the lens of the decades to get the knowledge base to this scale and in terms of the ambitiousness of its objectives, it is.

But semantic advantage and the semantic Web is not an either/or proposition. It is a spectrum of use and benefit, with Cyc representing a relatively extreme end of that spectrum.

In my recent post on Structurizing the Web with RDF, I noted a number of key areas in which the structured Web would bring benefit, including more targeted, filtered search results, better results presentation, and the ability to search on objects and entities not simply documents. Structurizing the Web, short of full reasoning, is both an essential first step and will bring its own significant benefits.

The use of RDF is also not unnecessarily limiting compared to Cyc’s internal predicate calculus language. The subject-predicate-object “triple” of RDF and its reliance on first-order logic for deduced inferences can also be used to express propositions about nested contexts resulting in metalanguages for modal and higher-order logic. OWL itself is expressed as a metalanguage of RDF encoded in XML; and virtually all mathematics can be described through RDF.

Thus, Cyc, like DBPedia and related efforts to express Wikipedia as RDF, can itself be a rich source of structure for this evolving Web. Like other formats and frameworks, parts can be used in greater or lesser scope and complexity depending on circumstances. The sheer scope of Cyc’s world view in its knowledge base is a tremendous asset.

OpenCyc as a Useful Knowledge Base

The value of this asset increased enormously with the release of OpenCyc in early 2002. The OpenCyc knowledge base and APIs are available under the Apache License. There are presently about 100,000 users of this open source KB, which has the same ontology as the commercial Cyc, but is limited to about 1 million assertions. The most recent release is v. 1.02. An OWL version is also available. The Cyc Foundation has also been formed to help promote further extensions and development around OpenCyc.

There has been some hint on the OpenCyc mailing list of others interested in creating RDF versions of OpenCyc or portions thereof, with the similar benefits to RDF versions of Wikipedia via SPARQL endpoints and retrievals, mashups and RDF browsing. OpenCyc would also obviously be a tremendous boon to better inferencing for the datasets that are rapidly becoming exposed via RDF.

The idea of the local ontologies and reasoners that Lenat discusses — and the broader collaboration mechanisms emerging around other semantic Web issues and tools — bode well for a resurgence of interest in Cyc. After more than two decades, perhaps its time has truly come.

BTW, some interesting slides with pretty good overlap with Lenat’s Google talk can be found here from the Cyc Foundation.

Posted:April 23, 2007

Image from http://davidbau.com/archives/2006/01/06/the_case_for_mindstorms.html In response to my review last week of OpenLink‘s offerings, Kingsley Idehen posted some really cool examples of what extracted RDF looks like. Kingsley used the content of my review to generate this structure; he also provided some further examples using DBpedia (which I have also recently discussed).

What is really neat about these examples is how amazingly easy it is to create RDF, and the “value added” that results when you do so. Below, I again discuss structure and RDF (only this time more on why it is important), describe how to create it from standard Web content, and link to a couple of other recent efforts to structurize content.

Why is Structure Important?

The first generation of the Web, what some now refer to as Web 1.0, was document-centric. Resources or links most always referred to the single Web page (or document) that displayed in your browser. Or, stated another way, the basic, most atomic unit of organizing information was the document. Though today’s so-called Web 2.0 has added social collaboration and tagging, search and display still largely occurs by this document-centric mode.

Yet, of course, a document or Web page almost always refers to many entities or objects, often thousands, which are the real atomic units of information. If we could extract out these entities or objects — that is the structure within the document, or what one might envision as the individual Lego © bricks from which the document is constructed — then we could manipulate things at a more meaningful level, a more atomic and granular level. Entities now would become the basis of manipulation, not the more jumbled up hodge-podge of the broader documents. Some have called this more atomic level of object information the “Web of data,” others use “Web 3.0.” Either term is OK, I guess, but not sufficiently evocative in my opinion to explain why all of this stuff is important.

So, what does this entity structure give us, what is this thing I’ve been calling the structured Web?

First, let’s take simple search. One problem with conventional text indexing, the basis for all major search engines, is the ambiguity of words. For example, the term ‘driver‘ could refer to a printer driver, Big Bertha golf driver, NASCAR driver, a driver of screws, family car driver, a force for change, or other meanings. Entity extraction from a document can help disambiguate what “class” (often referred to as a “facet” when applied to search or browsing) is being discussed in the document (that is, its context) such that spurious search results can be removed. Extracted structure thus helps to filter unwanted search results.

Extracted entities may also enable narrowing search requests to say, documents only published in the last month or from certain locations or by certain authors or from certain blogs or publishers or journals. Such retrieval options are not features of most search engines. Extracted structure thus adds new dimensions to characterize candidate results.

Second, in the structured Web, the basis of information retrieval and manipulation becomes the entity. Assembling all of the relevant information — irrespective of the completeness or location of source content sites — becomes easy. In theory, with a single request, we could collate the entire corpus of information about an entity, say Albert Einstein, from life history to publications to Nobel prizes to who links to these or any other relationship. The six degrees of Kevin Bacon would become child’s play. But sadly today, knowledge workers spend too much of their time assembling such information from disparate sources, with notable incompleteness, imprecision and inefficiency.

Third, the availability of such structured information makes for meaningful information display and presentation. One of the emerging exemplars of structured presentation (among others) is ZoomInfo. This, for example, is what a search on my name produces from ZoomInfo’s person search:

Example ZoomInfo Structured Result
[Click on image for full-size pop-up]

Granted, the listing is a bit out of date. But it is a structured view of what can be found in bits and pieces elsewhere about me, being built from about 50 contributing Web sources. And, the structure it provides in terms of name, title, employers, education, etc., is also more useful than a long page of results links with (mostly) unusable summary abstracts.

Presentations such as ZoomInfo’s will become common as we move to structured entity information on the Web as opposed to documents. And, we will see it for many more classes of entities beyond the categories of people, companies or jobs used by ZoomInfo. We are, for example, seeing such liftoff occurring in other categories of structured data within sources like DBpedia.

Fourth, we can get new types of mash-ups and data displays when this structure is combined, from calendars to tabular reports to timelines, maps, graphs of relatedness and topic clustering. We can also follow links and “explore” or “skate” this network of inter-relatedness, discovering new meanings and relationships.

And, fifth, where all of this is leading to is, of course, the semantic Web. We will be able to apply descriptive logic and draw inferences based on these relationships, resulting in the derivation of new information and connections not directly found in any of the atomic parts. However, note that much value still comes from the first areas of the structured Web alone, achievable immediately, short of this full-blown semantic Web vision.

OK, So What is this RDF Stuff Again?

As my earlier DBpedia review described, RDF — Resource Description Framework — is the data representation model at the heart of these trends. It uses a “triple” of subject-predicate-object, as generally defined by the W3C’s standard RDF model, to represent these informational entities or objects. In such triples, subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. (You can think of subjects and objects as nouns, predicates as verbs, and even think of the triples themselves as simple Dick-and-Jane sentences from a beginning reader.)

Resources are given a URI (as may also be given to predicates or objects that are not specified with a literal) so that there is a single, unique reference for each item. (OK, so here’s a tip: the length and complexity of the URIs themselves make these simple triple structures appear more complicated then they truly are! ‘Dick‘ seems much more complicated when it is expressed as http://www.dick-is-the-subject-of-this-discussion.com/identity/dickResolver/OpenID.xml.)

These URI lookups can themselves be an individual assertion, an entire specification (as is the case, for example, when referencing the RDF or XML standards), or a complete or partial ontology for some domain or world-view. While the RDF data is often stored and displayed using XML syntax, that is not a requirement. Other RDF forms may include N3 or Turtle syntax, and variants or more schematic representations of RDF also exist.

Here are some sample statements (among a few hundred generated, see later) from my reference blog piece on OpenLink that illustrate RDF triples:

The first four items have the post itself as the subject. The last statement is an entity referenced within my subject blog post. In all cases, the specific subjects of the triple statements are resources.

In all statements, the predicates point to reference URIs that precisely define the schema or controlled vocabularies used in that triple statement. For readability, such links are sometimes aliased, such as created at (time), links to, has title, is within topic, and has label, respectively, for the five example instances. These predicates form the edges or connecting lines between nodes in the conceptual RDF graph.

Lastly, note that the object, the other node in the triple besides the subject, may be either a URI reference or a literal. Depending on the literal type, the material can be full-text indexed (one triple, for example, may point to the entire text of the blog posting, while others point to each post image) or can be used to mash-up or display information in different display formats (such as calendars or timelines for date/time data or maps where the data refer to geo-coordinates).

[Depending on provenance, source format, use of aliases, or other changes to make the display of triples more readable, it may at times be necessary to "dereference" what is displayed to obtain the URI values to trace or navigate the actual triple linkages. Deferencing in this case means translating the displayed portion (the "reference") of a triple to its actual value and storage location, which means providing its linkable URI value. Note that literals are already actual values and thus not "dereferenced".]

The absolutely great thing about RDF is how well it lends itself through subsequent logic (not further discussed here) to map and mediate concepts from different sources into an unambiguous semantic representation [my 'glad' == (is the same as) your 'happy' OR my 'glad' is your 'glad']. Further, with additional structure (such as through RDF-S or the various dialects of OWL), drawing inferences and machine reasoning based on the data through more formal ontologies and descriptive logics is also possible.

How is This Structure Extracted?

The structure extraction necessary to construct a RDF “triple” is thus pivotal, and may require multiple steps. Depending on the nature of the starting content and the participation or not of the site publisher, there is a range of approaches.

Generally, the highest quality and richest structure occurs when the site publisher provides it. This can be done either through various APIs with a variety of data export formats, in which case various converters or translators to canonical RDF may be required by the consumers of that data, or in the direct provision of RDF itself. That is why the conversion of Wikipedia to RDF (done by DBPedia or System One with Wikipedia3) is so helpful.

I anticipate beyond Freebase that other sources, many public, will also become available as RDF or convertible with straightforward translators. We are at the cusp of a veritable explosion of such large-scale, high-quality RDF sources.

The next level of structure extractors are “RDFizers.” These extractors take other internal formats or metadata and convert them to RDF. Depending on the source, more or less structure may be extractable. For example, publishing a Web site with Dublin Core metadata or providing SIOC characterization for a blog (both of which I do for this blog site with available plugins, especially SIOC Plugin by Uldis Bojars or the Zotero COinS Metadata Exposer by Sean Takats), adds considerable structure automatically. For general listings of RDFizers, see my recent OpenLink review or the MIT Simile RDFizer site.

We next come to the general area of direct structure extraction, the least developed — but very exciting — area of gleaning RDF structure. There is a spectrum of challenges here.

At one end of the specturm are documents or Web sources that are published much like regular data records. Like the ZoomInfo listing above or category-oriented sites such as Amazon, eBay or the Internet Movie Data Base (IMDb), information is already presented in a record format with labels and internal HTML structure useful for extraction purposes.

Most of the so-called Web “wrappers” or extractors (such as Zotero’s translators or the Solvent extractor associated with Simile’s Semantic Bank and Piggy Bank) evaluate a Web page’s internal DOM structure or use various regular expression filters and parsers to find and extract info from structure. It is in this manner, for example, that an ISBN number or price can be readily extracted from an Amazon book catalog listing. In general, such tools rely heavily on the partial structure within semi-structured documents for such extractions.

The most challenging type of direct structure extraction is from unstructured documents. Approaches here use a family of possible information extraction (IE) techniques including named entity extraction (proper names and places, for example), event detection, and other structural patterns such as zip codes, phone numbers or email addresses. These techniques are most often applied to standard text, but newer approaches have emerged for images, audio and video.

While IE is the least developed of all structural extraction approaches, recent work is showing it to be possible to do so at scale with acceptable precision, and via semi-automated means. This is a key area of development with tremendous potential for payoff, since 80% to 85% of all content falls into this category.

Structure Extraction with OpenLink is a Snap

In the case of my own blog, I have a relatively well-known framework in WordPress with two resident plugins noted above that do automatic metadata creation in the background. With this minimum of starting material, Kingsley was able to produce two RDF extractions from my blog post using OpenLink’s Virtuoso Sponger (see earlier post). Sponger added to the richness of the baseline RDF extraction by first mapping to the SIOC ontology, followed then by mapping to all tags via the SKOS (simple knowledge organization structure) ontology and to all Web documents via the FOAF (friend-of-a-friend) ontology.

In the case of Kingsley’s first demo using the OpenLink RDF Browser Session (which gives a choice of browser, raw triples, SVG graph, Yahoo map or timeline views), you can do the same yourself for any URL with these steps:

  1. Go to http://demo.openlinksw.com/DAV/JS/rdfbrowser/index.html
  2. Enter the URL of your blog post or other page as the ‘Data Source URI’, then
  3. Go to Session | Save via the menu system or just click on permalink (which produces an URL that is bookmark friendly that can then be kept permanently or shared with others).

It is that simple. You really should use the demo directly yourself.

But here is the graph view for my blog post (note we are not really mashing up anything in this example, so the RDF graph structure has few external linkages, resulting in an expected star aspect with the subject resource of my blog post at the center):

Example OpenLink RDF Browser View
[Click on image for full-size pop-up]

If you then switch to the raw triples view, and are actually working with the live demo, you can click on any URI link within a triple and get a JavaScript popup that gives you these further options:

Example View of 'Explore' Popup
[Click on image for full-size pop-up]

The ‘Explore’ option on this popup enables you to navigate to that URI, which is often external, and then display its internal RDF triples rather than the normal page. In this manner, you can “skate” across the Web based on any of the linkages within the RDF graph model, navigating based on objects and their relationships and not documents.

The second example Kingsley provided for my write-up was the Dynamic Data Web Page. To create your own, follow these steps:

  1. Go to http://dbpedia.openlinksw.com:8890/isparql
  2. Enter your post URL as the Data Source URI
  3. Execute
  4. Click on the ‘Post URI’ in the results table, then
  5. Pick the ‘Dereference’ option (since the data is non-local; otherwise ‘Explore’ would suffice).

Again, it is that simple. Here is an example screenshot from this demo (a poor substitute for working with the live version):

Example OpenLink Dynamic Page View
[Click on image for full-size pop-up]

This option, too, also includes the ‘Explore’ popup.

These examples, plus other live demos frequently found on Kingsley’s blog (none of which requires more than your browser), show the power of RDF structuring and what can be done to view data and produce rich interrelationships “on the fly”.

Come, Join in the Fun

With the amount of RDF data now emerging, rapid events are occurring in viewing and posting such structure. Here are some options for you to join in the fun:

  • Go ahead and test and view your own URLs using the OpenLink demo sites noted above
  • Check out the new DBpedia faceted search browser created by Georgi Kobilarov, one of the founding members of the project. It will soon be announced; keep tabs on Georgi’s Web site
  • And, review this post by Patrick Gosetti-Murrayjohn and Jim Groom on ways to use the structure and searching power of RDF to utilize tags on blogs. They combined RSS feeds from about 10 people with some SIOC data using RAP and Dave Beckett’s Triplr. This example shows how RSS feeds are themselves a rich source of structure.
Posted:February 7, 2007

How to Process Your Own Large Libraries into Thumbnails

When I decided to upgrade my Sweet Tools semantic Web and -related tools listing, I wanted to add some images to make the presentation more attractive. It was also becoming the case that many metadata aggregation service providers were adopting image representations for data (see this Dlib article). Since the focus of my listing is software, I either could install all of the programs and take screenshots (not doable given the numbers involved) or adopt what many others have used as a sort of visual index for content: thumbnails, or, as specifically called when applied to Web pages, thumbshots.

Quick Review of Alternatives

Unless you get all of your Web content via feeds or have been living in a cave, you may have recently contracted a form of popup vertigo. Since its introduction just a few months back, the Snap Preview Anywhere thumbnail popup has become the eggplant that eats Chicago, with more than a half million sites now reported to be using the service. Since I don’t want this service myself for my blog (see below) and I therefore did not want to go through the effort of signing up for SPA nor restricting its use to just this posting (even though the signup appears clean and straightforward), I reproduce below what one of these Snap link-over popups looks like:

The sheer ubiquity of these popup thumbnails is creating its own backlash (check out this sample rant from UNEASYsilence and its comments) and, early promoters, such as TechCrunch, have now gone to the use of a clickable icon [ ] for a preview, rather than automatically popping up the image from the link hover.

Not only had the novelty of these popups worn off for me, but my actual desired use for Sweet Tools was to present a gallery of images for multiple results simultaneously. So, besides its other issues, the Snap service was not suitable for my purpose.

I had earlier used a Firefox add-on called BetterSearch that places thumbnails on results pages when doing searches with Google (including international versions), Amazon, MSN Search, Yahoo!, A9, Answers.com, AllTheWeb, Dogpile.com, del.icio.us and Simpy.com. But, like the Snap service, I personally found this service to be distracting. I also don’t like the fact that my use was potentially being logged and promo messages were inserted on each screen. (There is another Firefox browser extension called GooglePreview that appears less intrusive, but I have not tried it.) As it turns out, both of these services themselves piggyback on a free (for some uses) thumbnail acquisition and server service from Thumbshots.org.

Since my interest in thumbnails was limited and focused to a bounded roster of sites (not the dynamic results from a search query), I decided to cut out the middleman and try the Thumbshots.org source directly myself. However, my candidate sites are mostly obscure academic ones or semantic Web ones not generally in the top rankings, meaning that most of the Sweet Tools Web sites unfortunately had no thumbnails on Thumbshots.org.

Of course, throughout these investigations, I had always had the option of taking physical screen captures myself and converting them manually to thumbnails. This is a very straightforward process with standard graphics packages; I had done so often for other purposes using my standard Paint Shop Pro software. But with the number of the Sweet Tools growing into the hundreds, such a manual approach clearly wouldn’t scale.

Knowing there are literally hundreds of cheap or free graphics and image manipulation programs out there, I thus set out to see if I could find a utility that would provide most, if not all, of the automation required.

My Sweet Tools records don’t change frequently, so I could accept a batch mode approach. I wanted to also size the thumbnails to whatever displayed best in my Exhibit presentation. As well, if I was going to adopt a new utility, I decided I might as well seek other screen capture and display flexibilities for other purposes. I also importantly needed the individual file names created to be unique and readable (not just opaque IDs). Finally, like any tool I ultimately adopt, I wanted quality output and professional design.

Off and on I reviewed options and packages, mostly getting disgusted with the low quality of the dross that mostly exists out there, and appalled at the difficulty in using standard search services to find such candidates. (There truly is becoming whole categories of content such as products of all types, reviews, real data, market info and statistics, that are becoming nearly impossible to effectively find on the Web with current search engines; but those are topics for another day.)

Nonetheless, after much looking and trial runs of perhaps a dozen packages, I finally stumbled across a real gem, WebShot. (Reasons this product was difficult to find included its relatively recent vintage, apparent absence of any promotion, and the mismatch between the product name and Web site name.)

The WebShot Utility

WebShot is a program that allows you to take screenshots and thumbnails of web pages or whole websites. I find its GUI easy to use, but it also comes with a command line interface for advanced users or for high-volume services. WebShot can produce images in the JPG, GIF, PNG, or BMP formats. It was developed in C by Nathan Moinvaziri.

The program is free for use on Windows XP, though PayPal donations are encouraged. Nominal charges are applied to other Windows versions and use the command line. Linux is not supported and Internet Explorer must be installed.

The graphical UI on Windows XP has a standard tabbed design. Single thumbnails or ones in batch driven from a text file may be used. Output files can be flexibly sized via the above formats. The screen capture itself can be based on mandatory or max and min browser display parameters. There are a variety of file naming parameters and system settings allow WebShot to work in Web-friendly ways. Here’s an example of the Image tab for the GUI:

The command-liine version accepts about 20 different parameters.

Depending on settings, you can get a large variety of outputs. The long banner image to the left, for example, is a “complete” Web page dump of my Web site at the time of this posting, with about 8 consecutive posts shown (160 x ~2300). The system automatically stitches together the multiple long page screenshots, with the resolution in this case being set by the input width parameter of 160 pixels.

Another option is this sample “cropped” one (440 x 257) where I’m actually cutting the standard screen display to about 50% of its normal vertical (height) dimension:

And, then, the next example shows what I have chosen as my “standard” thumbnail size (160 x 120) (I added the image borders, not the program):

In batch mode, I set the destination parameter such that I got both a logical domain portion in the file name (%d) and a hashed portion (%m) since there were a few occasions of multiple, but different Web pages, from the same host domain.

As noted, download re-tries, delays and timeouts are all settable to be a good Web citizen while getting acceptable results. With more-or-less standard settings, I was able to complete the 400 thumbnail downloads (without error, I should mention) in just a few minutes for the Sweet Tools dataset.

How I Do Bulk Thumbnails for Sweet Tools

Your use will obviously vary, but I kept notes for myself so that I could easily repeat or update this batch process (in fact, I have done so already a couple of times with the incremental updates to Sweet Tools). This general work flow is:

  1. Create a text file with host Web site URLs in spreadsheet order
  2. Run WebShot with these general settings:
    • destination switches of %d%m (core domain, plus hash)
    • image at 160w x 120h (my standard; could be anything as long as proper aspect maintained)
    • use of Multiple tab, with a new destination directory for each incremental update
    • browser setting at 1024 x 768 required (most common aspect today); min of 800 x 600; highest quality image
  3. At completion, go to a command window and write-out image file names (images complete in the same order as submitted). (In Windows, this is the dir/o:d > listing.txt command.) Then, copy the file names in the resulting text file back into the spreadsheet for the record < --> image correspondence
  4. Upload to the appropriate WordPress image directory.

Some Other Tips

Like many such tools, there is insufficient documentation for the WebShot package. But, with some experimentation, it is in fact quite easy to accomplish a number of management or display options. Some of the ones I discovered are:

  • If harvesting multiple individual Web pages from the same domain, use the domain (%d) and hash (%m) options noted above
  • For complete capture of long Web pages (such as the image of my own Web site to the left), first decide on a desired resolution set via ‘width’ on the Image tab, leave height blank, and leave the browser settings open
  • For partial screen captures without distortion. set the image dimensions to the desired final size with height the desired partial percentage value, then adjust the browser dimensions to equal the image aspect.
Jewels & Doubloons An AI3 Jewels & Doubloon Winner
Posted:January 24, 2007

Structure for the Masses

or

Instant Mashups between WordPress, Google Spreadsheets and Exhibit

The past couple of days has seen a flurry of activity and much excitement revolving around a new “database-free” mashup and publication system called Exhibit. Another in a string of sneaky-cool software from MIT’s Simile program (and written by David Huynh, a pragmatic semantic Web developer of the first order), Exhibit (and its sure to follow rapid innovations) will truly revolutionize Web publishing and the visualization and presentation of structured data. Exhibit is quite simply “structure for the masses.”

What is It?

With just a few simple steps, even the most novice blog author can now embed structured data — such as sortable and filtered table displays, thumbnails, maps, timelines and histograms — in her blog posts and blog pages. Using Exhibit, you can now create rich data visualizations of web pages using only HTML (and optional CSS and Javascript code).

Exhibit requires no traditional database technology, no server-side code, and no need for a web server. Here is a sampling of Exhibit‘s current capabilities:

  • No external databases or hassles
  • Data filtering and sorting
  • Simple HTML embeds and calls
  • Automatic and dynamic page layouts and results rendering
  • Completely tailorable (with CSS and Javascript)
  • Direct updates and presentations (mashups) from Google spreadsheets
  • Pre-prepared timelines, map mashups, tabular display options, and Web page formatting
  • Easily embedded in WordPress blogs (see the first tutorial here).

Exhibit is as simple as defining a spreadsheet; after that you have a complete database! And, if you want to get wild and crazy with presentation and display, then that is easy as well!

What Are Some Examples?

Though Exhibit has been released barely one month, already there are some pretty impressive examples:

What Are People Saying?

Granted, we’re only talking about the last 24 hours or so, but interesting people are noticing and commenting on this phenomenon:

  • Ajaxian“Exhibit is a new project that lets you build rich sorting and filtering data applications in a simple way.”
  • Danny Ayers“Although the person using these tools doesn’t need to think about gobbledegook like RDF, when they use the tools, they are putting first class data on the web in a Semantic Web-friendly fashion.”
  • David Huynh“Now, we’re not just talking about the usual suspects of domains like photos and bookmarks. We’re talking about any arbitrary domain of data — Yes, the real world data. The data that people care about. I am hoping that we can create tools that let the average blogger easily publish structured data without ever having to coin URIs or design or pick ontologies — But there it is: this is, in my humble opinion, a beginning of something great.”
  • Kyler “If you don’t see the value of this, you are a fool.”
  • Derek Kinsman“Exhibit is an amazing web app … I am beginning to work alongside my WordPress mates in the hopes that we can create some sort of Administration area inside WordPress that connects to the Google accounts. Right inside WP. Also, we’re attempting to create some sort of plugin or downloadable template to which Exhibit can run mostly out of the box.”

What is Coming?

Johan Sundström has created an Instant Google Spreadsheets Exhibit, which lets you turn any Google spreadsheet (with certain formatting requirements) into an “exhibit” just by pasting in its public feed URL with immediate faceted browsing; maps and timelines are forthcoming.

Well, a WordPress plug-in is in the works (to be announced, with Derek helping to take the lead on it). Though incorporation into a blog is easy, it does require the author to have system administration rights and access to the WordPress server. A plug-in could remove those hurdles and make usage still easier.

Exhibit‘s very helpful online tutorials are being expanded, particularly with more examples and more templates. For those seriously interested in the technology, definitely monitor the Simile project site.

There continues to be activity and expansion of the Babel translation formats. You can now convert BibTeX, Excel, Notation 3 (N3), RDF/XML or tab-separated values (TSV) to a choice of Exhibit JSON , N3 or RDF/XML. And, since Exhibit itself internally stores its data representation as triples, it is tantalizing to think that another Simile project, RDFizers, with its impressive storehose of RDF converters, may also be more closely tied with Babel. Is it possible that Exhibit JSON may become the lingua franca of small-scale data representation formats?

And, within the project team of Huynh and his Ph.D. thesis advisor, David Karger, there are also efforts underway to extend the syntax and functionality of Exhibit. We’ve just seen the expansion to direct Google spreadsheet support, and support for more spreadsheet functionality is desired, including possible string concatenation and numeric operations.

Exhibit itself has been designed with extensibility in mind. Its linkage to Timeline, for example, is one such example. What will be critical in the weeks and months ahead is the development of a developer and user community surrounding Exhibit. There is presently a fairly active mailing list and I’m sure the MIT folks would welcome serious contributions.

Finally, other aspects of the Simile project itself and related intiatives at MIT have direct and growing ties to Exhibit both in terms of team members and developers and in terms of philosophy. You may want to check out these additional MIT projects including Longwell, Piggy Bank, Solvent, Semantic Bank, Welkin, DSpace, Haystack, Dwell, Ajax, Sifter, Relo plugin, Re:Search, Chickenfoot, and LAPIS. This is a program on the move, to which the closest attention is warranted.

Expected Growing Pains

There are some known issues sometimes with display in Safari and Opera browsers; these are being worked on and should be resolved shortly. There are also some style issues and conflicts when embedding in blogs (easily fixed with CSS modifications). There are likely performance problems when data sets get into the hundreds or thousands, but that exceeds Exhibit‘s lightweight objectives anyway. There may be other problems that emerge as use broadens.

These issues are to be expected and should not diminish playing with the system immediately. You’ll be amazed at what you can do, and how rapidly with so little code.

It has been a fun few days. It’s exciting to be able to be a camp follower during one of those seminal moments in Web development. And, so I say to David and colleagues at MIT and the band of merry collaborators on their mailing list: Thanks! This is truly cool.

Jewels & Doubloons An AI3 Jewels & Doubloon Winner