Evolution
AI³
Adaptive Information
Adaptive Innovation
Adaptive Infrastructure
a·dap·tive adj. Showing or having a capacity to make fit for new or special situations; flexible; a successful adjustment.

Blogasbörd (cloud version):
Send Email   Get SIOC Profile   Get FOAF Profile   Syndicate full contents for this site using RSS 20
Main Links
Categories
Calendar
February 2013
S M T W T F S
« Jan    
 12
3456789
10111213141516
17181920212223
2425262728  
Archives
More . . .  
Credits
Blog software courtesy of WordPress Site Meter View Mike's profile on LinkedIn
6295
Search
Date:   July 18, 2007
Image from Paul Thiessan
The structured Web is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance.

Over the past few months I have increasingly been writing about and referring to the structured Web. I have done so purposefully, but, so far, with little background or explication. With the inauguration of this occasional series, I hope to bring more color and depth to this topic [1].

Literally, over the past year, I have been learning and documenting on AI3 my attempts to understand the basis, concepts and tools of the emerging semantic Web. In that process, I have come to define my own outlines of the Web past, present and future. Within this world view, I see the structured Web as today’s current imperative and reality.

Confusing Terminology Surrounding Obvious Change

Some Web pundits have embraced a versioning terminology of Web 2.0 and Web 3.0 to describe one such world view. I don’t personally agree with this silly versioning — indeed I poked fun in a tongue-in-cheek posting about Web 98.6 more than a year ago — but such terminology has gotten some traction and serves a purpose. I actually give my own definitions for such “versions” below if for no other reason than to close the gap with alternative world views.

We need not go back to the alternative early protocols of Usenet (and news groups), Gopher and FTP and their search engines of Veronica, WAIS, Jughead or Archie in 1991 [2] when Tim Berners-Lee first publicly announced the World Wide Web and its combination of hypertext with the Internet. More likely, the release of the Mosaic browser and CERN‘s decision to make access to the Web free in 1993 marked the true take-off point for the Web and the continued demise of the competing protocols.

Images and links in Web pages (“documents”) plus the HTML mark-up language to enable the styling and graphical design of those pages were very much in keeping with general trends, paralleling the earlier transition of personal computers to graphical interfaces and away from terminals. Mosaic became the foundation for the Netscape browser, best links compilations became a big hit through sites like Yahoo!, and the Lycos search engine, one of the first profitable Web ventures, indexed a mere 54,000 pages when it was publicly released in 1994 [3].

This initial start to the Web — today now referred to by some as ‘Web 1.0′ — can be squarely timed to 1993-1994. By 1995, the Web was appearing on the covers of major news magazines and by 1996 the phenomenon was at full throttle. But, since these early beginnings, the Web has gone through many different “versions” and transitions, most not fitting with version numbers, as some of these examples show:

  • Academic v. Commercial Web — magazines like Wired, Red Herring, Business 2.0 and the mainstream press showered us with names such as e-commerce, dot-com and the gold rush for companies to establish a Web presence, B2B, etc. in the latter part of the 1990s. In fact, for some early architects of the Web, this was a period of some trauma and handwringing, since the “pure” open and academic roots of the Internet and the Web were being taken over by mainstream use, commercialization and the monied dominance of venture capital. This first major change in the Web, its first major new ‘version’ if you will, came back down to earth as a result of the “dot-com bust” of the bubble in 2001 [4]
  • Static v. Dynamic Web — all initial Web content was based on documents created by hand and posted as individual and hyperlinked Web pages. The relatively few documents of the early Web meant that hand-compiled “best of” listings such as Yahoo! worked pretty well; ‘metasearchers‘ also emerged to overcome the limited indexing coverage of early search engines. These trends, however, were also masking another version sea-change for the Web. With growth and more content, many larger sites were moving to dynamic page generation with retrieval via search forms. This dynamic portion of the Web, called at times either the ‘deep Web‘ or ‘invisible Web,’ acted like standard search engines and therefore was generally overlooked until I first popularized this change in 2000 [5]. I would argue that the shift to dynamic content, with certainly hundreds of thousands of such database-backed sites now in existence — and content many times larger than what is indexed by standard search engines — was also a major version shift for the Web
  • Open Source and Open Data –the open source Linux and the Apache Web server have been two software foundations to the growth of the Web, and MySQL has had a leading role in supporting sites and software with database-backed designs [6]. It is beyond the scope of this piece, but I believe that the dot-com frenzy, the demise of Netscape by Internet Explorer and other tensions with commercial interests, plus the very empowering nature of the Internet itself are also leading to a version change of the Web from commercial software products to open source ones. Further, proprietary publishers and data sources have only had limited success on the Web; we are now seeing strong trends to open data as well. Additionally, the very nature of open source software lends itself to interoperability and modularity based on naturally selected building blocks. This “open” infrastructural basis of the Web is more subtle and hard to see, but provides some powerful drivers for how more surface-oriented trends express themselves
  • Social Networking Web — the same early software that enabled dynamic Web pages and database-backed designs naturally lent themselves to early blogs, wikis and content-management systems, many backed by MySQL, which in turn led to more community-oriented designs and services such as del.icio.us for bookmarking, Flickr for photos, later YouTube for videos, and literally thousands of others. This trend, resulting from changed practices and the use of different tools and ways to harness user-generated content, and not resulting from any changes to standards per se, was first called ‘Web 2.0′ by Tim O’Reilly in about 2003
  • Ajax and Widgets — some would include Web services, APIs and ‘mashups‘ in the Web 2.0, often as expressed through embedded Web ‘widgets‘ and the use of Ajax or similar dynamic scripting approaches. These considerations were not part of the original Web 2.0 term, but usage today likely embraces aspects of these in many definitions of Web 2.0. In any case, there is certainly a change within the Web to more interactive, attractive, full-featured user interfaces, with interface updates no longer requiring a full Web page refresh
  • Document-centric Web v. Data-centric Web — however, in any event, portions of these trends and changes are more broadly combining to represent another version change in the Web from one solely focused on documents to one that is more data-centric; this topic, the basis for the term ‘structured Web,’ is more fully discussed below
  • Web 3.0 — Wikipedia states, “Web 3.0 is a term that has been coined with different meanings to describe the evolution of Web usage and interaction among several separate paths. These include transforming the Web into a database, a move towards making content accessible by multiple non-browser applications, the leveraging of artificial intelligence technologies, the Semantic web, or the Geospatial Web.” Of all current terms, this one is fully the silliest, since there is no consensus on what it represents nor its endpoints
  • Semantic Web — the glossary at W3C states that the semantic Web is “the Web of data with meaning in the sense that a computer program can learn enough about what the data means to process it.” Elsewhere, the vision of the semantic Web is described by the Education and Outreach working group (SWEO) of the W3C “to extend principles of the Web from documents to data. This extension will allow to fulfill more of the Web's potential, in that it will allow data to be shared effectively by wider communities, and to be processed automatically by tools as well as manually.” Note the importance of computer processing and autonomy in these statements, not to mention the pivotal term of ‘semantics.’ This is an expansive and wide-embracing vision, some challenges of which I more fully describe below, and
  • Visions of the Web — the semantic Web vision is matched with other visions, including voice activation, autonomous agents doing our bidding in the background, wireless interlinked everything, and other versions of the Web that are sometimes portrayed in science fiction. Whenever such transitions occur, they will all surely rely on all the various “versions” of the Web that have occurred in the short past 15 years of the Web’s existence.

Despite these differences in viewpoint, language does matter. Though some may view language as a contest in “branding,” which can legitimately apply in other venues, I think the issue here goes well beyond “branding.” Language is also necessary to aid communication.

As I explain below and elaborate upon more fully throughout this series, I believe one of the correct terms for the current evolutionary state of the Web is the ‘structured Web.’

A Clear Transition to a Data-centric Web

As noted, portions of these trends and changes are more broadly combining to represent another transitional change in the Web from one solely focused on documents to one that is more object- or data-centric. Evidence of this trend includes such factors as:

  • Broad database-backed Web site designs, with content re-purposed and served up dynamically, the trend first noted as the ‘deep Web’
  • ‘Mashups’ of data from multiple sources, such as in maps, timelines, etc.
  • The exposure of Web services and APIs. The programmableweb.com, for example, documents a doubling of such sources in the past nine months via its listing (as of July 2007) of about 500 APIs and more than 2,100 mashups
  • Huge growth and availability of large, often public, data sources, from US government and social sources like DBpedia, an RDF data extraction from Wikipedia (and others)
  • The emergence of entire data-centric sources, services and mashup platforms such as Freebase, Yahoo! Pipes, Google Base, Teqlo, QEDwiki, Ning, and OpenKapow
  • The rapid — and now almost universal — availability of data format converters (mostly to RDF) such as the listings of the W3C’s RDF Converters and MIT’s ‘RDFizers,’ the GRDDL initiative, Triplr, and the like
  • Soon, other to-be announced major data source look-up references, directories and conversion and filtering services.

One of the most popular series of presentations at this year’s WWW2007 conference in Banff was from the Linked Open Data project of the SWEO interest group. The members of this LOD project — comprised of accomplished advocates, developers and theorists — are providing the awareness, tools and example data that are showing how this emerging version may look. In fact, the group has just announced crossing the threshold of 1 billion ‘triples’ with 180,000 interlinks within its online DBpedia service, via these sources:

The LOD’s term for this effort is ‘linked data‘, and a Web site has been established to promote it. Others, harking back to Tim Berners-Lee’s original definition, refer to current efforts as a ‘Web of data’ or the ‘Semantic Data Web.’ Kingsley Idehen has been promoting the idea of ‘data spaces‘ — personal and collective — that is also a powerful metaphor.

Frankly, I think all of these terms are correct and useful. Yet I prefer the term structured Web because it is both more and less than some of these other terms.

The structured Web is more in that it pertains to any data formalism in use on the Web and includes the notion of extracting structure from uncharacterized content, by far the largest repository of potentially useful information on the Web. Yet the structured Web is also less because its ambition is solely to get that data into an interoperable framework and to forgo the full objectives of the ‘Semantic Web.’ In that regard, my concept of the structured Web is perhaps closest to the idea of linked data, though with less insistence on “correct” RDF and with specific attention to structure extraction from uncharacterized content.

Remarkable Progress on a Still Incomplete Journey

One of today’s realities is that we have accomplished much but still have a long way to go to achieve the grand vision of the ‘Semantic Web’ (capitalized).

More than a year ago I wrote a piece on “Climbing the Data Federation Pyramid” that noted the tremendous progress that has been made in the last twenty years in overcoming many seemingly intractable issues in data interoperability, initially of a physical and hardware nature. The Internet and Web standards have made enormous contributions to that progress.

The diagram I used in that piece is shown below [7]. Reaching the pyramid’s pinnacle could be argued as having achieved the grand vision of the Semantic Web. With the adoption of the Internet and Web protocols, all layers up through data representation have largely been solved. Data representation, data models, schema for different world views, and means for reconciling and mediating those different world views are much of the focus of today’s conceptual challenges.

Note, as we discuss the structured Web that we are largely focusing on the layer dealing with data representation, with some minor portions (principally in disambiguation) dealing with semantics. Getting data into a canonical data representation or model still leaves very crucial challenges in what does the data mean (its semantics), reasoning over that data (inference and pragmatics), and whether the data is authoritative or can be trusted. These are the daunting — and largely remaining challenges — of the Semantic Web.

For example, let’s look solely at the layer of semantics, the immediate challenge after data representation. By semantics, we are referring to whether different statements from different sources indeed refer or not to the same entity or concept; in other words, have the same meaning. Such a determination is pivotal if we are to combine data from multiple sources.

The use of RDF, accurate name spaces and syntactically correct URIs aid this resolution, but do not completely solve it. Ultimately, semantic mediation (such as my “glad” is equivalent to your “happy”) means resolving or mediating potential heterogeneities from on the order of 40 discrete categories of potential mismatches from units of measure, terminology, language, and many others. These sources may derive from structure, domain, data or language, as shown in this table [8]:

Class Category Sub-category
STRUCTURAL

Naming

Case Sensitivity
Synonyms
Acronyms
Homonyms
Generalization / Specialization
Aggregation Intra-aggregation
Inter-aggregation
Internal Path Discrepancy
Missing Item Content Discrepancy
Attribute List Discrepancy
Missing Attribute
Missing Content
Element Ordering
Constraint Mismatch
Type Mismatch
DOMAIN Schematic Discrepancy Element-value to Element-label Mapping
Attribute-value to Element-label Mapping
Element-value to Attribute-label Mapping
Attribute-value to Attribute-label Mapping
Scale or Units
Precision
Data Representation Primitive Data Type
Data Format
DATA Naming Case Sensitivity
Synonyms
Acronyms
Homonyms
ID Mismatch or Missing ID
Missing Data
Incorrect Spelling
LANGUAGE Encoding Ingest Encoding Mismatch
Ingest Encoding Lacking
Query Encoding Mismatch
Query Encoding Lacking
Languages Script Mismatches
Parsing / Morphological Analysis Errors (many)
Syntactical Errors (many)
Semantic Errors (many)

Using the same data model (say, RDF) or the same name spaces (say, Dublin Core or FOAF) helps somewhat to remove some of these sources of heterogeneity, but not all. Undoubtedly, longer term, resolving these heterogeneities will prove tractable. But they are not easily so today.

This observation does not undercut the Semantic Web vision nor negate the massive labors in support of that vision taken to date. But, hopefully, this observation may bring some perspective to the task ahead to obtain that vision.

Lowering Our Sights

If nothing else, the reality of the past 15 years shows us that the Web is a “dirty,” chaotic place. If HTML coding can be screwed up, it will. If loopholes in standards and protocols exist, they will be exploited. If there is ambiguity, all interpretations become possible, with many passionately held. Innovation and unintended uses occur everywhere.

This should not be surprising, and experienced Web designers, scientists and technologists should all know this by now. There can be no disconnect between workable standards and approaches and actual use in the “wild.” Nuanced arguments over the subtleties of standards and approaches are bound to fail. Robustness, simplicity and forgiveness must take precedence over elegance and theoretical completeness.

While there has been obvious growth in the sophistication of Web sites and the underlying technologies that support them, we see continued use of obsolete approaches that clearly should have been abandoned long ago (such as Web-safe colors, small displays, older browser versions, Web pages parked on some servers that have not been modified or looked at by their original authors in a decade, etc.). We also see slow uptake for clearly “better” new approaches. And we also sometimes see explosive uptake of approaches and ideas that seemingly come out of nowhere.

We also see that those approaches that enjoy the greatest success — blogging, tagging, microformats, RSS, widgets, for example, come most recently to mind — are those that the “citizen” user can easily and readily embrace. HTML was pretty foreign at first, but now millions of users modify their own code. Millions of users of various CMS systems and Firefox have learned how to install plug-ins and add-ins and modify CSS themes and use administration consoles.

So, my observation and argument is not that we must always choose what is mindless and unchallenging. But my argument is that we must accept real-world diversity and seek simplicity, robustness and clarity for what is new.

After nearly a decade of standards work, the basis for beginning the transition to the semantic Web is in place. But that vision itself sometimes appears too demanding, too intimidating. The vision at times appears all too unreachable.

Of course, this perception is wrong. Measured over many years, perhaps some decades, the vision of the semantic Web is reachable. Much remains to be worked on and understood regarding this vision in terms of mediating and resolving semantic heterogeneities, capturing and expressing world views through formal ontologies, making inferences between these views, and establishing trust and authoritativeness. And those challenges do not yet address the even more-exciting prospects of intelligent and autonomous agents.

Rather, the rationale for the structured Web is to tone down the vision, stay with the here and now, focus on what is achievable today. And what is achievable today is very great.

Why This Series on the ‘Structured Web‘?

Though version numbers for the Web are silly, with ‘Web 3.0′ for the semantic Web possibly being the silliest of all, such attempts do speak to the need for providing handles and language for capturing the dynamic change, diversity and complexity of the Web.

Today, right now, and all around us, a fundamental transition is taking place in the Web from a document-centric to a data-centric environment. A confluence of standards, advocacies, and previous trends are fueling this transition. Since the practical building blocks already exist, we will see this structured Web unfold before us at amazing speed.

The concept of the structured Web is thus narrower and less ambitious in scope than the ‘Semantic Web.’ The structured Web is merely a transitional step on the journey to the vision of the semantic Web, albeit one that can be fully realized today with current technologies and current understandings.

The purpose of this new series is thus to give prominence to this transition and to highlight the pragmatic, practical building blocks available to contribute to this transition. By somewhat shifting boundary definitions, the idea of the structured Web also aims to give more prominence to the importance of usability and structure extraction from semi-structured and unstructured content. These, too, are exciting areas with much potential.

So, as a way to provide a touchstone for continued discussion on this matter, here is one working definition of the structured Web:

The structured Web is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance.

Anticipated Topics in this Series

Some of the tentative topics that I plan to address in this series include discussion of what constitutes ‘structure’ in content, why structure is important, the various existing forms of structure, human v. machine bases for viewing and interpreting structure, the importance of finding “canonical” representation forms while also appreciating real-world diversity, the means to convert data forms and serializations, the means to extract structure from all types of content, transitioning to semantic understandings, and likely others.

Others may be added to this series over time and will be categorized under ‘Structured Web‘ on the AI3 blog.

This posting is the first part of a new, occasional series on the Structured Web, which also has its own new category. There are some additional prior topics in this series.

[1] You will note a heavy emphasis on Wikipedia definitions and histories in this piece, in keeping with the general theme of versions and transitions on the Web.

[2] News groups really did not have a good search engine until the launch of Deja News in 1995.

[3] Chris Sherman, "Happy Birthday, Lycos!," Search Engine Watch, August 14, 2002. See http://searchenginewatch.com/showPage.html?page=2160551.

[4] A fairly good summary of the History of the Web can be found on Wikipedia.

[5] Michael K. Bergman (Aug 2001). “The Deep Web: Surfacing Hidden Value“. The Journal of Electronic Publishing 7 (1). An earlier version of this paper was published by BrightPlanet Corp. in July 2000.

[6] While there are variations, Linux, Apache, MySQL and the scripting languages of either Python, PHP, or Perl are often referred to as ‘LAMP‘, one central basis for much open source software and, more broadly, interoperable open-source application packages.

[7] I would make a few changes today, notably in deprecating XML somewhat.

[8] This table builds on Pluempitiwiriyawej and Hammer's schema by adding the fourth major category of language. See Charnyote Pluempitiwiriyawej and Joachim Hammer, "A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources," Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See ftp.dbcenter.cise.ufl.edu/Pub/publications/tr00-004.pdf.

Posted by AI3's author, Mike Bergman

Posted on July 18, 2007 at 12:05 pm in Adaptive Information, Deep Web, Information Automation, Open Source, Semantic Web, Structured Web | Comments (8)
The URI link reference to this post is: http://www.mkbergman.com/390/what-is-the-structured-web/
The URI to trackback this post is: http://www.mkbergman.com/390/what-is-the-structured-web/trackback/
Date:   July 16, 2007
Since the progression of WordPress beyond version 2.5x, the Advanced TinyMCE plug-in has reached its E.O.L. A better alternative that is being kept current is TinyMCE Advanced from Andrew Ozz. Advanced TinyMCE is still available for download for older WP versions.
Announcing version 0.5.0 !
Download from here

I am pleased to announce a new update to the Advanced TinyMCE Editor for WordPress v. 2.2x. This new version — 0.5.0 — is now much easier to configure and customize thanks to the contributions of Chris Carson of Navy Road Software. You can download the plug-in and get detailed installation and documentation from the Advanced TinyMCE Editor page.

As with the previous version, the Advanced TinyMCE Editor enables you to turn your standard WordPress editor into this WYSIWYG powerhouse:

Enhanced Rich Text Editor for WP 2.2

The Advanced TinyMCE Editor plug-in:

  • Doubles the editor's available functions to more than 60, adding important functions such as tables, styles, inserts, and others
  • Improves existing functionality in handling images, links, etc., and
  • Corrects errors in the standard WordPress visual editor.

And, now, with Chris’ contributions, you can easily configure the Editor via a new Control Panel:

Advanced TinyMCE Control Panel
[Click on image for full-size pop-up]

This panel:

  • Allows you to add and enter new style sheets without needing to modify code
  • Allows you to configure which advanced buttons appear on or off in the Editor
  • Provides guidance on earlier width problems, esp. for small screen (800 x 600) older laptops.

Again, you can get this free WordPress plug-in from here.

TinyMCE and its advanced options are from Moxiecode Systems AB. Please note that the Advanced TinyMCE Editory plug-in has not been tested in WP versions prior to 2.2 and has not been tested in all browsers beyond Firefox and IE.

Posted by AI3's author, Mike Bergman

Posted on July 16, 2007 at 12:01 am in Blogs and Blogging, Open Source | Comments (5)
The URI link reference to this post is: http://www.mkbergman.com/389/announcing-advanced-tinymce-editor-v-050/
The URI to trackback this post is: http://www.mkbergman.com/389/announcing-advanced-tinymce-editor-v-050/trackback/
Date:   April 26, 2007

Image from http://newsimg.bbc.co.uk/media/images/41466000/jpg/_41466772_sotherton300.jpg

The Structured Web is But an Early Hurdle to the Semantic Web

About a year ago (May 30, 2006), Dr. Douglas Lenat, the president and CEO of Cycorp, Inc., gave a great talk at Google called Computers Versus Common Sense. Doug, a former professor at Stanford and Carnegie Mellon, has been working in artificial intelligence and machine learning his entire career. Since 1984 he has been working on the project Cyc, which subsequently formed the basis for starting Cycorp in 1994. Cycorp, as may become apparent below, does much work for the defense and intelligence communities.

Google Research selected this 70-min video as one of its best of 2006, and I have to heartily concur. Doug is very informative and is also an an entertaining and humorous speaker.

But that is not the main reason for my recommendation. For what Doug presents in this video are some of the real common sense challenges of semantic matching and reasoning by computer. These are the threshold hurdles of intelligent agents and real-time answering and (perhaps) forecasting, the ultimate objectives that some equate to the “Semantic Web” (title case):

Because of the reasoning objectives Cycorp and its clients have set for the system, threshold conditions require not only more direct deductive reasoning (logical inference from known facts or premises), but also inductive (inferred based on the likelihood or belief of multiple assertions) and even abductive reasoning (likely or most probabilistic explanation or hypothesis given available facts or assertions). Doug makes clear the devilishly difficult challenges of determining semantic relevance when complete machine-based reasoning is an objective.

As I listened to the video, I interpreted the attempt to reach these objectives as bringing at least four major, and linked, design implications.

The first design implication is that the reasoning basis requires many facts and assertions about the world, the basis of the “common sense” in the knowledge base. An early lesson that some AI practitioners in the “common sense” camp came to hold was that learning systems that did not know much, could not learn much. When Cyc was started more than 20 years ago, Lenat and Marvin Minsky estimated on the back of an envelope that it would take on the order of 1000 person-years to create a knowledge base with sufficient world knowledge to enable meaningful reasoning thereafter. This is what Lenat has called “priming the pump” of the knowledge base with common sense to resolve so many classes of semantic and contextual ambiguities.

However, this large number of assertions has a second design implication, if I understood Lenat correctly, for the need for higher-order predicate calculus. These higher orders for quantification over subsets and relations or even relations over relations are designed to reduce the number of potential “facts” that need to be queried for certain questions. This makes the knowledge base (KB) as a whole more computationally tractable, and able to provide second or sub-second response times.

A third implication is that, again to maintain computational tractability, reasoning should be local (with local ontologies) and with specialized reasoning modules. Today, Cyc has more than 1000 such modules and reasoning in some local areas may not actually infer correctly across the global KB. Lenat likens this to the observation that local geography appears flat even though we know the entire Earth is a globe. This enables local simplifications to make the inferences and reasoning tractable.

Finally, the fourth implication is a very much larger number of predicates than in other knowledge bases or ontologies, actually more than 16,000 in the current Cyc KB. This large number comes about because of:

  1. simplifying some of the more complex expressions that frequently repeat themselves in some of the higher-order patterns noted above as new predicates, again to speed processing time, and
  2. providing more precise meanings to certain language verbs that humans can disambiguate because of context, but which pose problems to computer processing. For example, Cyc contains 23 different predicates relating to the word “in”.

Growing the Knowledge Base

Roughly about the year 2000, sufficient “pump priming” had taken place such that Cyc could itself be used to extend its knowledge base through machine learning. A couple of critical enablers for this process were the querying of Web search engines and the engagement of volunteers and others to test the reasonableness of new assertions for addition to the KB. The basic learning and expansion process is now generally:

formal predicate calculus language → natural language queries → issue to Web search engine → translate results back to predicate language → present inferences for human review → accept / reject result (50%) → add to KB (knowledge base)

The engagement of volunteers is also coming about through the use of online “games”. Open source (see below) is another recent tool. (BTW, the 50% acceptance rate is not that there are so many wrong “facts” on the Web, but that context, humor, sarcasm or effect can be used in human language that does not actually lead to a “correct” common-sense assertion. Such observations do give pause to unbounded information extraction.)

Thus, today, Cyc has grown to have a very significant scope, as some of these metrics indicate:

  • 16,000 predicates
  • 1,000 reasoning modules
  • 300,000 concepts
  • 4,000 physical devices
  • 400 event-participant relationships
  • 11,000 event types
  • 171,000 “names” (chemicals, persons, places, etc.)
  • 1,100 geospatial classes, 500 goespatial predicates
  • 3.2 million assertions

As might be expected, the overall KB continues to grow, and on an accelerating rate.

Threshold Conditions and the Structured Web

This all appears daunting. And, when viewed through the lens of the decades to get the knowledge base to this scale and in terms of the ambitiousness of its objectives, it is.

But semantic advantage and the semantic Web is not an either/or proposition. It is a spectrum of use and benefit, with Cyc representing a relatively extreme end of that spectrum.

In my recent post on Structurizing the Web with RDF, I noted a number of key areas in which the structured Web would bring benefit, including more targeted, filtered search results, better results presentation, and the ability to search on objects and entities not simply documents. Structurizing the Web, short of full reasoning, is both an essential first step and will bring its own significant benefits.

The use of RDF is also not unnecessarily limiting compared to Cyc’s internal predicate calculus language. The subject-predicate-object “triple” of RDF and its reliance on first-order logic for deduced inferences can also be used to express propositions about nested contexts resulting in metalanguages for modal and higher-order logic. OWL itself is expressed as a metalanguage of RDF encoded in XML; and virtually all mathematics can be described through RDF.

Thus, Cyc, like DBPedia and related efforts to express Wikipedia as RDF, can itself be a rich source of structure for this evolving Web. Like other formats and frameworks, parts can be used in greater or lesser scope and complexity depending on circumstances. The sheer scope of Cyc’s world view in its knowledge base is a tremendous asset.

OpenCyc as a Useful Knowledge Base

The value of this asset increased enormously with the release of OpenCyc in early 2002. The OpenCyc knowledge base and APIs are available under the Apache License. There are presently about 100,000 users of this open source KB, which has the same ontology as the commercial Cyc, but is limited to about 1 million assertions. The most recent release is v. 1.02. An OWL version is also available. The Cyc Foundation has also been formed to help promote further extensions and development around OpenCyc.

There has been some hint on the OpenCyc mailing list of others interested in creating RDF versions of OpenCyc or portions thereof, with the similar benefits to RDF versions of Wikipedia via SPARQL endpoints and retrievals, mashups and RDF browsing. OpenCyc would also obviously be a tremendous boon to better inferencing for the datasets that are rapidly becoming exposed via RDF.

The idea of the local ontologies and reasoners that Lenat discusses — and the broader collaboration mechanisms emerging around other semantic Web issues and tools — bode well for a resurgence of interest in Cyc. After more than two decades, perhaps its time has truly come.

BTW, some interesting slides with pretty good overlap with Lenat’s Google talk can be found here from the Cyc Foundation.

Posted by AI3's author, Mike Bergman

Posted on April 26, 2007 at 3:26 pm in Adaptive Information, Information Automation, Open Source, Semantic Web | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/365/priming-the-pump-and-threshold-conditions/
The URI to trackback this post is: http://www.mkbergman.com/365/priming-the-pump-and-threshold-conditions/trackback/
Date:   April 16, 2007

String of PearlsVirtuoso and Related Tools String Together a Surprisingly Complete Score

NOTE: This is a lengthy post.

Jewels & DoubloonsOne of my objectives in starting the comprehensive Sweet Tools listing of 500+ semantic Web and -related tools was to surface hidden gems deserving of more attention. Partially to that end I created this AI3 blog’s Jewels and Doubloons award to recognize such treasures among the many offerings. The idea was and is that there is much of value out there; all we need to do is spend the time to find it.

I am now pleased to help better expose one of those treasures — in fact, a whole string of pearls — from OpenLink Software.

The semantic Web through its initial structured Web expression is not about to come — it has arrived. Mark your calendar. And OpenLink’s offerings are a key enabler of this milestone.

Though having been in existence since 1992 as a provider of ODBC middleware and then virtual databases, since at least 2003 OpenLink has been active in the semantic Web space [1]. The company was an early supporter of open source beginning with iODBC in 1999. But OpenLink’s decision to release its main product lines under dual-license open source in mid-2006 really marked a significant milestone for semantic Web toolsets. Though generally known mostly to the technology cognoscenti and innovators, these events now poise OpenLink for much higher visibility and prominence in the structured Web explosion now unfolding.

OpenLink’s current tools and technologies span the entire spectrum from RDF browsing and structure extraction to create RDF-enabled Web content, to format conversions and basic middleware operations, to ultimate data storage, query and management. (It is also important to emphasize that the company’s offerings support virtual databases, procedure hosting, and general SQL and XML data systems that provide value independent of RDF or the semantic Web.) There are numerous alternatives in other semantic Web tools at specific spots across this spectrum, most of which are also open source [2].

What sets OpenLink apart is the breadth and consistency of vision — correct in my view — that governs the company’s commitment and total offerings. It is this breadth of technology and responsiveness to emerging needs in the structured Web that signals to me that OpenLink will be a deserving player in this space for many years to come [3].

An Overview of OpenLink Products [4]

OpenLink SoftwareOpenLink provides a combination of open source and commercial software and associated services. While most attention in the remainder of this piece is given to its open source offerings, the company’s longevity obviously stems from its commercial success. And that success first built from its:

  • Universal Data Access Drivers — high-performance data access drivers for ODBC, JDBC, ADO.NET, and OLE DB that provide transparent access to enterprise databases. These UDA products also were the basis for OpenLink’s first foray into open source with the iODBC initiative (see below). This history of understanding relational database management systems (RDBMSs) and the needs of data transfer and interoperability are key underpinnings to the company’s present strengths.

This basis in data federation and interoperability then led to a constant evolution to more complex data integration and application needs, resulting in major expansion of the company’s product lines:

  • OpenLink Virtuoso — is a cross-platform ‘Universal Server’ for SQL, XML, and RDF data, including data management, that also includes a powerful virtual database engine, full-text indexing, native hosting of existing applications, Web Services (WS*) deployment platform, Web application server, and bridges to numerous existing programming languages. Now in version 5.0, Virtuoso is also offered in an open source version; it is described in focus below. The basic architecture of Virtuoso and its robust capabilities, as depicted by OpenLink, is:
Virtuoso Architecture
[Click on image for full-size pop-up]
  • OpenLink Data Spaces — also known as ODS, these are a series of distributed collaboration components built on top of Virtuoso that provide semantic Web and Web 2.0 support. Integrated components are provided for Web blogs, wikis, bookmark managers, discussion forums, photo galleries, feed aggregators and general Web site publishing using a clean scripting language. Support for multiple data formats, query languages, feed formats, and data services such as SPARQL, GData, GRDDL, microformats, OpenSearch, SQLX and SQL is included, among others (see below). This remarkable (but little known!) platform is extensible and provided both commercially and as open source; it is described in focus below.
OpenLink Software Demos
OpenLink has a number of useful (and impressive) online demos of capabilities. Here are a few of them, with instructions (if appropriate) on how best to run them:

  • OAT components – this series of demos, at http://demo.openlinksw.com/DAV/JS/demo/index.html?about, provide live use and JavaScript for all components in the toolkit. Simply choose one of the options in the menu to the left and then run the widget or app
  • Extract RDF from a standard Web page — in this demo, simply pasting a URL will allow you to extract RDF information from any arbitrary Web page. Go to http://demo.openlinksw.com/DAV/JS/rdfbrowser/index.html (demo / demo for username and password) and enter a URL of your choice (include http://!) in the Data Source URI box and then hit query. Alternatively, you can enter http://www.openlinksw.com/blog/~kidehen/ which will show all use options
  • Browse DBPedia RDF — this demo uses the OpenLink iSPARQL query builder and viewer, found at http://demo3.openlinksw.com:8890/isparql/ (demo / demo for username and password). At entry, then pick QBE > Open from the menu, then choose one of the listed queries. At this point, begin to explore the results. Note that clicking on a link in the lower pane will pull up the ‘explorer’, which allows you to navigate graph links between resources
  • Construct a SPARQL query — in the same example above, use the graph or generate options to either create a new query with the SPARQL language directly or via modifying the node graph in the middle pane (using its corresponding symbol buttons).
  • OAT (OpenLink Ajax Toolkit) — is a JavaScript-based toolkit for browser-independent rich-Internet application development. It includes a complete collection of UI widgets and controls, a platform-independent data access layer called Ajax Database Connectivity, and full-blown applications including unique database and table creation utilities, and RDF browser and SPARQL query builder, among others. It is offered as open source and is described in focus below.
  • iODBC — is another open source initiative that provides a comprehensive data access SDK for ODBC. iODBC is both a C-based API and cross-language hooks to C++, Java, Perl, Python, TCL and Ruby. iODBC has been ported to most modern platforms.
  • odbc-rails — is an open-source data adapter for Ruby on Rail‘s ActiveRecord that provides ODBC access to databases such as Oracle, Informix, Ingres, Virtuoso, SQL Server, Sybase, MySQL, PostgreSQL, DB2, Firebird, Progress, and others.
  • And, various other benchmarking and diagnostic utilities.

Relation to the Overall Semantic Web

One remarkable aspect of the OpenLink portfolio is its support for nearly the full spectrum of semantic Web requirements. Without the need to consider tools from any other company or entity, it is now possible to develop, test and deploy real semantic Web instantiations today, and with no direct software cost. Sure, some of these pieces are rougher than others, and some are likely not the “best” as narrowly defined, and there remain some notable and meaningful gaps. (After all, the semantic Web and its supporting tools will be a challenge for years to come.) But it is telling about the actual status of the semantic Web and its readiness for serious attention that most of the tools to construct a meaningful semantic Web deployment are available today — from a single supplier [5].

We can see the role and placement of OpenLink tools within the broader context of the overall semantic Web. The following diagram represents the W3C’s overall development roadmap (as of early 2006), with the roles played by OpenLink components shown in the opaque red ovals. While this is not necessarily the diagram I would personally draw to represent the current structured Web, it does represent consensus of key players and does represent a comprehensive view of the space. By this measure, OpenLink components contribute in many, many areas and across the full spectrum of the semantic Web architecture (note the full-size diagram is quite large):

OpenLink in Relation to W3C Architecture
[Click on image for full-size pop-up]

This status is no accident. It can be argued that OpenLink’s strong background in structured data, data federation and virtual databases and ODBC were excellent foundations. The company has also been focused on semantic Web issues and RDF support since at least 2003; we are now seeing the obvious fruits of these years of effort. As well, the early roots for some of the basic technology and approaches taken by the company extend back to Lisp and various AI (artificial intelligence) interests. One could also argue that perhaps a certain mindset comes from a self-defined role in “middleware” that lends itself to connecting the dots. And, finally, a vision and a belief in the centrality of data as an organizing principal for the next-generation Web has infused this company’s efforts from the beginning.

This placing of the scope of OpenLink’s offerings in context now allows us to examine some of the company’s individual “pearls” in more depth.

In Focus: Bringing Structure to the Web

No matter how much elegance and consistency of logic some may want to bring to the idea of the semantic Web, the fact will always remain that the medium is chaotic with multiple standards, motivations and statuses of development at work at any point in time. Do you want a consistent format, a consistent schema, a consistent ontology or world view? Forget it. It isn’t going to happen unless you can force it internally or with your suppliers (if you are a hegemonic enterprise, and, likely, not even then).

This real-world context has been a challenge for some semantic Web advocates. (I find it telling and amusing that the standard phrase for real-world applications is “in the wild”.) Many come from an academic background and naturally gravitate to theory or parsimony. Actually, there is nothing wrong with this; quite the opposite. What all of us now enjoy with the Internet would not have come about without the brilliance of understanding of simple and direct standards that early leaders brought to bear. My own mental image is that standards provide the navigation point on the horizon even if there is much tacking through the wind to get there.

But as theory meets practice the next layer of innovation comes from the experienced realists. And it is here that OpenLink (plus many others who recognize such things) provide real benefit. A commitment to data, to data federation, and an understanding of real world data formats and schemas “in the wild” naturally lead to a predilection to conversion and transformation. It may not be grand theory, but real practitioners are the ones who will lead the next phases with prejudices more to workability through such things as “pipeline” models or providing working code [6].

I’ve spoken previously about how RDF has emerged as the canonical storage form for the structured and semantic Web; I’m not sure this was self-evident to many observers up until recently. But it was evident to the folks at OpenLink, and — given their experience in heterogeneous data formats — they acted upon it.

Today, out of the box, you can translate the following formats and schema using OpenLink’s software alone. My guess is that the breadth of the table below would be a surprise to many current semantic Web researchers:

Accepted Data Formats / Schemas Query / Access / Transport Protocols Output Formats
  • RDF
  • XML
  • JSON
  • HTML

(Note, some of these items may reside in multiple categories and thus have been somewhat arbitrarily placed.)

In addition, OpenLink is developing or in beta with these additional formats, application sources or standards: Triplr, Jena, WordPress, Drupal, Zotero (with its support for major citation schemas such as CSL, COinS, etc.), Relax NG, phpBB, MediaWiki, XBRL, and eCRM.

A critical piece in these various transformations is the new Virtuoso Sponger, formally released in version 5.0 (though portions had been in use for quite some time). Depending on the file or format type detected at ingest, Sponger may apply metadata extractors to binary files (a multitude of built-in extractors, Open Office, images, audio, video, etc., plus an API to add new ones), “cartridges” for mapping REST-style Web services and query languages (see above), or cartridges for mapping microformats or metadata or structure embedded in HTML (basic XHTML, eRDF, RDFa, GRDDL profiles, or other sources suitable for XSLT). There is also a UI for simply adding your own cartridges via the Virtuoso Conductor, the system administration console for Virtuoso.

Detection occurs at the time of content negotiation instigated by the retrieval user agent. Sponger first tests for RDF (including N3 or Turtle), then scans for microformats or GRDDL. (If it is GRDDL-based the associated XSLT is used; otherwise Virtuoso’s built-in XSLT processors are used. If it is a microformat, Virtuoso uses its own XSLT to transform and map to RDF.) The next fallback is scanning of the HTML header for different Web 2.0 types or RSS 1.1, RSS 2.0, Atom, etc. In these cases, the format (if supported) is transformed to RDF on the fly using other built-in XSLT processors (via an internal table that associates data sources with XSLT similar to GRDDL patterns but without the dependency on GRRDL profiles). Failing those tests, the scan then uses standard Web 1.0 rules to search in the header tags for metadata (typically Dublin Core) and transform to RDF; other HTTP response header data may also be transformed to RDF.

“Transform” as used above includes the fact that OpenLink generates RDF based on an internal mapping table that associates the data sources with schemas and ontologies. This mapping will vary depending on if you are using Virtuoso with or without the ODS layer (see below). If using the ODS layer, OpenLink maps further to SIOC, SKOS, FOAF, AtomOWL, Annotea bookmarks, Annotea annotations, etc.. depending on the data source [7]. Getting all “scrubbed” or “sponged” data from these Web sources into RDF enables fast and easy mash-ups and, of course, an appropriate canonical form for querying and inference. Sponger is also fully extensible.

The result of these capabilities is simple: from any URL, it is now possible to get a minimum of RDF, and often quite rich RDF. The bridge is now made between the Web and the structured Web. Though we are now seeing such transformation utilities or so-called “RDFizers” rapidly emerge, none to my knowledge offer the breadth of formats, relation to ontology structure, or ease of integration and extension as do these Sponger capabilities from OpenLink.

In Focus: OAT and Tools

Though the youngest of the major product releases from Open Link, OAT — the OpenLink Ajax Toolkit — is also one of the most accessible and certainly flashiest of the company’s offerings. OAT is a JavaScript-based toolkit for browser-independent rich-Internet application development. It includes a robust set of standalone applications, database connectivity utilities, complete and supplemental UI (user interface) widgets and controls, an event management system, and supporting libraries. It works on a broad range of Ajax-capable web browsers in a standalone mode, and has notable on-demand library loading to reduce the total amount of downloaded JavaScript code.

OAT is provided as open source under the GPL license, and was first released in August 2006. It is now at version OAT 1.2 with the most recent release including integration of the iSPARQL QBE (see below) into the OAT Form Designer application. The project homepage is found at http://sourceforge.net/projects/oat; source code may be downloaded from http://sourceforge.net/projects/oat/files.

OAT is one of the first toolkits that fully conforms to the OpenAjax Alliance standards; see the conformance test page. OAT’s development team, led by Ondrej Zara, also recently incorporated the OAT data access layer as a module to the Dojo datastore.

OpenLink Ajax Toolkit (OAT) Overview

There are many Ajax toolkits, some with robust widget sets. What really sets OAT apart is its data access with breadth. This is very much in keeping with OpenLink’s data federation and middleware strengths. OAT’s Ajax database connectivity layer supports data binding to the following data source types:

  1. RDF — via SPARQL (with support in its query language, protocol, and result set serialization in any of the RDF/XML, RDF/N3, RDF/Turtle, XML, or JSON formats)
  2. Other Web data — via GData or OpenSearch
  3. SQL — via XMLA (a somewhat forgotten SOAP protocol for SQL data access that can sit atop ODBC, ADO.NET, OLE-DB, and even JDBC), and
  4. XML — via SOAP or REST style Web services.

Most all of the OAT components are data aware, and thus have the same broad data format flexibilities.The table below shows the full listing of OAT components, with live links to their online demos, selectable from a expanding menu, which itself is an example of the OAT Panelbar (menu) (Notes: use demo / demo for username and password if requested; pick “DSN=Local_Instance” if presented with a dropdown that asks for ‘Connection String’):

Standalone Applications
Forms Designer [see below]
DB Designer [see below]
SQL QBE [see below]
iSPARQL [see below]
RDF Browser [see below]
Libraries
Ajax DB Connectivity
Mapping (with support for Google, Yahoo!, OpenLayers, Microsoft Visual Earth)
Complete Widgets
Supplemental Widgets
Date Picker (calendar)

Not all of these components are supported on all major browsers; please refer to the OpenLink OAT compatibility chart for any restrictions.

The following screen shot shows one of the OAT complete widgets, the pie charting component:

OAT - Example Pie Graph
[Click on image for full-size pop-up]

One of the more complete widgets is the pivot table, which offers general Excel-level functionality including sorts, totals and subtotals, column and row switching, and the like:

OAT - Example Pie Graph
[Click on image for full-size pop-up]

Another one of the more interesting controls is the WebDAV Browser. WebDAV extends the basic HTTP protocols to allow moves, copies, file accesses and so forth on remote servers and in support of multiple authoring and versioning environments. OAT’s WebDAV Browser provides a virtual file navigation system across one or more Web-connected servers. Any resource that can be designated by a URI/IRI (the Internationalized Resource Identifier is a Unicode generalization of the Uniform Resource Identifier, or URI) can be organized and managed via this browser:

OAT - WebDAV Browser
[Click on image for full-size pop-up]

Again, there are online demos for each of the other standard widgets in the OAT toolkit.

The next subsections cover each of the major standalone applications contained in OAT; most are themselves combinations of multiple OAT components and widgets.

Forms Designer

The Forms Designer is the major UI design component within OAT. The full range of OAT widgets noted above — plus more basic controls such as labels, inputs, text areas, checkboxes, lines, URLs, images, tag clouds and others — can be placed by the designer into a savable and reusable composition. Links to data sources can be specified in any accessible network location with a choice of SQL, SPARQL or GData (and the various data forms associated with them as noted above).

The basic operation of the Forms Designer is akin to other standard RAD-type tools; the screenshot below shows a somewhat complicated data form pop-up overlaid on a map control (there is also a screencast showing this designer’s basic operations):

OAT - Using the Forms Designer
[Click on image for full-size pop-up]

When completed, this design produces the following result. Another screencast shows using the composite mash-up in action:

OAT - Mash-up Result of Using the Forms Designer
[Click on image for full-size pop-up]

Because of the common data representation at the core of OAT (which comes from the OpenLink vision), the ease of creating these mash-ups is phenomenal, the simplicity of which can be missed when viewing static examples. The real benefits actually become apparent when you make these mash-ups on your own.

Database (DB) Designer

The OAT Database (DB) Designer (also the basis for the Data Modeller) addresses a very different problem: laying out and designing a full database, including table structures and schemas. This is one of the truly unique applications within OAT, not shared (to my knowledge) by any other Ajax toolkit.

One starts either by importing a schema using XMLA or begins from scratch or from an existing template. Tables can be easily added, named and modified. As fields are added, it is easy to select data types, which also become color coded based on type on the design palette. Key relations between the tables are shown automatically:

OAT - Database Designer
[Click on image for full-size pop-up]

The result is a very easy, interactive data design environment. Of course, once completed, there are numerous options for saving or committing to actual data population.

SQL Query by Example

Once created and populated, the database is now available for querying and display. The SQL QBE (query-by-example) application within OAT is a fairly standard QBE interface. That may be a comfort to those used to such tools, but it has a more important role in being an exemplar for moving to RDF queries with SPARQL (see next):

OAT - Query-by-Example
[Click on image for full-size pop-up]

OpenLink also provides a screencast for how to make these table linkages and to modify the results display with a now-common SQL format.

iSPARQL Query Builder

OpenLink’s analog to QBE for SQL applied to RDF is its visual iSPARQL Query Builder (SVG-based, which of course is also an XML format). We are now dealing with graphs, not tables. And we are dealing with SPARQL and triples, not standard two-dimensional relational constructs. Some might claim that with triples we are now challenging the ability of standard users to grasp semantic Web concepts. Well, I say, yeah, OK, but I really don’t see how a relational table and schema with its lousy names is any easier than looking at a graph with URLs as the labeled nodes. Look at the following and judge for yourself:

iSPARQL Query Builder - Sample View
[Click on image for full-size pop-up]

OK, it looks pretty foreign. (Therefore, this is the one demo you should really try.) In my opinion, the community is still groping for constructs, visuals and paradigms to make the use of SPARQL intuitive. Though OpenLink has a presentation as good as anyone’s, the challenge of finding a better paradigm remains out there.

RDF Browser

Direct inspection of RDF is also hard to describe and even harder to present visually. What does it mean to have data represented as triples that enables such things such as graph (network) traversals or inference [8]? The broad approach that OpenLink takes for viewing RDF is via what it calls the RDF Browser. Actually, the browser presents a variety of RDF views, some in mash-up mode. The first view in the RDF Browser is for basic triples:

RDF Browser - Browse View
[Click on image for full-size pop-up]

This particular example is quite rich, with nearly 4000 triples with many categories and structure (see full size view). Note the tabs in the middle of the screen; these are where the other views below are selected.

For example, that same information can also be shown in standard RDF “triples” view (subject-object-predicate):

RDF Browser - Triples View
[Click on image for full-size pop-up]

The basic RDF structural relationships can also be shown in a graph view, itself with many settings for manipulating and viewing the data, with mouseovers showing the content of each graph node:

RDF Browser - Graph View
[Click on image for full-size pop-up]

The data can also be used for “mash-ups”; in this case, plotting the subject’s location on a map:

RDF Browser - Map View
[Click on image for full-size pop-up]

Or, another mash-up shows the dates of his blog postings on a timeline:

RDF Browser - Timeline View
[Click on image for full-size pop-up]

Since the RDF Browser uses SVG, some of these views may be incompatible with IE (6 or 7) and Safari. Firefox (1.5+), Opera 9.x, WebKit (Open Source Safari), and Camino all work well.

These are early views about how to present such “linked” information. Of course, the usability of such interfaces is important and can make or break user acceptance (again, see [8]). OpenLink — and all of us in the space — has the challenge to do better.

Yet what I find most telling about this current OpenLink story is that the data and its RDF structure now fully exist to manipulate this data in meaningful ways. The mash-ups are as easy as A connects to B. The only step necessary to get all of this data working within these various views is simply to point the RDF Browser to the starting Web site’s URL. Now that is easy!

So,we are not fully there in terms of compelling presentation. But in terms of the structured Web, we are already there in surprising measure to do meaningful things with Web data that began as unstructured raw grist.

In Focus: Virtuoso

OK, so we just got some sizzle; now it is time for the steak. OpenLink Virtuoso is the core component of the company’s offerings. The diagram at the top of this write-up provides an architectural overview of this product.

Virtuoso can best be described as a complete deployment environment with its own broad data storage and indexing engine; a virtual database server for interacting with all leading data types, third-party database management systems (DBMSs) and Internet “endpoints” (data-access points); and a virtual Web and application server for hosting its own and external applications written in other leading languages. To my knowledge, this “universal server” is the first cross-platform product that implements Web, file, and database server functionality in a single package.

The Virtuoso architecture exposes modular tools that can be strung together in a very versatile information-processing pipeline. Via the huge variety of structure and data format transformations that the product supports (see above), the developer only need worry about interacting with Virtuoso’s canonical formats and APIs. The messy details of real-world diversities and heterogeneities are largely abstracted from view.

Data and application interactions occur through the system’s virtual or federated database server. This core provides internal storage and application facilities, the ability to transparently expose tables and views from external databases, and the capability of exposing external application logic in a homogeneous way [9]. The variety of data sources supported by Virtuoso can be efficiently joined in any number of ways in order to provide a cohesive view of disparate data from virtually any source and in any form.

Since storage is supported for unstructured, semi-structured and structured data in its various forms, applications and users have a choice of retrieval and query constructs. Free-text searching is provided for unstructured data, conventional documents and literal objects within RDF triples; SQL is provided for conventional structured data; and SPARQL is provided for RDF and -related graph data. These forms are also supplemented with a variety of Web service protocols for retrievals across the network.

The major functional components within Virtuoso are thus:

  1. A DBMS engine (object-relational like PostgreSQL and relational like MySQL)
  2. XML data management (with support for XQuery, XPath, XSLT, and XML Schema)
  3. An RDF triple store (or database) that supports the SPARQL query language, transport protocol, and various serialization formats
  4. A service-oriented architecture (SOA) that combines a BPEL engine with an enterprise service bus (ESB)
  5. A Web application server (supporting both HTTP and WebDAV), and
  6. An NNTP-compliant discussion server.

Virtuoso provides its own scripting language (VSP, similar to Microsoft’s ASP) and Web application scripting language (VSPX, similar to Microsoft’s ASPX or PHP) [12]. Virtuoso Conductor is an accompanying system administrator’s console.

Virtuoso currently runs on Windows (XP/2000/2003), Linux (Redhat, Suse) Mac OS X, FreeBSD, Solaris, and other UNIX variants (including AIX and HP-UX and 32- and 64-bit platforms). Exclusive of documentation, a typical install of Virtuoso application code is about 100 MB (with help, documentation and examples, about 300 MB).

The development of Virtuoso first began in 1998. A major re-write with its virtual aspects occurred in 2001. WebDAV was added in early 2004, and RDF support with release of an open source version in 2006. Additional description of the Virtuoso history is available, plus a basic product FAQ and comprehensive online documentation. The most recent version 5.0 was released in April 2007.

RDF and SPARQL Enhancements

For the purposes of the structured Web and the semantic Web, Virtuoso’s addition of RDF and SPARQL support in early 2006 was probably the product’s most important milestone. This update required an expansion of Virtuoso’s native database design to accommodate RDF triples and a mapping and transformation of SPARQL syntax to Virtuoso’s native SQL engine.

For those more technically inclined, OpenLink offers an online paper, Implementing a SPARQL Compliant RDF Triple Store using a SQL-ORDBMS, that provides the details of how RDF and its graph IRI structure were superimposed on a conventional relational table design [10].

Virtuoso’s approach adds the IRI as a built-in and distinct data type. Virtuoso’s ANY type then allows for a single-column representation of a triple’s object (o). Since the graph (g), subject (s) and predicate (p) are always IRIs, they are declared as such. Since an ANY value is a valid key part with a well-defined collation order between non-comparable data types, indices can be built using the object (o) of the triple.

While text indexing is not directly supported in SPARQL, Virtuoso easily added it as an option for selected objects. Virtuoso also adds other SPARQL extensions to deal both with the mapping and transformation to native SQL and for other requirements. Though type cast rules of SPARQL and SQL differ, Virtuoso deals with this by offering a special SQL compiler directive that enables efficient SPARQL translation without introducing needless extra type tests. Virtuoso also extends SPARQL with SQL-like aggregate and group by functionality. With respect to storage, Virtuoso allows a mix of storage options for graphs in a single table or graph-specific tables; in some cases, the graph component does not have to be written in the table at all. Work on a system for declaring storage formats per graph is ongoing.

OpenLink provides an entire section on its RDF and SPARQL implementation in Virtuoso. In addition, there is an interactive SPARQL demo showing these capabilities at http://demo.openlinksw.com/isparql/.

Open Source Version and Differences

OpenLink released Virtuoso in an open source edition in mid-2006. According to Jon Udell at that time, to have “Virtuoso arrive on the scene as an open source implementation of a full-fledged SQL/XML hybrid out into the wild is a huge step because there just hasn’t been anything like that.” And, of course, now with RDF support and the Sponger, the open source uniqueness and advantages (especially for the semantic Web) are even greater.

Virtuoso’s open source home page is at http://virtuoso.openlinksw.com/wiki/main/ with downloads available from http://virtuoso.openlinksw.com/wiki/main/Main/VOSDownload.

Virtually all aspects of Virtuoso as described in this paper are available as open source. The one major component found in the commercial version, but not in the open source version, is the virtual database federation at the back-end, wherein a single call can access multiple database sources. This is likely extremely important to larger enterprises, but can be overcome in a Web setting via alternative use of inbound Web services or accessing multiple Internet “endpoints.”

Latest Release

The April 2007 version 5.0 release of Virtuoso demonstrates OpenLink’s commitment to continued improvements, especially in the area of the structured Web. There was a re-factoring of the database engine resulting in significant performance improvements. In addition, RDF-related enhancements included:

  • Added the built-in Sponger (see above) for transforming non-RDF into RDF “on the fly”
  • Full-text indexing of literal objects within triples
  • Added basic inferencing (subclass and subproperty support)
  • Added SPARQL aggregate functions
  • Improved SPARQL language support (updates, inserts, deletes)
  • Improved XML Schema support (including complex types through a bif:xcontains predicate)
  • Enhanced the SPARQL to SQL compiler’s optimizer, and
  • Further performance improvements to RDF views (SQL to RDF mapping).

In Focus: Open Data Spaces (ODS)

With the emergence of Web 2.0 and its many collaborative frameworks such as blogs, wikis and discussion forums, OpenLink came to realize that each of these locations was a “data space” of meaningful personal and community content, but that the systems were isolated “silos.” Each space was unconnected to other spaces, the protocols and standards for communicating between the spaces were fractured or non-existent, and each had its own organizing framework, means for interacting with it, and separate schema (if it had one at all).

Since the Virtuoso platform itself was designed to overcome such differences, it became clear that an application layer could be placed over the core Virtuoso system that would break down the barriers to these siloed “data spaces,” while at the same time providing all varieties of spaces similiar functionality and standards. Thus was started what became the OpenLink Data Space (or ODS) collaboration application, first released in open source in late 2006.

ODS is now provided as a packaged solution (in both open source and commercial versions) for use on either the Web or behind corporate firewalls, with security and single sign-on (SSO) capabilities. Mitko Iliev is OpenLink’s ODS program manager.

ODS is highly configurable and customizable. ODS has ten or so basic components, which can be included or not on a deployment-wide level, or by community, or by individual. Customizing the look and feel of ODS is also quite easy via CSS and the simple Virtuoso Web application scripting language, VSPX. (See here for examples of the language, which is based on XML.) A sample intro screen for ODS, largely as configured out of the box, is shown by this diagram:

OpenLink Data Space - General View
[Click on image for full-size pop-up]

Note the listing of ODS components on the menu bar across the top. These baseline ODS components are:

  1. Blog — this component is a full-featured blogging platform with support for all standard blogging features; it supports all the major publishing protocols (Atom, Moveable Type, Meta Weblog, and Blogger) and includes automatic generation of content syndication supporting RSS 1.0, RSS 2.0, Atom, SIOC, OPML, OCS (an older Microsoft format), and others
  2. Wiki — a full-Wiki implementation that supports the standard feeds and publishing protocols, plus supports Twiki, MediaWiki and Creole markup and WYSIWYG editing
  3. Briefcase — an integrated location to store and share virtually any electronic resource online; metadata is also extracted and stored
  4. Feed Aggregator — is an integrated application that allows you to store, organize, read, and share content including blogs, news and other information sources; provides automatic extraction of tags and comments from feeds; supports feeds in.RSS 1.0, 2.0, Atom, OPML, or OCS formats
  5. Discussion — this is, in essence, a forum platform that allows newsgroups to be monitored, posts viewed and responded to by thread, and other general discussion features
  6. Image Gallery — is an application to store and to share photos. The photos can be viewed as online albums or slide shows
  7. Polls — a straightforward utility for posting polls online and viewing voted results; organized by calendar as well
  8. Bookmark Manager –is a component for constructing, maintaining, and sharing vast collections of Web resource URLs; supports XBEL-based or Mozilla HTML bookmark imports; also mapped to the Annotea bookmark ontology
  9. Email plaftorm — this is a robust Web-based email client with standard email management and viewing functionality
  10. Community — a general portal showing all group and community listings within the current deployment, as well as means for joining or logging into them.

Here, for example, is the “standard” screen (again, modifiable if desired) when accessing an individual ODS community:

OpenLink Data Space - Community View
[Click on image for full-size pop-up]

And, here is another example, this time showing the “standard” photo image gallery:

OpenLink Data Space - Gallery View
[Click on image for full-size pop-up]

You can check out for yourself these various integrated components by going to http://myopenlink.net/ods/sfront.vspx.

There are subtle, but powerful, consistencies underlying this suite of components that I am amazed more people don’t talk about. While each individual component performs and has functionality quite similar to its standalone brethren, it is the shared functionality and common adherence to standards across all ODS components where the app really shines. All ODS components, for example, share these capabilities where it natively makes sense to the individual component:

  • Single sign-on with security
  • OpenID and Yadis compliance
  • Data exposed via SQL, RDF, XML or related serializations
  • Feed reading formats including RSS 1.0, RSS 2.0, Atom, OPML
  • Content syndication formats including RSS 1.0, RSS 2.0, Atom, OPML
  • Content publishing using Atom, Moveable Type, MetaWeblog, or Blogger protocols
  • Full-text indexing and search
  • WebDAV attachments and management
  • Automatic tagging and structurizing via numerous relevant ontologies (microformats, FOAF, SIOC, SKOS, Annotea, etc.)
  • Conversation (NNTP) awareness
  • Data management and query including SQL, RDF, XML and free text
  • Data access protocols including SQL, SPARQL, GData, Web services (SOAP or REST styles), WebDAV/HTTP
  • Use of structured Web middleware (see full alphabet soup above), and
  • OpenLink Ajax Toolkit (OAT) functionality.

This is a very powerful list, and it doesn’t end there. OpenLink has an API to extend ODS, which has not yet been published, but will likely be so in the near future.

So, with the use of ODS, you can immediately become “semantic Web-ready” in all of your postings and online activities. Or, as an enterprise or community leader, you can immediately connect the dots and remove artificial format, syntax and semantic barriers between virtually all online activities by your constituents. I mean, we’re talking here about hurdles that generally appear daunting, if not unsurmountable, that can be downloaded, installed and used today. Hello!?

Finally, OpenLink Data Spaces are provided as a standard inclusion with the open source Virtuoso.

Performance and Interoperability

I mean, what can one say? With all of this scope and power, why aren’t more people using OpenLink’s software? Why doesn’t the world better know about these capabilities? Is there some hidden flaw in all of this due to lack of performance or interoperability or something else?

I think it’s worth decomposing these questions from a number of perspectives because of what it may say about the state of the structured Web.

First, by way of disclaimer, I have not benchmarked OpenLink’s software v. other applications. There are tremendous offerings from other entities, many on my “to do” list for detailed coverage and perhaps hosannas. Some of those alternatives may be superior or better fits in certain environments.

Second, though, OpenLink is an experienced middleware vendor with data access and interoperability as its primary business. As a relatively small player, it can only compete with the big boys based on performance, responsiveness and/or cost. OpenLink understands this better than I; indeed, the company has both a legacy and reputation for being focused on benchmarks and performance testing.

I believe it is telling that the company has been an active developer and participant in performance testing standards, and is assiduous in posting performance results for its various offerings. This speaks to intellectual and technical integrity. For example, Orri Erling, the OpenLink Virtuoso program manager and lead developer, posts on a frequent basis on both how to improve standards and how OpenLink’s current releases stand up to them. The company builds and releases benchmarking utilities in open source. OpenLink is an active participant in the THALIA data integration benchmarking initiative and the LUBM semantic Web repository benchmark. While RDF performance and benchmarks are obviously still evolving, I think we can readily dismiss performance concerns as a factor limiting early uptake [11].

Third, as for functionality or scope, I honestly can not point to any alternative that spans as broad of a spectrum of structured Web requirements than does OpenLink. Further, the sheer scope of data formats and schemas that OpenLink transforms to RDF is the broadest I know. Indeed, breadth and scope are real technical differentiators for OpenLink and Virtuoso.

Fourth, while back-end operability with other triple stores such as Sesame, Jena or Mulgara/Kowari is not yet active with Virtuoso, there apparently has not yet been a demand for it, the ability to add it is clearly within Open Link’s virtual database and data federation strengths, and no other option — commercial or open source — currently does so. We’re at the cusp of real interoperability demands, but are not yet quite there.

And, last, and this is very recent, we are only now beginning to see large and meaningful RDF datasets begin to appear for which these capabilities are naturally associated. Many have been active in these new announcements. OpenLink has taken an enlightened self-interest role in its active participation in the Open Data movement and in its strong support for the Linking Open Data effort of the SWEO (Semantic Web Outreach and Education) community of the W3C.

So, via a confluence of many threads intersecting at once I think we have the answer: The time was not ripe until now.

What we are seeing at this precise moment is the very forming of the structured Web, the first wave of the semantic Web visible to the general public. Offerings such as OpenLink’s with real and meaningful capabilities are being released; large structured datastores such as DBpedia and Freebase have just been announced; and the pundit community is just beginning to evangelize this new market and new phase for the Web. Though some prognosticators earlier pointed to 2007 as the breakthrough year for the semantic Web, I actually think that moment is now. (Did you blink?)

The semantic Web through its initial structured Web expression is not about to come — it has arrived. Mark your calendar. And OpenLink’s offerings are a key enabler of this most significant Internet milestone.

A Deserving Jewels and Doubloons Winner

OpenLink, its visionary head, Kingsley Idehen, and full team are providing real leadership, services and tools to make the semantic Web a reality. Providing such a wide range of capable tools, middleware and RDF database technology as open source means that many of the prior challenges of linking together a working production pipeline have now been met and are affordable. For these reasons, I gladly announce OpenLink Software as a deserving winner of the (highly coveted, but still cheesy! :) ) AI3 Jewels & Doubloons award.

Jewels & Doubloons An AI3 Jewels & Doubloons Winner

[1] A general history of OpenLink Software and the Virtuoso product may be found at http://virtuoso.openlinksw.com/wiki/main/Main/VOSHistory.

[2] Most, if not all, of my reviews and testing in this AI3 blog focus on open source. That is because of the large percentage of such tools in the semantic Web space, plus the fact that I can install and test all comparable alternatives locally. Thus, there may indeed be better performing or more functional commercial software than what I typically cover.

[3] In terms of disclosure, I have no formal relationship with OpenLink Software, though I do informally participate with some OpenLink staff on some open source and open data groups of mutual interest. I am also actively engaged in testing the company’s products in relation to other products for my own purposes.

[4] An informative introduction to OpenLink and its CEO and founder, Kingsley Idehen, can be found in this April 28, 2006 podcast with Jon Udell, http://weblog.infoworld.com/udell/2006/04/28.html#a1437. Other useful OpenLink screencasts include:

[5] I am not suggesting that single-supplier solutions are unqualifiably better. Some components will be missing, and other third-party components may be superior. But it is also likely the case that a first, initial deployment can be up and running faster the fewer the interfaces that need to be reconciled.

[6] I have to say that Stefano Mazzocchi comes to mind with efforts such as Apache Cocoon and “RDFizers”.

[7] Thus, with ODS you get the additional benefit of having SIOC act as a kind of “gluing” ontology that provides a generic data model of Containers, Items, Item Types, and Associations / Relationships between Items). This approach ends up using RDF (via SIOC) to produce the instance data that makes the model conceptually concrete. (According to OpenLink, this is similar to what Apple formerly used in a closed form in the days of EOF and what Microsoft is grappling with ADO.net; it is also a similar model to what is used by Freebase, Dabble, Ning, Google Base, eBay, Amazon, Facebook, Flickr, and others who use Web services in combination or not with proprietary languages to expose data. Dave Beckett’s recently announced Triplr works in a similar manner.) This similarity of approach is why OpenLink can so readily produce RDF data from such services.

[8] I think these not-yet-fully formed constructs for interacting with RDF and using SPARQL are a legacy from the early semantic Web emphasis on “machines talking to machines via data”. While that remains a goal since it will eventually lead to autonomous agents that work in the background serving our interests, it is clear that we humans need to get intimately involved in inspecting, transforming and using the data directly. It is really only in the past year or three that workable user interfaces for these issues have even become a focus of attention. Paradoxically, I believe that sufficient semantic enabling of Web data to support a vision of intelligent and autonomous agents will only occur after we have innovated fun and intuitive ways for us humans to interact and manipulate that data. Whooping it up in the data playpen will also provide tremendous emphasis to bring still-further structure to Web content. These are actually the twin challenges of the structured Web.

[9] Stored procedures first innovated for SQL have been abstracted to support applications written in most of the leading programming languages. Virtuoso presently supports stored procedures or classes for SQL, Java, .Net (Mono), PERL, Python and Ruby.

[10] Many relational databases have been used for storing RDF triples and graphs. Also dedicated non-relational approaches, such as using bitmap indices as primary storage medium for triples, have been implemented. At present, there is no industry consensus on what constitutes the optimum storage format and set of indices for RDF.

[11] The W3C has a very informative article on RDF and traditional SQL databases; also, there is a running tally of large triple store scaling.

[12] Thanks to Ted Thibodeau, Jr, of the OpenLink staff, “This is basically true, but incomplete. In its Application Server role, Virtuoso can host runtime environments for PHP, Perl, Python, JSP, CLR, and others — including ASP and ASPX. Some of these are built-in; some are handled by dynamically loading local library resources. One of the bits of magic here — IIS is not required for ASP or ASPX hosting. Depending on the functionality used in the pages in question, Windows may not be necessary, either. CLR hosting may mandate Windows and Microsoft's .NET Frameworks, because Mono remains an incomplete implementation of the CLI.” Very cool.

Posted by AI3's author, Mike Bergman

Posted on April 16, 2007 at 2:33 pm in Adaptive Information, Jewels & Doubloons, Open Source, Semantic Web, Semantic Web Tools, Structured Web | Comments (8)
The URI link reference to this post is: http://www.mkbergman.com/355/openlink-plugs-the-gaps-in-the-structured-web/
The URI to trackback this post is: http://www.mkbergman.com/355/openlink-plugs-the-gaps-in-the-structured-web/trackback/
Date:   March 15, 2007

Bill Aron's The Scribe, from http://www.puckergallery.comOK, you lurkers. You know who you are. You hang out on open source forums, learning, gleaning, scheming . . . . You want to dive in, make contributions, but the truth is that the key developers on the project are really quite remarkable and have programming skills you can’t touch. And, so, you lurk, and maybe occasionally comment.

On the other hand, you are likely a user and implementer. And you probably share with me the observation, which I make frequently here and elsewhere, that one of the biggest weaknesses of most open source projects is their (relative) lack of documentation.

Maybe you’re also no great shakes as a programmer. But you can write! (Actually, overcoming the fear to write is a still lower threshold condition to contributing code to an open source project). I don’t know if this picture is true for you, but it is true for me.

But there’s hope and there’s a role waiting for you. If you begin contributing documentation to a project, it won’t be long before you start discovering some valuable secrets:

Secret #1. Those developers you admire so much hate to document and will thank you for it. It is actually the rare case where a great developer is also a great writer and communicator.

Secret #2. While you’re afraid of actually touching the code, getting into the app to write about how to use it or its nuances will bring you up the learning chain tremendously! Just as it is a truism that to learn a subject one needs to teach it, to learn about software code one should document it.

Secret #3. If, like me, you are not a natural programmer, then influencing those who are able to tackle some of the development issues of key concern to you is perhaps a more effective use of time than a direct frontal assault on the code itself. Writing documentation is one way to perhaps play to your greater strengths while still making a valuable contribution.

Secret #4. Committed developers behind every worthwhile open source project are typically creative and innovative. It is always rewarding to interact with smart, creative people. Starting to become a documentation cog within a broader open source wheel is personally and socially rewarding.

Finally, the total scope of documentation required by a project also includes user support and response to user forums. If you can also pick up some of the slack answering questions from newbies, you will also be doing the overall project a favor by freeing up valuable developer time.

Every open source project worth its promise has way, way too much that needs to be done and the community desires. You don’t have to be a world-class code jockey to make a meaningful contribution. So, lurkers and writers unite! Roll up your sleeves and get that quill wet.

Posted by AI3's author, Mike Bergman

Posted on March 15, 2007 at 3:50 pm in Open Source, Software Development | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/348/lurkers-and-writers-unite/
The URI to trackback this post is: http://www.mkbergman.com/348/lurkers-and-writers-unite/trackback/
Page 11 of 17« First...910111213...Last »
Copyright © 2004–2013 Michael K. Bergman.   This work is licensed under a Creative Commons License