Posted:July 18, 2007
Image from Paul Thiessan
The structured Web is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance.

Over the past few months I have increasingly been writing about and referring to the structured Web. I have done so purposefully, but, so far, with little background or explication. With the inauguration of this occasional series, I hope to bring more color and depth to this topic [1].

Literally, over the past year, I have been learning and documenting on AI3 my attempts to understand the basis, concepts and tools of the emerging semantic Web. In that process, I have come to define my own outlines of the Web past, present and future. Within this world view, I see the structured Web as today’s current imperative and reality.

Confusing Terminology Surrounding Obvious Change

Some Web pundits have embraced a versioning terminology of Web 2.0 and Web 3.0 to describe one such world view. I don’t personally agree with this silly versioning — indeed I poked fun in a tongue-in-cheek posting about Web 98.6 more than a year ago — but such terminology has gotten some traction and serves a purpose. I actually give my own definitions for such “versions” below if for no other reason than to close the gap with alternative world views.

We need not go back to the alternative early protocols of Usenet (and news groups), Gopher and FTP and their search engines of Veronica, WAIS, Jughead or Archie in 1991 [2] when Tim Berners-Lee first publicly announced the World Wide Web and its combination of hypertext with the Internet. More likely, the release of the Mosaic browser and CERN‘s decision to make access to the Web free in 1993 marked the true take-off point for the Web and the continued demise of the competing protocols.

Images and links in Web pages (“documents”) plus the HTML mark-up language to enable the styling and graphical design of those pages were very much in keeping with general trends, paralleling the earlier transition of personal computers to graphical interfaces and away from terminals. Mosaic became the foundation for the Netscape browser, best links compilations became a big hit through sites like Yahoo!, and the Lycos search engine, one of the first profitable Web ventures, indexed a mere 54,000 pages when it was publicly released in 1994 [3].

This initial start to the Web — today now referred to by some as ‘Web 1.0′ — can be squarely timed to 1993-1994. By 1995, the Web was appearing on the covers of major news magazines and by 1996 the phenomenon was at full throttle. But, since these early beginnings, the Web has gone through many different “versions” and transitions, most not fitting with version numbers, as some of these examples show:

  • Academic v. Commercial Web — magazines like Wired, Red Herring, Business 2.0 and the mainstream press showered us with names such as e-commerce, dot-com and the gold rush for companies to establish a Web presence, B2B, etc. in the latter part of the 1990s. In fact, for some early architects of the Web, this was a period of some trauma and handwringing, since the “pure” open and academic roots of the Internet and the Web were being taken over by mainstream use, commercialization and the monied dominance of venture capital. This first major change in the Web, its first major new ‘version’ if you will, came back down to earth as a result of the “dot-com bust” of the bubble in 2001 [4]
  • Static v. Dynamic Web — all initial Web content was based on documents created by hand and posted as individual and hyperlinked Web pages. The relatively few documents of the early Web meant that hand-compiled “best of” listings such as Yahoo! worked pretty well; ‘metasearchers‘ also emerged to overcome the limited indexing coverage of early search engines. These trends, however, were also masking another version sea-change for the Web. With growth and more content, many larger sites were moving to dynamic page generation with retrieval via search forms. This dynamic portion of the Web, called at times either the ‘deep Web‘ or ‘invisible Web,’ acted like standard search engines and therefore was generally overlooked until I first popularized this change in 2000 [5]. I would argue that the shift to dynamic content, with certainly hundreds of thousands of such database-backed sites now in existence — and content many times larger than what is indexed by standard search engines — was also a major version shift for the Web
  • Open Source and Open Data –the open source Linux and the Apache Web server have been two software foundations to the growth of the Web, and MySQL has had a leading role in supporting sites and software with database-backed designs [6]. It is beyond the scope of this piece, but I believe that the dot-com frenzy, the demise of Netscape by Internet Explorer and other tensions with commercial interests, plus the very empowering nature of the Internet itself are also leading to a version change of the Web from commercial software products to open source ones. Further, proprietary publishers and data sources have only had limited success on the Web; we are now seeing strong trends to open data as well. Additionally, the very nature of open source software lends itself to interoperability and modularity based on naturally selected building blocks. This “open” infrastructural basis of the Web is more subtle and hard to see, but provides some powerful drivers for how more surface-oriented trends express themselves
  • Social Networking Web — the same early software that enabled dynamic Web pages and database-backed designs naturally lent themselves to early blogs, wikis and content-management systems, many backed by MySQL, which in turn led to more community-oriented designs and services such as for bookmarking, Flickr for photos, later YouTube for videos, and literally thousands of others. This trend, resulting from changed practices and the use of different tools and ways to harness user-generated content, and not resulting from any changes to standards per se, was first called ‘Web 2.0′ by Tim O’Reilly in about 2003
  • Ajax and Widgets — some would include Web services, APIs and ‘mashups‘ in the Web 2.0, often as expressed through embedded Web ‘widgets‘ and the use of Ajax or similar dynamic scripting approaches. These considerations were not part of the original Web 2.0 term, but usage today likely embraces aspects of these in many definitions of Web 2.0. In any case, there is certainly a change within the Web to more interactive, attractive, full-featured user interfaces, with interface updates no longer requiring a full Web page refresh
  • Document-centric Web v. Data-centric Web — however, in any event, portions of these trends and changes are more broadly combining to represent another version change in the Web from one solely focused on documents to one that is more data-centric; this topic, the basis for the term ‘structured Web,’ is more fully discussed below
  • Web 3.0 — Wikipedia states, “Web 3.0 is a term that has been coined with different meanings to describe the evolution of Web usage and interaction among several separate paths. These include transforming the Web into a database, a move towards making content accessible by multiple non-browser applications, the leveraging of artificial intelligence technologies, the Semantic web, or the Geospatial Web.” Of all current terms, this one is fully the silliest, since there is no consensus on what it represents nor its endpoints
  • Semantic Web — the glossary at W3C states that the semantic Web is “the Web of data with meaning in the sense that a computer program can learn enough about what the data means to process it.” Elsewhere, the vision of the semantic Web is described by the Education and Outreach working group (SWEO) of the W3C “to extend principles of the Web from documents to data. This extension will allow to fulfill more of the Web's potential, in that it will allow data to be shared effectively by wider communities, and to be processed automatically by tools as well as manually.” Note the importance of computer processing and autonomy in these statements, not to mention the pivotal term of ‘semantics.’ This is an expansive and wide-embracing vision, some challenges of which I more fully describe below, and
  • Visions of the Web — the semantic Web vision is matched with other visions, including voice activation, autonomous agents doing our bidding in the background, wireless interlinked everything, and other versions of the Web that are sometimes portrayed in science fiction. Whenever such transitions occur, they will all surely rely on all the various “versions” of the Web that have occurred in the short past 15 years of the Web’s existence.

Despite these differences in viewpoint, language does matter. Though some may view language as a contest in “branding,” which can legitimately apply in other venues, I think the issue here goes well beyond “branding.” Language is also necessary to aid communication.

As I explain below and elaborate upon more fully throughout this series, I believe one of the correct terms for the current evolutionary state of the Web is the ‘structured Web.’

A Clear Transition to a Data-centric Web

As noted, portions of these trends and changes are more broadly combining to represent another transitional change in the Web from one solely focused on documents to one that is more object- or data-centric. Evidence of this trend includes such factors as:

  • Broad database-backed Web site designs, with content re-purposed and served up dynamically, the trend first noted as the ‘deep Web’
  • ‘Mashups’ of data from multiple sources, such as in maps, timelines, etc.
  • The exposure of Web services and APIs. The, for example, documents a doubling of such sources in the past nine months via its listing (as of July 2007) of about 500 APIs and more than 2,100 mashups
  • Huge growth and availability of large, often public, data sources, from US government and social sources like DBpedia, an RDF data extraction from Wikipedia (and others)
  • The emergence of entire data-centric sources, services and mashup platforms such as Freebase, Yahoo! Pipes, Google Base, Teqlo, QEDwiki, Ning, and OpenKapow
  • The rapid — and now almost universal — availability of data format converters (mostly to RDF) such as the listings of the W3C’s RDF Converters and MIT’s ‘RDFizers,’ the GRDDL initiative, Triplr, and the like
  • Soon, other to-be announced major data source look-up references, directories and conversion and filtering services.

One of the most popular series of presentations at this year’s WWW2007 conference in Banff was from the Linked Open Data project of the SWEO interest group. The members of this LOD project — comprised of accomplished advocates, developers and theorists — are providing the awareness, tools and example data that are showing how this emerging version may look. In fact, the group has just announced crossing the threshold of 1 billion ‘triples’ with 180,000 interlinks within its online DBpedia service, via these sources:

The LOD’s term for this effort is ‘linked data‘, and a Web site has been established to promote it. Others, harking back to Tim Berners-Lee’s original definition, refer to current efforts as a ‘Web of data’ or the ‘Semantic Data Web.’ Kingsley Idehen has been promoting the idea of ‘data spaces‘ — personal and collective — that is also a powerful metaphor.

Frankly, I think all of these terms are correct and useful. Yet I prefer the term structured Web because it is both more and less than some of these other terms.

The structured Web is more in that it pertains to any data formalism in use on the Web and includes the notion of extracting structure from uncharacterized content, by far the largest repository of potentially useful information on the Web. Yet the structured Web is also less because its ambition is solely to get that data into an interoperable framework and to forgo the full objectives of the ‘Semantic Web.’ In that regard, my concept of the structured Web is perhaps closest to the idea of linked data, though with less insistence on “correct” RDF and with specific attention to structure extraction from uncharacterized content.

Remarkable Progress on a Still Incomplete Journey

One of today’s realities is that we have accomplished much but still have a long way to go to achieve the grand vision of the ‘Semantic Web’ (capitalized).

More than a year ago I wrote a piece on “Climbing the Data Federation Pyramid” that noted the tremendous progress that has been made in the last twenty years in overcoming many seemingly intractable issues in data interoperability, initially of a physical and hardware nature. The Internet and Web standards have made enormous contributions to that progress.

The diagram I used in that piece is shown below [7]. Reaching the pyramid’s pinnacle could be argued as having achieved the grand vision of the Semantic Web. With the adoption of the Internet and Web protocols, all layers up through data representation have largely been solved. Data representation, data models, schema for different world views, and means for reconciling and mediating those different world views are much of the focus of today’s conceptual challenges.

Note, as we discuss the structured Web that we are largely focusing on the layer dealing with data representation, with some minor portions (principally in disambiguation) dealing with semantics. Getting data into a canonical data representation or model still leaves very crucial challenges in what does the data mean (its semantics), reasoning over that data (inference and pragmatics), and whether the data is authoritative or can be trusted. These are the daunting — and largely remaining challenges — of the Semantic Web.

For example, let’s look solely at the layer of semantics, the immediate challenge after data representation. By semantics, we are referring to whether different statements from different sources indeed refer or not to the same entity or concept; in other words, have the same meaning. Such a determination is pivotal if we are to combine data from multiple sources.

The use of RDF, accurate name spaces and syntactically correct URIs aid this resolution, but do not completely solve it. Ultimately, semantic mediation (such as my “glad” is equivalent to your “happy”) means resolving or mediating potential heterogeneities from on the order of 40 discrete categories of potential mismatches from units of measure, terminology, language, and many others. These sources may derive from structure, domain, data or language, as shown in this table [8]:

Class Category Sub-category


Case Sensitivity
Generalization / Specialization
Aggregation Intra-aggregation
Internal Path Discrepancy
Missing Item Content Discrepancy
Attribute List Discrepancy
Missing Attribute
Missing Content
Element Ordering
Constraint Mismatch
Type Mismatch
DOMAIN Schematic Discrepancy Element-value to Element-label Mapping
Attribute-value to Element-label Mapping
Element-value to Attribute-label Mapping
Attribute-value to Attribute-label Mapping
Scale or Units
Data Representation Primitive Data Type
Data Format
DATA Naming Case Sensitivity
ID Mismatch or Missing ID
Missing Data
Incorrect Spelling
LANGUAGE Encoding Ingest Encoding Mismatch
Ingest Encoding Lacking
Query Encoding Mismatch
Query Encoding Lacking
Languages Script Mismatches
Parsing / Morphological Analysis Errors (many)
Syntactical Errors (many)
Semantic Errors (many)

Using the same data model (say, RDF) or the same name spaces (say, Dublin Core or FOAF) helps somewhat to remove some of these sources of heterogeneity, but not all. Undoubtedly, longer term, resolving these heterogeneities will prove tractable. But they are not easily so today.

This observation does not undercut the Semantic Web vision nor negate the massive labors in support of that vision taken to date. But, hopefully, this observation may bring some perspective to the task ahead to obtain that vision.

Lowering Our Sights

If nothing else, the reality of the past 15 years shows us that the Web is a “dirty,” chaotic place. If HTML coding can be screwed up, it will. If loopholes in standards and protocols exist, they will be exploited. If there is ambiguity, all interpretations become possible, with many passionately held. Innovation and unintended uses occur everywhere.

This should not be surprising, and experienced Web designers, scientists and technologists should all know this by now. There can be no disconnect between workable standards and approaches and actual use in the “wild.” Nuanced arguments over the subtleties of standards and approaches are bound to fail. Robustness, simplicity and forgiveness must take precedence over elegance and theoretical completeness.

While there has been obvious growth in the sophistication of Web sites and the underlying technologies that support them, we see continued use of obsolete approaches that clearly should have been abandoned long ago (such as Web-safe colors, small displays, older browser versions, Web pages parked on some servers that have not been modified or looked at by their original authors in a decade, etc.). We also see slow uptake for clearly “better” new approaches. And we also sometimes see explosive uptake of approaches and ideas that seemingly come out of nowhere.

We also see that those approaches that enjoy the greatest success — blogging, tagging, microformats, RSS, widgets, for example, come most recently to mind — are those that the “citizen” user can easily and readily embrace. HTML was pretty foreign at first, but now millions of users modify their own code. Millions of users of various CMS systems and Firefox have learned how to install plug-ins and add-ins and modify CSS themes and use administration consoles.

So, my observation and argument is not that we must always choose what is mindless and unchallenging. But my argument is that we must accept real-world diversity and seek simplicity, robustness and clarity for what is new.

After nearly a decade of standards work, the basis for beginning the transition to the semantic Web is in place. But that vision itself sometimes appears too demanding, too intimidating. The vision at times appears all too unreachable.

Of course, this perception is wrong. Measured over many years, perhaps some decades, the vision of the semantic Web is reachable. Much remains to be worked on and understood regarding this vision in terms of mediating and resolving semantic heterogeneities, capturing and expressing world views through formal ontologies, making inferences between these views, and establishing trust and authoritativeness. And those challenges do not yet address the even more-exciting prospects of intelligent and autonomous agents.

Rather, the rationale for the structured Web is to tone down the vision, stay with the here and now, focus on what is achievable today. And what is achievable today is very great.

Why This Series on the ‘Structured Web‘?

Though version numbers for the Web are silly, with ‘Web 3.0′ for the semantic Web possibly being the silliest of all, such attempts do speak to the need for providing handles and language for capturing the dynamic change, diversity and complexity of the Web.

Today, right now, and all around us, a fundamental transition is taking place in the Web from a document-centric to a data-centric environment. A confluence of standards, advocacies, and previous trends are fueling this transition. Since the practical building blocks already exist, we will see this structured Web unfold before us at amazing speed.

The concept of the structured Web is thus narrower and less ambitious in scope than the ‘Semantic Web.’ The structured Web is merely a transitional step on the journey to the vision of the semantic Web, albeit one that can be fully realized today with current technologies and current understandings.

The purpose of this new series is thus to give prominence to this transition and to highlight the pragmatic, practical building blocks available to contribute to this transition. By somewhat shifting boundary definitions, the idea of the structured Web also aims to give more prominence to the importance of usability and structure extraction from semi-structured and unstructured content. These, too, are exciting areas with much potential.

So, as a way to provide a touchstone for continued discussion on this matter, here is one working definition of the structured Web:

The structured Web is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance.

Anticipated Topics in this Series

Some of the tentative topics that I plan to address in this series include discussion of what constitutes ‘structure’ in content, why structure is important, the various existing forms of structure, human v. machine bases for viewing and interpreting structure, the importance of finding “canonical” representation forms while also appreciating real-world diversity, the means to convert data forms and serializations, the means to extract structure from all types of content, transitioning to semantic understandings, and likely others.

Others may be added to this series over time and will be categorized under ‘Structured Web‘ on the AI3 blog.

This posting is the first part of a new, occasional series on the Structured Web, which also has its own new category. There are some additional prior topics in this series.

[1] You will note a heavy emphasis on Wikipedia definitions and histories in this piece, in keeping with the general theme of versions and transitions on the Web.

[2] News groups really did not have a good search engine until the launch of Deja News in 1995.

[3] Chris Sherman, "Happy Birthday, Lycos!," Search Engine Watch, August 14, 2002. See

[4] A fairly good summary of the History of the Web can be found on Wikipedia.

[5] Michael K. Bergman (Aug 2001). “The Deep Web: Surfacing Hidden Value“. The Journal of Electronic Publishing 7 (1). An earlier version of this paper was published by BrightPlanet Corp. in July 2000.

[6] While there are variations, Linux, Apache, MySQL and the scripting languages of either Python, PHP, or Perl are often referred to as ‘LAMP‘, one central basis for much open source software and, more broadly, interoperable open-source application packages.

[7] I would make a few changes today, notably in deprecating XML somewhat.

[8] This table builds on Pluempitiwiriyawej and Hammer's schema by adding the fourth major category of language. See Charnyote Pluempitiwiriyawej and Joachim Hammer, "A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources," Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See

Posted:July 12, 2007


A Reference for Data Set Interoperability, Look Up and Retrieval

UMBEL is a lightweight way to describe the subject(s) of Web content, akin to the relationship “isAbout”. Its subject reference structure is meant to be simple, universally applicable, and agnostic to the form or schema of source data. UMBEL does not replace formal domain or upper ontologies and has little or no inferential power. It is merely a pool of consensus ‘proxies’ to initially describe what subjects data sets are about.

UMBEL’s design includes binding mechanisms that work with HTML, tagging or other standard practices, including various RDF schema and more formal ontologies. Its reference subject ‘backbone’ is derived from the intersection of common subjects found on popularly used Web sites and other accepted subject references. Access and easy adoption is given preference over inferential or logical elegance.

In addition to its core reference subjects, the UMBEL project will provide look up, query, registration, pinging, and related services. The project is completely open and supported by a community process. All project products are made available without charge under Creative Commons licenses. UMBEL’s development is being backed by a number of leading open data efforts and entities; see the last section for how to get involved.

The UMBEL project stands for the Upper-level Mapping and Binding Exchange Layer. UMBEL is pronounced like “humble” — in keeping with its nature — except without the “h”. The name has the same Latin root as umbrella (umbra for shade, or umbella for parasol), meant to convey the umbrella-like nature of UMBEL’s subject bindings.

What’s the Problem?

With dozens of protocols and hundreds of thousands of potentially useful data sets, there are many challenges to getting Web data to interoperate. Two of these problems are foundational.

First, there are dozens of formalisms, schema, models and serializations for characterizing and communicating data and data content on the Web, ranging from the simplest Web page to the most formal OWL ontologies. A universal mechanism is lacking for how these variations can describe or publish to one other what they are about. This mechanism must be simple, neutral, broadly applicable and widely accepted.

Second, even if this publication mechanism existed, there is no accepted set of subjects for referencing what this diverse content is about. No attempt to date to provide a reference subject structure has been widely accepted.

Combined, these twin problems mean there are few road signs and poor road maps for how to find relevant data sets on the Web. UMBEL provides simple — but necessary — first steps to address these basic problems.

Simple, with Low Expectations

Advocates and users of various models and formalisms on the Web have their real-world reasons for embracing each form. Domain experts and various communities have their own world views, represented by their own vocabularies and structure. Only by understanding and respecting those differences can means to bridge them become widely accepted.

There is, of course, no such thing as complete objectivity or neutrality. But, from the standpoint of UMBEL and its purpose, keeping its approach simple with a minimum of structure poses the least challenge to the world views of existing publishers and data sets on the Web — and therefore the best likelihood of wide acceptance. Where choices are necessary, such as the selection of the reference subjects themselves, building from accepted Web practices and norms helps minimize bias and arbitrariness.

Thus, by necessity, UMBEL must be simple with limited ambitions. Its reference structure is merely a ‘bag of subjects’, with each subject reference only acting as a ‘proxy’ to a set of concepts that specific users may describe and refer to in their own ways. UMBEL’s core structure is completely flat, with no implied hierarchy or structure amongst its reference subjects. UMBEL’s reference subjects are simply that, proxy references and no more.

UMBEL thus has no or minimal inference power (though some disambiguation is possible). Inferencing, usefulness and authoritativeness are the responsibility of others. UMBEL is meant only to be a map to possible subjects, not whether those destinations are worthwhile or, indeed, even correct.

Consensus and Use Determine the Subject Pool

The selection of the actual subject proxies within the UMBEL core are to be based on consensus use. The subjects of existing and popular Web subject portals such as Wikipedia and the Open Directory Project (among others) will be intersected with other widely accepted subject reference systems such as WordNet and library classification systems (among others) in order to derive the candidate pool of UMBEL subject proxies. The actual methodology and sources of this process are still being determined (see further the project specification).

The objective, in any case, is to provide a simple and transparent method for subject selection that reflects current use and consensus to the maximum extent possible. The anticipation is that the first subject candidate pool will number in the many hundreds to the low thousands of proxies.

UMBEL as a general subject ‘backbone’ is meant to be useful as a reference by more specific domains or ontologies, but not fully descriptive for any of them. The core, internal UMBEL ontology is to be based on RDF and written in the RDF Schema vocabulary of SKOS (Simple Knowledge Organization System).

Universal Applicability

Very simple binding mechanisms will be developed and extended to the most widely employed approaches on the Web. UMBEL will, at minimum, support Atom, microformats, OPML, OWL, RDF, RDFa, RDF Schema, RSS, tags (via Tag Commons), and topic maps in its first release. The simplicity of the ontology and approach will enable other formats to be easily added.

Ping, update and registration protocols will also be provided for these formats. Existing project sponsors already possess a variety of ping, update, conversion and translation utilities for such purposes.

Additional UMBEL Initiatives

Besides the core structure, the UMBEL project will also develop a second ‘unofficial’ structure of hierarchical and interlinked subject relationships. This ‘unofficial’ structure will be used solely for look up and browsing functions, and will reside external to the core UMBEL subject and binding structure. Indeed, we anticipate that many such look-up structures from other parties may evolve over time for specific purposes and viewpoints.

Finally, besides development of the UMBEL ontology, the project will also be providing a data set registration service, information and collaboration Web site, tools clearinghouse, and support for language translations and some tools development.

How to Help and Get Involved

The initial project site is at, including this project introduction, the draft project specification (, and other helpful background information. A more interactive Web site is currently under development and will be announced shortly.

A mailing list you can monitor or join to become part of the project is at

Posted by AI3's author, Mike Bergman Posted on July 12, 2007 at 3:36 pm in Adaptive Information, Semantic Web, Structured Web, UMBEL | Comments (3)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:June 19, 2007

Sweet Tools Listing

AI3′s Sweet Tools Listing Updated to Version 9

This AI3 blog maintains Sweet Tools, the largest listing of about 800 semantic Web and -related tools available. Most are open source. Click here to see the current listing!

AI3‘s listing of semantic Web and -related tools has just been updated to version 9. This version adds 42 new tools since the last update on March 11, bringing the total to 542 tools. As noted in the last update, the compilation is now largely complete. These new listings are mostly newly announced tools in the last three months, though some are stragglers missed from previous listings.

As before, the updated Sweet Tools listing is provided both as a filterable and sortable Exhibit display (thanks again, David Huynh and MIT’s Simile program) and as a simple table for quick download and copying. (Hint: sort on ‘Posted’ for 6/19/07 to see the 42 latest additions.)

Background on prior listings and earlier statistics may be found on these previous posts:

With interim updates periodically over that period.

Posted:June 1, 2007

A remarkable 7-min video demonstration from a recent TED talk by Blaise Agüera y Arcas has just been posted online. If any of you have wondered what benefits a linked Web of data might bring, this is the talk to see. Simply stated, all I can say is, Wow!:

Photosynth TED Presentation

The actual TED talk link for Blaise’s presentation is found here.

The first remarkable piece of technology in this demo is from Seadragon, founded by Blaise and acquired last year by Microsoft. This part of the technology allows visual information to be smoothly browsed, panned or zoomed regardless of the amount of data involved or the bandwidth of the network. Watch as you zoom from all the pages of an entire book to an individual letter! This part of the technology has obvious applications to maps, but frankly, to any large data space.

The second portion of the technology, and perhaps an even more remarkable part of the demo, marries this part with a means to aggregate multiple images into an interactive, immersive, 3-D whole. This technology, originally developed at the University of Washington, is called Photosynth.

The Photosynth software takes a large collection of photos of a place or an object, analyzes them for similarities, and then displays the photos in a reconstructed three-dimensional space, showing you how each one relates to the next. This software is capable of assembling static photos — in the case of the demo, hundreds of photos of the Notre Dame cathedral publicly available on Flickr — into a synergy of zoomable, navigable, interactive and “immersive” spaces.

This example shows how individual data “objects” sharing a similar tag can be aggregated and put to a use as a new “whole” that is much, much greater than the sum of its parts. It is this kind of emergent quality that gives the promise of the structured Web and the semantic Web its power: much more can be done with linked data than with individual documents.

To better understand the amazing technology behind all of this magic, I recommend the geeks in the crowd check out this earlier and longer (37 min) video on Blaise and his work at Microsoft Live Labs; it is found at the Channel 9 MSDN site.

Thanks to Christian Long and his think:lab blog for first posting about this TED talk.

Posted by AI3's author, Mike Bergman Posted on June 1, 2007 at 11:41 am in Adaptive Information, Semantic Web | Comments (0)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:May 29, 2007

Donald Knuth's Road Sign

The Why, How and What of the Semantic Web are Becoming Clear — Yet the Maps and Road Signs to Guide Our Way are Largely Missing

There has been much recent excitement surrounding RDF, its linked data, “RDFizers” and GRDDL to convert existing structured information to RDF. The belief is that 2007 is the breakout year for the semantic Web.

The why of the semantic Web is now clear. The how of the semantic Web is now appearing clear, built around RDF as the canonical data model and the availability and maturation of a growing set of tools. And the what is also becoming clear, with the massive new datastores of DBpedia, Wikipedia3, the HCLS demo, Musicbrainz, Freebase, and the Encyclopedia of Life only being some of the most recent and visible exemplars.

Yet the where aspect seems to be largely missing.

By where I mean: Where do we look for our objects or data? If we have new objects or data, where do we plug into the system? Amongst all possible information and domains, where do we fit?

These questions seem simplistic or elemental in their basic nature. It is almost incomprehensible to wonder where all of this data now emerging in RDF relates to each other — what the overall frame of reference is — but my investigations seem to point to such a gap. In other words, a key piece of the emerging semantic Web infrastructure — where is this stuff — seems to be missing. This gap is across domains and across standard ontologies.

What are the specific components of this missing where, this missing infrastructure? I believe them to be:

  • Lack of a central look-up point for where to find this RDF data in reference to desired subject matter
  • Lack of a reference subject context for where this relevant RDF data fits; where can we place this data in a contextual frame of reference — animal, mineral, vegetable?
  • Lack of a open means where any content structure — from formal ontologies to “RDFized” documents to complete RDF data sets — can “bind” or “map” to other data sets relevant to its subject domain
  • Lack of a registration or publication mechanism for data sets that do become properly placed, the where of finding SPARQL or similar query endpoints, and
  • In filling these gaps, the need for a broad community process to give these essential infrastructure components legitimacy.

I discuss these missing components in a bit more detail below, concluding with some preliminary thoughts on how the problem of this critical infrastructure can be redressed. The good news, I believe, is that these potholes on the road to the semantic Web can be relatively easily and quickly filled.

The Lack of Road Signs Causes Collisions and Missed Turns

I think it is fair to say that structure on the current Web is a jumbled mess.

As my recent Intrepid Guide to Ontologies pointed out, there are at least 40 different approaches (or types of ontologies, loosely defined) extant on the Web for organizing information. These approaches embrace every conceivable domain and subject. The individual data sets using these approaches span many, many orders of magnitude in size and range of scope. Diversity and chaos we have aplenty, as the illustrative diagram of this jumbled structural mess shows below.

Jumbled and Diverse Formalisms
[Click on image for full-size pop-up]

Mind you, we are not yet even talking about whether one dot is equivalent or can be related to another dot and in what way (namely, connecting the dots via real semantics), but rather at a more fundamental level. Does one entire data set have a relation to any other data set?

Unfortunately, the essential precondition of getting data into the canonical RDF data model — a challenge in its own right — does little to guide us as to where these data sets exist or how they may relate. Even in RDF form, all of this wonderful RDF data exists as isolated and independent data sets, bouncing off of one another in some gross parody of Brownian motion.

What this means, of course, is that useful data that could be of benefit is overlooked or not known. As with problems of data silos everywhere, that blindness leads to unnecessary waste, incomplete analysis, inadequate understanding, and duplicated effort [1].

These gaps were easy to overlook when the focus of attention was on the why, what and how of the semantic Web. But, now that we are seeing RDF data sets emerge in meaningful numbers, the time is ripe to install the road signs and print up the maps. It is time to figure out where we want to go.

The Need for a Lightweight Subject Mapping Layer

As I discussed in an earlier posting, there’s not yet enough backbone to the structured Web. I believe this structure should firstly be built around a lightweight subject- or topic-oriented reference layer.

An umbrella subject reference becomes the “super-structure” to which other specific ontologies can place themselves in an “info-spatial” context.

Unlike traditional upper-level ontologies (see the Intrepid Guide), this backbone is not meant to be comprised of abstract concepts or a logical completeness of the “nature of knowledge”. Rather, it is meant to be only the thinnest veneer of (mostly) hierarchically organized subjects and topic references (see more below).

This subject or topic vocabulary (at least for the backbone) is meant to be quite small, likely more than a few hundred reference subjects, but likely less than many thousands. (There may be considerably more terms in the overall controlled vocabulary to assist context and disambiguation.)

This “umbrella” subject structure could be thought of as the reference subject “super-structure” to which other specific ontologies could place themselves in a sort of locational or “info-spatial” context.

One way to think of these subject reference nodes is as the major destinations — the key cities, locations or interchanges — on the broader structured Web highway system. A properly constructed subject structure could also help disambiguate many common misplacements by virtue of the context of actual subject mappings.

For example, an ambiguous term such as “driver” becomes unambiguous once it is properly mapped to one of its possible topics such as golf, printers, automobiles, screws, NASCAR, or whatever. In this manner, context is also provided for other terms in that contributing domain. (For example, we would now know how to disambiguate “cart” as a term for that domain.)

A high-level and lightweight subject mapping layer does not warrant difficult (and potentially contentious) specificity. The point is not to comprehensively define the scope of all knowledge, but to provide the fewest choices necessary for what subject or subjects a given domain ontology may appropriately reference. We want a listing of the major destinations, not every town and parish in existence.

(That is not to say that more specific subject references won’t emerge or be appropriate for specific domains. Indeed, the hope is that an “umbrella” reference subject structure might be a tie-in point for such specific maps. The more salient issue addressed here is to create such an “umbrella” backbone in the first place.)

This subject reference “super-structure” would in no way impose any limits on what a specific community might do itself with respect to its own ontology scope, definition, format, schema or approach. Moreover, there would be no limit to a community mapping its ontology to multiple subject references (or “destinations”, if you will).

The reason for this high-level subject structure, then, is simply to provide a reference map for where we might want to go — no more, no less. Such a reference structure would greatly aid finding, viewing and querying actual content ontologies — of whatever scope and approach — wherever that content may exist on the Web.

This is not a new idea. About the year 2000 the topic map community was active with published subject indicators (PSIs) [2] and other attempts at topic or subject landmarks. For example, that report stated:

The goal of any application which aggregates information, be it a simple back-of-book index, a library classification system, a topic map or some other kind of application, is to achieve the “collocation objective;” that is, to provide binding points from which everything that is known about a given subject can be reached. In topic maps, binding points take the form of topics; for a topic map application to fully achieve the collocation objective there must be an exact one-to-one correspondence between subjects and topics: Every topic must represent exactly one subject and every subject must be represented by exactly one topic.
When aggregating information (for example, when merging topic maps), comparing ontologies, or matching vocabularies, it is crucially important to know when two topics represent the same subject, in order to be able to combine them into a single topic. To achieve this, the correspondence between a topic and the subject that it represents needs to be made clear. This in turn requires subjects to be identified in a non-ambiguous manner.

The identification of subjects is not only critical to individual topic map applications and to interoperability between topic map applications; it is also critical to interoperability between topic map applications and other applications that make explicit use of abstract representations of subjects, such as RDF.

From that earlier community, Bernard Vatant has subsequently spoken of the need and use of “hubjects” as organizing and binding points, as has Jack Park and Patrick Durusau using the related concept of “subject maps” [3]. An effort that has some overlap with a subject structure is also the Metadata Registry being maintained by the National Science Digital Library (NSDL).

However, while these efforts support the idea of subjects as partial binding or mapping targets, none of them actually proposed a reference subject structure. Actual subject structures may be a bit of a “third rail” in ontology topics due to the historical artifact of wanting to avoid the pitfalls of older library classification systems such as the Dewey Decimal Classification or the Library of Congress Subject Headings.

Be that as it may. I now think the timing is right for us to close this subject gap.

A General Conceptual Model

This mapping layer lends itself to a three-tiered general conceptual model. The first tier is the subject structure, the conceptualization embracing all possible subject content. This referential layer is the lookup point that provides guidance for where to search and find “stuff.”

The second layer is the representation layer, made up of informal to “formal” ontologies. Depending on the formalism, the ontology provides more or less understanding about the subject matter it represents, but at minimum binds to the major subject concepts in the top subject mapping layer.

The third layer are the data sets and “data spaces” [4] that provide the actual content instantiations of these subjects and their ontology representations. This data space layer is the actual source for getting the target information.

Here is a diagram of this general conceptual model:

Three-tiered Conceptual Model
[Click on image for full-size pop-up]

The layers in this general conceptual model progress from the more abstract and conceptual at the upper level, useful for directing where traffic needs to go, to concrete information and data at the lower level, the real object of manipulation and analysis.

The data spaces and ontologies of various formalisms in the lower two tiers exist in part today. The upper mapping layer does not.

By its nature the general upper mapping layer, in its role as a universal reference “backbone,” must be somewhat general in the scope of its reference subjects. As this mapping infrastructure begins to be fleshed out, it is also therefore likely that additional intermediate mapping layers will emerge for specific domains, which will have more specific scopes and terminology for more accurate and complete understanding with their contributing specific data spaces.

Six Principles for a Possible Lightweight Binding Mechanism

So, let’s beg the question for a moment of what an actual reference subject structure might be. Let’s assume one exists. What should we do with this structure? How should we bind to it?

  • First, we need to assume that the various ontologies that might bind to this structure reside in the real world, and have a broad diversity of domains, topics and formality of structure. Therefore, we should: a) provide a binding mechanism responsive to the real-world range of formalisms (that is, make no grand assumptions or requirements of structure; each subject structure will be be provided as is); and b) thus place the responsibility to register or bound the subject mapping assignment(s) to the publisher of the contributing content ontology [5].
  • Second, we can assume that the reference subject structure (light green below) and its binding ontology basis (dark green) are independent actors. As with many other RESTful services, this needs to work in a peer-to-peer (P2P) manner.
  • Third, as the Intrepid Guide argues, RDF and its emergent referential schema provide the natural data model and characterization “middle ground” across all Web ontology formalisms. This observation leads to SKOS as the organizing schema for ontology integration, supplemented by the related RDF schema of DOAP, SIOC, FOAF and Geonames for the concepts of projects, communities, people and places, respectively. Other standard referents may emerge, which should also be able to be incorporated.
  • Fourth, the actual binding point of “subjects” are themselves only that: binding points. In other words, this makes them representative “proxies” not to be confused with the actual subjects themselves. This implies no further semantics than a possible binding, and no assertion about the accuracy, relevance or completeness of the binding. How such negotiation may resolve itself needs to be outside of this scope of a simple mapping and binding reference layer (see again [3]).
  • Fifth, the binding structure and its subject structure needs to have community acceptance; no “wise guys” here, just well-intentioned bozos.
  • And, sixth, keep it simple. While not all publishers of Web sites need comply — and the critical threshold is to get some of the major initial publishers to comply — the truth like everything else on the Web is that the network effect makes thinks go. By keeping it simple, individual publishers and tool makers are more likely to use and contribute to the system.

How this might look in a representative subject binding to two candidate ontology data sets is shown below:

Example Binding Structure

This diagram shows two contributing data sets and their respective ontologies (no matter how “formal”) binding to a given “subject.” This subject proxy may not be the only one bound by a given data set and its ontology. Also note the “subject” is completely arbitrary and, in fact, is only a proxy for the binding to the potential topic.

One Possible Road Map

Clearly, this approach does not provide a powerful inferential structure.

But, what it does provide is a quite powerful organizational structure. Access and the where for the sake of simplicity and adoption are given preference over inferential elegance. Thus, assuming now, again, a subject structure backbone, matched with these principles and the same jumbled structures as first noted above, we can now see an organizational order emerge from the chaos:

Lightweight Binding to an Upper Subject Structure Can Bring Order
[Click on image for full-size pop-up]

The assumptions to get to this point are not heroic. Simple binding mechanisms matched with a high-level subject “backbone” are all that is required mechanically for such an approach to emerge. All we have done to achieve this road map is to follow the trails above.

Use Existing Consensus to Achieve Authority

So, there is nothing really revolutionary in any of the discussion to this point. Indeed, many have cited the importance of reference structures previously. Why hasn’t such a subject structure yet been developed?

One explanation is that no one accepts any global “external authority” for such subject identifications and organization. The very nature of the Web is participatory and democratic (with a small “d”). Everyone’s voice is equal, and any structure that suggests otherwise will not be accepted.

It is not unusual, therefore, that some of the most accepted venues on the Web are the ones where everyone has an equal chance to participate and contribute: Wikipedia, Flickr, eBay, Amazon, Facebook, YouTube, etc. Indeed, figuring out what generates such self-sustaining “magic” is the focus of many wannabe ventures.

While that is not our purpose here, our purpose is to set the preconditions for what would constitute a referential subject structure that can achieve broad acceptance on the Web. And in the paragraph above, we already have insight into the answer: Build a subject structure from already accepted sources rich in subject content.

A suitable subject structure must be adaptable and self-defining. These criteria reflect expressions of actual social usage and practice, which of course changes over time as knowledge increases and technologies evolve.

One obvious foundation to building a subject structure is thus Wikipedia. That is because the starting basis of Wikipedia information has been built entirely from the bottom up — namely, what is a deserving topic. This has served Wikipedia and the world extremely well, with now nearly 1.8 million articles online in English alone (versions exist for about 100 different languages) [6]. There is also a wealth of internal structure within Wikipedia’s “infobox” templates, structure that has been utilized by DBpedia (among others) to actually transform Wikipedia into an RDF database (as I described in an earlier article). As socially-driven and -evolving, I foresee Wikipedia to continue to be the substantive core at the center of a knowledge organizational framework for some time to come.

But Wikipedia was never designed with an organizing, high-level subject structure in mind. For my arguments herein, creating such an organizing (yes, in part, hierarchical) structure is pivotal.

One innovative approach to provide a more hierarchical structural underpinning to Wikipedia has been YAGO (“yet another great ontology”), an effort from the Max-Planck-Institute Saarbrücken [7]. YAGO matches key nouns between Wikipedia and WordNet, and then uses WordNet’s well-defined taxonomy of synsets to superimpose the hierarchical class structure. The match is more than 95% accurate; YAGO is also designed for extensibility with other quality data sets.

I believe YAGO or similar efforts show how the foundational basis of Wikipedia can be supplemented with other accepted lexicons to derive a suitable subject structure with appropriate high-level “binding” attributes. In any case, however constructed, I believe that a high-level reference subject structure must evolve from the global community of practice, as has Wikipedia and WordNet.

I have previously described this formula as W + W + S + ? (for Wikipedia + WordNet + SKOS + other?). There indeed may need to be “other” contributing sources to construct this high-level reference subject structure. Other such potential data sets could be analyzed for subject hierarchies and relationships using fairly well accepted ontology learning methods. Additional techniques will also be necessary for multiple language versions. Those are important details to be discussed and worked out.

The real point, however, is that existing and accepted information systems already exist on the Web that can inform and guide the construction of a high-level subject map. As the contributing sources evolve over time, so could periodic updates and new versions of this subject structure be generated.

Though the choice of the contributing data sets from which this subject structure could be built will never be unanimous, using sources that have already been largely selected through survival of the “fittest” by large portions of the Web-using public will go a long ways to establishing authoritativeness. Moreover, since the subject structure is only intended as a lightweight reference structure — and not a complete closed-world definition — we are also setting realistic thresholds for acceptance.

Conclusion and Next Steps

The specific topic of this piece has been on a subject reference mapping and binding layer that is lightweight, extensible, and reflects current societal practice (broadly defined). In the discussion, there has been recognition of existing schema in such areas as people (FOAF), projects (DOAP), communities (SIOC) and geographical places (Geonames) that might also contribute to the overall binding structure. There very well may need to be some additional expansions in other dimensions such as time and events, organizations, products or whatever. I hope that a consensus view on appropriate high-level dimensions emerges soon.

There are a number of individuals presently working on a draft proposal for an open process to create this subject structure. What we are working quickly to draft and share with the broader community is a proposal related to:

  1. A reference umbrella subject binding ontology, with its own high-level subject structure
  2. Lightweight mechanisms for binding subject-specific community ontologies to this structure
  3. Identification of existing data sets for high-level subject extraction
  4. Codification of high-level subject structure extraction techniques
  5. Identification and collation of tools to work with this subject structure, and
  6. A public Web site for related information, collaboration and project coordination.

We believe this to be an exciting and worthwhile endeavor. Prior to the unveiling of our public Web site and project, I encourage any of you with interest in helping to further this cause to contact me directly at mike at mkbergman dot com [8].

[1] I attempted to quantify this problem in a white paper from about two years ago, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents, BrightPlanet Corporation, 42 pp., July 20, 2005. Some reasons for how such waste occurs were documented in a four-part series on this AI3 blog, Why Are $800 Billion in Document Assets Wasted Annually?, beginning in October 2005 through parts two, three and four concluding in November 2005.

[2] See Published Subjects: Introduction and Basic Requirements (OASIS Published Subjects Technical Committee Recommendation, 2003-06-24.

[3] See especially Park and Durusau, Avoiding Hobson’s Choice In Choosing An Ontology and Towards Subject-centric Merging of Ontologies.

[4] The concept of “data spaces” has been well-articulated by Kingsley Idehen of OpenLink Software and Frédérick Giasson of Zitgist LLC. A “data space” can be personal, collective or topical, and is a virtual “container” for related information irrespective of storage location, schema or structure.

[5] If the publisher gets it wrong, and users through the reference structure don’t access their desired content, there will be sufficient motivation to correct the mapping.

[6] See Wikipedia’s statistics sections.

[7] Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, “Yago – A Core of Semantic Knowledge” (also in bib or ppt). Presented at the 16th International World Wide Web Conference (WWW 2007) in Banff, Alberta, on May 8-12, 2007. YAGO contains over 900,000 entities (like persons, organizations, cities, etc.) and 6 million facts about these entities, organized under a hierarchical schema. YAGO is available for download (400Mb) and converters are available for XML, RDFS, MySQL, Oracle and Postgres. The YAGO data set may also be queried directly online.

[8] I’d especially like to thank Frédérick Giasson and Bernard Vatant of Mondeca for their reviews of a draft of this posting. Fred was also instrumental in suggesting the ideas behind the figure on the general conceptual model.

Posted by AI3's author, Mike Bergman Posted on May 29, 2007 at 5:14 pm in Adaptive Information, Semantic Web, Structured Web | Comments (3)
The URI link reference to this post is:
The URI to trackback this post is: