Posted:August 8, 2007

Donald Knuth's Road Sign

Structured Data and UMBEL Will Benefit from a Standard Registration Format

UMBEL Logo One implication of the structured Web is that, with the rapid proliferation of the data, how do you find what is relevant? That purpose is what stimulated the initiation of the UMBEL (Upper Mapping and Binding Exchange Layer) project.

The original specification for UMBEL recognized the need for a reference set of subject “proxies” to help describe what each data set “was about” as well as the need for a variety of binding mechanisms depending on scope and data structure of the source dataset.

At its most general level, the intent of UMBEL is to provide four components in its ‘core’ ontology [1]:

  1. A set of reference subject “proxies” and properties and relations around them
  2. A means of binding the ontologies, classes or subsets of data within each contributing dataset to the ‘core’ UMBEL ontology and those subject proxies
  3. Characterizing a given dataset via metadata, and
  4. Describing access methods and endpoints for getting at that data.

The first component on subject proxies is largely left to another discussion. The topic of this posting is mostly related to the latter three components of dataset binding and registration mechanisms.

How such road signs might work, the contributions of possible analogs, their differences in providing solutions in and of themselves, and a first-cut outline of the resulting ‘core’ UMBEL ontology are described below.

The Why and How of These Road Signs

‘Road signs’ are simply a shorthand for how to find stuff. The normative case is to have sufficient characterization of datasets such that a central registry can aid their discovery, look-up and productive access and use. Yet registration can be an onerous task, and one not generally easily or willingly undertaken by publishers or providers.

These challenges lead to two important design considerations. First, only minimal characterization should be required for initially registering a dataset. The remaining characteristics should be optional. The incentive over time for such optional fields to be completed is its indication to consumers that fully characterized datasets may be more dependable or authoritative. It is possible, for example, to envision external qualification rules or routines that “score” competing datasets providing similar information based on the completeness of dataset characterization.

Second, any party should be allowed to register and characterize a dataset. There may be motivations by non-publishers to do so, for altruistic or other reasons. However, in the case of disputes over the accuracy of characterization, the owner or publisher should have final say. Another open question is whether competing characterizations or different registrations should be allowed for the same dataset.

These considerations are made still further complicated by the range of scope and scale and data content and formalism on the real-world Web.

In the spirit of not re-inventing the wheel, we began a process to discover what other communities have done faced with similar problems. The two closest analogs are, firstly, the library community and its need to describe digital archives and collections and, secondly, the general approaches devoted to Web services, including its dedicated language WSDL (Web Services Description Language).

Digital Collections and Archives

Librarians and information architects have been active for at least the past decade in efforts to describe and relate digital collections to one another. These efforts have been geared to search, general look-up and interlibrary loans and sharing. This community of practice, while embracing a variety of somewhat competing and overlapping schemes, has also (from an outsider’s viewpoint) come up with a general consensus view as to how to describe these archives and collections.

(The library community still tends to use the terminology of ‘metadata’ and descriptors, whereas the ontology and RDF communities tend to speak more of classes, properties and instances. However, the net intent and outcome still appears much the same.)

A common reference point to these schemes is the Dublin Core Collection Application Profile (DCCAP), a specification of how metadata terms from the Dublin Core metadata initiative (DCMI) and other vocabularies can be used to construct a description of a collection in accordance with the DCMI Abstract Model. There is also a more easily read summary of the DCCAP [2]. Though, again, there are differences in terminology, the presentation of this scheme is very much in keeping with the format of a W3C specification (such as for WSDL, see below).

One of the first widely embraced efforts of the community is the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH), begun in the late 1990s. OAI-PMH provides specifications for both data providers and service providers in how to assign and describe collection metadata. The OAI Protocol has become widely adopted by many digital libraries, institutional repositories, and digital archives, with total sources registered numbering into the thousands [3]. These large institutional repositories are also increasingly being indexed by large search engines (such as Google Scholar).

The National Information Standards Organization (NISO) began a MetaSearch Initiative (NISO-MS) that resulted in the development of a collection description schema. Though NISO recently reorganized its content and collection activities, the draft NISO Collection Description Specification [4] remains a readable overall reference for these initiatives. Like related initiatives, the draft also uses the DCCAP format.

A similar effort was initiated in the United Kingdom called the Information Environment Service Registry. IESR was designed to make it easier for other applications to discover and use materials which will help their users’ learning, teaching and research. IESR’s various terms (namespaces, classes and properties) and controlled vocabularies are very helpful to UMBEL.

Another example is the Ockham Digital Library Service Registry (DLSR), that enables service-based digital libraries, funded by the National Science Digital Library (NSDL) initiative, to interoperate. Efforts such as this, in turn, have led to interest in exploiting “light-weight” protocols and open source tools in the community [5]. For example, there is an interesting discussion of tools to Implement Digital Library Collections and Services from DLib magazine [5].

These efforts, among many across the digital library community including the related National Digital Information Infrastructure and Preservation Program (NDIIPP) at the Library of Congress, represent tremendous efforts to describe digital collections. Clearly, subsets of this learning can be applied directly to UMBEL in relation to registry and dataset metadata.

Most all of these schemes use XML data serializations. Our investigations to date have not been able to turn up any RDF representations, though they are surely to come. Fortunately, all of the DCCAP-based efforts have an RDF-like design and most properties have URIs and defined namespaces.

Web Services and Bindings

The Web Services Description Language (WSDL) is an XML-based language for how to communicate with, and therefore interoperate, Web services. WSDL defines a service as accessible Internet endpoints (or ports), supported operations and related messages. It was first proposed to the W3C in March 2001; though it is a recommendation, it is not yet an official W3C standard.

A service is a definition of what kinds of operations can be performed and the messages involved. A port is defined by associating a network address with a reusable binding, with a collection of such ports constituting the service. Messages are abstract descriptions of the data being exchanged, and port types are abstract collections of supported operations. The combination of a specific network protocol with a specific message and data format for a particular port type constitutes a reusable binding.

These definitions are kept abstract from any concrete use or instance, enabling the service definition to be reused and to act as a public interface to the service.

Though there are some important differences from datasets (see next subsection), there has now been sufficient use and exposure of WSDL to inform how to construct sufficiently abstract and reusable interfaces and bindings on the Web. Especially useful is the recent WSDL 2.0 and its draft Web Services Description Language (WSDL) Version 2.0: RDF Mapping specification [6].

WSDL, by its use of XML schema, is not well suited to combining vocabularies and definitions. As a supplement, various discussions on data binding, including from Microsoft (with respect to data source objects — DSOs) and from others such as the W3C on XML data bindings help provide additional perspective [7].

WS-Notification and Topic Maps

Another perspective on this problem comes from efforts surrounding adding topic structure to Web services. This standards effort, called WS-Notification, was an effort of the OASIS group completed in late 2006. According to its published standards [8]:

WS-Notification is a family of related specifications that define a standard Web services approach to notification using a topic-based publish/subscribe pattern. It provides standard message exchanges to be implemented by service providers that wish to participate in Notifications, standard message exchanges for a notification broker service provider (allowing publication of messages from entities that are not themselves service providers), operational requirements expected of service providers and requestors that participate in notifications, and an XML model that describes topics. The WS-Notification family of documents includes three normative specifications: WS-BaseNotification, WS-BrokeredNotification, and WS-Topics.

There are some similarities to the binding mechanisms and topic relations required by UMBEL. WS-BaseNotification bears some resemblance to the binding mechanisms portions, and WS-Topics has some relation to the subject requirements. Again, however, the perspective is limited to Web services and has as its intent the general interchange of topic structures, not the use of a proxy reference set.

Why the Need for a ‘Third Way’?

So, we can see that the library community has made much progress in defining how to characterize a digital collection with regard to its source, ownership, scope and nature, while the Web applications community has made much progress with respect to service definitions and binding mechanisms. Some components have direct applicability to UMBEL.

Yet a lightweight dataset binding and subject reference structure for the Web — namely, UMBEL’s intended objective — has a number of very important differences from these other efforts. These differences prevent direct adoption of any current schema. Some of these summary distinctions as they apply to general Web data are:

  • There are a variety of potential registrants for UMBEL datasets including original owners and developers (the case for the other approaches), but also third-parties and consumers
  • Therefore, there is a broader spectrum of possible knowledge and ability to characterize the datasets, which suggests more flexibility, more optional items and fewer mandatory items
  • Relatedly, those performing the characterizations may be untrained in formal metadata and cataloging techniques or may need to rely on automated and semi-automated characterization methods; this increases the risk of error, imprecision and uncertainty
  • The relevant data may or may not be part of a dedicated resource; it may be fragmentary or embedded
  • The source data resides in an extreme range in possible scale, from a single datum (granted, generally of quite limited value) to the largest of online databases
  • The existence of a huge diversity of data formats and protocols (WSDL has a similar challenge)
  • The need to accommodate a choice of many different data serializations (WSDL has a similar challenge)
  • WSDL’s service and endpoint considerations never explicitly accounted for data federation
  • The desire to get all forms into other formats or canonical forms for mashups, true federation, etc.

These differences suggest that the UMBEL ontology needs to be both broader and less prescriptive than other approaches.

Some Initial Design Considerations

We can thus combine the best transferable elements of existing schemes with the unique requirements and perspective of UMBEL. Initially, we are not adopting specific definitions or portions of possibly contributing schema. Rather, in this first cut, we are only attempting to capture the necessary scope and concepts. Later, after definitions and closer inspection of specific schema, we will refine this organization and relate it to particular namespaces.

The major “superclass” in this organization is the:

  • Profile — this definition is similar to that used for the DCCAPs (indeed, even adopts the “profile” label!), and is closely allied to the idea of Description in WSDL. The Profile represents the broad metadata characteristics of a dataset including ownership, rights and access policies, and other administrative aspects. Generally, a single dataset no matter of what size or scope may have a single profile, though a federated knowledge base from multiple sources may contain multiples of these. This class does not include specific details regarding interface format or subject scope. Note the next classes are themselves subclasses of this profile

The remaining classes are subsidiary to the Profile and inherit and refer to its metadata. The first two subclasses are also largely administrative in nature:

  • Annotator — this set of properties describes the annotator of the dataset metadata. UMBEL is designed to allow third-parties to describe others’ datasets with optional levels of detail. In the case of disputes, the dataset owner characterizations would hold sway, but there also may be circumstances where multiple characterizations are desirable and allowed
  • Rights — these properties describe the use and access rights for the data. Of course, only the owner may set such conditions, but third parties may provide this characterization if the source site spells out these conditions.

The remaining three classes contain the real guts of the data aspects:

  • Interface — the technical details of the data schema and structure within the dataset (or portions thereof) are defined in the Interface properties. (Interface is similar to the idea of the Interface and portions of the Service classes within WSDL, as with similar analogs for data exchange). The endpoints and access methods for accessing the actual data are by definition part of this Interface class. There is little or no consensus regarding how to classify and organize these details, so that it is likely much of the terminology in this area will be actively discussed and revised. See further [9] for one of the more comprehensive surveys
  • Binding — the Binding properties set the mechanisms for relating the dataset or portions thereof to one or more subject proxies. There may be more than one binding for a given profile or different portions of a dataset
  • SubjectProxy — finally, the SubjectProxy class, representing a likely extension to the core UMBEL for the enumeration of the subject proxies, becomes the linkage to the subject coverage of the datasets.

These classes have a hierarchical relationship similar to the following, with multiple Interface, Binding and SubjectProxy mappings allowable for any given Profile:

Profile -------- Annotator
Interface --------- Binding --------- SubjectProxy

Presented below in simple outline form only are these first-proposed classes, and the associated properties and instances of those properties informing the development of the ‘core’ UMBEL ontology. Some definitions of classes are also shown:

ClasssubClass[1]PropertyasPredicate [2]NoteDefinition
Profilethe broad metadata characteristics of a dataset including ownership, rights and access policies, and other administrative aspects; generally only one per dataset, though there will be multiples in a repository
Annotator[21]description of the entity that has provided the current Profile description (may be third parties; but deferrence to owner when there are differences)
annotatorLocatorhasAnnotatorLocator [9]
Bindingthe linkage made between the set or subsets of data within the datasets to the actual subject proxy(ies); may be multiples for a given dataset
aboutisAbout[10,24]cross-reference to the actual subject proxy IDs; may be multiples
bindingScopehasBindingScope [5,12]
Interface[13]the technical characteristics of the dataset that provide the essential information for enabling retrieval and interoperability; analogous to Interface in WSDL
Rights[21]various rights and restrictions to accessing, using or reproducing the subject data
SubjectProxy[22]a preferred label that acts as a proxy to the topic concept(s) for which the given dataset content is bound; may be multiples, and backed with ‘synset’ synonyms

General table notes are provided under the endnotes [10].

Please note that the specific subject proxies and their defining classes and properties is being handled in a separate document. This outline, as being revised, is informing the first N3 version of the ‘core’ UMBEL ontology.

This structure is still quite preliminary. (For example, data type definitions and interface constructs are still in active discussion, without accepted standards.) Comments on this draft UMBEL ‘core’ ontology outline are welcomed either at the UMBEL discussion forum on Google or at the specific outline page on the UMBEL wiki.

Revisiting the ‘Lightweight’ Designation

We can thus see that there is only minimal semantics in the potential linkage between UMBEL datasets.

One way to place this system is through and interesting approach called the Levels of Conceptual Interoperability Model. One way to view these levels is through the following conceptual diagram [11]:


Under this model, UMBEL resides right at the interface between Levels 2 and 3, where syntactic interoperability is achieved but with only limited semantic understanding. In fact, this represents a clear analog to AI3‘s discussion of the structured Web, which is very much related to the syntactic level with the negotiation of semantics the next challenge.

This posting is part of a new, occasional series on the Structured Web.

[1] The UMBEL ontology has two parts. The first ‘core’ part is a flat listing, or pool, of concrete subject topics that are the proxy binding points for external data sets. The second ‘unofficial’ part is a reference look-up structure of hierarchical and interlinked subject relationships.

[2] The “Dublin” in the name refers to Dublin, Ohio, where the work originated from an invitational workshop hosted in 1995 by the Online Computer Library Center (OCLC), a library consortium that has its headquarters there. The “Core” refers to the fact that the metadata element set is a basic but expandable “core” list, used is a similar way to the UMBEL ‘core’.

[3] There are several large registries of OAI-compliant repositories: The OAI registry at University of Illinois at Urbana-Champaign, The Open Archives list of registered OAI repositories, The Celestial OAI registry, Eprint’s Institutional Archives Registry, The European Guide to OAI-PMH compliant repositories in the world, and the A worldwide service and registry.

[4] The Standards Committee BB (Task Group 2): Collection & Service Descriptions, NISO Z39.91-200x, Collection Description Specification, November 2005. It also specifies an XML binding for serializing such descriptions for interchange between applications.

[5] Xiaorong Xiang and Eric Lease Morgan, “Exploiting ‘Light-weight’ Protocols and Open Source Tools to Implement Digital Library Collections and Services,’ D-Lib Magazine 11(10), October 2005. See Also see its MyLibrary reference for examples of facets applied to collections.

[6] Jacek Kopecký, Ed., Web Services Description Language (WSDL) Version 2.0: RDF Mapping, W3C Working Group Note, 26 June 2007. See

[7] Data binding from the Microsoft perspective is described at The W3C’s perspective on XML data binding is described, for example, in Paul Downey, ed., XML Schema Patterns for Common Data Structures, see more goes here.

[8] Also, there is an entire corpus related to topic maps. In specific reference to Web services, there is the so-called WS-Topics, Web Services Topics 1.3 (WS-Topics) OASIS Standard 1 October 2006; see more here. This is used in conjunction with WS-Notification. Also WS-BaseNotification Web Services Base Notification 1.3 (WS-BaseNotification) OASIS Standard 1 October 2006 see PDFs of these documents are also available.

[9] For an intial introduction with a focus on Xcerpt, see François Bry, Tim Furche, and Benedikt Linse, “Let’s Mix It: Versatile Access to Web Data in Xcerpt,” in Proceedings of 3rd Workshop on Information Integration on the Web (IIWeb 2006), Edinburgh, Scotland, 22nd May 2006; also as REWERSE-RP-2006-034, see For a more detailed treatment, see T. Furche, F. Bry, S. Schaffert, R. Orsini, I. Horrocks, M. Krauss, and O. Bolzer. Survey over Existing Query and Transformation Languages. Deliverable I4-D1a Revision 2.0, REWERSE, 225 pp., April 2006. See

[10] Here are the general table notes:

[1] SubClasses have the advantage of inheritance and shared metadata; see main text for full subClass path
[2] hasPredicate is actually my preferred format; thoughts?
[3] Standard ISO languages
[4] extendable base listing including NISO, IESR, DLSR, DCMI, etc.; need completion
[5] uses the idea of an extended base class per XML Schema; the enumerated listings thus only need be partially complete; see below for most listings
[6] number of records or TBD metric?
[7] possibly unnecessary; can not see enumeration of profile Types
[8] related to idea of Fresnel or Zitgist “preferred” viewing format or XSLT-type stylesheet
[9] should there be other types than FOAF (what of other formal listings or organizations v. individuals?)?
[10] not sure how to do this; need a x-ref between two class categories (e.g., Binding <-> Interface, Binding <-> SubjectProxy)
[11] are there patterns for bindingTypes or a likely enumerated listing?


[12] need an enumerated list (?) going from individual annotation / metadata (a la RDFa) to complete dataset


[13] perhaps not best name; related to Interface and Services in WSDL, could also be called Construct, Composition, others
[14] see possible dataFormalisms below; needs completion; could be named differently (schema, format, etc.)

Other (unspecified)
Other Ontology
OWL Full
OWL Lite
Topic Map

[15] see possible endpointTypes below

other Query formats ???

[16] patterns are fairly prominent in WSDL and XML Schema; applicable here?
[17] need to discuss
[18] see possible queryLanguages below; needs completion

IR (standard text search)
[19] see possible serializations below; needs completion


[20] GRDDL, RDFizers, and various converters/translators; likely needs an expandable, enumerated list
[21] unlike the other subClasses, this is closely aligned with the standard Profile metadata
[22] likely a separate namespace (e.b., ‘umbels’) that will contain additional information such as synsets, etc. See text.
[23] SKOS concept
[24] need to check on overlap/replacement/use with SKOS subjectIndicator property
[25] should names or IDs be used for subjectProxys? (IDs have the advantage of changing labels and use in other languages)
[26] seems similar or identical to SKOS scopeNote
[27] see possible annotatorTypes below; needs completion


[28] similar to dc:language, but must be kept separate from language of the resource (its metadata characterization) from the actual subject proxies

[11] A Tolk, S.Y. Diallo, C.D. Turnitsa and L.S. Winters LS, “Composable M&S Web Services for Net-centric Applications,” Journal for Defense Modeling & Simulation (JDMS), Volume 3 Number 1, pp. 27-44, January 2006.

Posted:July 24, 2007

Huynh Adds to His Winning Series of Lightweight Structured Data Tools

David Huynh, a Ph.D. grad student developer par excellence from MIT’s Simile program, has just announced the beta availability of Potluck. Potluck allows casual users to mashup data on the Web using direct manipulation and simultaneous editing techniques, generally (but not exclusively!) based on Exhibit-powered pages.

Besides Potluck and Exhibit, David has also been the lead developer on such innovative Simile efforts as Piggy Bank, Timeline, Ajax, Babel, and Sifter, as well as a contributor to Longwell and Solvent. Each merits a look. Those familiar with these other projects will notice David’s distinct interface style in Potluck.

Taking Your First Bites

There is a helpful 6-min movie on Potluck that gives a basic overview of use and operation. I recommend you start here. Those who want more details can also read the Potluck paper in PDF, just accepted for presentation at ISWC 2007. And, after playing with the online demo, you can also download the beta source code directly from the Simile site.

Please note that Firefox is the browser of choice for this beta; Internet Explorer support is limited.

To invoke Potluck, you simply go to the demo page, enter two or more appropriate source URLs for mashup, and press Mix Data:

Potluck Entry Screen
[Click on image for full-size pop-up]

(You can also get to the movie from this page.)

Once the datasets are loaded, all fields from the respective sources are rendered as field tags. To combine different fields from different datasets, the respective field tags (color coded by dataset) to be matched are simply dragged to a new column. Differences in field value formats between datasets can be edited with an innovative approach to simultaneous group editing (see below). Once fields are aligned, they then may be assigned as browsing facets. The last step in working with the Potluck mashup is choosing either tabular or map views for the results display.

Potluck is designed to mashup existing Exhibit displays (JSON format), and is therefore lightweight in design. (Generally, Exhibit should be limited to about 500 data records or so per set.)

However, with the addition of the appropriate type name when specifying one of the sources to mash up, you can also use spreadsheet (xls), BibTeX, N3 or RDF/XML formats. The demo page contains a few sample data links. Additional sample data files for different mime types are (note entry using a space with type designator at end):

Besides the standard tabular display, you can also map results. For example, use the BibTeX example above and drop the “address” field into the first drop target area. Then, chose Map at the top of the display to get a mapping of conference locations.

In my own case, I mashed up this source and the xls sample on presidents, and then plotted out location in the US:

Example Potluck Map
[Click on image for full-size pop-up]

Given the capabilities in some of the other Simile tool sets, incorporating timelines or other views should be relatively straightforward.

Pragmatic Lessons and Cautions with Semantic Mashups

Different datasets name similar or identical things differently and characterize their data differently. You can’t combine data from different datasets without resolving these differences. These various heterogeneities — which by some counts can be 40 or so classes of possible differences — were tabulated in one of my recent structured Web posts.

There has been considerable discussion in recent days on various ontology and semantic Web mailing lists about how some practices may solve or not questions of semantic matching. Some express sentiments that proper use of URIs, use of similar namespaces and use of some predicates like owl:sameAs may largely resolve these matters.

However, discussion in David’s ISWC 2007 paper and use of the Potluck demo readily show the pragmatic issues in such matches. Section 2 in the paper presents a readable scenario for real-world challenges in how a historian without programming skills would go about matching and merging data. Despite best practices, and even if all are pursued, actually “meshing” data together from different sources requires judgment and reconciliation. One of the great values of Potluck is as a heuristic and learning tool for making prominent these real-world semantic heterogeneities.

The complementary value of Potluck is its innovative interface design for actually doing such meshing. Potluck is a case argument that pragmatic solutions and designs only come about by just “doing it.”

Easy, Simultaneous Editing

(Note: Though a diagram illustrates some points below, it is no substitute for using Potluck yourself.)

Potluck uses a simple drag-and-drop model for matching fields from different datasets. In the left-hand oval in the diagram below, the user clicks on a field name in a record, drags it to a column, and then repeats that process for matching fields in a records of a different dataset. In the instance below, we are matching the field names of “address” and “birth-place”, which then also get color coded by dataset:

Potluck's Simultaneous Group Editing
[Click on image for full-size pop-up]

This process can be repeated for multiple field matches. The merged fields themselves can be subsequently dragged-and-dropped to new columns for renaming or still further merging.

The core innovation at the heart of Potluck is what happens next. By clicking on Edit for any record in a merged field, the dialog shown above pops up. This dialog supports simultaneous group editing based on LAPIS, another MIT tool for editing text with lightweight structure developed by Ron Miller and team.

As implemented in JavaScript in Potluck, LAPIS first groups data items by similar patterned structure; this initial grouping is what determines the various columns in the above display. Then, when the user highlights any pattern in a column, these are repeated (see same cursors and shading in the right-hand oval) for all entries in the column. They can then be deleted (for pruning, in this case removing ‘USA’), or cut-and-pasted (such as for changing first- and last-name order) for all items in a column. (Single item editing is obviously also an option.)

The first grouping mostly ensures that data formatted differently in different datasets are displayed in their own column. One data form is used for the merged field, and all other columns are group edited to conform. The actual patterns are based on runs of digits, letters, white spaces, or individual punctuation marks and symbols, which are then “greedy” aligned for first the column grouping and then for cursor alignment within columns on highlighted patterns.

The net result is very fast and efficient bulk editing. This approach points the way to more complicated pattern matches and other substitution possibilities (such as unit changes or date and time formats).

Rough Spots and A Hope

I was tempted to award Potluck one of AI3‘s Jewels and Doubloons Awards, but the tool is still premature with rough spots and gaps. For examples, IE and browser support needs to be improved; it would be helpful to be able to delete a record from inclusion in the mashup. (Sometimes only after combining is it clear some records don’t belong together.)

One big issue is that the system does not yet work well with all external sites. For example, my own Sweet Tools Exhibit refused to load and the one from the European Space Agency’s Advanced Concept Team caused JavaScript errors.

Another big issue is that whole classes of functionality, such as writing out combined results or more data view options, are missing.

Of course, this code is not claimed to be commercial grade. What is most important is its pathbreaking approach to semantic mashups (actually, what some others such as Jonathan Lathem have called ‘smashups’) and interfaces and approaches to group editing techniques.

I hope that others pick up on this tool in earnest. David Huynh is himself getting close to completing his degree and may not have much time in the foreseeable future to continue Potluck development. Besides Potluck’s potential to evolve into a real production-grade utility, I think its potential to act as a learning test bed for new UI approaches and techniques for resolving semantic heterogeneities is even greater.

Posted:July 22, 2007

McCruffy Pattern Blocks

Language is Essential to Communication

I recently began a series on the structured Web and its role in the continued evolution of the Internet. This next installment in the series probes in greater depth the question of What is structure? in reference to data and Web expressions, with an emphasis on terminology and definitions.

This post ties in with a new best-practices guide published by Chris Bizer, Richard Cyganiak, and Tom Heath — called the Linked Data Publishing Tutorial — that provides definitions and viewpoints from the perspective of the use of RDF (Resource Description Framework) and W3C practices. My initial post in this series and their tutorial occasioned Kingsley Idehen to post his own Linked Data and the Web Information BUS entry, adding the valuable perspective of practices and terminology going back to the early 1990s in object and relational database systems and standards such as ODBC.

All of these efforts share a desire to craft practices, language and terminology to help promote the availability and interoperability of useful data on the Web.

A challenge for the semantic Web community is to craft language that is clear and understandable to the lay public and Web developers, something which it has often done poorly.

Some problematic terms include information resource, non-information resource, dereferencing, bnode, content negotiation, representation, URL re-writing, and others.

This piece only tackles ‘non-information resource‘ head on. Discussion of other problematic terms awaits another day.

These posts caused Kingsley and me to engage in a prolonged discussion about definitions and terms. I acted as the unofficial scribe, which I attempt to more generally capture and argue below. If you like the ideas below, you may credit both of us; if you don’t, ascribe any errors or omissions to me alone [1].

The Basic Framework and Argument

The basic observation related to the structured Web is that it is a transition phase from the initial document-centric Web to the eventual semantic Web. In this transitional phase, the Web is becoming much more data-centric. The idea of ‘linked data‘ also is a component of this transition, but is more precise in meaning because by definition the data must be expressed as RDF in order to aid interoperability.

The challenge is to convert existing Web pages and data into the structured Web with every resource accessible via an unambiguous URI. Insofar as this conversion also occurs to RDF, it will promote linked data interoperability.

Transitions of such a profound nature — even if short of the full vision of the semantic Web — create the need for new language and terminology to aid understanding and communication. Sometimes, as well, longstanding terms and practices may need to be refined or challenged. In any event, notions of simplistic versioning such as ‘Web 3.0‘ add little to understanding and communication.

Let’s Begin with Data Structure

Independent of the Internet or the Web, let’s begin our discussion about the nature of structure in its application to data [2,3]. Peter Wood, a professor of computer science at Birkbeck College at the University of London, provides succinct definitions of the “structure” of various types of data [4]:

  • Structured Data — in this form, data is organized in semantic chunks or entities, with similar entities grouped together in relations or classes. Entities in the same group have the same descriptions (or attributes), while descriptions for all entities in a group (or schema): a) have the same defined format; b) have a predefined length; c) are all present; and d) follow the same order. Structured data are what is normally associated with conventional databases such as relational transactional ones where information is organized into rows and columns within tables. Spreadsheets are another example. Nearly all understood database management systems (DBMS) are designed for structural data
  • Unstructured Data — in this form, data can be of any type and do not necessarily follow any format or sequence, do not follow any rules, are not predictable, and can generally be described as “free form.” Examples of unstructured data include text, images, video or sound (the latter two also known as “streaming media”). Generally, “search engines” are used for retrieval of unstructured data via querying on keywords or tokens that are indexed at time of the data ingest, and
  • Semi-structured Data — the idea of semi-structured data predates XML but not HTML (with the actual genesis better associated with SGML, see below). Semi-structured data are intermediate between the two forms above wherein “tags” or “structure” may be associated or embedded within unstructured data. Semi-structured data are organized in semantic entities, similar entities are grouped together, entities in the same group may not have same attributes, the order of attributes is not necessarily important, not all attributes may be required, and the size or type of some attributes in a group may differ. To be organized and searched, semi-structured data should be provided electronically from database systems, file systems (e.g., bibliographic data, Web data) or via data exchange formats (e.g., EDI, scientific data, XML).

We can thus view data structure as residing on a spectrum (also shown with “typical” storage and indexing frameworks on the top line):

For the past decades, structured data has been typically managed by database management systems (DBMSs) or spreadsheets, unstructured data by text indexing systems such as used by search engines or unindexed in file systems or repositories.

Semi-structured data models are sometimes called “self-describing” (or schema-less) [5]. The first known definition of semi-structured data dates to 1993 [6] by Peter Schäuble: “We call a data collection semistructured if there exists a database scheme which specifies both normalized attributes (e.g., dates or employee numbers) and non-normalized attributes (e.g., full text or images).” More current usage (see the Wood definition above) also includes the notion of labeled graphs or trees with the data stored at the leaves, with the schema information contained in the edge labels of the graph. Semi-structured representations also lend themselves well to data exchange or the integration of heterogeneous data sources.

HTML tags within Web documents are a prime example of semi-structured data, as are text-based data transfer protocols or serializations [7]. Semi-structured data is also a natural intermediate form when “structure” is desired to be extracted from standard text through techniques generally called ‘information extraction’ (IE). For example, here is possible structure shown in yellow as might be extracted from a death notice or obituary [8]:

John A. Smith of Salem, MA died Friday at Deaconess Medical Center in Boston after a bout with cancer. He was 67. Born in Revere, he was raised and educated in Salem, MA. He was a member of St. Mary’s Church in Salem, MA, and is survived by his wife, Jane N., and two children, John A., Jr., and Lily C., both of Winchester, MA. A memorial service will be held at 10:00 AM at St. Mary’s Church in Salem.

This notice contains a great deal of information including names, places, dates and relationships, which once extracted, can be separately indexed or manipulated. Virtually all text-based documents can have similar structure extracted.

The Web has been a prime source of growth of semi-structured data, most often through text-based data serializations [9] and mark-up languages [10].

Depending on point of view or definition, RDF can either be called semi-structured or structured data. It resides squarely at the transition point between these two categories on this structural continuum.

A Variety of Web ‘Resources’

A couple of Web concepts often cause confusion and difficulty for some users: 1) Uniform Resource Locators (URLs) v Uniform Resource Identifiers (URIs); and 2) the concept of “resources” themselves. As it happens, there was a pretty accessible discussion on URIs that was recently posted; I recommend that discussion on that topic. Instead, we’ll focus here on “resources.”

The concept of a resource is basic to the Web’s architecture and is used in the definition of its fundamental elements, including obviously URL and URI above. The essence of the semantic Web parlance, as well, is built around the notion of abstract resources and their semantic properties. The data model and languages of the Resource Description Framework (RDF) squarely revolve around this “resource” concept.

The first explicit definition of resource related to the Web is found in RFC 2396 in August 1998:

A resource can be anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., “today’s weather report for Los Angeles”), and a collection of other resources.

However, in the context of the Internet, not all resources so defined (such as a person or a company) can be retrieved, while electronic resources like an image or Web page can. Thus, a first challenge arises in the concept of resource and its locational address: some are actual and physical, others are abstract or referential.

According to Wikipedia’s discussion of resources:

The concept of resource has evolved during the Web’s history, from the early notion of a static addressable document or file, to a more generic and abstract definition, now encompassing every thing or entity that can be identified, named, addressed or handled, in any way whatsoever, in the Web at large, or in any networked information system. The declarative aspects of a resource (identification and naming) and its functional aspects (addressing and technical handling) were not clearly distinct in the early specifications of the Web . . . .

The need to somehow make this distinction between actual or physical resources vs. abstract or referential resources led to much discussion and controversy sometimes known as the httpRange-14 issue [11], resolved by the W3C’s Technical Architecture Group (TAG) in 2005. The TAG defined a distinction between two resource types:

  • information resource — this category pertains to the original sense of resources on the Web, such as files, documents or other resources to which a URL can be assigned (including images and non-document media). Though it has been proposed that such resources be designated as “slash” URIs, such as http:///, this is not enforceable and some resources in the next category do not adhere. A “slash” URI, however, still is “better practice” (if not “best practice”) for such traditional resources. Note that this category of information resource is very much in keeping with the nature of the early, document-centric Web
  • non-information resource — this category is the new one to deal with all of the “abstract” cases such as classes, etc.; this category is especially important to the data-centric Web. As with the other resource category, it was proposed to use the “pound sign” fragment identifiers used by anchor tags, such as http:///, for example, to signal this different resource type, but, again, it is not enforceable and at most could be “better practice.”

A successful HTTP request for an information resource results in a 200 message (“OK”, followed by transfer) from the Web server; if the request is for a resource that the Web server recognizes as existing but of the wrong type requested, the publisher can use the 303 redirect response to provide the correct URI [12].

These resource distinctions are very, very unfortunate in their labeling, if not on more fundamental grounds. It is non-sensical to call one category “information” and the other not, when everything is informational. Moreover, the distinctions bring absolutely no clarity. (Important note: actually, the provenance of the term ‘non-information resource‘ appears to be quite recent, as well as wrong and unfortunate [13]).

However, getting standards bodies to change labels is a long and uncertain task. The approach taken below is to stick with the ‘information resource‘ term, but to provide the alias of ‘structured data resource‘ in place of ‘non-informational resource‘ and to add some additional sub-category distinctions [14].

The Relation Between Resources and Structure

Even though a Web ‘resource‘ has an address scheme and other requirements, these are details and specifics that can mask the fundamental purpose of a resource to act as a “container” for encapsulating information of some sort. This encapsulation is what enables access to its “payload” information (Kingsley’s term) on the Web or the broader Internet via standard protocols (HTTP and TCP/IP). The mechanisms of the encapsulation constitute the details and specifics.

Inherent to the transition from the document Web to the structured Web is the increased importance of that most confusing category: non-information resource. That is because, like fragment identifiers, we are talking about objects more granular (subsidiary) to document-level resources and it is because we are now referencing structure that includes such “abstract” notions as classes, properties, types, namespaces, ontologies, schema, etc.

One paradox in all of this is the very category of resource designed to deal with these data issues is itself called by many a ‘non-informational resource‘. This term is non-sense [13]. We use instead ‘structured data resource.’

There is much that needs to be cleaned up and clarified over time regarding uses and nomenclature regarding resources. However, from the standpoint of the structured Web, we can probably for the time being concentrate on those items shown in bold in this table:

Information Resourcecurrent standard Web term
unstructured and semi-structured data
Document Resourcetext and markup within standard Web pages
‘Other’ Resourcesnon-text resources with a URL (images, streaming media, non-text indexable files)
Non-information Resource
(aka Structured Data
[see text; 13]
current standard Web term; non-sensical
semi-structured and structured data
Structured Dataall non-RDF data-oriented resources, including non-RDF namespaces, etc.
Linked Dataall RDF

Note that the two main resource categories used in practice are maintained. The ‘information resource’ category retains its traditional understanding. From the standpoint of the structured Web, the document resources are the unstructured and semi-structured data content from which information extraction (IE) techniques and software can extract the eventual structure.

The category of document resource likely represents the majority of potentially useful structural content on the Web and is most often overlooked in discussions of linked data or the semantic Web. This content, if subjected to IE and therefore structure creation, then becomes a URI resource better handled as a ‘structured data resource.’

Structured data resources‘ (that is, the poorly labeled ‘non-information resources‘) are the building blocks for the structured Web. In all cases, these resources are either semi-structured or structured data. There are two sub-categories of resources in this category, differentiated by whether the structured data is expressed in RDF (or RDF-based languages) or not. All RDF variants are called ‘linked data‘; all other forms are termed ‘structured data.’

For most general purposes, putting aside nuances and subtle technicalities, the best shorthand for thinking about these resource distinctions is simply Documents v. Data.

Thus, we can see the path to the structured Web taking a number of different branches.

The first branch, and the one necessary for the largest portion of content, is to use a combination of IE techniques to extract entity information from unstructured text or to use structure extraction on the semi-structure of the document [15] to create the structured data resource for that document. Of all variants, this is the longest path, the one least developed, but one with potentially great value.

The second branch is to publish the structured data directly as a resource or to provide access to it through a Web service or API. This is the current basis for most of the structured data resources presently available on the Web. (It is also the outcome of IE for the first branch.)

The third branch is really just a complete variant of the other two — ensuring that the structured data resource is available as interoperable RDF linked data. There are two ways to proceed down this branch. One way is for the publisher to create and post the resource directly in a form of RDF. (Though the actual data can be serialized in a variety of ways such as RDF/XML, N3, Atom or Turtle, conversions between these forms is relatively straightforward.) The other way, less direct, is for the publisher or a third-party to convert non-RDF structured data into RDF with the rich and growing list of available ‘RDFizers’ [16].

The Web in Transition

This material, plus the earlier introduction to the structured Web, can now be brought together as a picture of the Web in transition. While there are no real beginning and end points, there is a steady progression from a document-centric Web to one that is data-centric, including the mediation of semantics:

Transition in Web Structure
Document WebStructured Web
Semantic Web
Linked Data

  • Document-centric
  • Document resources
  • Unstructured data and semi-structured data
  • HTML
  • URL-centric
  • circa 1993
  • Data-centric
  • Structured data
  • Semi-structured data and structured data
  • XML, JSON, RDF, etc
  • URI-centric
  • circa 2003
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S
  • URI-centric
  • circa 2006
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • URI-centric
  • circa ???

The basic argument of this series is that we are in the midst of a transition phase — the structured Web — that marks the beginning of the dominance of data on the Web. In its broadest definition, the structured Web has many different data forms and serializations. A subset of the structured Web — namely, linked data — is a direct precursor to the semantic Web with its emphasis on RDF and data interoperability and services.

Another argument of this series is — despite the promise of linked data — that structured data resources in many forms will co-exist and provide alternatives. This diversity is natural. For RDF and linked data advocates, tools are now largely in place to convert these diverse forms. Though the ability to see large-scale availability of RDF data appears clear, the longer-term resolution of mediating heterogeneous semantics remains cloudy.

Brief Glossary

To re-cap, and to aid language and understanding, here is a brief glossary of the key terms used in this discussion:

  • dereferencing — the act of locating and transmitting structured data, in a requested format (representation), exposed by a URL
  • document resource — a Web resource designated by a URL that contains unstructured and semi-structured data
  • information extraction — any of a variety of techniques for extracting structure and entities from content
  • information resource — any Web resource that can be retrieved via a URL
  • linked data — a structured data resource in RDF that can be obtained via a URI
  • non-information resource — this is “any resource” that is not an information resource; preference is to deprecate its use and use structured data resource in its stead
  • semantic Web — the Web of data with sufficient meaning associated with that data such that a computer program may learn enough to meaningfully process it
  • semi-structured data — data that includes semantic entities that may have attributes or groups, but which are not all required or presented in a set order or in a set size or type; may be embedded or interspersed in unstructured data
  • structured data — data organized into semantic chunks or entities, with similar entities grouped together in relations or classes, and presented in a patterned manner
  • structured data resource — a structured data resource that can be obtained via a URI
  • structured Web — the data-centric Web of structured data resources and structured data in various forms; formally defined as, object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance
  • unstructured data — data can be of any type and does not necessarily follow any format or sequence or rules, can generally be described as “free form;” includes text, images, video or sound
  • URI — uniform resource identifier
  • URL — uniform resource locator
This posting is part of a new, occasional series on the Structured Web.

[1] You will note a heavy emphasis on Wikipedia definitions in keeping with Web usage.

[2] Of course, the word ‘structure‘ has a broad range of meanings; we are only concerned here about data structure and its specific applicability to Web-related information.

[3] Much of the discussion in this sub-section is derived from an earlier AI3 posting, Semi-structured Data: Happy 10th Birthday!, from November 2005; some of the original information is a bit dated. It is also aided by Kingsley’s Structured Data v. Unstructured Data posting from June 2006. That posting notes the frequent confusion and ambiguity between the terms “structured data” and “unstructured data,” and the importance when speaking of structure to keep separate the structure of the data itself (the focus herein), the structure of the container that hosts the data, and the structure of the access method used to access the data.

[4] Peter Wood, School of Computer Science and Information Systems, Birkbeck College, the University of London. See

[5] The earliest known recorded mention of “semi-structured data” occurred in 1992 from N. J. Belkin and Croft, W. B., “Information filtering and information retrieval: two sides of the same coin?,” in Communications of the ACM: Special Issue on Information Filtering, vol. 35(12), pp. 29 – 38, with the first known definition from [6]. The next two mentions were in 1995 from D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman and J. Widom, “Querying semistructured heterogeneous information,” presented at Deductive and Object-Oriented Databases (DOOD ’95), LNCS, No. 1013, pp. 319-344, Springer, and M. Tresch, N. Palmer, and A. Luniewski, “Type classification of semi-structured data,” in Proceedings of the International Conference on Very Large Data Bases (VLDB). However, the real popularization of the term “semi-structured data” occurred through the seminal 1997 papers from S. Abiteboul, “Querying semi-structured data,” in International Conference on Data Base Theory (ICDT), pp. 1-18, Delphi, Greece, 1997 ( and P. Buneman, “Semistructured data,” in ACM Symposium on Principles of Database Systems (PODS), pp. 117-121, Tucson, Arizona, May 1997 ( Of course, semi-structured data had existed prior to these early references, only it had not been named as such.

[6] P. Schäuble, “SPIDER: a multiuser information retrieval system for semistructured and dynamic data,” in Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318 – 327, 1993.

[7] Such protocols first received serious computer science study in the late 1970s and early 1980s. In the financial realm, one early standard was electronic data interchange (EDI). In science, there were literally tens of exchange forms proposed with varying degrees of acceptance, notably abstract syntax notation (ASN.1), TeX (a typesetting system created by Donald Knuth and its variants such as LaTeX), hierarchical data format (HDF), CDF (common data format), and the like, as well as commercial formats such as Postscript, PDF (portable document format), RTF (rich text format), and the like. One of these proposed standards was the “standard generalized markup language” (SGML), first published in 1986. SGML was flexible enough to represent either formatting or data exchange. However, with its flexibility came complexity. Only when two simpler forms arose, namely HTML (HyperText Markup Language) for describing Web pages and XML (eXtensible Markup Language) for data exchange, did variants of the SGML form emerge as widely used common standards. The XML standard was first published by the W3C in February 1998, rather after the semi-structured data term had achieved some impact ( Dan Suciu was the first to publish on the linkage of XML to semi-structured data in 1998 (D. Suciu, “Semistructured data and XML,” in International Conference on Foundations of Data Organization (FODO), Kobe, Japan, November 1998; see PDF option from, a reference that remains worth reading to this day.

[8] David Loshin, “Simple semi-structured data,” Business Intelligence Network, October 17, 2005. See This example is actually quite complex and demonstrates the challenges facing IE software. Extracted entities most often relate to the nouns or “things” within a document. Note also, for example, how many of the entities involve internal “co-referencing,” or the relation of subjects such as “he” or clock times such as “10 a.m” to specific dates. A good entity extraction engine helps resolve these so-called “within document co-references.”

[9] Example text-based data serializations and formats used on the Web include Atom, Gdata, JSON, N3, pickle, RDF/XML, RSS, Turtle, XML, YaML.

[10] Example mark-up languages used on the Web include HTML, Wikitext, XHTML, XML.

[11] It was called the httpRange-14 issue by virtue of the agenda label at the TAG’s meetings.

[12] Of course, nothing compels the publisher to provide these instructions, but they are integral to “best practices” and publishers desirous of attracting consumers have incentives to follow them.

Depending on the nature of the information resource, an HTTP GET returns a 200 OK status and sends the resource (for example a request for a Web page or an RDF file) if the request is for the correct type. If the request is for the wrong type, the publisher can include in the header a 303 (See Other) redirect response to send the requester to the appropriate information resource URL. If the request is for an unknown URI or resource, a variety of 4xx error responses may result. (See further [13].)

One advantage of posting structured data resources as RDF- linked data is that this redirect can be sent to a REST-style Web Service that does a best-effort DESCRIBE of the resource using SPARQL. Since SPARQL is a query language, protocol, and results representation scheme, the redirect can come in the form of a URL that can directly query the structured data resource. In this manner, large structured data resource data sets can act as ‘endpoints’ for context-specific information linked anywhere on the Web.

[13] The report of the TAG’s 2005 ‘compromise solution’ was reported by Roy Fielding, with his public notice reproduced in full:


That we provide advice to the community that they may mint
“http” URIs for any resource provided that they follow this
simple rule for the sake of removing ambiguity:
a) If an “http” resource responds to a GET request with a
2xx response, then the resource identified by that URI
is an information resource;
b) If an “http” resource responds to a GET request with a
303 (See Other) response, then the resource identified
by that URI could be any resource;
c) If an “http” resource responds to a GET request with a
4xx (error) response, then the nature of the resource
is unknown.


I believe that this solution enables people to name arbitrary
resources using the “http” namespace without any dependence on
fragment vs non-fragment URIs, while at the same time providing
a mechanism whereby information can be supplied via the 303
redirect without leading to ambiguous interpretation of such
information as being a representation of the resource (rather,
the redirection points to a different resource in the same way
as an external link from one resource to the other).

Note that point “a” discusses an “information resource,” with the contrasting treatment in point “b” as potentially “any resource”.

To my knowledge, the first that the unfortunate term ‘non-information resource’ was introduced to cover these point “b” conditions was in a draft (that is, unofficial) TAG finding from two years later, in May 31 of this year, “Dereferencing HTTP URIs.” Besides its unfortunate continuation of the ‘dereferencing’ term (a discussion for another day), it introduces the even-worse ‘non-information resource’ terminology. That draft TAG finding in Sec 2 talks about ‘information resources’ (as does the 2005 TAG finding), and in Sec 3 about ‘other Web resources’ (or the “any” category from the 2005 notice). In Sec 4, however, the document switches from ‘other Web’ to ‘non-information’, which is then continued through the rest of the document.

It is not too late for the community to cease using this term; our replacement suggestion is ‘structured data resource.’

[14] You may also want to see, Leo Sauermann, Richard Cyganiak, and Max VölkelCool URIs for the Semantic Web,” or Dan Connolly, “A Pragmatic Theory of Reference for the Web.”

[15] Web documents are typically semi-structured with embedded tag, metadata, and presentation structure in such things as headers, tables, layouts, labels, dropdown lists, etc. For the unstructured text content, traditional information extraction techniques are applied. But for the semi-structure, various scraping or extraction techniques may either be crafted by hand or be semi-automatically or automatically applied through regular expression processing, pattern matching, inspection of the Web page DOM or other techniques. One posting in this structured Web series is to be devoted to this topic.

[16] There is an impressive and growing list of data conversion protocols and tools, most which support multiple input and output forms, including RDF and various serializations of RDF. This list includes Virtuoso Sponger, GRDDL, Babel, RDFizers, general converters, Triplr, etc. One posting in this structured Web series is to be devoted to this topic.

Posted by AI3's author, Mike Bergman Posted on July 22, 2007 at 3:58 pm in Adaptive Information, Semantic Web, Structured Web | Comments (2)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:July 18, 2007
Image from Paul Thiessan
The structured Web is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance.

Over the past few months I have increasingly been writing about and referring to the structured Web. I have done so purposefully, but, so far, with little background or explication. With the inauguration of this occasional series, I hope to bring more color and depth to this topic [1].

Literally, over the past year, I have been learning and documenting on AI3 my attempts to understand the basis, concepts and tools of the emerging semantic Web. In that process, I have come to define my own outlines of the Web past, present and future. Within this world view, I see the structured Web as today’s current imperative and reality.

Confusing Terminology Surrounding Obvious Change

Some Web pundits have embraced a versioning terminology of Web 2.0 and Web 3.0 to describe one such world view. I don’t personally agree with this silly versioning — indeed I poked fun in a tongue-in-cheek posting about Web 98.6 more than a year ago — but such terminology has gotten some traction and serves a purpose. I actually give my own definitions for such “versions” below if for no other reason than to close the gap with alternative world views.

We need not go back to the alternative early protocols of Usenet (and news groups), Gopher and FTP and their search engines of Veronica, WAIS, Jughead or Archie in 1991 [2] when Tim Berners-Lee first publicly announced the World Wide Web and its combination of hypertext with the Internet. More likely, the release of the Mosaic browser and CERN‘s decision to make access to the Web free in 1993 marked the true take-off point for the Web and the continued demise of the competing protocols.

Images and links in Web pages (“documents”) plus the HTML mark-up language to enable the styling and graphical design of those pages were very much in keeping with general trends, paralleling the earlier transition of personal computers to graphical interfaces and away from terminals. Mosaic became the foundation for the Netscape browser, best links compilations became a big hit through sites like Yahoo!, and the Lycos search engine, one of the first profitable Web ventures, indexed a mere 54,000 pages when it was publicly released in 1994 [3].

This initial start to the Web — today now referred to by some as ‘Web 1.0′ — can be squarely timed to 1993-1994. By 1995, the Web was appearing on the covers of major news magazines and by 1996 the phenomenon was at full throttle. But, since these early beginnings, the Web has gone through many different “versions” and transitions, most not fitting with version numbers, as some of these examples show:

  • Academic v. Commercial Web — magazines like Wired, Red Herring, Business 2.0 and the mainstream press showered us with names such as e-commerce, dot-com and the gold rush for companies to establish a Web presence, B2B, etc. in the latter part of the 1990s. In fact, for some early architects of the Web, this was a period of some trauma and handwringing, since the “pure” open and academic roots of the Internet and the Web were being taken over by mainstream use, commercialization and the monied dominance of venture capital. This first major change in the Web, its first major new ‘version’ if you will, came back down to earth as a result of the “dot-com bust” of the bubble in 2001 [4]
  • Static v. Dynamic Web — all initial Web content was based on documents created by hand and posted as individual and hyperlinked Web pages. The relatively few documents of the early Web meant that hand-compiled “best of” listings such as Yahoo! worked pretty well; ‘metasearchers‘ also emerged to overcome the limited indexing coverage of early search engines. These trends, however, were also masking another version sea-change for the Web. With growth and more content, many larger sites were moving to dynamic page generation with retrieval via search forms. This dynamic portion of the Web, called at times either the ‘deep Web‘ or ‘invisible Web,’ acted like standard search engines and therefore was generally overlooked until I first popularized this change in 2000 [5]. I would argue that the shift to dynamic content, with certainly hundreds of thousands of such database-backed sites now in existence — and content many times larger than what is indexed by standard search engines — was also a major version shift for the Web
  • Open Source and Open Data –the open source Linux and the Apache Web server have been two software foundations to the growth of the Web, and MySQL has had a leading role in supporting sites and software with database-backed designs [6]. It is beyond the scope of this piece, but I believe that the dot-com frenzy, the demise of Netscape by Internet Explorer and other tensions with commercial interests, plus the very empowering nature of the Internet itself are also leading to a version change of the Web from commercial software products to open source ones. Further, proprietary publishers and data sources have only had limited success on the Web; we are now seeing strong trends to open data as well. Additionally, the very nature of open source software lends itself to interoperability and modularity based on naturally selected building blocks. This “open” infrastructural basis of the Web is more subtle and hard to see, but provides some powerful drivers for how more surface-oriented trends express themselves
  • Social Networking Web — the same early software that enabled dynamic Web pages and database-backed designs naturally lent themselves to early blogs, wikis and content-management systems, many backed by MySQL, which in turn led to more community-oriented designs and services such as for bookmarking, Flickr for photos, later YouTube for videos, and literally thousands of others. This trend, resulting from changed practices and the use of different tools and ways to harness user-generated content, and not resulting from any changes to standards per se, was first called ‘Web 2.0′ by Tim O’Reilly in about 2003
  • Ajax and Widgets — some would include Web services, APIs and ‘mashups‘ in the Web 2.0, often as expressed through embedded Web ‘widgets‘ and the use of Ajax or similar dynamic scripting approaches. These considerations were not part of the original Web 2.0 term, but usage today likely embraces aspects of these in many definitions of Web 2.0. In any case, there is certainly a change within the Web to more interactive, attractive, full-featured user interfaces, with interface updates no longer requiring a full Web page refresh
  • Document-centric Web v. Data-centric Web — however, in any event, portions of these trends and changes are more broadly combining to represent another version change in the Web from one solely focused on documents to one that is more data-centric; this topic, the basis for the term ‘structured Web,’ is more fully discussed below
  • Web 3.0 — Wikipedia states, “Web 3.0 is a term that has been coined with different meanings to describe the evolution of Web usage and interaction among several separate paths. These include transforming the Web into a database, a move towards making content accessible by multiple non-browser applications, the leveraging of artificial intelligence technologies, the Semantic web, or the Geospatial Web.” Of all current terms, this one is fully the silliest, since there is no consensus on what it represents nor its endpoints
  • Semantic Web — the glossary at W3C states that the semantic Web is “the Web of data with meaning in the sense that a computer program can learn enough about what the data means to process it.” Elsewhere, the vision of the semantic Web is described by the Education and Outreach working group (SWEO) of the W3C “to extend principles of the Web from documents to data. This extension will allow to fulfill more of the Web's potential, in that it will allow data to be shared effectively by wider communities, and to be processed automatically by tools as well as manually.” Note the importance of computer processing and autonomy in these statements, not to mention the pivotal term of ‘semantics.’ This is an expansive and wide-embracing vision, some challenges of which I more fully describe below, and
  • Visions of the Web — the semantic Web vision is matched with other visions, including voice activation, autonomous agents doing our bidding in the background, wireless interlinked everything, and other versions of the Web that are sometimes portrayed in science fiction. Whenever such transitions occur, they will all surely rely on all the various “versions” of the Web that have occurred in the short past 15 years of the Web’s existence.

Despite these differences in viewpoint, language does matter. Though some may view language as a contest in “branding,” which can legitimately apply in other venues, I think the issue here goes well beyond “branding.” Language is also necessary to aid communication.

As I explain below and elaborate upon more fully throughout this series, I believe one of the correct terms for the current evolutionary state of the Web is the ‘structured Web.’

A Clear Transition to a Data-centric Web

As noted, portions of these trends and changes are more broadly combining to represent another transitional change in the Web from one solely focused on documents to one that is more object- or data-centric. Evidence of this trend includes such factors as:

  • Broad database-backed Web site designs, with content re-purposed and served up dynamically, the trend first noted as the ‘deep Web’
  • ‘Mashups’ of data from multiple sources, such as in maps, timelines, etc.
  • The exposure of Web services and APIs. The, for example, documents a doubling of such sources in the past nine months via its listing (as of July 2007) of about 500 APIs and more than 2,100 mashups
  • Huge growth and availability of large, often public, data sources, from US government and social sources like DBpedia, an RDF data extraction from Wikipedia (and others)
  • The emergence of entire data-centric sources, services and mashup platforms such as Freebase, Yahoo! Pipes, Google Base, Teqlo, QEDwiki, Ning, and OpenKapow
  • The rapid — and now almost universal — availability of data format converters (mostly to RDF) such as the listings of the W3C’s RDF Converters and MIT’s ‘RDFizers,’ the GRDDL initiative, Triplr, and the like
  • Soon, other to-be announced major data source look-up references, directories and conversion and filtering services.

One of the most popular series of presentations at this year’s WWW2007 conference in Banff was from the Linked Open Data project of the SWEO interest group. The members of this LOD project — comprised of accomplished advocates, developers and theorists — are providing the awareness, tools and example data that are showing how this emerging version may look. In fact, the group has just announced crossing the threshold of 1 billion ‘triples’ with 180,000 interlinks within its online DBpedia service, via these sources:

The LOD’s term for this effort is ‘linked data‘, and a Web site has been established to promote it. Others, harking back to Tim Berners-Lee’s original definition, refer to current efforts as a ‘Web of data’ or the ‘Semantic Data Web.’ Kingsley Idehen has been promoting the idea of ‘data spaces‘ — personal and collective — that is also a powerful metaphor.

Frankly, I think all of these terms are correct and useful. Yet I prefer the term structured Web because it is both more and less than some of these other terms.

The structured Web is more in that it pertains to any data formalism in use on the Web and includes the notion of extracting structure from uncharacterized content, by far the largest repository of potentially useful information on the Web. Yet the structured Web is also less because its ambition is solely to get that data into an interoperable framework and to forgo the full objectives of the ‘Semantic Web.’ In that regard, my concept of the structured Web is perhaps closest to the idea of linked data, though with less insistence on “correct” RDF and with specific attention to structure extraction from uncharacterized content.

Remarkable Progress on a Still Incomplete Journey

One of today’s realities is that we have accomplished much but still have a long way to go to achieve the grand vision of the ‘Semantic Web’ (capitalized).

More than a year ago I wrote a piece on “Climbing the Data Federation Pyramid” that noted the tremendous progress that has been made in the last twenty years in overcoming many seemingly intractable issues in data interoperability, initially of a physical and hardware nature. The Internet and Web standards have made enormous contributions to that progress.

The diagram I used in that piece is shown below [7]. Reaching the pyramid’s pinnacle could be argued as having achieved the grand vision of the Semantic Web. With the adoption of the Internet and Web protocols, all layers up through data representation have largely been solved. Data representation, data models, schema for different world views, and means for reconciling and mediating those different world views are much of the focus of today’s conceptual challenges.

Note, as we discuss the structured Web that we are largely focusing on the layer dealing with data representation, with some minor portions (principally in disambiguation) dealing with semantics. Getting data into a canonical data representation or model still leaves very crucial challenges in what does the data mean (its semantics), reasoning over that data (inference and pragmatics), and whether the data is authoritative or can be trusted. These are the daunting — and largely remaining challenges — of the Semantic Web.

For example, let’s look solely at the layer of semantics, the immediate challenge after data representation. By semantics, we are referring to whether different statements from different sources indeed refer or not to the same entity or concept; in other words, have the same meaning. Such a determination is pivotal if we are to combine data from multiple sources.

The use of RDF, accurate name spaces and syntactically correct URIs aid this resolution, but do not completely solve it. Ultimately, semantic mediation (such as my “glad” is equivalent to your “happy”) means resolving or mediating potential heterogeneities from on the order of 40 discrete categories of potential mismatches from units of measure, terminology, language, and many others. These sources may derive from structure, domain, data or language, as shown in this table [8]:



Case Sensitivity
Generalization / Specialization
Internal Path Discrepancy
Missing ItemContent Discrepancy
Attribute List Discrepancy
Missing Attribute
Missing Content
Element Ordering
Constraint Mismatch
Type Mismatch
DOMAINSchematic DiscrepancyElement-value to Element-label Mapping
Attribute-value to Element-label Mapping
Element-value to Attribute-label Mapping
Attribute-value to Attribute-label Mapping
Scale or Units
Data RepresentationPrimitive Data Type
Data Format
DATANamingCase Sensitivity
ID Mismatch or Missing ID
Missing Data
Incorrect Spelling
LANGUAGEEncodingIngest Encoding Mismatch
Ingest Encoding Lacking
Query Encoding Mismatch
Query Encoding Lacking
LanguagesScript Mismatches
Parsing / Morphological Analysis Errors (many)
Syntactical Errors (many)
Semantic Errors (many)

Using the same data model (say, RDF) or the same name spaces (say, Dublin Core or FOAF) helps somewhat to remove some of these sources of heterogeneity, but not all. Undoubtedly, longer term, resolving these heterogeneities will prove tractable. But they are not easily so today.

This observation does not undercut the Semantic Web vision nor negate the massive labors in support of that vision taken to date. But, hopefully, this observation may bring some perspective to the task ahead to obtain that vision.

Lowering Our Sights

If nothing else, the reality of the past 15 years shows us that the Web is a “dirty,” chaotic place. If HTML coding can be screwed up, it will. If loopholes in standards and protocols exist, they will be exploited. If there is ambiguity, all interpretations become possible, with many passionately held. Innovation and unintended uses occur everywhere.

This should not be surprising, and experienced Web designers, scientists and technologists should all know this by now. There can be no disconnect between workable standards and approaches and actual use in the “wild.” Nuanced arguments over the subtleties of standards and approaches are bound to fail. Robustness, simplicity and forgiveness must take precedence over elegance and theoretical completeness.

While there has been obvious growth in the sophistication of Web sites and the underlying technologies that support them, we see continued use of obsolete approaches that clearly should have been abandoned long ago (such as Web-safe colors, small displays, older browser versions, Web pages parked on some servers that have not been modified or looked at by their original authors in a decade, etc.). We also see slow uptake for clearly “better” new approaches. And we also sometimes see explosive uptake of approaches and ideas that seemingly come out of nowhere.

We also see that those approaches that enjoy the greatest success — blogging, tagging, microformats, RSS, widgets, for example, come most recently to mind — are those that the “citizen” user can easily and readily embrace. HTML was pretty foreign at first, but now millions of users modify their own code. Millions of users of various CMS systems and Firefox have learned how to install plug-ins and add-ins and modify CSS themes and use administration consoles.

So, my observation and argument is not that we must always choose what is mindless and unchallenging. But my argument is that we must accept real-world diversity and seek simplicity, robustness and clarity for what is new.

After nearly a decade of standards work, the basis for beginning the transition to the semantic Web is in place. But that vision itself sometimes appears too demanding, too intimidating. The vision at times appears all too unreachable.

Of course, this perception is wrong. Measured over many years, perhaps some decades, the vision of the semantic Web is reachable. Much remains to be worked on and understood regarding this vision in terms of mediating and resolving semantic heterogeneities, capturing and expressing world views through formal ontologies, making inferences between these views, and establishing trust and authoritativeness. And those challenges do not yet address the even more-exciting prospects of intelligent and autonomous agents.

Rather, the rationale for the structured Web is to tone down the vision, stay with the here and now, focus on what is achievable today. And what is achievable today is very great.

Why This Series on the ‘Structured Web‘?

Though version numbers for the Web are silly, with ‘Web 3.0′ for the semantic Web possibly being the silliest of all, such attempts do speak to the need for providing handles and language for capturing the dynamic change, diversity and complexity of the Web.

Today, right now, and all around us, a fundamental transition is taking place in the Web from a document-centric to a data-centric environment. A confluence of standards, advocacies, and previous trends are fueling this transition. Since the practical building blocks already exist, we will see this structured Web unfold before us at amazing speed.

The concept of the structured Web is thus narrower and less ambitious in scope than the ‘Semantic Web.’ The structured Web is merely a transitional step on the journey to the vision of the semantic Web, albeit one that can be fully realized today with current technologies and current understandings.

The purpose of this new series is thus to give prominence to this transition and to highlight the pragmatic, practical building blocks available to contribute to this transition. By somewhat shifting boundary definitions, the idea of the structured Web also aims to give more prominence to the importance of usability and structure extraction from semi-structured and unstructured content. These, too, are exciting areas with much potential.

So, as a way to provide a touchstone for continued discussion on this matter, here is one working definition of the structured Web:

The structured Web is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance.

Anticipated Topics in this Series

Some of the tentative topics that I plan to address in this series include discussion of what constitutes ‘structure’ in content, why structure is important, the various existing forms of structure, human v. machine bases for viewing and interpreting structure, the importance of finding “canonical” representation forms while also appreciating real-world diversity, the means to convert data forms and serializations, the means to extract structure from all types of content, transitioning to semantic understandings, and likely others.

Others may be added to this series over time and will be categorized under ‘Structured Web‘ on the AI3 blog.

This posting is the first part of a new, occasional series on the Structured Web, which also has its own new category. There are some additional prior topics in this series.

[1] You will note a heavy emphasis on Wikipedia definitions and histories in this piece, in keeping with the general theme of versions and transitions on the Web.

[2] News groups really did not have a good search engine until the launch of Deja News in 1995.

[3] Chris Sherman, "Happy Birthday, Lycos!," Search Engine Watch, August 14, 2002. See

[4] A fairly good summary of the History of the Web can be found on Wikipedia.

[5] Michael K. Bergman (Aug 2001). “The Deep Web: Surfacing Hidden Value“. The Journal of Electronic Publishing 7 (1). An earlier version of this paper was published by BrightPlanet Corp. in July 2000.

[6] While there are variations, Linux, Apache, MySQL and the scripting languages of either Python, PHP, or Perl are often referred to as ‘LAMP‘, one central basis for much open source software and, more broadly, interoperable open-source application packages.

[7] I would make a few changes today, notably in deprecating XML somewhat.

[8] This table builds on Pluempitiwiriyawej and Hammer's schema by adding the fourth major category of language. See Charnyote Pluempitiwiriyawej and Joachim Hammer, "A Classification Scheme for Semantic and Schematic Heterogeneities in XML Data Sources," Technical Report TR00-004, University of Florida, Gainesville, FL, 36 pp., September 2000. See

Posted:July 12, 2007


A Reference for Data Set Interoperability, Look Up and Retrieval

UMBEL is a lightweight way to describe the subject(s) of Web content, akin to the relationship “isAbout”. Its subject reference structure is meant to be simple, universally applicable, and agnostic to the form or schema of source data. UMBEL does not replace formal domain or upper ontologies and has little or no inferential power. It is merely a pool of consensus ‘proxies’ to initially describe what subjects data sets are about.

UMBEL’s design includes binding mechanisms that work with HTML, tagging or other standard practices, including various RDF schema and more formal ontologies. Its reference subject ‘backbone’ is derived from the intersection of common subjects found on popularly used Web sites and other accepted subject references. Access and easy adoption is given preference over inferential or logical elegance.

In addition to its core reference subjects, the UMBEL project will provide look up, query, registration, pinging, and related services. The project is completely open and supported by a community process. All project products are made available without charge under Creative Commons licenses. UMBEL’s development is being backed by a number of leading open data efforts and entities; see the last section for how to get involved.

The UMBEL project stands for the Upper-level Mapping and Binding Exchange Layer. UMBEL is pronounced like “humble” — in keeping with its nature — except without the “h”. The name has the same Latin root as umbrella (umbra for shade, or umbella for parasol), meant to convey the umbrella-like nature of UMBEL’s subject bindings.

What’s the Problem?

With dozens of protocols and hundreds of thousands of potentially useful data sets, there are many challenges to getting Web data to interoperate. Two of these problems are foundational.

First, there are dozens of formalisms, schema, models and serializations for characterizing and communicating data and data content on the Web, ranging from the simplest Web page to the most formal OWL ontologies. A universal mechanism is lacking for how these variations can describe or publish to one other what they are about. This mechanism must be simple, neutral, broadly applicable and widely accepted.

Second, even if this publication mechanism existed, there is no accepted set of subjects for referencing what this diverse content is about. No attempt to date to provide a reference subject structure has been widely accepted.

Combined, these twin problems mean there are few road signs and poor road maps for how to find relevant data sets on the Web. UMBEL provides simple — but necessary — first steps to address these basic problems.

Simple, with Low Expectations

Advocates and users of various models and formalisms on the Web have their real-world reasons for embracing each form. Domain experts and various communities have their own world views, represented by their own vocabularies and structure. Only by understanding and respecting those differences can means to bridge them become widely accepted.

There is, of course, no such thing as complete objectivity or neutrality. But, from the standpoint of UMBEL and its purpose, keeping its approach simple with a minimum of structure poses the least challenge to the world views of existing publishers and data sets on the Web — and therefore the best likelihood of wide acceptance. Where choices are necessary, such as the selection of the reference subjects themselves, building from accepted Web practices and norms helps minimize bias and arbitrariness.

Thus, by necessity, UMBEL must be simple with limited ambitions. Its reference structure is merely a ‘bag of subjects’, with each subject reference only acting as a ‘proxy’ to a set of concepts that specific users may describe and refer to in their own ways. UMBEL’s core structure is completely flat, with no implied hierarchy or structure amongst its reference subjects. UMBEL’s reference subjects are simply that, proxy references and no more.

UMBEL thus has no or minimal inference power (though some disambiguation is possible). Inferencing, usefulness and authoritativeness are the responsibility of others. UMBEL is meant only to be a map to possible subjects, not whether those destinations are worthwhile or, indeed, even correct.

Consensus and Use Determine the Subject Pool

The selection of the actual subject proxies within the UMBEL core are to be based on consensus use. The subjects of existing and popular Web subject portals such as Wikipedia and the Open Directory Project (among others) will be intersected with other widely accepted subject reference systems such as WordNet and library classification systems (among others) in order to derive the candidate pool of UMBEL subject proxies. The actual methodology and sources of this process are still being determined (see further the project specification).

The objective, in any case, is to provide a simple and transparent method for subject selection that reflects current use and consensus to the maximum extent possible. The anticipation is that the first subject candidate pool will number in the many hundreds to the low thousands of proxies.

UMBEL as a general subject ‘backbone’ is meant to be useful as a reference by more specific domains or ontologies, but not fully descriptive for any of them. The core, internal UMBEL ontology is to be based on RDF and written in the RDF Schema vocabulary of SKOS (Simple Knowledge Organization System).

Universal Applicability

Very simple binding mechanisms will be developed and extended to the most widely employed approaches on the Web. UMBEL will, at minimum, support Atom, microformats, OPML, OWL, RDF, RDFa, RDF Schema, RSS, tags (via Tag Commons), and topic maps in its first release. The simplicity of the ontology and approach will enable other formats to be easily added.

Ping, update and registration protocols will also be provided for these formats. Existing project sponsors already possess a variety of ping, update, conversion and translation utilities for such purposes.

Additional UMBEL Initiatives

Besides the core structure, the UMBEL project will also develop a second ‘unofficial’ structure of hierarchical and interlinked subject relationships. This ‘unofficial’ structure will be used solely for look up and browsing functions, and will reside external to the core UMBEL subject and binding structure. Indeed, we anticipate that many such look-up structures from other parties may evolve over time for specific purposes and viewpoints.

Finally, besides development of the UMBEL ontology, the project will also be providing a data set registration service, information and collaboration Web site, tools clearinghouse, and support for language translations and some tools development.

How to Help and Get Involved

The initial project site is at, including this project introduction, the draft project specification (, and other helpful background information. A more interactive Web site is currently under development and will be announced shortly.

A mailing list you can monitor or join to become part of the project is at

Posted by AI3's author, Mike Bergman Posted on July 12, 2007 at 3:36 pm in Adaptive Information, Semantic Web, Structured Web, UMBEL | Comments (3)
The URI link reference to this post is:
The URI to trackback this post is: