Structured Data and UMBEL Will Benefit from a Standard Registration Format
One implication of the structured Web is that, with the rapid proliferation of the data, how do you find what is relevant? That purpose is what stimulated the initiation of the UMBEL (Upper Mapping and Binding Exchange Layer) project.
The original specification for UMBEL recognized the need for a reference set of subject “proxies” to help describe what each data set “was about” as well as the need for a variety of binding mechanisms depending on scope and data structure of the source dataset.
At its most general level, the intent of UMBEL is to provide four components in its ‘core’ ontology :
- A set of reference subject “proxies” and properties and relations around them
- A means of binding the ontologies, classes or subsets of data within each contributing dataset to the ‘core’ UMBEL ontology and those subject proxies
- Characterizing a given dataset via metadata, and
- Describing access methods and endpoints for getting at that data.
The first component on subject proxies is largely left to another discussion. The topic of this posting is mostly related to the latter three components of dataset binding and registration mechanisms.
How such road signs might work, the contributions of possible analogs, their differences in providing solutions in and of themselves, and a first-cut outline of the resulting ‘core’ UMBEL ontology are described below.
The Why and How of These Road Signs
‘Road signs’ are simply a shorthand for how to find stuff. The normative case is to have sufficient characterization of datasets such that a central registry can aid their discovery, look-up and productive access and use. Yet registration can be an onerous task, and one not generally easily or willingly undertaken by publishers or providers.
These challenges lead to two important design considerations. First, only minimal characterization should be required for initially registering a dataset. The remaining characteristics should be optional. The incentive over time for such optional fields to be completed is its indication to consumers that fully characterized datasets may be more dependable or authoritative. It is possible, for example, to envision external qualification rules or routines that “score” competing datasets providing similar information based on the completeness of dataset characterization.
Second, any party should be allowed to register and characterize a dataset. There may be motivations by non-publishers to do so, for altruistic or other reasons. However, in the case of disputes over the accuracy of characterization, the owner or publisher should have final say. Another open question is whether competing characterizations or different registrations should be allowed for the same dataset.
These considerations are made still further complicated by the range of scope and scale and data content and formalism on the real-world Web.
In the spirit of not re-inventing the wheel, we began a process to discover what other communities have done faced with similar problems. The two closest analogs are, firstly, the library community and its need to describe digital archives and collections and, secondly, the general approaches devoted to Web services, including its dedicated language WSDL (Web Services Description Language).
Digital Collections and Archives
Librarians and information architects have been active for at least the past decade in efforts to describe and relate digital collections to one another. These efforts have been geared to search, general look-up and interlibrary loans and sharing. This community of practice, while embracing a variety of somewhat competing and overlapping schemes, has also (from an outsider’s viewpoint) come up with a general consensus view as to how to describe these archives and collections.
(The library community still tends to use the terminology of ‘metadata’ and descriptors, whereas the ontology and RDF communities tend to speak more of classes, properties and instances. However, the net intent and outcome still appears much the same.)
A common reference point to these schemes is the Dublin Core Collection Application Profile (DCCAP), a specification of how metadata terms from the Dublin Core metadata initiative (DCMI) and other vocabularies can be used to construct a description of a collection in accordance with the DCMI Abstract Model. There is also a more easily read summary of the DCCAP . Though, again, there are differences in terminology, the presentation of this scheme is very much in keeping with the format of a W3C specification (such as for WSDL, see below).
One of the first widely embraced efforts of the community is the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH), begun in the late 1990s. OAI-PMH provides specifications for both data providers and service providers in how to assign and describe collection metadata. The OAI Protocol has become widely adopted by many digital libraries, institutional repositories, and digital archives, with total sources registered numbering into the thousands . These large institutional repositories are also increasingly being indexed by large search engines (such as Google Scholar).
The National Information Standards Organization (NISO) began a MetaSearch Initiative (NISO-MS) that resulted in the development of a collection description schema. Though NISO recently reorganized its content and collection activities, the draft NISO Collection Description Specification  remains a readable overall reference for these initiatives. Like related initiatives, the draft also uses the DCCAP format.
A similar effort was initiated in the United Kingdom called the Information Environment Service Registry. IESR was designed to make it easier for other applications to discover and use materials which will help their users’ learning, teaching and research. IESR’s various terms (namespaces, classes and properties) and controlled vocabularies are very helpful to UMBEL.
Another example is the Ockham Digital Library Service Registry (DLSR), that enables service-based digital libraries, funded by the National Science Digital Library (NSDL) initiative, to interoperate. Efforts such as this, in turn, have led to interest in exploiting “light-weight” protocols and open source tools in the community . For example, there is an interesting discussion of tools to Implement Digital Library Collections and Services from DLib magazine .
These efforts, among many across the digital library community including the related National Digital Information Infrastructure and Preservation Program (NDIIPP) at the Library of Congress, represent tremendous efforts to describe digital collections. Clearly, subsets of this learning can be applied directly to UMBEL in relation to registry and dataset metadata.
Most all of these schemes use XML data serializations. Our investigations to date have not been able to turn up any RDF representations, though they are surely to come. Fortunately, all of the DCCAP-based efforts have an RDF-like design and most properties have URIs and defined namespaces.
Web Services and Bindings
The Web Services Description Language (WSDL) is an XML-based language for how to communicate with, and therefore interoperate, Web services. WSDL defines a service as accessible Internet endpoints (or ports), supported operations and related messages. It was first proposed to the W3C in March 2001; though it is a recommendation, it is not yet an official W3C standard.
A service is a definition of what kinds of operations can be performed and the messages involved. A port is defined by associating a network address with a reusable binding, with a collection of such ports constituting the service. Messages are abstract descriptions of the data being exchanged, and port types are abstract collections of supported operations. The combination of a specific network protocol with a specific message and data format for a particular port type constitutes a reusable binding.
These definitions are kept abstract from any concrete use or instance, enabling the service definition to be reused and to act as a public interface to the service.
Though there are some important differences from datasets (see next subsection), there has now been sufficient use and exposure of WSDL to inform how to construct sufficiently abstract and reusable interfaces and bindings on the Web. Especially useful is the recent WSDL 2.0 and its draft Web Services Description Language (WSDL) Version 2.0: RDF Mapping specification .
WSDL, by its use of XML schema, is not well suited to combining vocabularies and definitions. As a supplement, various discussions on data binding, including from Microsoft (with respect to data source objects — DSOs) and from others such as the W3C on XML data bindings help provide additional perspective .
WS-Notification and Topic Maps
Another perspective on this problem comes from efforts surrounding adding topic structure to Web services. This standards effort, called WS-Notification, was an effort of the OASIS group completed in late 2006. According to its published standards :
There are some similarities to the binding mechanisms and topic relations required by UMBEL. WS-BaseNotification bears some resemblance to the binding mechanisms portions, and WS-Topics has some relation to the subject requirements. Again, however, the perspective is limited to Web services and has as its intent the general interchange of topic structures, not the use of a proxy reference set.
Why the Need for a ‘Third Way’?
So, we can see that the library community has made much progress in defining how to characterize a digital collection with regard to its source, ownership, scope and nature, while the Web applications community has made much progress with respect to service definitions and binding mechanisms. Some components have direct applicability to UMBEL.
Yet a lightweight dataset binding and subject reference structure for the Web — namely, UMBEL’s intended objective — has a number of very important differences from these other efforts. These differences prevent direct adoption of any current schema. Some of these summary distinctions as they apply to general Web data are:
- There are a variety of potential registrants for UMBEL datasets including original owners and developers (the case for the other approaches), but also third-parties and consumers
- Therefore, there is a broader spectrum of possible knowledge and ability to characterize the datasets, which suggests more flexibility, more optional items and fewer mandatory items
- Relatedly, those performing the characterizations may be untrained in formal metadata and cataloging techniques or may need to rely on automated and semi-automated characterization methods; this increases the risk of error, imprecision and uncertainty
- The relevant data may or may not be part of a dedicated resource; it may be fragmentary or embedded
- The source data resides in an extreme range in possible scale, from a single datum (granted, generally of quite limited value) to the largest of online databases
- The existence of a huge diversity of data formats and protocols (WSDL has a similar challenge)
- The need to accommodate a choice of many different data serializations (WSDL has a similar challenge)
- WSDL’s service and endpoint considerations never explicitly accounted for data federation
- The desire to get all forms into other formats or canonical forms for mashups, true federation, etc.
These differences suggest that the UMBEL ontology needs to be both broader and less prescriptive than other approaches.
Some Initial Design Considerations
We can thus combine the best transferable elements of existing schemes with the unique requirements and perspective of UMBEL. Initially, we are not adopting specific definitions or portions of possibly contributing schema. Rather, in this first cut, we are only attempting to capture the necessary scope and concepts. Later, after definitions and closer inspection of specific schema, we will refine this organization and relate it to particular namespaces.
The major “superclass” in this organization is the:
- Profile — this definition is similar to that used for the DCCAPs (indeed, even adopts the “profile” label!), and is closely allied to the idea of Description in WSDL. The Profile represents the broad metadata characteristics of a dataset including ownership, rights and access policies, and other administrative aspects. Generally, a single dataset no matter of what size or scope may have a single profile, though a federated knowledge base from multiple sources may contain multiples of these. This class does not include specific details regarding interface format or subject scope. Note the next classes are themselves subclasses of this profile
The remaining classes are subsidiary to the Profile and inherit and refer to its metadata. The first two subclasses are also largely administrative in nature:
- Annotator — this set of properties describes the annotator of the dataset metadata. UMBEL is designed to allow third-parties to describe others’ datasets with optional levels of detail. In the case of disputes, the dataset owner characterizations would hold sway, but there also may be circumstances where multiple characterizations are desirable and allowed
- Rights — these properties describe the use and access rights for the data. Of course, only the owner may set such conditions, but third parties may provide this characterization if the source site spells out these conditions.
The remaining three classes contain the real guts of the data aspects:
- Interface — the technical details of the data schema and structure within the dataset (or portions thereof) are defined in the Interface properties. (Interface is similar to the idea of the Interface and portions of the Service classes within WSDL, as with similar analogs for data exchange). The endpoints and access methods for accessing the actual data are by definition part of this Interface class. There is little or no consensus regarding how to classify and organize these details, so that it is likely much of the terminology in this area will be actively discussed and revised. See further  for one of the more comprehensive surveys
- Binding — the Binding properties set the mechanisms for relating the dataset or portions thereof to one or more subject proxies. There may be more than one binding for a given profile or different portions of a dataset
- SubjectProxy — finally, the SubjectProxy class, representing a likely extension to the core UMBEL for the enumeration of the subject proxies, becomes the linkage to the subject coverage of the datasets.
These classes have a hierarchical relationship similar to the following, with multiple Interface, Binding and SubjectProxy mappings allowable for any given Profile:
Profile -------- Annotator
Interface --------- Binding --------- SubjectProxy
Presented below in simple outline form only are these first-proposed classes, and the associated properties and instances of those properties informing the development of the ‘core’ UMBEL ontology. Some definitions of classes are also shown:
|Profile||the broad metadata characteristics of a dataset including ownership, rights and access policies, and other administrative aspects; generally only one per dataset, though there will be multiples in a repository|
|Annotator||||description of the entity that has provided the current Profile description (may be third parties; but deferrence to owner when there are differences)|
|Binding||the linkage made between the set or subsets of data within the datasets to the actual subject proxy(ies); may be multiples for a given dataset|
|about||isAbout||[10,24]||cross-reference to the actual subject proxy IDs; may be multiples|
|Interface||||the technical characteristics of the dataset that provide the essential information for enabling retrieval and interoperability; analogous to Interface in WSDL|
|Rights||||various rights and restrictions to accessing, using or reproducing the subject data|
|SubjectProxy||||a preferred label that acts as a proxy to the topic concept(s) for which the given dataset content is bound; may be multiples, and backed with ‘synset’ synonyms|
General table notes are provided under the endnotes .
Please note that the specific subject proxies and their defining classes and properties is being handled in a separate document. This outline, as being revised, is informing the first N3 version of the ‘core’ UMBEL ontology.
Revisiting the ‘Lightweight’ Designation
We can thus see that there is only minimal semantics in the potential linkage between UMBEL datasets.
Under this model, UMBEL resides right at the interface between Levels 2 and 3, where syntactic interoperability is achieved but with only limited semantic understanding. In fact, this represents a clear analog to AI3‘s discussion of the structured Web, which is very much related to the syntactic level with the negotiation of semantics the next challenge.
 The UMBEL ontology has two parts. The first ‘core’ part is a flat listing, or pool, of concrete subject topics that are the proxy binding points for external data sets. The second ‘unofficial’ part is a reference look-up structure of hierarchical and interlinked subject relationships.
 The “Dublin” in the name refers to Dublin, Ohio, where the work originated from an invitational workshop hosted in 1995 by the Online Computer Library Center (OCLC), a library consortium that has its headquarters there. The “Core” refers to the fact that the metadata element set is a basic but expandable “core” list, used is a similar way to the UMBEL ‘core’.
 There are several large registries of OAI-compliant repositories: The OAI registry at University of Illinois at Urbana-Champaign, The Open Archives list of registered OAI repositories, The Celestial OAI registry, Eprint’s Institutional Archives Registry, Openarchives.eu The European Guide to OAI-PMH compliant repositories in the world, and the ScientificCommons.org A worldwide service and registry.
 The Standards Committee BB (Task Group 2): Collection & Service Descriptions, NISO Z39.91-200x, Collection Description Specification, November 2005. It also specifies an XML binding for serializing such descriptions for interchange between applications.
 Xiaorong Xiang and Eric Lease Morgan, “Exploiting ‘Light-weight’ Protocols and Open Source Tools to Implement Digital Library Collections and Services,’ D-Lib Magazine 11(10), October 2005. See http://www.dlib.org/dlib/october05/morgan/10morgan.html. Also see its MyLibrary reference for examples of facets applied to collections.
 Jacek Kopecký, Ed., Web Services Description Language (WSDL) Version 2.0: RDF Mapping, W3C Working Group Note, 26 June 2007. See http://www.w3.org/TR/wsdl20-rdf.
 Data binding from the Microsoft perspective is described at http://msdn2.microsoft.com/en-us/library/ms531387.aspx. The W3C’s perspective on XML data binding is described, for example, in Paul Downey, ed., XML Schema Patterns for Common Data Structures, see http://www.w3.org/2005/07/xml-schema-patterns.html more goes here.
 Also, there is an entire corpus related to topic maps. In specific reference to Web services, there is the so-called WS-Topics, Web Services Topics 1.3 (WS-Topics) OASIS Standard 1 October 2006; see http://docs.oasis-open.org/wsn/wsn-ws_topics-1.3-spec-os.htm more here. This is used in conjunction with WS-Notification. Also WS-BaseNotification Web Services Base Notification 1.3 (WS-BaseNotification) OASIS Standard 1 October 2006 see http://docs.oasis-open.org/wsn/wsn-ws_base_notification-1.3-spec-os.htm. PDFs of these documents are also available.
 For an intial introduction with a focus on Xcerpt, see François Bry, Tim Furche, and Benedikt Linse, “Let’s Mix It: Versatile Access to Web Data in Xcerpt,” in Proceedings of 3rd Workshop on Information Integration on the Web (IIWeb 2006), Edinburgh, Scotland, 22nd May 2006; also as REWERSE-RP-2006-034, see http://rewerse.net/publications/download/REWERSE-RP-2006-034.pdf. For a more detailed treatment, see T. Furche, F. Bry, S. Schaffert, R. Orsini, I. Horrocks, M. Krauss, and O. Bolzer. Survey over Existing Query and Transformation Languages. Deliverable I4-D1a Revision 2.0, REWERSE, 225 pp., April 2006. See http://rewerse.net/deliverables/m24/i4-d9a.pdf.
 hasPredicate is actually my preferred format; thoughts?
 Standard ISO languages
 extendable base listing including NISO, IESR, DLSR, DCMI, etc.; need completion
 uses the idea of an extended base class per XML Schema; the enumerated listings thus only need be partially complete; see below for most listings
 number of records or TBD metric?
 possibly unnecessary; can not see enumeration of profile Types
 related to idea of Fresnel or Zitgist “preferred” viewing format or XSLT-type stylesheet
 should there be other types than FOAF (what of other formal listings or organizations v. individuals?)?
 not sure how to do this; need a x-ref between two class categories (e.g., Binding <-> Interface, Binding <-> SubjectProxy)
 are there patterns for bindingTypes or a likely enumerated listing?
 need an enumerated list (?) going from individual annotation / metadata (a la RDFa) to complete dataset
 perhaps not best name; related to Interface and Services in WSDL, could also be called Construct, Composition, others
 see possible dataFormalisms below; needs completion; could be named differently (schema, format, etc.)
 see possible endpointTypes below
other Query formats ???
 patterns are fairly prominent in WSDL and XML Schema; applicable here?
 need to discuss
 see possible queryLanguages below; needs completion
IR (standard text search)
 GRDDL, RDFizers, and various converters/translators; likely needs an expandable, enumerated list
 unlike the other subClasses, this is closely aligned with the standard Profile metadata
 likely a separate namespace (e.b., ‘umbels’) that will contain additional information such as synsets, etc. See text.
 SKOS concept
 need to check on overlap/replacement/use with SKOS subjectIndicator property
 should names or IDs be used for subjectProxys? (IDs have the advantage of changing labels and use in other languages)
 seems similar or identical to SKOS scopeNote
 see possible annotatorTypes below; needs completion
 similar to dc:language, but must be kept separate from language of the resource (its metadata characterization) from the actual subject proxies
 A Tolk, S.Y. Diallo, C.D. Turnitsa and L.S. Winters LS, “Composable M&S Web Services for Net-centric Applications,” Journal for Defense Modeling & Simulation (JDMS), Volume 3 Number 1, pp. 27-44, January 2006.