Posted:September 16, 2007

Sweet Tools Listing

AI3's Sweet Tools Listing Updated to Version 10

This AI3 blog maintains Sweet Tools, the largest listing of about 800 semantic Web and -related tools available. Most are open source. Click here to see the current listing!

AI3's listing of semantic Web and -related tools has just been updated to version 10. This version adds 36 new tools since the last update on June 19, bringing the new total to 578 tools.

This version 10 update of Sweet Tools also includes an upgrade to version 2 of the lightweight Exhibit display (thanks again, MIT's Simile program and David Huynh, plus congratulations on your Ph.D, David!) and is separately provided as a simple table for quick download and copying.

Background on prior listings and earlier statistics may be found on these previous posts:

With interim updates periodically over that period.

Because of comments expirations on prior posts, this entry is now the new location for adding a suggested new tool. Simply provide your information in the comments section, and the tool will be included in the next update.

Posted by AI3's author, Mike Bergman Posted on September 16, 2007 at 11:08 pm in Uncategorized | Comments (7)
The URI link reference to this post is: http://www.mkbergman.com/403/new-release-577-semantic-web-and-related-tools/
The URI to trackback this post is: http://www.mkbergman.com/403/new-release-577-semantic-web-and-related-tools/trackback/
Posted:September 10, 2007

Astoria is Whistling Past the Graveyard to Irrelevance

I was pleased to see in my blog reader this morning a post from the Microsoft Astoria team on anticipated data formats for its pending formal release. I have been working on modeling Web data models and hoped to see some insight in the piece.

As the project team states,

The goal of Astoria is to make data available to loosely coupled systems for querying and manipulation. In order to do that we need to use protocols that define the interaction model between the producer and the consumer of that data, and of course we have to serialize the data in some form that all the involved parties understand. So protocols and formats are an important topic in our design process.

With that said, the team announced that the first formal Astoria release will support these three formats (with the single HTTP protocol):

  • ATOM / APP
  • JSON, the JavaScript Object Notation, and
  • Web3S, a Microsoft marketing wonder that as far as I know is only used by the MS Live group.

The later is a strange mapping of a tree data model to the record base of Astoria, in the process also abandoning a straight XML implementation in earlier versions.

Also notable for its absence is RDF (Resource Description Framework). The defensive response of the Astoria team to this absence speaks for itself:

The May [announcement on Astoria] included support for RDF. While we got positive comments about the fact we supported it, we didn't see any early user actually using it and we haven't seen a particular popular scenario where RDF was a must-have. So we are thinking that we may not include RDF as a format in the first release of Astoria, and focus on the other 3 formats (which are already a bunch from the development/testing perspective).

My personal take is that while I understand how RDF fits in the picture of the semantic web and related tools, the semantic web goes well beyond a particular format. The point is to have well-defined, derivable semantics from services. I believe that Astoria does this independently of the format being used. That, combined with the fact that we didn't see a strong demand for it, put RDF lower in our priority lists for formats.

There was a funny Glenn Ford movie from 1964 called “Advance to the Rear”. The problem is, this is not a movie, but the largest software company in the world taking two steps back for each one forward. Congratulations on alienating still further many thought leaders on the Web.

This is yet another stunning and lame attempt by Microsoft to replace open standards with proprietary ones. Get a clue, Redmond!

Posted by AI3's author, Mike Bergman Posted on September 10, 2007 at 9:52 am in Uncategorized | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/402/typical-microsoft-bullt/
The URI to trackback this post is: http://www.mkbergman.com/402/typical-microsoft-bullt/trackback/
Posted:August 18, 2007

RDF123

UMBC’s Ebiquity Program Creates Another Great Tool

In a strange coincidence, I encountered a new project called RDF123 from UMBC’s Ebiquity program a few days back while researching ways to more easily create RDF specifications. (I was looking in the context of easier ways to test out variations of the UMBEL ontology.) I put in on my to-do list for testing, use and a possible review.

Then, this morning, I saw that Tim Finin had posted up a more formal announcement of the project, including a demo of converting my own Sweet Tools to RDF using the very same tool! Thanks, Tim, and also for accelerating my attention on this. Folks, we have another winner!

RDF123, developed by Lushan Han with funding from NSF [1], improves upon earlier efforts from the University of Maryland’s Mindswap lab, which had developed Excel2RDF and the more flexible ConvertToRDF a number of years back. Unlike RDF123, these other tools were limited to creating an instance of a given class for each row in the spreadsheet. RDF123, on the other hand, allows users to define mappings to arbitrary graphs and different templates by row.

It is curious why so little work has been done on spreadsheets as an input and specification mechanism for RDF given the huge use and ubiquity (pun on purpose!) of the format. According to the Ebiquity technical report [1], Topbraid Composer has a spreadsheet utility (one that I have not tested) and there is a new plug-in for Protégé version 4.0 from Jay Kola that was also on my to-do list for testing (which requires upgrading to the beta version of Protégé) that has support for imports of OWL and RDF Schema.

I have also been working with the Linking Open Data group at the W3C regarding converting the Sweet Tools listing to RDF, and have indeed had a RDF/XML listing available for quite some time [2]. You may want to compare this version with the N3 version produced by RDF123 [3]. The specification for creating this RDF123 file, also in N3 format, is really quite simple:

@prefix d: <http://spreadsheets.google.com/feeds/list/o15870720903820235800 etc., etc.> .
@prefix mkbm: <http://www.mkbergman.com/> .
@prefix exhibit: <http://simile.mit.edu/2006/11/exhibit#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <#> .
@prefix e: <http://spreadsheets.google.com/feeds/list/o15870720903820235800 etc., etc.> .
<Ex:e+$1>
  a exhibit:Item ;
  rdfs:label "Ex:$1" ;
  exhibit:origin "Ex:mkbm+'#'+$1^^string" ;
  d:Category "Ex:$5" ;
  d:Existing "Ex:$7" ;
  d:FOSS "Ex:$4" ;
  d:Language "Ex:$6" ;
  d:Posted "Ex:$8" ;
  d:URL "Ex:$2^^string" ;
  d:Updated "Ex:$9" ;
  d:description "Ex:$3" ;
  d:thumbnail "Ex:@If($10='';'';mkbm+@Substr($10,12,@Sub(@Length($10),4)))^^string" .

The UMBC approach is somewhat like GRDDL for converting other formats to RDF, but is more direct by bypassing the need to first convert the spreadsheet to XML and then transform with XSLT. This means updates can be automatic, and the difficulty of writing XSLT is replaced itself with a simple notation as above for properly replacing label names.

RDF123 has the option of two interfaces in its four versions. The first interface, used by the application versions, is a graphical interface that allows users to create their mapping in an intuitive manner. The second is a Web service that takes as input a combined URL string to a Google spreadsheet or CSV file and an RDF123 map and output specification [3].

The four versions of the software are the:

RDF123 is a tremendous addition to the RDF tools base, and one with promise for further development for easy use by standard users (non-developers). Thanks NSF, UMBC and Lushan!

And, Thanks Josh for the Census RDF

Along with last week’s tremendous announcement by Josh Tauberer for making 2000 US Census data available as nearly 1 billion RDF triples, this dog week of August in fact has proven to be a stellar one on the RDF front! These two events should help promote an explosion of RDF in numeric data.


[1] Lushan Han, Tim Finin, Cynthia Parr, Joel Sachs, and Anupam Joshi, RDF123: A Mechanism to Translate Spreadsheets to RDF, Technical Report from the Computer Science and Electrical Engineering Dept., University of Maryland, Baltimore County, August 2007, 17 pp. See http://ebiquity.umbc.edu/paper/html/id/368/RDF123-a-mechanism-to-translate-spreadsheets-to-RDF; also, a PDF version of the report is available. The effort was supported by a grant from the National Science Foundation.

[2] This version was created using Exhibit, the lightweight data publishing framework for Sweet Tools. It allows RDF/XML to be copied from the online Exhibit, though it has a few encoding issues, which required the manual adjustments to produce valid RDF/XML. A better RDF export service is apparently in the works for Exhibit version 2.0, slated for soon release.

[3] N3 stands for Notation 3 and is a more easily read serialization of RDF. For direct comparison with my native RDF/XML, you can convert the N3 file at RDFabout.com. Alternatively, you can directly create the RDF/XML output with the slightly different instructions to the online service of: http://rdf123.umbc.edu/server/?src=http://spreadsheets.google.com/pub?key=pGFSSSZMgQNxIJUCX6VO3Ww&map=http://rdf123.umbc.edu/map/demo1.N3&out=XML; note the last statement changing the output format from N3 to XML. Also note the UMBC service address, followed by the spreadsheet address, followed by the specification address (the listing of which is shown above), then ending with the output form. This RDF/XML output validates with the W3C’s RDF validation service, unlike the original RDF/XML created from Sweet Tools that had some encoding issues that required the manual fixing.

Posted by AI3's author, Mike Bergman Posted on August 18, 2007 at 1:29 pm in Uncategorized | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/394/rdf123-makes-generating-flexible-rdf-a-snap/
The URI to trackback this post is: http://www.mkbergman.com/394/rdf123-makes-generating-flexible-rdf-a-snap/trackback/
Posted:August 8, 2007

Donald Knuth's Road Sign

Structured Data and UMBEL Will Benefit from a Standard Registration Format

UMBEL Logo One implication of the structured Web is that, with the rapid proliferation of the data, how do you find what is relevant? That purpose is what stimulated the initiation of the UMBEL (Upper Mapping and Binding Exchange Layer) project.

The original specification for UMBEL recognized the need for a reference set of subject “proxies” to help describe what each data set “was about” as well as the need for a variety of binding mechanisms depending on scope and data structure of the source dataset.

At its most general level, the intent of UMBEL is to provide four components in its ‘core’ ontology [1]:

  1. A set of reference subject “proxies” and properties and relations around them
  2. A means of binding the ontologies, classes or subsets of data within each contributing dataset to the ‘core’ UMBEL ontology and those subject proxies
  3. Characterizing a given dataset via metadata, and
  4. Describing access methods and endpoints for getting at that data.

The first component on subject proxies is largely left to another discussion. The topic of this posting is mostly related to the latter three components of dataset binding and registration mechanisms.

How such road signs might work, the contributions of possible analogs, their differences in providing solutions in and of themselves, and a first-cut outline of the resulting ‘core’ UMBEL ontology are described below.

The Why and How of These Road Signs

‘Road signs’ are simply a shorthand for how to find stuff. The normative case is to have sufficient characterization of datasets such that a central registry can aid their discovery, look-up and productive access and use. Yet registration can be an onerous task, and one not generally easily or willingly undertaken by publishers or providers.

These challenges lead to two important design considerations. First, only minimal characterization should be required for initially registering a dataset. The remaining characteristics should be optional. The incentive over time for such optional fields to be completed is its indication to consumers that fully characterized datasets may be more dependable or authoritative. It is possible, for example, to envision external qualification rules or routines that “score” competing datasets providing similar information based on the completeness of dataset characterization.

Second, any party should be allowed to register and characterize a dataset. There may be motivations by non-publishers to do so, for altruistic or other reasons. However, in the case of disputes over the accuracy of characterization, the owner or publisher should have final say. Another open question is whether competing characterizations or different registrations should be allowed for the same dataset.

These considerations are made still further complicated by the range of scope and scale and data content and formalism on the real-world Web.

In the spirit of not re-inventing the wheel, we began a process to discover what other communities have done faced with similar problems. The two closest analogs are, firstly, the library community and its need to describe digital archives and collections and, secondly, the general approaches devoted to Web services, including its dedicated language WSDL (Web Services Description Language).

Digital Collections and Archives

Librarians and information architects have been active for at least the past decade in efforts to describe and relate digital collections to one another. These efforts have been geared to search, general look-up and interlibrary loans and sharing. This community of practice, while embracing a variety of somewhat competing and overlapping schemes, has also (from an outsider’s viewpoint) come up with a general consensus view as to how to describe these archives and collections.

(The library community still tends to use the terminology of ‘metadata’ and descriptors, whereas the ontology and RDF communities tend to speak more of classes, properties and instances. However, the net intent and outcome still appears much the same.)

A common reference point to these schemes is the Dublin Core Collection Application Profile (DCCAP), a specification of how metadata terms from the Dublin Core metadata initiative (DCMI) and other vocabularies can be used to construct a description of a collection in accordance with the DCMI Abstract Model. There is also a more easily read summary of the DCCAP [2]. Though, again, there are differences in terminology, the presentation of this scheme is very much in keeping with the format of a W3C specification (such as for WSDL, see below).

One of the first widely embraced efforts of the community is the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH), begun in the late 1990s. OAI-PMH provides specifications for both data providers and service providers in how to assign and describe collection metadata. The OAI Protocol has become widely adopted by many digital libraries, institutional repositories, and digital archives, with total sources registered numbering into the thousands [3]. These large institutional repositories are also increasingly being indexed by large search engines (such as Google Scholar).

The National Information Standards Organization (NISO) began a MetaSearch Initiative (NISO-MS) that resulted in the development of a collection description schema. Though NISO recently reorganized its content and collection activities, the draft NISO Collection Description Specification [4] remains a readable overall reference for these initiatives. Like related initiatives, the draft also uses the DCCAP format.

A similar effort was initiated in the United Kingdom called the Information Environment Service Registry. IESR was designed to make it easier for other applications to discover and use materials which will help their users’ learning, teaching and research. IESR’s various terms (namespaces, classes and properties) and controlled vocabularies are very helpful to UMBEL.

Another example is the Ockham Digital Library Service Registry (DLSR), that enables service-based digital libraries, funded by the National Science Digital Library (NSDL) initiative, to interoperate. Efforts such as this, in turn, have led to interest in exploiting “light-weight” protocols and open source tools in the community [5]. For example, there is an interesting discussion of tools to Implement Digital Library Collections and Services from DLib magazine [5].

These efforts, among many across the digital library community including the related National Digital Information Infrastructure and Preservation Program (NDIIPP) at the Library of Congress, represent tremendous efforts to describe digital collections. Clearly, subsets of this learning can be applied directly to UMBEL in relation to registry and dataset metadata.

Most all of these schemes use XML data serializations. Our investigations to date have not been able to turn up any RDF representations, though they are surely to come. Fortunately, all of the DCCAP-based efforts have an RDF-like design and most properties have URIs and defined namespaces.

Web Services and Bindings

The Web Services Description Language (WSDL) is an XML-based language for how to communicate with, and therefore interoperate, Web services. WSDL defines a service as accessible Internet endpoints (or ports), supported operations and related messages. It was first proposed to the W3C in March 2001; though it is a recommendation, it is not yet an official W3C standard.

A service is a definition of what kinds of operations can be performed and the messages involved. A port is defined by associating a network address with a reusable binding, with a collection of such ports constituting the service. Messages are abstract descriptions of the data being exchanged, and port types are abstract collections of supported operations. The combination of a specific network protocol with a specific message and data format for a particular port type constitutes a reusable binding.

These definitions are kept abstract from any concrete use or instance, enabling the service definition to be reused and to act as a public interface to the service.

Though there are some important differences from datasets (see next subsection), there has now been sufficient use and exposure of WSDL to inform how to construct sufficiently abstract and reusable interfaces and bindings on the Web. Especially useful is the recent WSDL 2.0 and its draft Web Services Description Language (WSDL) Version 2.0: RDF Mapping specification [6].

WSDL, by its use of XML schema, is not well suited to combining vocabularies and definitions. As a supplement, various discussions on data binding, including from Microsoft (with respect to data source objects — DSOs) and from others such as the W3C on XML data bindings help provide additional perspective [7].

WS-Notification and Topic Maps

Another perspective on this problem comes from efforts surrounding adding topic structure to Web services. This standards effort, called WS-Notification, was an effort of the OASIS group completed in late 2006. According to its published standards [8]:

WS-Notification is a family of related specifications that define a standard Web services approach to notification using a topic-based publish/subscribe pattern. It provides standard message exchanges to be implemented by service providers that wish to participate in Notifications, standard message exchanges for a notification broker service provider (allowing publication of messages from entities that are not themselves service providers), operational requirements expected of service providers and requestors that participate in notifications, and an XML model that describes topics. The WS-Notification family of documents includes three normative specifications: WS-BaseNotification, WS-BrokeredNotification, and WS-Topics.

There are some similarities to the binding mechanisms and topic relations required by UMBEL. WS-BaseNotification bears some resemblance to the binding mechanisms portions, and WS-Topics has some relation to the subject requirements. Again, however, the perspective is limited to Web services and has as its intent the general interchange of topic structures, not the use of a proxy reference set.

Why the Need for a ‘Third Way’?

So, we can see that the library community has made much progress in defining how to characterize a digital collection with regard to its source, ownership, scope and nature, while the Web applications community has made much progress with respect to service definitions and binding mechanisms. Some components have direct applicability to UMBEL.

Yet a lightweight dataset binding and subject reference structure for the Web — namely, UMBEL’s intended objective — has a number of very important differences from these other efforts. These differences prevent direct adoption of any current schema. Some of these summary distinctions as they apply to general Web data are:

  • There are a variety of potential registrants for UMBEL datasets including original owners and developers (the case for the other approaches), but also third-parties and consumers
  • Therefore, there is a broader spectrum of possible knowledge and ability to characterize the datasets, which suggests more flexibility, more optional items and fewer mandatory items
  • Relatedly, those performing the characterizations may be untrained in formal metadata and cataloging techniques or may need to rely on automated and semi-automated characterization methods; this increases the risk of error, imprecision and uncertainty
  • The relevant data may or may not be part of a dedicated resource; it may be fragmentary or embedded
  • The source data resides in an extreme range in possible scale, from a single datum (granted, generally of quite limited value) to the largest of online databases
  • The existence of a huge diversity of data formats and protocols (WSDL has a similar challenge)
  • The need to accommodate a choice of many different data serializations (WSDL has a similar challenge)
  • WSDL’s service and endpoint considerations never explicitly accounted for data federation
  • The desire to get all forms into other formats or canonical forms for mashups, true federation, etc.

These differences suggest that the UMBEL ontology needs to be both broader and less prescriptive than other approaches.

Some Initial Design Considerations

We can thus combine the best transferable elements of existing schemes with the unique requirements and perspective of UMBEL. Initially, we are not adopting specific definitions or portions of possibly contributing schema. Rather, in this first cut, we are only attempting to capture the necessary scope and concepts. Later, after definitions and closer inspection of specific schema, we will refine this organization and relate it to particular namespaces.

The major “superclass” in this organization is the:

  • Profile — this definition is similar to that used for the DCCAPs (indeed, even adopts the “profile” label!), and is closely allied to the idea of Description in WSDL. The Profile represents the broad metadata characteristics of a dataset including ownership, rights and access policies, and other administrative aspects. Generally, a single dataset no matter of what size or scope may have a single profile, though a federated knowledge base from multiple sources may contain multiples of these. This class does not include specific details regarding interface format or subject scope. Note the next classes are themselves subclasses of this profile

The remaining classes are subsidiary to the Profile and inherit and refer to its metadata. The first two subclasses are also largely administrative in nature:

  • Annotator — this set of properties describes the annotator of the dataset metadata. UMBEL is designed to allow third-parties to describe others’ datasets with optional levels of detail. In the case of disputes, the dataset owner characterizations would hold sway, but there also may be circumstances where multiple characterizations are desirable and allowed
  • Rights — these properties describe the use and access rights for the data. Of course, only the owner may set such conditions, but third parties may provide this characterization if the source site spells out these conditions.

The remaining three classes contain the real guts of the data aspects:

  • Interface — the technical details of the data schema and structure within the dataset (or portions thereof) are defined in the Interface properties. (Interface is similar to the idea of the Interface and portions of the Service classes within WSDL, as with similar analogs for data exchange). The endpoints and access methods for accessing the actual data are by definition part of this Interface class. There is little or no consensus regarding how to classify and organize these details, so that it is likely much of the terminology in this area will be actively discussed and revised. See further [9] for one of the more comprehensive surveys
  • Binding — the Binding properties set the mechanisms for relating the dataset or portions thereof to one or more subject proxies. There may be more than one binding for a given profile or different portions of a dataset
  • SubjectProxy — finally, the SubjectProxy class, representing a likely extension to the core UMBEL for the enumeration of the subject proxies, becomes the linkage to the subject coverage of the datasets.

These classes have a hierarchical relationship similar to the following, with multiple Interface, Binding and SubjectProxy mappings allowable for any given Profile:

Profile -------- Annotator
|
Rights
|
Interface --------- Binding --------- SubjectProxy

Presented below in simple outline form only are these first-proposed classes, and the associated properties and instances of those properties informing the development of the ‘core’ UMBEL ontology. Some definitions of classes are also shown:

ClasssubClass[1]PropertyasPredicate [2]NoteDefinition
Profilethe broad metadata characteristics of a dataset including ownership, rights and access policies, and other administrative aspects; generally only one per dataset, though there will be multiples in a repository
abstracthasAbstract
alternativeTitlehasAlternativeTitle
collectionisPartofCollection
conceptSchemehasConceptScheme[23]
crawlinghasCrawlingPolicy
dateSubmittedhasSubmittedDate
descriptionhasDescription
languagehasLanguage[3]
locationhasLocation
modifiedwasModifiedOn
namespacehasNamespace
ontologyhasOntology
ownerhasOwner
registryisListedOnRegistry[4,5]
sitemaphasSitemap
sizehasSize[6]
titlehasTitle
typeisOfType[7]
versionhasVersion
viewisBestViewedUsing[8]
Annotator[21]description of the entity that has provided the current Profile description (may be third parties; but deferrence to owner when there are differences)
annotationDatehasAnnotationDate
annotationNotehasAnnotationNote
annotatorLocatorhasAnnotatorLocator [9]
annotatorNamehasAnnotatorName
annotatorTypeisAnnotatorType[5,27]
Bindingthe linkage made between the set or subsets of data within the datasets to the actual subject proxy(ies); may be multiples for a given dataset
aboutisAbout[10,24]cross-reference to the actual subject proxy IDs; may be multiples
bindingNamehasBindingName[10]
bindingScopehasBindingScope [5,12]
bindingTypehasBindingType[5,11]
Interface[13]the technical characteristics of the dataset that provide the essential information for enabling retrieval and interoperability; analogous to Interface in WSDL
bindingNamehasBindingName[10]
dataFormalismhasDataFormalism[5,14]
endpointLocationhasEndpointLocation
endpointTypehasEndpointType[5,15]
patternusesPattern[16]
pingTypehasPingType[17]
queryLanguageusesQueryLanguage[5,18]
serializationhasSerialization[5,19]
translatorusesTranslator[20]
Rights[21]various rights and restrictions to accessing, using or reproducing the subject data
accessRighthasAccessRight
copyrighthasCopyright
licensehasLicense
rightsNotehasRightsNote
SubjectProxy[22]a preferred label that acts as a proxy to the topic concept(s) for which the given dataset content is bound; may be multiples, and backed with ‘synset’ synonyms
altLabelisAlternateLabel[23]
bindingNamehasBindingName[10]
prefLabelisPreferredLabel[23]
primarySubjectisPrimarySubjectOf[23]
proxyIDhasProxyID[25]
subjectLanguagehasSubjectLanguage[28]
subjectNotehasSubjectNote[26]

General table notes are provided under the endnotes [10].

Please note that the specific subject proxies and their defining classes and properties is being handled in a separate document. This outline, as being revised, is informing the first N3 version of the ‘core’ UMBEL ontology.

This structure is still quite preliminary. (For example, data type definitions and interface constructs are still in active discussion, without accepted standards.) Comments on this draft UMBEL ‘core’ ontology outline are welcomed either at the UMBEL discussion forum on Google or at the specific outline page on the UMBEL wiki.

Revisiting the ‘Lightweight’ Designation

We can thus see that there is only minimal semantics in the potential linkage between UMBEL datasets.

One way to place this system is through and interesting approach called the Levels of Conceptual Interoperability Model. One way to view these levels is through the following conceptual diagram [11]:

test

Under this model, UMBEL resides right at the interface between Levels 2 and 3, where syntactic interoperability is achieved but with only limited semantic understanding. In fact, this represents a clear analog to AI3‘s discussion of the structured Web, which is very much related to the syntactic level with the negotiation of semantics the next challenge.

This posting is part of a new, occasional series on the Structured Web.

[1] The UMBEL ontology has two parts. The first ‘core’ part is a flat listing, or pool, of concrete subject topics that are the proxy binding points for external data sets. The second ‘unofficial’ part is a reference look-up structure of hierarchical and interlinked subject relationships.

[2] The “Dublin” in the name refers to Dublin, Ohio, where the work originated from an invitational workshop hosted in 1995 by the Online Computer Library Center (OCLC), a library consortium that has its headquarters there. The “Core” refers to the fact that the metadata element set is a basic but expandable “core” list, used is a similar way to the UMBEL ‘core’.

[3] There are several large registries of OAI-compliant repositories: The OAI registry at University of Illinois at Urbana-Champaign, The Open Archives list of registered OAI repositories, The Celestial OAI registry, Eprint’s Institutional Archives Registry, Openarchives.eu The European Guide to OAI-PMH compliant repositories in the world, and the ScientificCommons.org A worldwide service and registry.

[4] The Standards Committee BB (Task Group 2): Collection & Service Descriptions, NISO Z39.91-200x, Collection Description Specification, November 2005. It also specifies an XML binding for serializing such descriptions for interchange between applications.

[5] Xiaorong Xiang and Eric Lease Morgan, “Exploiting ‘Light-weight’ Protocols and Open Source Tools to Implement Digital Library Collections and Services,’ D-Lib Magazine 11(10), October 2005. See http://www.dlib.org/dlib/october05/morgan/10morgan.html. Also see its MyLibrary reference for examples of facets applied to collections.

[6] Jacek Kopecký, Ed., Web Services Description Language (WSDL) Version 2.0: RDF Mapping, W3C Working Group Note, 26 June 2007. See http://www.w3.org/TR/wsdl20-rdf.

[7] Data binding from the Microsoft perspective is described at http://msdn2.microsoft.com/en-us/library/ms531387.aspx. The W3C’s perspective on XML data binding is described, for example, in Paul Downey, ed., XML Schema Patterns for Common Data Structures, see http://www.w3.org/2005/07/xml-schema-patterns.html more goes here.

[8] Also, there is an entire corpus related to topic maps. In specific reference to Web services, there is the so-called WS-Topics, Web Services Topics 1.3 (WS-Topics) OASIS Standard 1 October 2006; see http://docs.oasis-open.org/wsn/wsn-ws_topics-1.3-spec-os.htm more here. This is used in conjunction with WS-Notification. Also WS-BaseNotification Web Services Base Notification 1.3 (WS-BaseNotification) OASIS Standard 1 October 2006 see http://docs.oasis-open.org/wsn/wsn-ws_base_notification-1.3-spec-os.htm. PDFs of these documents are also available.

[9] For an intial introduction with a focus on Xcerpt, see François Bry, Tim Furche, and Benedikt Linse, “Let’s Mix It: Versatile Access to Web Data in Xcerpt,” in Proceedings of 3rd Workshop on Information Integration on the Web (IIWeb 2006), Edinburgh, Scotland, 22nd May 2006; also as REWERSE-RP-2006-034, see http://rewerse.net/publications/download/REWERSE-RP-2006-034.pdf. For a more detailed treatment, see T. Furche, F. Bry, S. Schaffert, R. Orsini, I. Horrocks, M. Krauss, and O. Bolzer. Survey over Existing Query and Transformation Languages. Deliverable I4-D1a Revision 2.0, REWERSE, 225 pp., April 2006. See http://rewerse.net/deliverables/m24/i4-d9a.pdf.

[10] Here are the general table notes:

[1] SubClasses have the advantage of inheritance and shared metadata; see main text for full subClass path
[2] hasPredicate is actually my preferred format; thoughts?
[3] Standard ISO languages
[4] extendable base listing including NISO, IESR, DLSR, DCMI, etc.; need completion
[5] uses the idea of an extended base class per XML Schema; the enumerated listings thus only need be partially complete; see below for most listings
[6] number of records or TBD metric?
[7] possibly unnecessary; can not see enumeration of profile Types
[8] related to idea of Fresnel or Zitgist “preferred” viewing format or XSLT-type stylesheet
[9] should there be other types than FOAF (what of other formal listings or organizations v. individuals?)?
[10] not sure how to do this; need a x-ref between two class categories (e.g., Binding <-> Interface, Binding <-> SubjectProxy)
[11] are there patterns for bindingTypes or a likely enumerated listing?

HTTP
JDBC
ODBC
RPC
RPI
SOAP
XML-RPC

[12] need an enumerated list (?) going from individual annotation / metadata (a la RDFa) to complete dataset

webPage
dataSet
dataRecords

[13] perhaps not best name; related to Interface and Services in WSDL, could also be called Construct, Composition, others
[14] see possible dataFormalisms below; needs completion; could be named differently (schema, format, etc.)

Atom
eRDF
Microformats
OPML
Other (unspecified)
Other Ontology
OWL DL
OWL Full
OWL Lite
RDF
RDFa
RDF-S
RSS
Spreadsheet
Topic Map
WebPage
XFML

[15] see possible endpointTypes below

dropdownList
fileExport
other Query formats ???
queryBox
SPARQL

[16] patterns are fairly prominent in WSDL and XML Schema; applicable here?
[17] need to discuss
[18] see possible queryLanguages below; needs completion

DQL
IR (standard text search)
N3QL
R-DEVICE
RDFQ
RDQ
RDQL
RQL
SeRQL
SPARQL
SQL
Versa
Xcerpt
XPath
XPointer
XQuery
XSLT
XUL
[19] see possible serializations below; needs completion

Atom
Gbase
html
JSON
JSON-P
N3
RDF/A
RDF/XML
Turtle
XML

[20] GRDDL, RDFizers, and various converters/translators; likely needs an expandable, enumerated list
[21] unlike the other subClasses, this is closely aligned with the standard Profile metadata
[22] likely a separate namespace (e.b., ‘umbels’) that will contain additional information such as synsets, etc. See text.
[23] SKOS concept
[24] need to check on overlap/replacement/use with SKOS subjectIndicator property
[25] should names or IDs be used for subjectProxys? (IDs have the advantage of changing labels and use in other languages)
[26] seems similar or identical to SKOS scopeNote
[27] see possible annotatorTypes below; needs completion

archivist
bot
owner
repository
third-party

[28] similar to dc:language, but must be kept separate from language of the resource (its metadata characterization) from the actual subject proxies

[11] A Tolk, S.Y. Diallo, C.D. Turnitsa and L.S. Winters LS, “Composable M&S Web Services for Net-centric Applications,” Journal for Defense Modeling & Simulation (JDMS), Volume 3 Number 1, pp. 27-44, January 2006.

Posted by AI3's author, Mike Bergman Posted on August 8, 2007 at 7:53 pm in Uncategorized | Comments (0)
The URI link reference to this post is: http://www.mkbergman.com/393/erecting-road-signs-on-the-structured-web/
The URI to trackback this post is: http://www.mkbergman.com/393/erecting-road-signs-on-the-structured-web/trackback/
Posted:July 22, 2007

McCruffy Pattern Blocks

Language is Essential to Communication

I recently began a series on the structured Web and its role in the continued evolution of the Internet. This next installment in the series probes in greater depth the question of What is structure? in reference to data and Web expressions, with an emphasis on terminology and definitions.

This post ties in with a new best-practices guide published by Chris Bizer, Richard Cyganiak, and Tom Heath — called the Linked Data Publishing Tutorial — that provides definitions and viewpoints from the perspective of the use of RDF (Resource Description Framework) and W3C practices. My initial post in this series and their tutorial occasioned Kingsley Idehen to post his own Linked Data and the Web Information BUS entry, adding the valuable perspective of practices and terminology going back to the early 1990s in object and relational database systems and standards such as ODBC.

All of these efforts share a desire to craft practices, language and terminology to help promote the availability and interoperability of useful data on the Web.

A challenge for the semantic Web community is to craft language that is clear and understandable to the lay public and Web developers, something which it has often done poorly.

Some problematic terms include information resource, non-information resource, dereferencing, bnode, content negotiation, representation, URL re-writing, and others.

This piece only tackles ‘non-information resource‘ head on. Discussion of other problematic terms awaits another day.

These posts caused Kingsley and me to engage in a prolonged discussion about definitions and terms. I acted as the unofficial scribe, which I attempt to more generally capture and argue below. If you like the ideas below, you may credit both of us; if you don’t, ascribe any errors or omissions to me alone [1].

The Basic Framework and Argument

The basic observation related to the structured Web is that it is a transition phase from the initial document-centric Web to the eventual semantic Web. In this transitional phase, the Web is becoming much more data-centric. The idea of ‘linked data‘ also is a component of this transition, but is more precise in meaning because by definition the data must be expressed as RDF in order to aid interoperability.

The challenge is to convert existing Web pages and data into the structured Web with every resource accessible via an unambiguous URI. Insofar as this conversion also occurs to RDF, it will promote linked data interoperability.

Transitions of such a profound nature — even if short of the full vision of the semantic Web — create the need for new language and terminology to aid understanding and communication. Sometimes, as well, longstanding terms and practices may need to be refined or challenged. In any event, notions of simplistic versioning such as ‘Web 3.0‘ add little to understanding and communication.

Let’s Begin with Data Structure

Independent of the Internet or the Web, let’s begin our discussion about the nature of structure in its application to data [2,3]. Peter Wood, a professor of computer science at Birkbeck College at the University of London, provides succinct definitions of the “structure” of various types of data [4]:

  • Structured Data — in this form, data is organized in semantic chunks or entities, with similar entities grouped together in relations or classes. Entities in the same group have the same descriptions (or attributes), while descriptions for all entities in a group (or schema): a) have the same defined format; b) have a predefined length; c) are all present; and d) follow the same order. Structured data are what is normally associated with conventional databases such as relational transactional ones where information is organized into rows and columns within tables. Spreadsheets are another example. Nearly all understood database management systems (DBMS) are designed for structural data
  • Unstructured Data — in this form, data can be of any type and do not necessarily follow any format or sequence, do not follow any rules, are not predictable, and can generally be described as “free form.” Examples of unstructured data include text, images, video or sound (the latter two also known as “streaming media”). Generally, “search engines” are used for retrieval of unstructured data via querying on keywords or tokens that are indexed at time of the data ingest, and
  • Semi-structured Data — the idea of semi-structured data predates XML but not HTML (with the actual genesis better associated with SGML, see below). Semi-structured data are intermediate between the two forms above wherein “tags” or “structure” may be associated or embedded within unstructured data. Semi-structured data are organized in semantic entities, similar entities are grouped together, entities in the same group may not have same attributes, the order of attributes is not necessarily important, not all attributes may be required, and the size or type of some attributes in a group may differ. To be organized and searched, semi-structured data should be provided electronically from database systems, file systems (e.g., bibliographic data, Web data) or via data exchange formats (e.g., EDI, scientific data, XML).

We can thus view data structure as residing on a spectrum (also shown with “typical” storage and indexing frameworks on the top line):

For the past decades, structured data has been typically managed by database management systems (DBMSs) or spreadsheets, unstructured data by text indexing systems such as used by search engines or unindexed in file systems or repositories.

Semi-structured data models are sometimes called “self-describing” (or schema-less) [5]. The first known definition of semi-structured data dates to 1993 [6] by Peter Schäuble: “We call a data collection semistructured if there exists a database scheme which specifies both normalized attributes (e.g., dates or employee numbers) and non-normalized attributes (e.g., full text or images).” More current usage (see the Wood definition above) also includes the notion of labeled graphs or trees with the data stored at the leaves, with the schema information contained in the edge labels of the graph. Semi-structured representations also lend themselves well to data exchange or the integration of heterogeneous data sources.

HTML tags within Web documents are a prime example of semi-structured data, as are text-based data transfer protocols or serializations [7]. Semi-structured data is also a natural intermediate form when “structure” is desired to be extracted from standard text through techniques generally called ‘information extraction’ (IE). For example, here is possible structure shown in yellow as might be extracted from a death notice or obituary [8]:

John A. Smith of Salem, MA died Friday at Deaconess Medical Center in Boston after a bout with cancer. He was 67. Born in Revere, he was raised and educated in Salem, MA. He was a member of St. Mary’s Church in Salem, MA, and is survived by his wife, Jane N., and two children, John A., Jr., and Lily C., both of Winchester, MA. A memorial service will be held at 10:00 AM at St. Mary’s Church in Salem.

This notice contains a great deal of information including names, places, dates and relationships, which once extracted, can be separately indexed or manipulated. Virtually all text-based documents can have similar structure extracted.

The Web has been a prime source of growth of semi-structured data, most often through text-based data serializations [9] and mark-up languages [10].

Depending on point of view or definition, RDF can either be called semi-structured or structured data. It resides squarely at the transition point between these two categories on this structural continuum.

A Variety of Web ‘Resources’

A couple of Web concepts often cause confusion and difficulty for some users: 1) Uniform Resource Locators (URLs) v Uniform Resource Identifiers (URIs); and 2) the concept of “resources” themselves. As it happens, there was a pretty accessible discussion on URIs that was recently posted; I recommend that discussion on that topic. Instead, we’ll focus here on “resources.”

The concept of a resource is basic to the Web’s architecture and is used in the definition of its fundamental elements, including obviously URL and URI above. The essence of the semantic Web parlance, as well, is built around the notion of abstract resources and their semantic properties. The data model and languages of the Resource Description Framework (RDF) squarely revolve around this “resource” concept.

The first explicit definition of resource related to the Web is found in RFC 2396 in August 1998:

A resource can be anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., “today’s weather report for Los Angeles”), and a collection of other resources.

However, in the context of the Internet, not all resources so defined (such as a person or a company) can be retrieved, while electronic resources like an image or Web page can. Thus, a first challenge arises in the concept of resource and its locational address: some are actual and physical, others are abstract or referential.

According to Wikipedia’s discussion of resources:

The concept of resource has evolved during the Web’s history, from the early notion of a static addressable document or file, to a more generic and abstract definition, now encompassing every thing or entity that can be identified, named, addressed or handled, in any way whatsoever, in the Web at large, or in any networked information system. The declarative aspects of a resource (identification and naming) and its functional aspects (addressing and technical handling) were not clearly distinct in the early specifications of the Web . . . .

The need to somehow make this distinction between actual or physical resources vs. abstract or referential resources led to much discussion and controversy sometimes known as the httpRange-14 issue [11], resolved by the W3C’s Technical Architecture Group (TAG) in 2005. The TAG defined a distinction between two resource types:

  • information resource — this category pertains to the original sense of resources on the Web, such as files, documents or other resources to which a URL can be assigned (including images and non-document media). Though it has been proposed that such resources be designated as “slash” URIs, such as http:///www.example.com/home/index.html, this is not enforceable and some resources in the next category do not adhere. A “slash” URI, however, still is “better practice” (if not “best practice”) for such traditional resources. Note that this category of information resource is very much in keeping with the nature of the early, document-centric Web
  • non-information resource — this category is the new one to deal with all of the “abstract” cases such as classes, etc.; this category is especially important to the data-centric Web. As with the other resource category, it was proposed to use the “pound sign” fragment identifiers used by anchor tags, such as http:///www.example.com/data#bigClass, for example, to signal this different resource type, but, again, it is not enforceable and at most could be “better practice.”

A successful HTTP request for an information resource results in a 200 message (“OK”, followed by transfer) from the Web server; if the request is for a resource that the Web server recognizes as existing but of the wrong type requested, the publisher can use the 303 redirect response to provide the correct URI [12].

These resource distinctions are very, very unfortunate in their labeling, if not on more fundamental grounds. It is non-sensical to call one category “information” and the other not, when everything is informational. Moreover, the distinctions bring absolutely no clarity. (Important note: actually, the provenance of the term ‘non-information resource‘ appears to be quite recent, as well as wrong and unfortunate [13]).

However, getting standards bodies to change labels is a long and uncertain task. The approach taken below is to stick with the ‘information resource‘ term, but to provide the alias of ‘structured data resource‘ in place of ‘non-informational resource‘ and to add some additional sub-category distinctions [14].

The Relation Between Resources and Structure

Even though a Web ‘resource‘ has an address scheme and other requirements, these are details and specifics that can mask the fundamental purpose of a resource to act as a “container” for encapsulating information of some sort. This encapsulation is what enables access to its “payload” information (Kingsley’s term) on the Web or the broader Internet via standard protocols (HTTP and TCP/IP). The mechanisms of the encapsulation constitute the details and specifics.

Inherent to the transition from the document Web to the structured Web is the increased importance of that most confusing category: non-information resource. That is because, like fragment identifiers, we are talking about objects more granular (subsidiary) to document-level resources and it is because we are now referencing structure that includes such “abstract” notions as classes, properties, types, namespaces, ontologies, schema, etc.

One paradox in all of this is the very category of resource designed to deal with these data issues is itself called by many a ‘non-informational resource‘. This term is non-sense [13]. We use instead ‘structured data resource.’

There is much that needs to be cleaned up and clarified over time regarding uses and nomenclature regarding resources. However, from the standpoint of the structured Web, we can probably for the time being concentrate on those items shown in bold in this table:

Information Resourcecurrent standard Web term
unstructured and semi-structured data
Document Resourcetext and markup within standard Web pages
‘Other’ Resourcesnon-text resources with a URL (images, streaming media, non-text indexable files)
Non-information Resource
(aka Structured Data
Resource)
[see text; 13]
current standard Web term; non-sensical
semi-structured and structured data
Structured Dataall non-RDF data-oriented resources, including non-RDF namespaces, etc.
Linked Dataall RDF

Note that the two main resource categories used in practice are maintained. The ‘information resource’ category retains its traditional understanding. From the standpoint of the structured Web, the document resources are the unstructured and semi-structured data content from which information extraction (IE) techniques and software can extract the eventual structure.

The category of document resource likely represents the majority of potentially useful structural content on the Web and is most often overlooked in discussions of linked data or the semantic Web. This content, if subjected to IE and therefore structure creation, then becomes a URI resource better handled as a ‘structured data resource.’

Structured data resources‘ (that is, the poorly labeled ‘non-information resources‘) are the building blocks for the structured Web. In all cases, these resources are either semi-structured or structured data. There are two sub-categories of resources in this category, differentiated by whether the structured data is expressed in RDF (or RDF-based languages) or not. All RDF variants are called ‘linked data‘; all other forms are termed ‘structured data.’

For most general purposes, putting aside nuances and subtle technicalities, the best shorthand for thinking about these resource distinctions is simply Documents v. Data.

Thus, we can see the path to the structured Web taking a number of different branches.

The first branch, and the one necessary for the largest portion of content, is to use a combination of IE techniques to extract entity information from unstructured text or to use structure extraction on the semi-structure of the document [15] to create the structured data resource for that document. Of all variants, this is the longest path, the one least developed, but one with potentially great value.

The second branch is to publish the structured data directly as a resource or to provide access to it through a Web service or API. This is the current basis for most of the structured data resources presently available on the Web. (It is also the outcome of IE for the first branch.)

The third branch is really just a complete variant of the other two — ensuring that the structured data resource is available as interoperable RDF linked data. There are two ways to proceed down this branch. One way is for the publisher to create and post the resource directly in a form of RDF. (Though the actual data can be serialized in a variety of ways such as RDF/XML, N3, Atom or Turtle, conversions between these forms is relatively straightforward.) The other way, less direct, is for the publisher or a third-party to convert non-RDF structured data into RDF with the rich and growing list of available ‘RDFizers’ [16].

The Web in Transition

This material, plus the earlier introduction to the structured Web, can now be brought together as a picture of the Web in transition. While there are no real beginning and end points, there is a steady progression from a document-centric Web to one that is data-centric, including the mediation of semantics:

Transition in Web Structure
Document WebStructured Web
Semantic Web
Linked Data

  • Document-centric
  • Document resources
  • Unstructured data and semi-structured data
  • HTML
  • URL-centric
  • circa 1993
  • Data-centric
  • Structured data
  • Semi-structured data and structured data
  • XML, JSON, RDF, etc
  • URI-centric
  • circa 2003
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S
  • URI-centric
  • circa 2006
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S, OWL
  • URI-centric
  • circa ???

The basic argument of this series is that we are in the midst of a transition phase — the structured Web — that marks the beginning of the dominance of data on the Web. In its broadest definition, the structured Web has many different data forms and serializations. A subset of the structured Web — namely, linked data — is a direct precursor to the semantic Web with its emphasis on RDF and data interoperability and services.

Another argument of this series is — despite the promise of linked data — that structured data resources in many forms will co-exist and provide alternatives. This diversity is natural. For RDF and linked data advocates, tools are now largely in place to convert these diverse forms. Though the ability to see large-scale availability of RDF data appears clear, the longer-term resolution of mediating heterogeneous semantics remains cloudy.

Brief Glossary

To re-cap, and to aid language and understanding, here is a brief glossary of the key terms used in this discussion:

  • dereferencing — the act of locating and transmitting structured data, in a requested format (representation), exposed by a URL
  • document resource — a Web resource designated by a URL that contains unstructured and semi-structured data
  • information extraction — any of a variety of techniques for extracting structure and entities from content
  • information resource — any Web resource that can be retrieved via a URL
  • linked data — a structured data resource in RDF that can be obtained via a URI
  • non-information resource — this is “any resource” that is not an information resource; preference is to deprecate its use and use structured data resource in its stead
  • semantic Web — the Web of data with sufficient meaning associated with that data such that a computer program may learn enough to meaningfully process it
  • semi-structured data — data that includes semantic entities that may have attributes or groups, but which are not all required or presented in a set order or in a set size or type; may be embedded or interspersed in unstructured data
  • structured data — data organized into semantic chunks or entities, with similar entities grouped together in relations or classes, and presented in a patterned manner
  • structured data resource — a structured data resource that can be obtained via a URI
  • structured Web — the data-centric Web of structured data resources and structured data in various forms; formally defined as, object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance
  • unstructured data — data can be of any type and does not necessarily follow any format or sequence or rules, can generally be described as “free form;” includes text, images, video or sound
  • URI — uniform resource identifier
  • URL — uniform resource locator
This posting is part of a new, occasional series on the Structured Web.

[1] You will note a heavy emphasis on Wikipedia definitions in keeping with Web usage.

[2] Of course, the word ‘structure‘ has a broad range of meanings; we are only concerned here about data structure and its specific applicability to Web-related information.

[3] Much of the discussion in this sub-section is derived from an earlier AI3 posting, Semi-structured Data: Happy 10th Birthday!, from November 2005; some of the original information is a bit dated. It is also aided by Kingsley’s Structured Data v. Unstructured Data posting from June 2006. That posting notes the frequent confusion and ambiguity between the terms “structured data” and “unstructured data,” and the importance when speaking of structure to keep separate the structure of the data itself (the focus herein), the structure of the container that hosts the data, and the structure of the access method used to access the data.

[4] Peter Wood, School of Computer Science and Information Systems, Birkbeck College, the University of London. See http://www.dcs.bbk.ac.uk/~ptw/teaching/ssd/toc.html.

[5] The earliest known recorded mention of “semi-structured data” occurred in 1992 from N. J. Belkin and Croft, W. B., “Information filtering and information retrieval: two sides of the same coin?,” in Communications of the ACM: Special Issue on Information Filtering, vol. 35(12), pp. 29 – 38, with the first known definition from [6]. The next two mentions were in 1995 from D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman and J. Widom, “Querying semistructured heterogeneous information,” presented at Deductive and Object-Oriented Databases (DOOD ’95), LNCS, No. 1013, pp. 319-344, Springer, and M. Tresch, N. Palmer, and A. Luniewski, “Type classification of semi-structured data,” in Proceedings of the International Conference on Very Large Data Bases (VLDB). However, the real popularization of the term “semi-structured data” occurred through the seminal 1997 papers from S. Abiteboul, “Querying semi-structured data,” in International Conference on Data Base Theory (ICDT), pp. 1-18, Delphi, Greece, 1997 (http://dbpubs.stanford.edu:8090/pub/1996-19) and P. Buneman, “Semistructured data,” in ACM Symposium on Principles of Database Systems (PODS), pp. 117-121, Tucson, Arizona, May 1997 (http://db.cis.upenn.edu/DL/97/Tutorial-Peter/tutorial-semi-pods.ps.gz). Of course, semi-structured data had existed prior to these early references, only it had not been named as such.

[6] P. Schäuble, “SPIDER: a multiuser information retrieval system for semistructured and dynamic data,” in Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318 – 327, 1993.

[7] Such protocols first received serious computer science study in the late 1970s and early 1980s. In the financial realm, one early standard was electronic data interchange (EDI). In science, there were literally tens of exchange forms proposed with varying degrees of acceptance, notably abstract syntax notation (ASN.1), TeX (a typesetting system created by Donald Knuth and its variants such as LaTeX), hierarchical data format (HDF), CDF (common data format), and the like, as well as commercial formats such as Postscript, PDF (portable document format), RTF (rich text format), and the like. One of these proposed standards was the “standard generalized markup language” (SGML), first published in 1986. SGML was flexible enough to represent either formatting or data exchange. However, with its flexibility came complexity. Only when two simpler forms arose, namely HTML (HyperText Markup Language) for describing Web pages and XML (eXtensible Markup Language) for data exchange, did variants of the SGML form emerge as widely used common standards. The XML standard was first published by the W3C in February 1998, rather after the semi-structured data term had achieved some impact (http://www.w3.org/XML/hist2002) Dan Suciu was the first to publish on the linkage of XML to semi-structured data in 1998 (D. Suciu, “Semistructured data and XML,” in International Conference on Foundations of Data Organization (FODO), Kobe, Japan, November 1998; see PDF option from http://citeseer.ist.psu.edu/suciu98semistructured.html), a reference that remains worth reading to this day.

[8] David Loshin, “Simple semi-structured data,” Business Intelligence Network, October 17, 2005. See http://www.b-eye-network.com/view/1761. This example is actually quite complex and demonstrates the challenges facing IE software. Extracted entities most often relate to the nouns or “things” within a document. Note also, for example, how many of the entities involve internal “co-referencing,” or the relation of subjects such as “he” or clock times such as “10 a.m” to specific dates. A good entity extraction engine helps resolve these so-called “within document co-references.”

[9] Example text-based data serializations and formats used on the Web include Atom, Gdata, JSON, N3, pickle, RDF/XML, RSS, Turtle, XML, YaML.

[10] Example mark-up languages used on the Web include HTML, Wikitext, XHTML, XML.

[11] It was called the httpRange-14 issue by virtue of the agenda label at the TAG’s meetings.

[12] Of course, nothing compels the publisher to provide these instructions, but they are integral to “best practices” and publishers desirous of attracting consumers have incentives to follow them.

Depending on the nature of the information resource, an HTTP GET returns a 200 OK status and sends the resource (for example a request for a Web page or an RDF file) if the request is for the correct type. If the request is for the wrong type, the publisher can include in the header a 303 (See Other) redirect response to send the requester to the appropriate information resource URL. If the request is for an unknown URI or resource, a variety of 4xx error responses may result. (See further [13].)

One advantage of posting structured data resources as RDF- linked data is that this redirect can be sent to a REST-style Web Service that does a best-effort DESCRIBE of the resource using SPARQL. Since SPARQL is a query language, protocol, and results representation scheme, the redirect can come in the form of a URL that can directly query the structured data resource. In this manner, large structured data resource data sets can act as ‘endpoints’ for context-specific information linked anywhere on the Web.

[13] The report of the TAG’s 2005 ‘compromise solution’ was reported by Roy Fielding, with his public notice reproduced in full:

<TAG type=”RESOLVED”>

That we provide advice to the community that they may mint
“http” URIs for any resource provided that they follow this
simple rule for the sake of removing ambiguity:
a) If an “http” resource responds to a GET request with a
2xx response, then the resource identified by that URI
is an information resource;
b) If an “http” resource responds to a GET request with a
303 (See Other) response, then the resource identified
by that URI could be any resource;
c) If an “http” resource responds to a GET request with a
4xx (error) response, then the nature of the resource
is unknown.

</TAG>

I believe that this solution enables people to name arbitrary
resources using the “http” namespace without any dependence on
fragment vs non-fragment URIs, while at the same time providing
a mechanism whereby information can be supplied via the 303
redirect without leading to ambiguous interpretation of such
information as being a representation of the resource (rather,
the redirection points to a different resource in the same way
as an external link from one resource to the other).

Note that point “a” discusses an “information resource,” with the contrasting treatment in point “b” as potentially “any resource”.

To my knowledge, the first that the unfortunate term ‘non-information resource’ was introduced to cover these point “b” conditions was in a draft (that is, unofficial) TAG finding from two years later, in May 31 of this year, “Dereferencing HTTP URIs.” Besides its unfortunate continuation of the ‘dereferencing’ term (a discussion for another day), it introduces the even-worse ‘non-information resource’ terminology. That draft TAG finding in Sec 2 talks about ‘information resources’ (as does the 2005 TAG finding), and in Sec 3 about ‘other Web resources’ (or the “any” category from the 2005 notice). In Sec 4, however, the document switches from ‘other Web’ to ‘non-information’, which is then continued through the rest of the document.

It is not too late for the community to cease using this term; our replacement suggestion is ‘structured data resource.’

[14] You may also want to see, Leo Sauermann, Richard Cyganiak, and Max VölkelCool URIs for the Semantic Web,” or Dan Connolly, “A Pragmatic Theory of Reference for the Web.”

[15] Web documents are typically semi-structured with embedded tag, metadata, and presentation structure in such things as headers, tables, layouts, labels, dropdown lists, etc. For the unstructured text content, traditional information extraction techniques are applied. But for the semi-structure, various scraping or extraction techniques may either be crafted by hand or be semi-automatically or automatically applied through regular expression processing, pattern matching, inspection of the Web page DOM or other techniques. One posting in this structured Web series is to be devoted to this topic.

[16] There is an impressive and growing list of data conversion protocols and tools, most which support multiple input and output forms, including RDF and various serializations of RDF. This list includes Virtuoso Sponger, GRDDL, Babel, RDFizers, general converters, Triplr, etc. One posting in this structured Web series is to be devoted to this topic.

Posted by AI3's author, Mike Bergman Posted on July 22, 2007 at 3:58 pm in Uncategorized | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/391/more-structure-more-terminology-and-hopefully-more-clarity/
The URI to trackback this post is: http://www.mkbergman.com/391/more-structure-more-terminology-and-hopefully-more-clarity/trackback/