Posted:August 23, 2007

Production Printing PressWas the Industrial Revolution Truly the Catalyst?

Why, roughly beginning in 1820, did historical economic growth patterns skyrocket?

This is a question of no small import, and one that has occupied economic historians for many decades. We know what some of the major transitions have been in recorded history: the printing press, Renaissance, Age of Reason, Reformation, scientific method, Industrial Revolution, and so forth. But, which of these factors were outcomes, and which were causative?

This is not a new topic for me. Some of my earlier posts have discussed Paul Ormerod’s Why Most Things Fail: Evolution, Extinction and Economics, David Warsh’s Knowledge and the Wealth of Nations: A Story of Economic Discovery, David M. Levy’s Scrolling Forward: Making Sense of Documents in the Digital Age, Elizabeth Eisenstein’s classic Printing Press, Joel Mokyr’s Gifts of Athena : Historical Origins of the Knowledge Economy, Daniel R. Headrick’s When Information Came of Age : Technologies of Knowledge in the Age of Reason and Revolution, 1700-1850, and Yochai Benkler’s, The Wealth of Networks: How Social Production Transforms Markets and Freedoms. Thought provoking references, all.

But, in my opinion, none of them posits the central point.

Statistical Leaps of Faith

Statistics (originally derived from the concept of information about the state) really only began to be collected in France in the 1700s. For example, the first true population census (as opposed to the enumerations of biblical times) occurred in Spain in that same century, with the United States being the first country to set forth a decennial census beginning around 1790. Pretty much everything of a quantitative historical basis prior to that point is a guesstimate, and often a lousy one to boot.

Because no data was collected — indeed, the idea of data and statistics did not exist — attempts in our modern times to re-create economic and population assessments in earlier centuries are truly a heroic — and an estimation-laden exercise. Nonetheless, the renowned economic historian who has written a number of definitive OECD studies, Angus Maddison, and his team have prepared economic and population growth estimates for the world and various regions going back to AD 1 [1].

One summary of their results shows:

Year Ave Per Capita Ave Annual Yrs Required
AD GDP (1990 $) Growth Rate for Doubling
1 461
1000 450 -0.002% N/A
1500 566 0.046% 1,504
1600 596 0.051% 1,365
1700 615 0.032% 2,167
1820 667 0.067% 1,036
1870 874 0.542% 128
1900 1,262 1.235% 56
1913 1,526 1.470% 47
1950 2,111 0.881% 79
1967 3,396 2.836% 25
1985 4,764 1.898% 37
2003 6,432 1.682% 42

Note that through at least 1000 AD economic growth per capita (as well as population growth) was approximately flat. Indeed, up to the nineteenth century, Maddison estimates that a doubling of economic well-being per capita only occurred every 3000 to 4000 years. But, by 1820 or so onward, this doubling accelerated at warp speed to every 50 years or so.

Looking at a Couple of Historical Breakpoints

The first historical shift in millenial trends occurred roughly about 1000 AD, when flat or negative growth began to accelerate slightly. The growth trend looks comparatively impressive in the figure below, but that is only because the doubling of economic per capita wealth has now dropped to about every 1000 to 2000 years (note the relatively small differences in the income scale). These are annual growth rates about 30 times lower than today, which, with compounding, prove anemic indeed (see estimated rates in the table above).

Nonetheless, at about 1000 AD, however, there is an inflection point, though small. It is also one that corresponds somewhat to the adoption of raw linen paper v. skins and vellum (among other correlations that might be drawn).

When the economic growth scale gets expanded to include today, these optics change considerable. Yes, there was a bit of growth inflection around 1000 AD, but it is almost lost in the noise over the longer historical horizon. The real discontinuity in economic growth appears to have occurred in the early 1800s compared to all previous recorded history. At this major inflection point in the early 1800s, historically flat income averages skyrocketed. Why?

The fact that this inflection point does not correspond to earlier events such as invention of the printing press or Reformation (or other earlier notable transitions) — and does more closely correspond to the era of the Industrial Revolution — has tended to cement in popular histories and the public’s mind that it was machinery and mechanization that was the causative factor creating economic growth.

Had a notable transition occurred in the mid-1400s to 1500s it would have been obvious to ascribe more modern economic growth trends with the availability of information and the printing press. And, while, indeed, the printing press had massive effects, as Elizabeth Eisenstein has shown, the empirical record of changes in economic growth is not directly linked with adoption of the printing press. Moreover, as the graph above shows, something huge did happen in the early 1800s.

Pulp Paper and Mass Media

In its earliest incarnations, the printing press was an instrument of broader idea dissemination, but still largely to and through a relatively small and elite educated class. That is because books and printed material were still too expensive — I would submit largely due to the exorbitant cost of paper — even though somewhat more available to the wealthy classes. Ideas were fermenting, but the relative percentage of participants in that direct ferment were small. The overall situation was better than monks laboriously scribing manuscripts, but not disruptively so.

However, by the 1800s, those base conditions change, as reflected in the figure above. The combination of mechanical presses and paper production with the innovation of cheaper “pulp” paper were the factors that truly brought information to the “masses.” Yet, some have even taken “mass media” to be its own pejorative. But, look closely as what that term means and its importance to bringing information to the broader populace.

In Paul Starr’s Creation of the Media, he notes how in 15 years from 1835 to 1850 the cost of setting up a mass-circulation paper increased from $10,000 to over $2 million (in 2005 dollars). True, mechanization was increasing costs, but from the standpoint of consumers, the cost of information content was dropping to zero and approaching a near-time immediacy. The concept of “news” was coined, delivered by the “press” for a now-emerging “mass media.” Hmmm.

This mass publishing and pulp paper were emerging to bring an increasing storehouse of content and information to the public at levels never before seen. Though mass media may prove to be an historical artifact, its role in bringing literacy and information to the “masses” was generally an unalloyed good and the basis for an improvement in economic well being the likes of which had never been seen.

More recent trends show an upward blip in growth shortly after the turn of the 20th century, corresponding to electrification, but then a much larger discontinuity beginning after World War II:

In keeping with my thesis, I would posit that organizational information efforts and early electromechanical and then electronic computers resulting from the war effort, which in turn led to more efficient processing of information, were possible factors for this post-WWII growth increase.

It is silly, of course, to point to single factors or offer simplistic slogans about why this growth occurred and when. Indeed, the scientific revolution, industrial revolution, increase in literacy, electrification, printing press, Reformation, rise in democracy, and many other plausible and worthy candidates have been brought forward to explain these historical inflections in accelerated growth. For my own lights, I believe each and every one of these factors had its role to play.

But at a more fundamental level, I believe the drivers for this growth change came from the global increase and access to prior human information. Surely, the printing press helped to increase absolute volumes. Declining paper costs (a factor I believe to be greatly overlooked but also conterminous with the growth spurt and the transition from rag to pulp paper in the early 1800s), made information access affordable and universal. With accumulations in information volume came the need for better means to organize and present that information — title pages, tables of contents, indexes, glossaries, encyclopedia, dictionaries, journals, logs, ledgers,etc., all innovations of relatively recent times — that themselves worked to further fuel growth and development.

Of course, were I an economic historian, I would need to argue and document my thesis in a 400-pp book. And, even then, my arguments would appropriately be subject to debate and scrutiny.

Information, Not Machines

Tools and physical artifacts distinguish us from other animals. When we see the lack of a direct correlation of growth changes with the invention of the printing press, or growth changes approximate to the age of machines corresponding to the Industrial Revolution, it is easy and natural for us humans to equate such things to the tangible device. Indeed, our current fixation on technology is in part due to our comfort as tool makers. But, is this association with the technology and the tangible reliable, or (hehe) “artifactual”?

Information, specifically non-biological information passed on through cultural means, is what truly distinguishes us humans from other animals. We have been easily distracted looking at the tangible, when it is the information artifacts (“symbols”) that make us the humans who we truly are.

So, the confluence of cheaper machines (steam printing presses) with cheaper paper (pulp) brought information to the masses. And, in that process, more people learned, more people shared, and more people could innovate. And, yes, folks, we innovated like hell, and continue to do so today.

If the nature of the biological organism is to contain within it genetic information from which adaptations arise that it can pass to offspring via reproduction — an information volume that is inherently limited and only transmittable by single organisms — then the nature of human cultural information is a massive shift to an entirely different plane.

With the fixity and permanence of printing and cheap paper — and now cheap electrons — all prior discovered information across the entire species can now be accumulated and passed on to subsequent generations. Our storehouse of available information is thus accreting in an exponential way, and available to all. These factors make the fitness of our species a truly quantum shift from all prior biological beings, including early humans.

What Now Internet?

The information by which the means to produce and disseminate information itself is changing and growing. This is an infrastructural innovation that applies multiplier benefits upon the standard multiplier benefit of information. In other words, innovation in the basis of information use and dissemination itself is disruptive. Over history, writing systems, paper, the printing press, mass paper, and electronic information have all had such multiplier effects.

The Internet is but the latest example of such innovations in the infrastructural groundings of information. The Internet will continue to support the inexorable trend to more adaptability, more wealth and more participation. The multiplier effect of information itself will continue to empower and strengthen the individual, not in spite of mass media or any other ideologically based viewpoint but due to the freeing and adaptive benefits of information itself. Information is the natural antidote to entropy and, longer term, to the concentrations of wealth and power.

If many of these arguments of the importance of the availability of information prove correct, then we should conclude that the phenomenon of the Internet and global information access promises still more benefits to come. We are truly seeing access to meaningful information leapfrog anything seen before in history, with soon nearly every person on Earth contributing to the information dialog and leverage.

Endnote: And, oh, to answer the rhetorical question of this piece: No, it is information that has been the source of economic growth. The Industrial Revolution was but a natural expression of then-current information and through its innovations a source of still newer-information, all continuing to feed economic growth.


[1] The historical data were originally developed in three books by Angus Maddison: Monitoring the World Economy 1820-1992, OECD, Paris 1995; The World Economy: A Millennial Perspective, OECD Development Centre, Paris 2001; and The World Economy: Historical Statistics, OECD Development Centre, Paris 2003. All these contain detailed source notes. Figures for 1820 onwards are annual, wherever possible.

For earlier years, benchmark figures are shown for 1 AD, 1000 AD, 1500, 1600 and 1700. These figures have been updated to 2003 and may be downloaded by spreadsheet from the Groningen Growth and Development Centre (GGDC), a research group of economists and economic historians at the Economics Department of the University of Groningen headed by Maddison. See http://www.ggdc.net/.

Posted:August 18, 2007

RDF123

UMBC’s Ebiquity Program Creates Another Great Tool

In a strange coincidence, I encountered a new project called RDF123 from UMBC’s Ebiquity program a few days back while researching ways to more easily create RDF specifications. (I was looking in the context of easier ways to test out variations of the UMBEL ontology.) I put in on my to-do list for testing, use and a possible review.

Then, this morning, I saw that Tim Finin had posted up a more formal announcement of the project, including a demo of converting my own Sweet Tools to RDF using the very same tool! Thanks, Tim, and also for accelerating my attention on this. Folks, we have another winner!

RDF123, developed by Lushan Han with funding from NSF [1], improves upon earlier efforts from the University of Maryland’s Mindswap lab, which had developed Excel2RDF and the more flexible ConvertToRDF a number of years back. Unlike RDF123, these other tools were limited to creating an instance of a given class for each row in the spreadsheet. RDF123, on the other hand, allows users to define mappings to arbitrary graphs and different templates by row.

It is curious why so little work has been done on spreadsheets as an input and specification mechanism for RDF given the huge use and ubiquity (pun on purpose!) of the format. According to the Ebiquity technical report [1], Topbraid Composer has a spreadsheet utility (one that I have not tested) and there is a new plug-in for Protégé version 4.0 from Jay Kola that was also on my to-do list for testing (which requires upgrading to the beta version of Protégé) that has support for imports of OWL and RDF Schema.

I have also been working with the Linking Open Data group at the W3C regarding converting the Sweet Tools listing to RDF, and have indeed had a RDF/XML listing available for quite some time [2]. You may want to compare this version with the N3 version produced by RDF123 [3]. The specification for creating this RDF123 file, also in N3 format, is really quite simple:

@prefix d: <http://spreadsheets.google.com/feeds/list/o15870720903820235800 etc., etc.> .
@prefix mkbm: <https://www.mkbergman.com/> .
@prefix exhibit: <http://simile.mit.edu/2006/11/exhibit#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <#> .
@prefix e: <http://spreadsheets.google.com/feeds/list/o15870720903820235800 etc., etc.> .
<Ex:e+$1>
  a exhibit:Item ;
  rdfs:label "Ex:$1" ;
  exhibit:origin "Ex:mkbm+'#'+$1^^string" ;
  d:Category "Ex:$5" ;
  d:Existing "Ex:$7" ;
  d:FOSS "Ex:$4" ;
  d:Language "Ex:$6" ;
  d:Posted "Ex:$8" ;
  d:URL "Ex:$2^^string" ;
  d:Updated "Ex:$9" ;
  d:description "Ex:$3" ;
  d:thumbnail "Ex:@If($10='';'';mkbm+@Substr($10,12,@Sub(@Length($10),4)))^^string" .

The UMBC approach is somewhat like GRDDL for converting other formats to RDF, but is more direct by bypassing the need to first convert the spreadsheet to XML and then transform with XSLT. This means updates can be automatic, and the difficulty of writing XSLT is replaced itself with a simple notation as above for properly replacing label names.

RDF123 has the option of two interfaces in its four versions. The first interface, used by the application versions, is a graphical interface that allows users to create their mapping in an intuitive manner. The second is a Web service that takes as input a combined URL string to a Google spreadsheet or CSV file and an RDF123 map and output specification [3].

The four versions of the software are the:

RDF123 is a tremendous addition to the RDF tools base, and one with promise for further development for easy use by standard users (non-developers). Thanks NSF, UMBC and Lushan!

And, Thanks Josh for the Census RDF

Along with last week’s tremendous announcement by Josh Tauberer for making 2000 US Census data available as nearly 1 billion RDF triples, this dog week of August in fact has proven to be a stellar one on the RDF front! These two events should help promote an explosion of RDF in numeric data.


[1] Lushan Han, Tim Finin, Cynthia Parr, Joel Sachs, and Anupam Joshi, RDF123: A Mechanism to Translate Spreadsheets to RDF, Technical Report from the Computer Science and Electrical Engineering Dept., University of Maryland, Baltimore County, August 2007, 17 pp. See http://ebiquity.umbc.edu/paper/html/id/368/RDF123-a-mechanism-to-translate-spreadsheets-to-RDF; also, a PDF version of the report is available. The effort was supported by a grant from the National Science Foundation.

[2] This version was created using Exhibit, the lightweight data publishing framework for Sweet Tools. It allows RDF/XML to be copied from the online Exhibit, though it has a few encoding issues, which required the manual adjustments to produce valid RDF/XML. A better RDF export service is apparently in the works for Exhibit version 2.0, slated for soon release.

[3] N3 stands for Notation 3 and is a more easily read serialization of RDF. For direct comparison with my native RDF/XML, you can convert the N3 file at RDFabout.com. Alternatively, you can directly create the RDF/XML output with the slightly different instructions to the online service of: http://rdf123.umbc.edu/server/?src=http://spreadsheets.google.com/pub?key=pGFSSSZMgQNxIJUCX6VO3Ww&map=http://rdf123.umbc.edu/map/demo1.N3&out=XML; note the last statement changing the output format from N3 to XML. Also note the UMBC service address, followed by the spreadsheet address, followed by the specification address (the listing of which is shown above), then ending with the output form. This RDF/XML output validates with the W3C’s RDF validation service, unlike the original RDF/XML created from Sweet Tools that had some encoding issues that required the manual fixing.

Posted:August 8, 2007

Donald Knuth's Road Sign
Structured Dataand UMBEL Will Benefit from a Standard Registration Format

UMBEL Logo One implication of the structured Web is that, with the rapid proliferation of the data, how do you find what is relevant? That purpose is what stimulated the initiation of the UMBEL (Upper Mapping and Binding Exchange Layer) project.

The original specification for UMBEL recognized the need for a reference set of subject “proxies” to help describe what each data set “was about” as well as the need for a variety of binding mechanisms depending on scope and data structure of the source dataset.

At its most general level, the intent of UMBEL is to provide four components in its ‘core’ ontology [1]:

  1. A set of reference subject “proxies” and properties and relations around them
  2. A means of binding the ontologies, classes or subsets of data within each contributing dataset to the ‘core’ UMBEL ontology and those subject proxies
  3. Characterizing a given dataset via metadata, and
  4. Describing access methods and endpoints for getting at that data.

The first component on subject proxies is largely left to another discussion. The topic of this posting is mostly related to the latter three components of dataset binding and registration mechanisms.

How such road signs might work, the contributions of possible analogs, their differences in providing solutions in and of themselves, and a first-cut outline of the resulting ‘core’ UMBEL ontology are described below.

The Why and How of These Road Signs

‘Road signs’ are simply a shorthand for how to find stuff. The normative case is to have sufficient characterization of datasets such that a central registry can aid their discovery, look-up and productive access and use. Yet registration can be an onerous task, and one not generally easily or willingly undertaken by publishers or providers.

These challenges lead to two important design considerations. First, only minimal characterization should be required for initially registering a dataset. The remaining characteristics should be optional. The incentive over time for such optional fields to be completed is its indication to consumers that fully characterized datasets may be more dependable or authoritative. It is possible, for example, to envision external qualification rules or routines that “score” competing datasets providing similar information based on the completeness of dataset characterization.

Second, any party should be allowed to register and characterize a dataset. There may be motivations by non-publishers to do so, for altruistic or other reasons. However, in the case of disputes over the accuracy of characterization, the owner or publisher should have final say. Another open question is whether competing characterizations or different registrations should be allowed for the same dataset.

These considerations are made still further complicated by the range of scope and scale and data content and formalism on the real-world Web.

In the spirit of not re-inventing the wheel, we began a process to discover what other communities have done faced with similar problems. The two closest analogs are, firstly, the library community and its need to describe digital archives and collections and, secondly, the general approaches devoted to Web services, including its dedicated language WSDL (Web Services Description Language).

Digital Collections and Archives

Librarians and information architects have been active for at least the past decade in efforts to describe and relate digital collections to one another. These efforts have been geared to search, general look-up and interlibrary loans and sharing. This community of practice, while embracing a variety of somewhat competing and overlapping schemes, has also (from an outsider’s viewpoint) come up with a general consensus view as to how to describe these archives and collections.

(The library community still tends to use the terminology of ‘metadata’ and descriptors, whereas the ontology and RDF communities tend to speak more of classes, properties and instances. However, the net intent and outcome still appears much the same.)

A common reference point to these schemes is the Dublin Core Collection Application Profile (DCCAP), a specification of how metadata terms from the Dublin Core metadata initiative (DCMI) and other vocabularies can be used to construct a description of a collection in accordance with the DCMI Abstract Model. There is also a more easily read summary of the DCCAP [2]. Though, again, there are differences in terminology, the presentation of this scheme is very much in keeping with the format of a W3C specification (such as for WSDL, see below).

One of the first widely embraced efforts of the community is the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH), begun in the late 1990s. OAI-PMH provides specifications for both data providers and service providers in how to assign and describe collection metadata. The OAI Protocol has become widely adopted by many digital libraries, institutional repositories, and digital archives, with total sources registered numbering into the thousands [3]. These large institutional repositories are also increasingly being indexed by large search engines (such as Google Scholar).

The National Information Standards Organization (NISO) began a MetaSearch Initiative (NISO-MS) that resulted in the development of a collection description schema. Though NISO recently reorganized its content and collection activities, the draft NISO Collection Description Specification [4] remains a readable overall reference for these initiatives. Like related initiatives, the draft also uses the DCCAP format.

A similar effort was initiated in the United Kingdom called the Information Environment Service Registry. IESR was designed to make it easier for other applications to discover and use materials which will help their users’ learning, teaching and research. IESR’s various terms (namespaces, classes and properties) and controlled vocabularies are very helpful to UMBEL.

Another example is the Ockham Digital Library Service Registry (DLSR), that enables service-based digital libraries, funded by the National Science Digital Library (NSDL) initiative, to interoperate. Efforts such as this, in turn, have led to interest in exploiting “light-weight” protocols and open source tools in the community [5]. For example, there is an interesting discussion of tools to Implement Digital Library Collections and Services from DLib magazine [5].

These efforts, among many across the digital library community including the related National Digital Information Infrastructure and Preservation Program (NDIIPP) at the Library of Congress, represent tremendous efforts to describe digital collections. Clearly, subsets of this learning can be applied directly to UMBEL in relation to registry and dataset metadata.

Most all of these schemes use XML data serializations. Our investigations to date have not been able to turn up any RDF representations, though they are surely to come. Fortunately, all of the DCCAP-based efforts have an RDF-like design and most properties have URIs and defined namespaces.

Web Services and Bindings

The Web Services Description Language (WSDL) is an XML-based language for how to communicate with, and therefore interoperate, Web services. WSDL defines a service as accessible Internet endpoints (or ports), supported operations and related messages. It was first proposed to the W3C in March 2001; though it is a recommendation, it is not yet an official W3C standard.

A service is a definition of what kinds of operations can be performed and the messages involved. A port is defined by associating a network address with a reusable binding, with a collection of such ports constituting the service. Messages are abstract descriptions of the data being exchanged, and port types are abstract collections of supported operations. The combination of a specific network protocol with a specific message and data format for a particular port type constitutes a reusable binding.

These definitions are kept abstract from any concrete use or instance, enabling the service definition to be reused and to act as a public interface to the service.

Though there are some important differences from datasets (see next subsection), there has now been sufficient use and exposure of WSDL to inform how to construct sufficiently abstract and reusable interfaces and bindings on the Web. Especially useful is the recent WSDL 2.0 and its draft Web Services Description Language (WSDL) Version 2.0: RDF Mapping specification [6].

WSDL, by its use of XML schema, is not well suited to combining vocabularies and definitions. As a supplement, various discussions on data binding, including from Microsoft (with respect to data source objects — DSOs) and from others such as the W3C on XML data bindings help provide additional perspective [7].

WS-Notification and Topic Maps

Another perspective on this problem comes from efforts surrounding adding topic structure to Web services. This standards effort, called WS-Notification, was an effort of the OASIS group completed in late 2006. According to its published standards [8]:

WS-Notification is a family of related specifications that define a standard Web services approach to notification using a topic-based publish/subscribe pattern. It provides standard message exchanges to be implemented by service providers that wish to participate in Notifications, standard message exchanges for a notification broker service provider (allowing publication of messages from entities that are not themselves service providers), operational requirements expected of service providers and requestors that participate in notifications, and an XML model that describes topics. The WS-Notification family of documents includes three normative specifications: WS-BaseNotification, WS-BrokeredNotification, and WS-Topics.

There are some similarities to the binding mechanisms and topic relations required by UMBEL. WS-BaseNotification bears some resemblance to the binding mechanisms portions, and WS-Topics has some relation to the subject requirements. Again, however, the perspective is limited to Web services and has as its intent the general interchange of topic structures, not the use of a proxy reference set.

Why the Need for a ‘Third Way’?

So, we can see that the library community has made much progress in defining how to characterize a digital collection with regard to its source, ownership, scope and nature, while the Web applications community has made much progress with respect to service definitions and binding mechanisms. Some components have direct applicability to UMBEL.

Yet a lightweight dataset binding and subject reference structure for the Web — namely, UMBEL’s intended objective — has a number of very important differences from these other efforts. These differences prevent direct adoption of any current schema. Some of these summary distinctions as they apply to general Web data are:

  • There are a variety of potential registrants for UMBEL datasets including original owners and developers (the case for the other approaches), but also third-parties and consumers
  • Therefore, there is a broader spectrum of possible knowledge and ability to characterize the datasets, which suggests more flexibility, more optional items and fewer mandatory items
  • Relatedly, those performing the characterizations may be untrained in formal metadata and cataloging techniques or may need to rely on automated and semi-automated characterization methods; this increases the risk of error, imprecision and uncertainty
  • The relevant data may or may not be part of a dedicated resource; it may be fragmentary or embedded
  • The source data resides in an extreme range in possible scale, from a single datum (granted, generally of quite limited value) to the largest of online databases
  • The existence of a huge diversity of data formats and protocols (WSDL has a similar challenge)
  • The need to accommodate a choice of many different data serializations (WSDL has a similar challenge)
  • WSDL’s service and endpoint considerations never explicitly accounted for data federation
  • The desire to get all forms into other formats or canonical forms for mashups, true federation, etc.

These differences suggest that the UMBEL ontology needs to be both broader and less prescriptive than other approaches.

Some Initial Design Considerations

We can thus combine the best transferable elements of existing schemes with the unique requirements and perspective of UMBEL. Initially, we are not adopting specific definitions or portions of possibly contributing schema. Rather, in this first cut, we are only attempting to capture the necessary scope and concepts. Later, after definitions and closer inspection of specific schema, we will refine this organization and relate it to particular namespaces.

The major “superclass” in this organization is the:

  • Profile — this definition is similar to that used for the DCCAPs (indeed, even adopts the “profile” label!), and is closely allied to the idea of Description in WSDL. The Profile represents the broad metadata characteristics of a dataset including ownership, rights and access policies, and other administrative aspects. Generally, a single dataset no matter of what size or scope may have a single profile, though a federated knowledge base from multiple sources may contain multiples of these. This class does not include specific details regarding interface format or subject scope. Note the next classes are themselves subclasses of this profile

The remaining classes are subsidiary to the Profile and inherit and refer to its metadata. The first two subclasses are also largely administrative in nature:

  • Annotator — this set of properties describes the annotator of the dataset metadata. UMBEL is designed to allow third-parties to describe others’ datasets with optional levels of detail. In the case of disputes, the dataset owner characterizations would hold sway, but there also may be circumstances where multiple characterizations are desirable and allowed
  • Rights — these properties describe the use and access rights for the data. Of course, only the owner may set such conditions, but third parties may provide this characterization if the source site spells out these conditions.

The remaining three classes contain the real guts of the data aspects:

  • Interface — the technical details of the data schema and structure within the dataset (or portions thereof) are defined in the Interface properties. (Interface is similar to the idea of the Interface and portions of the Service classes within WSDL, as with similar analogs for data exchange). The endpoints and access methods for accessing the actual data are by definition part of this Interface class. There is little or no consensus regarding how to classify and organize these details, so that it is likely much of the terminology in this area will be actively discussed and revised. See further [9] for one of the more comprehensive surveys
  • Binding — the Binding properties set the mechanisms for relating the dataset or portions thereof to one or more subject proxies. There may be more than one binding for a given profile or different portions of a dataset
  • SubjectProxy — finally, the SubjectProxy class, representing a likely extension to the core UMBEL for the enumeration of the subject proxies, becomes the linkage to the subject coverage of the datasets.

These classes have a hierarchical relationship similar to the following, with multiple Interface, Binding and SubjectProxy mappings allowable for any given Profile:

Profile -------- Annotator
|
Rights
|
Interface --------- Binding --------- SubjectProxy

Presented below in simple outline form only are these first-proposed classes, and the associated properties and instances of those properties informing the development of the ‘core’ UMBEL ontology. Some definitions of classes are also shown:

Class subClass[1] Property asPredicate [2] Note Definition
Profile the broad metadata characteristics of a dataset including ownership, rights and access policies, and other administrative aspects; generally only one per dataset, though there will be multiples in a repository
abstract hasAbstract
alternativeTitle hasAlternativeTitle
collection isPartofCollection
conceptScheme hasConceptScheme [23]
crawling hasCrawlingPolicy
dateSubmitted hasSubmittedDate
description hasDescription
language hasLanguage [3]
location hasLocation
modified wasModifiedOn
namespace hasNamespace
ontology hasOntology
owner hasOwner
registry isListedOnRegistry [4,5]
sitemap hasSitemap
size hasSize [6]
title hasTitle
type isOfType [7]
version hasVersion
view isBestViewedUsing [8]
Annotator [21] description of the entity that has provided the current Profile description (may be third parties; but deferrence to owner when there are differences)
annotationDate hasAnnotationDate
annotationNote hasAnnotationNote
annotatorLocator hasAnnotatorLocator [9]
annotatorName hasAnnotatorName
annotatorType isAnnotatorType [5,27]
Binding the linkage made between the set or subsets of data within the datasets to the actual subject proxy(ies); may be multiples for a given dataset
about isAbout [10,24] cross-reference to the actual subject proxy IDs; may be multiples
bindingName hasBindingName [10]
bindingScope hasBindingScope [5,12]
bindingType hasBindingType [5,11]
Interface [13] the technical characteristics of the dataset that provide the essential information for enabling retrieval and interoperability; analogous to Interface in WSDL
bindingName hasBindingName [10]
dataFormalism hasDataFormalism [5,14]
endpointLocation hasEndpointLocation
endpointType hasEndpointType [5,15]
pattern usesPattern [16]
pingType hasPingType [17]
queryLanguage usesQueryLanguage [5,18]
serialization hasSerialization [5,19]
translator usesTranslator [20]
Rights [21] various rights and restrictions to accessing, using or reproducing the subject data
accessRight hasAccessRight
copyright hasCopyright
license hasLicense
rightsNote hasRightsNote
SubjectProxy [22] a preferred label that acts as a proxy to the topic concept(s) for which the given dataset content is bound; may be multiples, and backed with ‘synset’ synonyms
altLabel isAlternateLabel [23]
bindingName hasBindingName [10]
prefLabel isPreferredLabel [23]
primarySubject isPrimarySubjectOf [23]
proxyID hasProxyID [25]
subjectLanguage hasSubjectLanguage [28]
subjectNote hasSubjectNote [26]

General table notes are provided under the endnotes [10].

Please note that the specific subject proxies and their defining classes and properties is being handled in a separate document. This outline, as being revised, is informing the first N3 version of the ‘core’ UMBEL ontology.

This structure is still quite preliminary. (For example, data type definitions and interface constructs are still in active discussion, without accepted standards.) Comments on this draft UMBEL ‘core’ ontology outline are welcomed either at the UMBEL discussion forum on Google or at the specific outline page on the UMBEL wiki.

Revisiting the ‘Lightweight’ Designation

We can thus see that there is only minimal semantics in the potential linkage between UMBEL datasets.

One way to place this system is through and interesting approach called the Levels of Conceptual Interoperability Model. One way to view these levels is through the following conceptual diagram [11]:

test

Under this model, UMBEL resides right at the interface between Levels 2 and 3, where syntactic interoperability is achieved but with only limited semantic understanding. In fact, this represents a clear analog to AI3‘s discussion of the structured Web, which is very much related to the syntactic level with the negotiation of semantics the next challenge.

This posting is part of a new, occasional series on the Structured Web.

[1] The UMBEL ontology has two parts. The first ‘core’ part is a flat listing, or pool, of concrete subject topics that are the proxy binding points for external data sets. The second ‘unofficial’ part is a reference look-up structure of hierarchical and interlinked subject relationships.

[2] The “Dublin” in the name refers to Dublin, Ohio, where the work originated from an invitational workshop hosted in 1995 by the Online Computer Library Center (OCLC), a library consortium that has its headquarters there. The “Core” refers to the fact that the metadata element set is a basic but expandable “core” list, used is a similar way to the UMBEL ‘core’.

[3] There are several large registries of OAI-compliant repositories: The OAI registry at University of Illinois at Urbana-Champaign, The Open Archives list of registered OAI repositories, The Celestial OAI registry, Eprint’s Institutional Archives Registry, Openarchives.eu The European Guide to OAI-PMH compliant repositories in the world, and the ScientificCommons.org A worldwide service and registry.

[4] The Standards Committee BB (Task Group 2): Collection & Service Descriptions, NISO Z39.91-200x, Collection Description Specification, November 2005. It also specifies an XML binding for serializing such descriptions for interchange between applications.

[5] Xiaorong Xiang and Eric Lease Morgan, “Exploiting ‘Light-weight’ Protocols and Open Source Tools to Implement Digital Library Collections and Services,’ D-Lib Magazine 11(10), October 2005. See http://www.dlib.org/dlib/october05/morgan/10morgan.html. Also see its MyLibrary reference for examples of facets applied to collections.

[6] Jacek Kopecký, Ed., Web Services Description Language (WSDL) Version 2.0: RDF Mapping, W3C Working Group Note, 26 June 2007. See http://www.w3.org/TR/wsdl20-rdf.

[7] Data binding from the Microsoft perspective is described at http://msdn2.microsoft.com/en-us/library/ms531387.aspx. The W3C’s perspective on XML data binding is described, for example, in Paul Downey, ed., XML Schema Patterns for Common Data Structures, see http://www.w3.org/2005/07/xml-schema-patterns.html more goes here.

[8] Also, there is an entire corpus related to topic maps. In specific reference to Web services, there is the so-called WS-Topics, Web Services Topics 1.3 (WS-Topics) OASIS Standard 1 October 2006; see http://docs.oasis-open.org/wsn/wsn-ws_topics-1.3-spec-os.htm more here. This is used in conjunction with WS-Notification. Also WS-BaseNotification Web Services Base Notification 1.3 (WS-BaseNotification) OASIS Standard 1 October 2006 see http://docs.oasis-open.org/wsn/wsn-ws_base_notification-1.3-spec-os.htm. PDFs of these documents are also available.

[9] For an intial introduction with a focus on Xcerpt, see François Bry, Tim Furche, and Benedikt Linse, “Let’s Mix It: Versatile Access to Web Data in Xcerpt,” in Proceedings of 3rd Workshop on Information Integration on the Web (IIWeb 2006), Edinburgh, Scotland, 22nd May 2006; also as REWERSE-RP-2006-034, see http://rewerse.net/publications/download/REWERSE-RP-2006-034.pdf. For a more detailed treatment, see T. Furche, F. Bry, S. Schaffert, R. Orsini, I. Horrocks, M. Krauss, and O. Bolzer. Survey over Existing Query and Transformation Languages. Deliverable I4-D1a Revision 2.0, REWERSE, 225 pp., April 2006. See http://rewerse.net/deliverables/m24/i4-d9a.pdf.

[10] Here are the general table notes:

[1] SubClasses have the advantage of inheritance and shared metadata; see main text for full subClass path
[2] hasPredicate is actually my preferred format; thoughts?
[3] Standard ISO languages
[4] extendable base listing including NISO, IESR, DLSR, DCMI, etc.; need completion
[5] uses the idea of an extended base class per XML Schema; the enumerated listings thus only need be partially complete; see below for most listings
[6] number of records or TBD metric?
[7] possibly unnecessary; can not see enumeration of profile Types
[8] related to idea of Fresnel or Zitgist “preferred” viewing format or XSLT-type stylesheet
[9] should there be other types than FOAF (what of other formal listings or organizations v. individuals?)?
[10] not sure how to do this; need a x-ref between two class categories (e.g., Binding <-> Interface, Binding <-> SubjectProxy)
[11] are there patterns for bindingTypes or a likely enumerated listing?

HTTP
JDBC
ODBC
RPC
RPI
SOAP
XML-RPC

[12] need an enumerated list (?) going from individual annotation / metadata (a la RDFa) to complete dataset

webPage
dataSet
dataRecords

[13] perhaps not best name; related to Interface and Services in WSDL, could also be called Construct, Composition, others
[14] see possible dataFormalisms below; needs completion; could be named differently (schema, format, etc.)

Atom
eRDF
Microformats
OPML
Other (unspecified)
Other Ontology
OWL DL
OWL Full
OWL Lite
RDF
RDFa
RDF-S
RSS
Spreadsheet
Topic Map
WebPage
XFML

[15] see possible endpointTypes below

dropdownList
fileExport
other Query formats ???
queryBox
SPARQL

[16] patterns are fairly prominent in WSDL and XML Schema; applicable here?
[17] need to discuss
[18] see possible queryLanguages below; needs completion

DQL
IR (standard text search)
N3QL
R-DEVICE
RDFQ
RDQ
RDQL
RQL
SeRQL
SPARQL
SQL
Versa
Xcerpt
XPath
XPointer
XQuery
XSLT
XUL

[19] see possible serializations below; needs completion

Atom
Gbase
html
JSON
JSON-P
N3
RDF/A
RDF/XML
Turtle
XML

[20] GRDDL, RDFizers, and various converters/translators; likely needs an expandable, enumerated list
[21] unlike the other subClasses, this is closely aligned with the standard Profile metadata
[22] likely a separate namespace (e.b., ‘umbels’) that will contain additional information such as synsets, etc. See text.
[23] SKOS concept
[24] need to check on overlap/replacement/use with SKOS subjectIndicator property
[25] should names or IDs be used for subjectProxys? (IDs have the advantage of changing labels and use in other languages)
[26] seems similar or identical to SKOS scopeNote
[27] see possible annotatorTypes below; needs completion

archivist
bot
owner
repository
third-party

[28] similar to dc:language, but must be kept separate from language of the resource (its metadata characterization) from the actual subject proxies

[11] A Tolk, S.Y. Diallo, C.D. Turnitsa and L.S. Winters LS, “Composable M&S Web Services for Net-centric Applications,” Journal for Defense Modeling & Simulation (JDMS), Volume 3 Number 1, pp. 27-44, January 2006.

Posted:July 24, 2007

Huynh Adds to His Winning Series of Lightweight Structured Data Tools

David Huynh, a Ph.D. grad student developer par excellence from MIT’s Simile program, has just announced the beta availability of Potluck. Potluck allows casual users to mashup data on the Web using direct manipulation and simultaneous editing techniques, generally (but not exclusively!) based on Exhibit-powered pages.

Besides Potluck and Exhibit, David has also been the lead developer on such innovative Simile efforts as Piggy Bank, Timeline, Ajax, Babel, and Sifter, as well as a contributor to Longwell and Solvent. Each merits a look. Those familiar with these other projects will notice David’s distinct interface style in Potluck.

Taking Your First Bites

There is a helpful 6-min movie on Potluck that gives a basic overview of use and operation. I recommend you start here. Those who want more details can also read the Potluck paper in PDF, just accepted for presentation at ISWC 2007. And, after playing with the online demo, you can also download the beta source code directly from the Simile site.

Please note that Firefox is the browser of choice for this beta; Internet Explorer support is limited.

To invoke Potluck, you simply go to the demo page, enter two or more appropriate source URLs for mashup, and press Mix Data:

Potluck Entry Screen
[Click on image for full-size pop-up]

(You can also get to the movie from this page.)

Once the datasets are loaded, all fields from the respective sources are rendered as field tags. To combine different fields from different datasets, the respective field tags (color coded by dataset) to be matched are simply dragged to a new column. Differences in field value formats between datasets can be edited with an innovative approach to simultaneous group editing (see below). Once fields are aligned, they then may be assigned as browsing facets. The last step in working with the Potluck mashup is choosing either tabular or map views for the results display.

Potluck is designed to mashup existing Exhibit displays (JSON format), and is therefore lightweight in design. (Generally, Exhibit should be limited to about 500 data records or so per set.)

However, with the addition of the appropriate type name when specifying one of the sources to mash up, you can also use spreadsheet (xls), BibTeX, N3 or RDF/XML formats. The demo page contains a few sample data links. Additional sample data files for different mime types are (note entry using a space with type designator at end):

Besides the standard tabular display, you can also map results. For example, use the BibTeX example above and drop the “address” field into the first drop target area. Then, chose Map at the top of the display to get a mapping of conference locations.

In my own case, I mashed up this source and the xls sample on presidents, and then plotted out location in the US:

Example Potluck Map
[Click on image for full-size pop-up]

Given the capabilities in some of the other Simile tool sets, incorporating timelines or other views should be relatively straightforward.

Pragmatic Lessons and Cautions with Semantic Mashups

Different datasets name similar or identical things differently and characterize their data differently. You can’t combine data from different datasets without resolving these differences. These various heterogeneities — which by some counts can be 40 or so classes of possible differences — were tabulated in one of my recent structured Web posts.

There has been considerable discussion in recent days on various ontology and semantic Web mailing lists about how some practices may solve or not questions of semantic matching. Some express sentiments that proper use of URIs, use of similar namespaces and use of some predicates like owl:sameAs may largely resolve these matters.

However, discussion in David’s ISWC 2007 paper and use of the Potluck demo readily show the pragmatic issues in such matches. Section 2 in the paper presents a readable scenario for real-world challenges in how a historian without programming skills would go about matching and merging data. Despite best practices, and even if all are pursued, actually “meshing” data together from different sources requires judgment and reconciliation. One of the great values of Potluck is as a heuristic and learning tool for making prominent these real-world semantic heterogeneities.

The complementary value of Potluck is its innovative interface design for actually doing such meshing. Potluck is a case argument that pragmatic solutions and designs only come about by just “doing it.”

Easy, Simultaneous Editing

(Note: Though a diagram illustrates some points below, it is no substitute for using Potluck yourself.)

Potluck uses a simple drag-and-drop model for matching fields from different datasets. In the left-hand oval in the diagram below, the user clicks on a field name in a record, drags it to a column, and then repeats that process for matching fields in a records of a different dataset. In the instance below, we are matching the field names of “address” and “birth-place”, which then also get color coded by dataset:

Potluck's Simultaneous Group Editing
[Click on image for full-size pop-up]

This process can be repeated for multiple field matches. The merged fields themselves can be subsequently dragged-and-dropped to new columns for renaming or still further merging.

The core innovation at the heart of Potluck is what happens next. By clicking on Edit for any record in a merged field, the dialog shown above pops up. This dialog supports simultaneous group editing based on LAPIS, another MIT tool for editing text with lightweight structure developed by Ron Miller and team.

As implemented in JavaScript in Potluck, LAPIS first groups data items by similar patterned structure; this initial grouping is what determines the various columns in the above display. Then, when the user highlights any pattern in a column, these are repeated (see same cursors and shading in the right-hand oval) for all entries in the column. They can then be deleted (for pruning, in this case removing ‘USA’), or cut-and-pasted (such as for changing first- and last-name order) for all items in a column. (Single item editing is obviously also an option.)

The first grouping mostly ensures that data formatted differently in different datasets are displayed in their own column. One data form is used for the merged field, and all other columns are group edited to conform. The actual patterns are based on runs of digits, letters, white spaces, or individual punctuation marks and symbols, which are then “greedy” aligned for first the column grouping and then for cursor alignment within columns on highlighted patterns.

The net result is very fast and efficient bulk editing. This approach points the way to more complicated pattern matches and other substitution possibilities (such as unit changes or date and time formats).

Rough Spots and A Hope

I was tempted to award Potluck one of AI3‘s Jewels and Doubloons Awards, but the tool is still premature with rough spots and gaps. For examples, IE and browser support needs to be improved; it would be helpful to be able to delete a record from inclusion in the mashup. (Sometimes only after combining is it clear some records don’t belong together.)

One big issue is that the system does not yet work well with all external sites. For example, my own Sweet Tools Exhibit refused to load and the one from the European Space Agency’s Advanced Concept Team caused JavaScript errors.

Another big issue is that whole classes of functionality, such as writing out combined results or more data view options, are missing.

Of course, this code is not claimed to be commercial grade. What is most important is its pathbreaking approach to semantic mashups (actually, what some others such as Jonathan Lathem have called ‘smashups’) and interfaces and approaches to group editing techniques.

I hope that others pick up on this tool in earnest. David Huynh is himself getting close to completing his degree and may not have much time in the foreseeable future to continue Potluck development. Besides Potluck’s potential to evolve into a real production-grade utility, I think its potential to act as a learning test bed for new UI approaches and techniques for resolving semantic heterogeneities is even greater.

Posted:July 22, 2007

Pattern BlocksLanguage is Essential to Communication

I recently began a series on the structured Web and its role in the continued evolution of the Internet. This next installment in the series probes in greater depth the question of What is structure? in reference to data and Web expressions, with an emphasis on terminology and definitions.

This post ties in with a new best-practices guide published by Chris Bizer, Richard Cyganiak, and Tom Heath — called the Linked Data Publishing Tutorial — that provides definitions and viewpoints from the perspective of the use of RDF (Resource Description Framework) and W3C practices. My initial post in this series and their tutorial occasioned Kingsley Idehen to post his own Linked Data and the Web Information BUS entry, adding the valuable perspective of practices and terminology going back to the early 1990s in object and relational database systems and standards such as ODBC.

All of these efforts share a desire to craft practices, language and terminology to help promote the availability and interoperability of useful data on the Web.

A challenge for the semantic Web community is to craft language that is clear and understandable to the lay public and Web developers, something which it has often done poorly.Some problematic terms include information resource, non-information resource, dereferencing, bnode, content negotiation, representation, URL re-writing, and others.

This piece only tackles ‘non-information resource‘ head on. Discussion of other problematic terms awaits another day.

These posts caused Kingsley and me to engage in a prolonged discussion about definitions and terms. I acted as the unofficial scribe, which I attempt to more generally capture and argue below. If you like the ideas below, you may credit both of us; if you don’t, ascribe any errors or omissions to me alone [1].

The Basic Framework and Argument

The basic observation related to the structured Web is that it is a transition phase from the initial document-centric Web to the eventual semantic Web. In this transitional phase, the Web is becoming much more data-centric. The idea of ‘linked data‘ also is a component of this transition, but is more precise in meaning because by definition the data must be expressed as RDF in order to aid interoperability.

The challenge is to convert existing Web pages and data into the structured Web with every resource accessible via an unambiguous URI. Insofar as this conversion also occurs to RDF, it will promote linked data interoperability.

Transitions of such a profound nature — even if short of the full vision of the semantic Web — create the need for new language and terminology to aid understanding and communication. Sometimes, as well, longstanding terms and practices may need to be refined or challenged. In any event, notions of simplistic versioning such as ‘Web 3.0‘ add little to understanding and communication.

Let’s Begin with Data Structure

Independent of the Internet or the Web, let’s begin our discussion about the nature of structure in its application to data [2,3]. Peter Wood, a professor of computer science at Birkbeck College at the University of London, provides succinct definitions of the “structure” of various types of data [4]:

  • Structured Data — in this form, data is organized in semantic chunks or entities, with similar entities grouped together in relations or classes. Entities in the same group have the same descriptions (or attributes), while descriptions for all entities in a group (or schema): a) have the same defined format; b) have a predefined length; c) are all present; and d) follow the same order. Structured data are what is normally associated with conventional databases such as relational transactional ones where information is organized into rows and columns within tables. Spreadsheets are another example. Nearly all understood database management systems (DBMS) are designed for structural data
  • Unstructured Data — in this form, data can be of any type and do not necessarily follow any format or sequence, do not follow any rules, are not predictable, and can generally be described as “free form.” Examples of unstructured data include text, images, video or sound (the latter two also known as “streaming media”). Generally, “search engines” are used for retrieval of unstructured data via querying on keywords or tokens that are indexed at time of the data ingest, and
  • Semi-structured Data — the idea of semi-structured data predates XML but not HTML (with the actual genesis better associated with SGML, see below). Semi-structured data are intermediate between the two forms above wherein “tags” or “structure” may be associated or embedded within unstructured data. Semi-structured data are organized in semantic entities, similar entities are grouped together, entities in the same group may not have same attributes, the order of attributes is not necessarily important, not all attributes may be required, and the size or type of some attributes in a group may differ. To be organized and searched, semi-structured data should be provided electronically from database systems, file systems (e.g., bibliographic data, Web data) or via data exchange formats (e.g., EDI, scientific data, XML).

We can thus view data structure as residing on a spectrum (also shown with “typical” storage and indexing frameworks on the top line):

For the past decades, structured data has been typically managed by database management systems (DBMSs) or spreadsheets, unstructured data by text indexing systems such as used by search engines or unindexed in file systems or repositories.

Semi-structured data models are sometimes called “self-describing” (or schema-less) [5]. The first known definition of semi-structured data dates to 1993 [6] by Peter Schäuble: “We call a data collection semistructured if there exists a database scheme which specifies both normalized attributes (e.g., dates or employee numbers) and non-normalized attributes (e.g., full text or images).” More current usage (see the Wood definition above) also includes the notion of labeled graphs or trees with the data stored at the leaves, with the schema information contained in the edge labels of the graph. Semi-structured representations also lend themselves well to data exchange or the integration of heterogeneous data sources.

HTML tags within Web documents are a prime example of semi-structured data, as are text-based data transfer protocols or serializations [7]. Semi-structured data is also a natural intermediate form when “structure” is desired to be extracted from standard text through techniques generally called ‘information extraction’ (IE). For example, here is possible structure shown in yellow as might be extracted from a death notice or obituary [8]:

John A. Smith of Salem, MA died Friday at Deaconess Medical Center in Boston after a bout with cancer. He was 67. Born in Revere, he was raised and educated in Salem, MA. He was a member of St. Mary’s Church in Salem, MA, and is survived by his wife, Jane N., and two children, John A., Jr., and Lily C., both of Winchester, MA. A memorial service will be held at 10:00 AM at St. Mary’s Church in Salem.

This notice contains a great deal of information including names, places, dates and relationships, which once extracted, can be separately indexed or manipulated. Virtually all text-based documents can have similar structure extracted.

The Web has been a prime source of growth of semi-structured data, most often through text-based data serializations [9] and mark-up languages [10].

Depending on point of view or definition, RDF can either be called semi-structured or structured data. It resides squarely at the transition point between these two categories on this structural continuum.

A Variety of Web ‘Resources’

A couple of Web concepts often cause confusion and difficulty for some users: 1) Uniform Resource Locators (URLs) v Uniform Resource Identifiers (URIs); and 2) the concept of “resources” themselves. As it happens, there was a pretty accessible discussion on URIs that was recently posted; I recommend that discussion on that topic. Instead, we’ll focus here on “resources.”

The concept of a resource is basic to the Web’s architecture and is used in the definition of its fundamental elements, including obviously URL and URI above. The essence of the semantic Web parlance, as well, is built around the notion of abstract resources and their semantic properties. The data model and languages of the Resource Description Framework (RDF) squarely revolve around this “resource” concept.

The first explicit definition of resource related to the Web is found in RFC 2396 in August 1998:

A resource can be anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., “today’s weather report for Los Angeles”), and a collection of other resources.

However, in the context of the Internet, not all resources so defined (such as a person or a company) can be retrieved, while electronic resources like an image or Web page can. Thus, a first challenge arises in the concept of resource and its locational address: some are actual and physical, others are abstract or referential.

According to Wikipedia’s discussion of resources:

The concept of resource has evolved during the Web’s history, from the early notion of a static addressable document or file, to a more generic and abstract definition, now encompassing every thing or entity that can be identified, named, addressed or handled, in any way whatsoever, in the Web at large, or in any networked information system. The declarative aspects of a resource (identification and naming) and its functional aspects (addressing and technical handling) were not clearly distinct in the early specifications of the Web . . . .

The need to somehow make this distinction between actual or physical resources vs. abstract or referential resources led to much discussion and controversy sometimes known as the httpRange-14 issue [11], resolved by the W3C’s Technical Architecture Group (TAG) in 2005. The TAG defined a distinction between two resource types:

  • information resource — this category pertains to the original sense of resources on the Web, such as files, documents or other resources to which a URL can be assigned (including images and non-document media). Though it has been proposed that such resources be designated as “slash” URIs, such as http:///www.example.com/home/index.html, this is not enforceable and some resources in the next category do not adhere. A “slash” URI, however, still is “better practice” (if not “best practice”) for such traditional resources. Note that this category of information resource is very much in keeping with the nature of the early, document-centric Web
  • non-information resource — this category is the new one to deal with all of the “abstract” cases such as classes, etc.; this category is especially important to the data-centric Web. As with the other resource category, it was proposed to use the “pound sign” fragment identifiers used by anchor tags, such as http:///www.example.com/data#bigClass, for example, to signal this different resource type, but, again, it is not enforceable and at most could be “better practice.”

A successful HTTP request for an information resource results in a 200 message (“OK”, followed by transfer) from the Web server; if the request is for a resource that the Web server recognizes as existing but of the wrong type requested, the publisher can use the 303 redirect response to provide the correct URI [12].

These resource distinctions are very, very unfortunate in their labeling, if not on more fundamental grounds. It is non-sensical to call one category “information” and the other not, when everything is informational. Moreover, the distinctions bring absolutely no clarity. (Important note: actually, the provenance of the term ‘non-information resource‘ appears to be quite recent, as well as wrong and unfortunate [13]).

However, getting standards bodies to change labels is a long and uncertain task. The approach taken below is to stick with the ‘information resource‘ term, but to provide the alias of ‘structured data resource‘ in place of ‘non-informational resource‘ and to add some additional sub-category distinctions [14].

The Relation Between Resources and Structure

Even though a Web ‘resource‘ has an address scheme and other requirements, these are details and specifics that can mask the fundamental purpose of a resource to act as a “container” for encapsulating information of some sort. This encapsulation is what enables access to its “payload” information (Kingsley’s term) on the Web or the broader Internet via standard protocols (HTTP and TCP/IP). The mechanisms of the encapsulation constitute the details and specifics.

Inherent to the transition from the document Web to the structured Web is the increased importance of that most confusing category: non-information resource. That is because, like fragment identifiers, we are talking about objects more granular (subsidiary) to document-level resources and it is because we are now referencing structure that includes such “abstract” notions as classes, properties, types, namespaces, ontologies, schema, etc.

One paradox in all of this is the very category of resource designed to deal with these data issues is itself called by many a ‘non-informational resource‘. This term is non-sense [13]. We use instead ‘structured data resource.’

There is much that needs to be cleaned up and clarified over time regarding uses and nomenclature regarding resources. However, from the standpoint of the structured Web, we can probably for the time being concentrate on those items shown in bold in this table:

Information Resource current standard Web term
unstructured and semi-structured data
Document Resource text and markup within standard Web pages
‘Other’ Resources non-text resources with a URL (images, streaming media, non-text indexable files)
Non-information Resource
(aka Structured Data
Resource)
[see text; 13]
current standard Web term; non-sensical
semi-structured and structured data
Structured Data all non-RDF data-oriented resources, including non-RDF namespaces, etc.
Linked Data all RDF

Note that the two main resource categories used in practice are maintained. The ‘information resource’ category retains its traditional understanding. From the standpoint of the structured Web, the document resources are the unstructured and semi-structured data content from which information extraction (IE) techniques and software can extract the eventual structure.

The category of document resource likely represents the majority of potentially useful structural content on the Web and is most often overlooked in discussions of linked data or the semantic Web. This content, if subjected to IE and therefore structure creation, then becomes a URI resource better handled as a ‘structured data resource.’

Structured data resources‘ (that is, the poorly labeled ‘non-information resources‘) are the building blocks for the structured Web. In all cases, these resources are either semi-structured or structured data. There are two sub-categories of resources in this category, differentiated by whether the structured data is expressed in RDF (or RDF-based languages) or not. All RDF variants are called ‘linked data‘; all other forms are termed ‘structured data.’

For most general purposes, putting aside nuances and subtle technicalities, the best shorthand for thinking about these resource distinctions is simply Documents v. Data.

Thus, we can see the path to the structured Web taking a number of different branches.

The first branch, and the one necessary for the largest portion of content, is to use a combination of IE techniques to extract entity information from unstructured text or to use structure extraction on the semi-structure of the document [15] to create the structured data resource for that document. Of all variants, this is the longest path, the one least developed, but one with potentially great value.

The second branch is to publish the structured data directly as a resource or to provide access to it through a Web service or API. This is the current basis for most of the structured data resources presently available on the Web. (It is also the outcome of IE for the first branch.)

The third branch is really just a complete variant of the other two — ensuring that the structured data resource is available as interoperable RDF linked data. There are two ways to proceed down this branch. One way is for the publisher to create and post the resource directly in a form of RDF. (Though the actual data can be serialized in a variety of ways such as RDF/XML, N3, Atom or Turtle, conversions between these forms is relatively straightforward.) The other way, less direct, is for the publisher or a third-party to convert non-RDF structured data into RDF with the rich and growing list of available ‘RDFizers’ [16].

The Web in Transition

This material, plus the earlier introduction to the structured Web, can now be brought together as a picture of the Web in transition. While there are no real beginning and end points, there is a steady progression from a document-centric Web to one that is data-centric, including the mediation of semantics:

Transition in Web Structure
Document Web Structured Web
Semantic Web
Linked Data

 

  • Document-centric
  • Document resources
  • Unstructured data and semi-structured data
  • HTML
  • URL-centric
  • circa 1993
  • Data-centric
  • Structured data
  • Semi-structured data and structured data
  • XML, JSON, RDF, etc
  • URI-centric
  • circa 2003
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S
  • URI-centric
  • circa 2006
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S, OWL
  • URI-centric
  • circa ???

The basic argument of this series is that we are in the midst of a transition phase — the structured Web — that marks the beginning of the dominance of data on the Web. In its broadest definition, the structured Web has many different data forms and serializations. A subset of the structured Web — namely, linked data — is a direct precursor to the semantic Web with its emphasis on RDF and data interoperability and services.

Another argument of this series is — despite the promise of linked data — that structured data resources in many forms will co-exist and provide alternatives. This diversity is natural. For RDF and linked data advocates, tools are now largely in place to convert these diverse forms. Though the ability to see large-scale availability of RDF data appears clear, the longer-term resolution of mediating heterogeneous semantics remains cloudy.

Brief Glossary

To re-cap, and to aid language and understanding, here is a brief glossary of the key terms used in this discussion:

  • dereferencing — the act of locating and transmitting structured data, in a requested format (representation), exposed by a URL
  • document resource — a Web resource designated by a URL that contains unstructured and semi-structured data
  • information extraction — any of a variety of techniques for extracting structure and entities from content
  • information resource — any Web resource that can be retrieved via a URL
  • linked data — a structured data resource in RDF that can be obtained via a URI
  • non-information resource — this is “any resource” that is not an information resource; preference is to deprecate its use and use structured data resource in its stead
  • semantic Web — the Web of data with sufficient meaning associated with that data such that a computer program may learn enough to meaningfully process it
  • semi-structured data — data that includes semantic entities that may have attributes or groups, but which are not all required or presented in a set order or in a set size or type; may be embedded or interspersed in unstructured data
  • structured data — data organized into semantic chunks or entities, with similar entities grouped together in relations or classes, and presented in a patterned manner
  • structured data resource — a structured data resource that can be obtained via a URI
  • structured Web — the data-centric Web of structured data resources and structured data in various forms; formally defined as, object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance
  • unstructured data — data can be of any type and does not necessarily follow any format or sequence or rules, can generally be described as “free form;” includes text, images, video or sound
  • URI — uniform resource identifier
  • URL — uniform resource locator
This posting is part of a new, occasional series on the Structured Web.

[1] You will note a heavy emphasis on Wikipedia definitions in keeping with Web usage.

[2] Of course, the word ‘structure‘ has a broad range of meanings; we are only concerned here about data structure and its specific applicability to Web-related information.

[3] Much of the discussion in this sub-section is derived from an earlier AI3 posting, Semi-structured Data: Happy 10th Birthday!, from November 2005; some of the original information is a bit dated. It is also aided by Kingsley’s Structured Data v. Unstructured Data posting from June 2006. That posting notes the frequent confusion and ambiguity between the terms “structured data” and “unstructured data,” and the importance when speaking of structure to keep separate the structure of the data itself (the focus herein), the structure of the container that hosts the data, and the structure of the access method used to access the data.

[4] Peter Wood, School of Computer Science and Information Systems, Birkbeck College, the University of London. See http://www.dcs.bbk.ac.uk/~ptw/teaching/ssd/toc.html.

[5] The earliest known recorded mention of “semi-structured data” occurred in 1992 from N. J. Belkin and Croft, W. B., “Information filtering and information retrieval: two sides of the same coin?,” in Communications of the ACM: Special Issue on Information Filtering, vol. 35(12), pp. 29 – 38, with the first known definition from [6]. The next two mentions were in 1995 from D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman and J. Widom, “Querying semistructured heterogeneous information,” presented at Deductive and Object-Oriented Databases (DOOD ’95), LNCS, No. 1013, pp. 319-344, Springer, and M. Tresch, N. Palmer, and A. Luniewski, “Type classification of semi-structured data,” in Proceedings of the International Conference on Very Large Data Bases (VLDB). However, the real popularization of the term “semi-structured data” occurred through the seminal 1997 papers from S. Abiteboul, “Querying semi-structured data,” in International Conference on Data Base Theory (ICDT), pp. 1-18, Delphi, Greece, 1997 (http://dbpubs.stanford.edu:8090/pub/1996-19) and P. Buneman, “Semistructured data,” in ACM Symposium on Principles of Database Systems (PODS), pp. 117-121, Tucson, Arizona, May 1997 (http://db.cis.upenn.edu/DL/97/Tutorial-Peter/tutorial-semi-pods.ps.gz). Of course, semi-structured data had existed prior to these early references, only it had not been named as such.

[6] P. Schäuble, “SPIDER: a multiuser information retrieval system for semistructured and dynamic data,” in Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318 – 327, 1993.

[7] Such protocols first received serious computer science study in the late 1970s and early 1980s. In the financial realm, one early standard was electronic data interchange (EDI). In science, there were literally tens of exchange forms proposed with varying degrees of acceptance, notably abstract syntax notation (ASN.1), TeX (a typesetting system created by Donald Knuth and its variants such as LaTeX), hierarchical data format (HDF), CDF (common data format), and the like, as well as commercial formats such as Postscript, PDF (portable document format), RTF (rich text format), and the like. One of these proposed standards was the “standard generalized markup language” (SGML), first published in 1986. SGML was flexible enough to represent either formatting or data exchange. However, with its flexibility came complexity. Only when two simpler forms arose, namely HTML (HyperText Markup Language) for describing Web pages and XML (eXtensible Markup Language) for data exchange, did variants of the SGML form emerge as widely used common standards. The XML standard was first published by the W3C in February 1998, rather after the semi-structured data term had achieved some impact (http://www.w3.org/XML/hist2002) Dan Suciu was the first to publish on the linkage of XML to semi-structured data in 1998 (D. Suciu, “Semistructured data and XML,” in International Conference on Foundations of Data Organization (FODO), Kobe, Japan, November 1998; see PDF option from http://citeseer.ist.psu.edu/suciu98semistructured.html), a reference that remains worth reading to this day.

[8] David Loshin, “Simple semi-structured data,” Business Intelligence Network, October 17, 2005. See http://www.b-eye-network.com/view/1761. This example is actually quite complex and demonstrates the challenges facing IE software. Extracted entities most often relate to the nouns or “things” within a document. Note also, for example, how many of the entities involve internal “co-referencing,” or the relation of subjects such as “he” or clock times such as “10 a.m” to specific dates. A good entity extraction engine helps resolve these so-called “within document co-references.”

[9] Example text-based data serializations and formats used on the Web include Atom, Gdata, JSON, N3, pickle, RDF/XML, RSS, Turtle, XML, YaML.

[10] Example mark-up languages used on the Web include HTML, Wikitext, XHTML, XML.

[11] It was called the httpRange-14 issue by virtue of the agenda label at the TAG’s meetings.

[12] Of course, nothing compels the publisher to provide these instructions, but they are integral to “best practices” and publishers desirous of attracting consumers have incentives to follow them.

Depending on the nature of the information resource, an HTTP GET returns a 200 OK status and sends the resource (for example a request for a Web page or an RDF file) if the request is for the correct type. If the request is for the wrong type, the publisher can include in the header a 303 (See Other) redirect response to send the requester to the appropriate information resource URL. If the request is for an unknown URI or resource, a variety of 4xx error responses may result. (See further [13].)

One advantage of posting structured data resources as RDF- linked data is that this redirect can be sent to a REST-style Web Service that does a best-effort DESCRIBE of the resource using SPARQL. Since SPARQL is a query language, protocol, and results representation scheme, the redirect can come in the form of a URL that can directly query the structured data resource. In this manner, large structured data resource data sets can act as ‘endpoints’ for context-specific information linked anywhere on the Web.

[13] The report of the TAG’s 2005 ‘compromise solution’ was reported by Roy Fielding, with his public notice reproduced in full:

<TAG type=”RESOLVED”>

That we provide advice to the community that they may mint
“http” URIs for any resource provided that they follow this
simple rule for the sake of removing ambiguity:
a) If an “http” resource responds to a GET request with a
2xx response, then the resource identified by that URI
is an information resource;
b) If an “http” resource responds to a GET request with a
303 (See Other) response, then the resource identified
by that URI could be any resource;
c) If an “http” resource responds to a GET request with a
4xx (error) response, then the nature of the resource
is unknown.

</TAG>

I believe that this solution enables people to name arbitrary
resources using the “http” namespace without any dependence on
fragment vs non-fragment URIs, while at the same time providing
a mechanism whereby information can be supplied via the 303
redirect without leading to ambiguous interpretation of such
information as being a representation of the resource (rather,
the redirection points to a different resource in the same way
as an external link from one resource to the other).

Note that point “a” discusses an “information resource,” with the contrasting treatment in point “b” as potentially “any resource”.

To my knowledge, the first that the unfortunate term ‘non-information resource’ was introduced to cover these point “b” conditions was in a draft (that is, unofficial) TAG finding from two years later, in May 31 of this year, “Dereferencing HTTP URIs.” Besides its unfortunate continuation of the ‘dereferencing’ term (a discussion for another day), it introduces the even-worse ‘non-information resource’ terminology. That draft TAG finding in Sec 2 talks about ‘information resources’ (as does the 2005 TAG finding), and in Sec 3 about ‘other Web resources’ (or the “any” category from the 2005 notice). In Sec 4, however, the document switches from ‘other Web’ to ‘non-information’, which is then continued through the rest of the document.

It is not too late for the community to cease using this term; our replacement suggestion is ‘structured data resource.’

[14] You may also want to see, Leo Sauermann, Richard Cyganiak, and Max VölkelCool URIs for the Semantic Web,” or Dan Connolly, “A Pragmatic Theory of Reference for the Web.”

[15] Web documents are typically semi-structured with embedded tag, metadata, and presentation structure in such things as headers, tables, layouts, labels, dropdown lists, etc. For the unstructured text content, traditional information extraction techniques are applied. But for the semi-structure, various scraping or extraction techniques may either be crafted by hand or be semi-automatically or automatically applied through regular expression processing, pattern matching, inspection of the Web page DOM or other techniques. One posting in this structured Web series is to be devoted to this topic.

[16] There is an impressive and growing list of data conversion protocols and tools, most which support multiple input and output forms, including RDF and various serializations of RDF. This list includes Virtuoso Sponger, GRDDL, Babel, RDFizers, general converters, Triplr, etc. One posting in this structured Web series is to be devoted to this topic.

Posted by AI3's author, Mike Bergman Posted on July 22, 2007 at 3:58 pm in Adaptive Information, Semantic Web, Structured Web | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/391/more-structure-more-terminology-and-hopefully-more-clarity/
The URI to trackback this post is: https://www.mkbergman.com/391/more-structure-more-terminology-and-hopefully-more-clarity/trackback/