Posted:September 9, 2007

A Milestone Week of Innovation

This Past Week’s Theme is Lightweight, and Cool!

Danny Ayers has recently instituted a semantic Web links re-cap each week (hey folks, set up a specific category!, though the del.icio.us tag “semweb weekly” is a nice touch), which is a fantastic service by Talis and one that most of us in the community look forward to. This post is not meant to be competitive with it. But, with the stuff happening last week, I could not resist a bit of a re-cap of my own.

While I have a pending post on Zitgist’s zLinks tool for WordPress (stay tuned!), there were some other announcements of the past week I thought were unique and noteworthy:

CouchDB — this is a very lightweight, scalable and flexible database approach that involves simple declarations on Web pages and a simple Erlang server component to render them. What is exciting about the approach is the ease for non-developers or users to create structured and transferrable data from unstructured and semi-structured bases; the technical overview is a good place to start
RDF/JSON — the Javascript community and many of us desirous of seeing a broadening of semWeb efforts to mainstream developers welcomed Keith Alexander‘s announcement and request for input on RDF/JSON; Keith has been a pragmatic innovator of the first rank and I suspect this initiative will be widely embraced
CoScripter — a simple, natural language system for automating repeated tasks within a Web browser, based on the Koala initiative (see, for example, the CHI 2007 paper) from IBM that many of us have been keeping our ears to the railroad tracks, and
zLinks — see my next post.

Posted:August 23, 2007

Information is the Basis for Economic Growth

Was the Industrial Revolution Truly the Catalyst?

Download PDF

Why, roughly beginning in 1820, did historical economic growth patterns skyrocket?

This is a question of no small import, and one that has occupied economic historians for many decades. We know what some of the major transitions have been in recorded history: the printing press, Renaissance, Age of Reason, Reformation, scientific method, Industrial Revolution, and so forth. But, which of these factors were outcomes, and which were causative?

This is not a new topic for me. Some of my earlier posts have discussed Paul Ormerod’s Why Most Things Fail: Evolution, Extinction and Economics, David Warsh’s Knowledge and the Wealth of Nations: A Story of Economic Discovery, David M. Levy’s Scrolling Forward: Making Sense of Documents in the Digital Age, Elizabeth Eisenstein’s classic Printing Press, Joel Mokyr’s Gifts of Athena : Historical Origins of the Knowledge Economy, Daniel R. Headrick’s When Information Came of Age : Technologies of Knowledge in the Age of Reason and Revolution, 1700-1850, and Yochai Benkler’s, The Wealth of Networks: How Social Production Transforms Markets and Freedoms. Thought provoking references, all.

But, in my opinion, none of them posits the central point.

Statistical Leaps of Faith

Statistics (originally derived from the concept of information about the state) really only began to be collected in France in the 1700s. For example, the first true population census (as opposed to the enumerations of biblical times) occurred in Spain in that same century, with the United States being the first country to set forth a decennial census beginning around 1790. Pretty much everything of a quantitative historical basis prior to that point is a guesstimate, and often a lousy one to boot.

Because no data was collected — indeed, the idea of data and statistics did not exist — attempts in our modern times to re-create economic and population assessments in earlier centuries are truly a heroic — and an estimation-laden exercise. Nonetheless, the renowned economic historian who has written a number of definitive OECD studies, Angus Maddison, and his team have prepared economic and population growth estimates for the world and various regions going back to AD 1 [1].

One summary of their results shows:

Year	Ave Per Capita	Ave Annual	Yrs Required
AD	GDP (1990 $)	Growth Rate	for Doubling
1	461
1000	450	-0.002%	N/A
1500	566	0.046%	1,504
1600	596	0.051%	1,365
1700	615	0.032%	2,167
1820	667	0.067%	1,036
1870	874	0.542%	128
1900	1,262	1.235%	56
1913	1,526	1.470%	47
1950	2,111	0.881%	79
1967	3,396	2.836%	25
1985	4,764	1.898%	37
2003	6,432	1.682%	42

Note that through at least 1000 AD economic growth per capita (as well as population growth) was approximately flat. Indeed, up to the nineteenth century, Maddison estimates that a doubling of economic well-being per capita only occurred every 3000 to 4000 years. But, by 1820 or so onward, this doubling accelerated at warp speed to every 50 years or so.

Looking at a Couple of Historical Breakpoints

The first historical shift in millenial trends occurred roughly about 1000 AD, when flat or negative growth began to accelerate slightly. The growth trend looks comparatively impressive in the figure below, but that is only because the doubling of economic per capita wealth has now dropped to about every 1000 to 2000 years (note the relatively small differences in the income scale). These are annual growth rates about 30 times lower than today, which, with compounding, prove anemic indeed (see estimated rates in the table above).

Nonetheless, at about 1000 AD, however, there is an inflection point, though small. It is also one that corresponds somewhat to the adoption of raw linen paper v. skins and vellum (among other correlations that might be drawn).

When the economic growth scale gets expanded to include today, these optics change considerable. Yes, there was a bit of growth inflection around 1000 AD, but it is almost lost in the noise over the longer historical horizon. The real discontinuity in economic growth appears to have occurred in the early 1800s compared to all previous recorded history. At this major inflection point in the early 1800s, historically flat income averages skyrocketed. Why?

The fact that this inflection point does not correspond to earlier events such as invention of the printing press or Reformation (or other earlier notable transitions) — and does more closely correspond to the era of the Industrial Revolution — has tended to cement in popular histories and the public’s mind that it was machinery and mechanization that was the causative factor creating economic growth.

Had a notable transition occurred in the mid-1400s to 1500s it would have been obvious to ascribe more modern economic growth trends with the availability of information and the printing press. And, while, indeed, the printing press had massive effects, as Elizabeth Eisenstein has shown, the empirical record of changes in economic growth is not directly linked with adoption of the printing press. Moreover, as the graph above shows, something huge did happen in the early 1800s.

Pulp Paper and Mass Media

In its earliest incarnations, the printing press was an instrument of broader idea dissemination, but still largely to and through a relatively small and elite educated class. That is because books and printed material were still too expensive — I would submit largely due to the exorbitant cost of paper — even though somewhat more available to the wealthy classes. Ideas were fermenting, but the relative percentage of participants in that direct ferment were small. The overall situation was better than monks laboriously scribing manuscripts, but not disruptively so.

However, by the 1800s, those base conditions change, as reflected in the figure above. The combination of mechanical presses and paper production with the innovation of cheaper “pulp” paper were the factors that truly brought information to the “masses.” Yet, some have even taken “mass media” to be its own pejorative. But, look closely as what that term means and its importance to bringing information to the broader populace.

In Paul Starr’s Creation of the Media, he notes how in 15 years from 1835 to 1850 the cost of setting up a mass-circulation paper increased from $10,000 to over $2 million (in 2005 dollars). True, mechanization was increasing costs, but from the standpoint of consumers, the cost of information content was dropping to zero and approaching a near-time immediacy. The concept of “news” was coined, delivered by the “press” for a now-emerging “mass media.” Hmmm.

This mass publishing and pulp paper were emerging to bring an increasing storehouse of content and information to the public at levels never before seen. Though mass media may prove to be an historical artifact, its role in bringing literacy and information to the “masses” was generally an unalloyed good and the basis for an improvement in economic well being the likes of which had never been seen.

More recent trends show an upward blip in growth shortly after the turn of the 20th century, corresponding to electrification, but then a much larger discontinuity beginning after World War II:

In keeping with my thesis, I would posit that organizational information efforts and early electromechanical and then electronic computers resulting from the war effort, which in turn led to more efficient processing of information, were possible factors for this post-WWII growth increase.

It is silly, of course, to point to single factors or offer simplistic slogans about why this growth occurred and when. Indeed, the scientific revolution, industrial revolution, increase in literacy, electrification, printing press, Reformation, rise in democracy, and many other plausible and worthy candidates have been brought forward to explain these historical inflections in accelerated growth. For my own lights, I believe each and every one of these factors had its role to play.

But at a more fundamental level, I believe the drivers for this growth change came from the global increase and access to prior human information. Surely, the printing press helped to increase absolute volumes. Declining paper costs (a factor I believe to be greatly overlooked but also conterminous with the growth spurt and the transition from rag to pulp paper in the early 1800s), made information access affordable and universal. With accumulations in information volume came the need for better means to organize and present that information — title pages, tables of contents, indexes, glossaries, encyclopedia, dictionaries, journals, logs, ledgers,etc., all innovations of relatively recent times — that themselves worked to further fuel growth and development.

Of course, were I an economic historian, I would need to argue and document my thesis in a 400-pp book. And, even then, my arguments would appropriately be subject to debate and scrutiny.

Information, Not Machines

Tools and physical artifacts distinguish us from other animals. When we see the lack of a direct correlation of growth changes with the invention of the printing press, or growth changes approximate to the age of machines corresponding to the Industrial Revolution, it is easy and natural for us humans to equate such things to the tangible device. Indeed, our current fixation on technology is in part due to our comfort as tool makers. But, is this association with the technology and the tangible reliable, or (hehe) “artifactual”?

Information, specifically non-biological information passed on through cultural means, is what truly distinguishes us humans from other animals. We have been easily distracted looking at the tangible, when it is the information artifacts (“symbols”) that make us the humans who we truly are.

So, the confluence of cheaper machines (steam printing presses) with cheaper paper (pulp) brought information to the masses. And, in that process, more people learned, more people shared, and more people could innovate. And, yes, folks, we innovated like hell, and continue to do so today.

If the nature of the biological organism is to contain within it genetic information from which adaptations arise that it can pass to offspring via reproduction — an information volume that is inherently limited and only transmittable by single organisms — then the nature of human cultural information is a massive shift to an entirely different plane.

With the fixity and permanence of printing and cheap paper — and now cheap electrons — all prior discovered information across the entire species can now be accumulated and passed on to subsequent generations. Our storehouse of available information is thus accreting in an exponential way, and available to all. These factors make the fitness of our species a truly quantum shift from all prior biological beings, including early humans.

What Now Internet?

The information by which the means to produce and disseminate information itself is changing and growing. This is an infrastructural innovation that applies multiplier benefits upon the standard multiplier benefit of information. In other words, innovation in the basis of information use and dissemination itself is disruptive. Over history, writing systems, paper, the printing press, mass paper, and electronic information have all had such multiplier effects.

The Internet is but the latest example of such innovations in the infrastructural groundings of information. The Internet will continue to support the inexorable trend to more adaptability, more wealth and more participation. The multiplier effect of information itself will continue to empower and strengthen the individual, not in spite of mass media or any other ideologically based viewpoint but due to the freeing and adaptive benefits of information itself. Information is the natural antidote to entropy and, longer term, to the concentrations of wealth and power.

If many of these arguments of the importance of the availability of information prove correct, then we should conclude that the phenomenon of the Internet and global information access promises still more benefits to come. We are truly seeing access to meaningful information leapfrog anything seen before in history, with soon nearly every person on Earth contributing to the information dialog and leverage.

Endnote: And, oh, to answer the rhetorical question of this piece: No, it is information that has been the source of economic growth. The Industrial Revolution was but a natural expression of then-current information and through its innovations a source of still newer-information, all continuing to feed economic growth.

[1] The historical data were originally developed in three books by Angus Maddison: Monitoring the World Economy 1820-1992, OECD, Paris 1995; The World Economy: A Millennial Perspective, OECD Development Centre, Paris 2001; and The World Economy: Historical Statistics, OECD Development Centre, Paris 2003. All these contain detailed source notes. Figures for 1820 onwards are annual, wherever possible.

For earlier years, benchmark figures are shown for 1 AD, 1000 AD, 1500, 1600 and 1700. These figures have been updated to 2003 and may be downloaded by spreadsheet from the Groningen Growth and Development Centre (GGDC), a research group of economists and economic historians at the Economics Department of the University of Groningen headed by Maddison. See http://www.ggdc.net/.

Posted:August 18, 2007

RDF123 Makes Generating Flexible RDF a Snap

RDF123

UMBC’s Ebiquity Program Creates Another Great Tool

In a strange coincidence, I encountered a new project called RDF123 from UMBC’s Ebiquity program a few days back while researching ways to more easily create RDF specifications. (I was looking in the context of easier ways to test out variations of the UMBEL ontology.) I put in on my to-do list for testing, use and a possible review.

Then, this morning, I saw that Tim Finin had posted up a more formal announcement of the project, including a demo of converting my own Sweet Tools to RDF using the very same tool! Thanks, Tim, and also for accelerating my attention on this. Folks, we have another winner!

RDF123, developed by Lushan Han with funding from NSF [1], improves upon earlier efforts from the University of Maryland’s Mindswap lab, which had developed Excel2RDF and the more flexible ConvertToRDF a number of years back. Unlike RDF123, these other tools were limited to creating an instance of a given class for each row in the spreadsheet. RDF123, on the other hand, allows users to define mappings to arbitrary graphs and different templates by row.

It is curious why so little work has been done on spreadsheets as an input and specification mechanism for RDF given the huge use and ubiquity (pun on purpose!) of the format. According to the Ebiquity technical report [1], Topbraid Composer has a spreadsheet utility (one that I have not tested) and there is a new plug-in for Protégé version 4.0 from Jay Kola that was also on my to-do list for testing (which requires upgrading to the beta version of Protégé) that has support for imports of OWL and RDF Schema.

I have also been working with the Linking Open Data group at the W3C regarding converting the Sweet Tools listing to RDF, and have indeed had a RDF/XML listing available for quite some time [2]. You may want to compare this version with the N3 version produced by RDF123 [3]. The specification for creating this RDF123 file, also in N3 format, is really quite simple:

@prefix d: <http://spreadsheets.google.com/feeds/list/o15870720903820235800 etc., etc.> .
@prefix mkbm: <https://www.mkbergman.com/> .
@prefix exhibit: <http://simile.mit.edu/2006/11/exhibit#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <#> .
@prefix e: <http://spreadsheets.google.com/feeds/list/o15870720903820235800 etc., etc.> .

<Ex:e+$1>
  a exhibit:Item ;
  rdfs:label "Ex:$1" ;
  exhibit:origin "Ex:mkbm+'#'+$1^^string" ;
  d:Category "Ex:$5" ;
  d:Existing "Ex:$7" ;
  d:FOSS "Ex:$4" ;
  d:Language "Ex:$6" ;
  d:Posted "Ex:$8" ;
  d:URL "Ex:$2^^string" ;
  d:Updated "Ex:$9" ;
  d:description "Ex:$3" ;
  d:thumbnail "Ex:@If($10='';'';mkbm+@Substr($10,12,@Sub(@Length($10),4)))^^string" .

The UMBC approach is somewhat like GRDDL for converting other formats to RDF, but is more direct by bypassing the need to first convert the spreadsheet to XML and then transform with XSLT. This means updates can be automatic, and the difficulty of writing XSLT is replaced itself with a simple notation as above for properly replacing label names.

RDF123 has the option of two interfaces in its four versions. The first interface, used by the application versions, is a graphical interface that allows users to create their mapping in an intuitive manner. The second is a Web service that takes as input a combined URL string to a Google spreadsheet or CSV file and an RDF123 map and output specification [3].

The four versions of the software are the:

RDF123 is a tremendous addition to the RDF tools base, and one with promise for further development for easy use by standard users (non-developers). Thanks NSF, UMBC and Lushan!

And, Thanks Josh for the Census RDF

Along with last week’s tremendous announcement by Josh Tauberer for making 2000 US Census data available as nearly 1 billion RDF triples, this dog week of August in fact has proven to be a stellar one on the RDF front! These two events should help promote an explosion of RDF in numeric data.

[1] Lushan Han, Tim Finin, Cynthia Parr, Joel Sachs, and Anupam Joshi, RDF123: A Mechanism to Translate Spreadsheets to RDF, Technical Report from the Computer Science and Electrical Engineering Dept., University of Maryland, Baltimore County, August 2007, 17 pp. See http://ebiquity.umbc.edu/paper/html/id/368/RDF123-a-mechanism-to-translate-spreadsheets-to-RDF; also, a PDF version of the report is available. The effort was supported by a grant from the National Science Foundation.

[2] This version was created using Exhibit, the lightweight data publishing framework for Sweet Tools. It allows RDF/XML to be copied from the online Exhibit, though it has a few encoding issues, which required the manual adjustments to produce valid RDF/XML. A better RDF export service is apparently in the works for Exhibit version 2.0, slated for soon release.

[3] N3 stands for Notation 3 and is a more easily read serialization of RDF. For direct comparison with my native RDF/XML, you can convert the N3 file at RDFabout.com. Alternatively, you can directly create the RDF/XML output with the slightly different instructions to the online service of: http://rdf123.umbc.edu/server/?src=http://spreadsheets.google.com/pub?key=pGFSSSZMgQNxIJUCX6VO3Ww&map=http://rdf123.umbc.edu/map/demo1.N3&out=XML; note the last statement changing the output format from N3 to XML. Also note the UMBC service address, followed by the spreadsheet address, followed by the specification address (the listing of which is shown above), then ending with the output form. This RDF/XML output validates with the W3C’s RDF validation service, unlike the original RDF/XML created from Sweet Tools that had some encoding issues that required the manual fixing.

Posted:August 8, 2007

Erecting Road Signs on the Structured Web

Structured Dataand UMBEL Will Benefit from a Standard Registration Format

One implication of the structured Web is that, with the rapid proliferation of the data, how do you find what is relevant? That purpose is what stimulated the initiation of the UMBEL (Upper Mapping and Binding Exchange Layer) project.

The original specification for UMBEL recognized the need for a reference set of subject “proxies” to help describe what each data set “was about” as well as the need for a variety of binding mechanisms depending on scope and data structure of the source dataset.

At its most general level, the intent of UMBEL is to provide four components in its ‘core’ ontology [1]:

A set of reference subject “proxies” and properties and relations around them
A means of binding the ontologies, classes or subsets of data within each contributing dataset to the ‘core’ UMBEL ontology and those subject proxies
Characterizing a given dataset via metadata, and
Describing access methods and endpoints for getting at that data.

The first component on subject proxies is largely left to another discussion. The topic of this posting is mostly related to the latter three components of dataset binding and registration mechanisms.

How such road signs might work, the contributions of possible analogs, their differences in providing solutions in and of themselves, and a first-cut outline of the resulting ‘core’ UMBEL ontology are described below.

The Why and How of These Road Signs

‘Road signs’ are simply a shorthand for how to find stuff. The normative case is to have sufficient characterization of datasets such that a central registry can aid their discovery, look-up and productive access and use. Yet registration can be an onerous task, and one not generally easily or willingly undertaken by publishers or providers.

These challenges lead to two important design considerations. First, only minimal characterization should be required for initially registering a dataset. The remaining characteristics should be optional. The incentive over time for such optional fields to be completed is its indication to consumers that fully characterized datasets may be more dependable or authoritative. It is possible, for example, to envision external qualification rules or routines that “score” competing datasets providing similar information based on the completeness of dataset characterization.

Second, any party should be allowed to register and characterize a dataset. There may be motivations by non-publishers to do so, for altruistic or other reasons. However, in the case of disputes over the accuracy of characterization, the owner or publisher should have final say. Another open question is whether competing characterizations or different registrations should be allowed for the same dataset.

These considerations are made still further complicated by the range of scope and scale and data content and formalism on the real-world Web.

In the spirit of not re-inventing the wheel, we began a process to discover what other communities have done faced with similar problems. The two closest analogs are, firstly, the library community and its need to describe digital archives and collections and, secondly, the general approaches devoted to Web services, including its dedicated language WSDL (Web Services Description Language).

Digital Collections and Archives

Librarians and information architects have been active for at least the past decade in efforts to describe and relate digital collections to one another. These efforts have been geared to search, general look-up and interlibrary loans and sharing. This community of practice, while embracing a variety of somewhat competing and overlapping schemes, has also (from an outsider’s viewpoint) come up with a general consensus view as to how to describe these archives and collections.

(The library community still tends to use the terminology of ‘metadata’ and descriptors, whereas the ontology and RDF communities tend to speak more of classes, properties and instances. However, the net intent and outcome still appears much the same.)

A common reference point to these schemes is the Dublin Core Collection Application Profile (DCCAP), a specification of how metadata terms from the Dublin Core metadata initiative (DCMI) and other vocabularies can be used to construct a description of a collection in accordance with the DCMI Abstract Model. There is also a more easily read summary of the DCCAP [2]. Though, again, there are differences in terminology, the presentation of this scheme is very much in keeping with the format of a W3C specification (such as for WSDL, see below).

One of the first widely embraced efforts of the community is the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH), begun in the late 1990s. OAI-PMH provides specifications for both data providers and service providers in how to assign and describe collection metadata. The OAI Protocol has become widely adopted by many digital libraries, institutional repositories, and digital archives, with total sources registered numbering into the thousands [3]. These large institutional repositories are also increasingly being indexed by large search engines (such as Google Scholar).

The National Information Standards Organization (NISO) began a MetaSearch Initiative (NISO-MS) that resulted in the development of a collection description schema. Though NISO recently reorganized its content and collection activities, the draft NISO Collection Description Specification [4] remains a readable overall reference for these initiatives. Like related initiatives, the draft also uses the DCCAP format.

A similar effort was initiated in the United Kingdom called the Information Environment Service Registry. IESR was designed to make it easier for other applications to discover and use materials which will help their users’ learning, teaching and research. IESR’s various terms (namespaces, classes and properties) and controlled vocabularies are very helpful to UMBEL.

Another example is the Ockham Digital Library Service Registry (DLSR), that enables service-based digital libraries, funded by the National Science Digital Library (NSDL) initiative, to interoperate. Efforts such as this, in turn, have led to interest in exploiting “light-weight” protocols and open source tools in the community [5]. For example, there is an interesting discussion of tools to Implement Digital Library Collections and Services from DLib magazine [5].

These efforts, among many across the digital library community including the related National Digital Information Infrastructure and Preservation Program (NDIIPP) at the Library of Congress, represent tremendous efforts to describe digital collections. Clearly, subsets of this learning can be applied directly to UMBEL in relation to registry and dataset metadata.

Most all of these schemes use XML data serializations. Our investigations to date have not been able to turn up any RDF representations, though they are surely to come. Fortunately, all of the DCCAP-based efforts have an RDF-like design and most properties have URIs and defined namespaces.

Web Services and Bindings

The Web Services Description Language (WSDL) is an XML-based language for how to communicate with, and therefore interoperate, Web services. WSDL defines a service as accessible Internet endpoints (or ports), supported operations and related messages. It was first proposed to the W3C in March 2001; though it is a recommendation, it is not yet an official W3C standard.

A service is a definition of what kinds of operations can be performed and the messages involved. A port is defined by associating a network address with a reusable binding, with a collection of such ports constituting the service. Messages are abstract descriptions of the data being exchanged, and port types are abstract collections of supported operations. The combination of a specific network protocol with a specific message and data format for a particular port type constitutes a reusable binding.

These definitions are kept abstract from any concrete use or instance, enabling the service definition to be reused and to act as a public interface to the service.

Though there are some important differences from datasets (see next subsection), there has now been sufficient use and exposure of WSDL to inform how to construct sufficiently abstract and reusable interfaces and bindings on the Web. Especially useful is the recent WSDL 2.0 and its draft Web Services Description Language (WSDL) Version 2.0: RDF Mapping specification [6].

WSDL, by its use of XML schema, is not well suited to combining vocabularies and definitions. As a supplement, various discussions on data binding, including from Microsoft (with respect to data source objects — DSOs) and from others such as the W3C on XML data bindings help provide additional perspective [7].

WS-Notification and Topic Maps

Another perspective on this problem comes from efforts surrounding adding topic structure to Web services. This standards effort, called WS-Notification, was an effort of the OASIS group completed in late 2006. According to its published standards [8]:

WS-Notification is a family of related specifications that define a standard Web services approach to notification using a topic-based publish/subscribe pattern. It provides standard message exchanges to be implemented by service providers that wish to participate in Notifications, standard message exchanges for a notification broker service provider (allowing publication of messages from entities that are not themselves service providers), operational requirements expected of service providers and requestors that participate in notifications, and an XML model that describes topics. The WS-Notification family of documents includes three normative specifications: WS-BaseNotification, WS-BrokeredNotification, and WS-Topics.

There are some similarities to the binding mechanisms and topic relations required by UMBEL. WS-BaseNotification bears some resemblance to the binding mechanisms portions, and WS-Topics has some relation to the subject requirements. Again, however, the perspective is limited to Web services and has as its intent the general interchange of topic structures, not the use of a proxy reference set.

Why the Need for a ‘Third Way’?

So, we can see that the library community has made much progress in defining how to characterize a digital collection with regard to its source, ownership, scope and nature, while the Web applications community has made much progress with respect to service definitions and binding mechanisms. Some components have direct applicability to UMBEL.

Yet a lightweight dataset binding and subject reference structure for the Web — namely, UMBEL’s intended objective — has a number of very important differences from these other efforts. These differences prevent direct adoption of any current schema. Some of these summary distinctions as they apply to general Web data are:

There are a variety of potential registrants for UMBEL datasets including original owners and developers (the case for the other approaches), but also third-parties and consumers
Therefore, there is a broader spectrum of possible knowledge and ability to characterize the datasets, which suggests more flexibility, more optional items and fewer mandatory items
Relatedly, those performing the characterizations may be untrained in formal metadata and cataloging techniques or may need to rely on automated and semi-automated characterization methods; this increases the risk of error, imprecision and uncertainty
The relevant data may or may not be part of a dedicated resource; it may be fragmentary or embedded
The source data resides in an extreme range in possible scale, from a single datum (granted, generally of quite limited value) to the largest of online databases
The existence of a huge diversity of data formats and protocols (WSDL has a similar challenge)
The need to accommodate a choice of many different data serializations (WSDL has a similar challenge)
WSDL’s service and endpoint considerations never explicitly accounted for data federation
The desire to get all forms into other formats or canonical forms for mashups, true federation, etc.

These differences suggest that the UMBEL ontology needs to be both broader and less prescriptive than other approaches.

Some Initial Design Considerations

We can thus combine the best transferable elements of existing schemes with the unique requirements and perspective of UMBEL. Initially, we are not adopting specific definitions or portions of possibly contributing schema. Rather, in this first cut, we are only attempting to capture the necessary scope and concepts. Later, after definitions and closer inspection of specific schema, we will refine this organization and relate it to particular namespaces.

The major “superclass” in this organization is the:

Profile — this definition is similar to that used for the DCCAPs (indeed, even adopts the “profile” label!), and is closely allied to the idea of Description in WSDL. The Profile represents the broad metadata characteristics of a dataset including ownership, rights and access policies, and other administrative aspects. Generally, a single dataset no matter of what size or scope may have a single profile, though a federated knowledge base from multiple sources may contain multiples of these. This class does not include specific details regarding interface format or subject scope. Note the next classes are themselves subclasses of this profile

The remaining classes are subsidiary to the Profile and inherit and refer to its metadata. The first two subclasses are also largely administrative in nature:

Annotator — this set of properties describes the annotator of the dataset metadata. UMBEL is designed to allow third-parties to describe others’ datasets with optional levels of detail. In the case of disputes, the dataset owner characterizations would hold sway, but there also may be circumstances where multiple characterizations are desirable and allowed
Rights — these properties describe the use and access rights for the data. Of course, only the owner may set such conditions, but third parties may provide this characterization if the source site spells out these conditions.

The remaining three classes contain the real guts of the data aspects:

Interface — the technical details of the data schema and structure within the dataset (or portions thereof) are defined in the Interface properties. (Interface is similar to the idea of the Interface and portions of the Service classes within WSDL, as with similar analogs for data exchange). The endpoints and access methods for accessing the actual data are by definition part of this Interface class. There is little or no consensus regarding how to classify and organize these details, so that it is likely much of the terminology in this area will be actively discussed and revised. See further [9] for one of the more comprehensive surveys
Binding — the Binding properties set the mechanisms for relating the dataset or portions thereof to one or more subject proxies. There may be more than one binding for a given profile or different portions of a dataset
SubjectProxy — finally, the SubjectProxy class, representing a likely extension to the core UMBEL for the enumeration of the subject proxies, becomes the linkage to the subject coverage of the datasets.

These classes have a hierarchical relationship similar to the following, with multiple Interface, Binding and SubjectProxy mappings allowable for any given Profile:

Profile -------- Annotator

|

Rights

|

Interface --------- Binding --------- SubjectProxy

Presented below in simple outline form only are these first-proposed classes, and the associated properties and instances of those properties informing the development of the ‘core’ UMBEL ontology. Some definitions of classes are also shown:

Class	subClass[1]	Property	asPredicate [2]	Note	Definition
Profile					the broad metadata characteristics of a dataset including ownership, rights and access policies, and other administrative aspects; generally only one per dataset, though there will be multiples in a repository
		abstract	hasAbstract
		alternativeTitle	hasAlternativeTitle
		collection	isPartofCollection
		conceptScheme	hasConceptScheme	[23]
		crawling	hasCrawlingPolicy
		dateSubmitted	hasSubmittedDate
		description	hasDescription
		language	hasLanguage	[3]
		location	hasLocation
		modified	wasModifiedOn
		namespace	hasNamespace
		ontology	hasOntology
		owner	hasOwner
		registry	isListedOnRegistry	[4,5]
		sitemap	hasSitemap
		size	hasSize	[6]
		title	hasTitle
		type	isOfType	[7]
		version	hasVersion
		view	isBestViewedUsing	[8]
	Annotator			[21]	description of the entity that has provided the current Profile description (may be third parties; but deferrence to owner when there are differences)
		annotationDate	hasAnnotationDate
		annotationNote	hasAnnotationNote
		annotatorLocator	hasAnnotatorLocator	[9]
		annotatorName	hasAnnotatorName
		annotatorType	isAnnotatorType	[5,27]
	Binding				the linkage made between the set or subsets of data within the datasets to the actual subject proxy(ies); may be multiples for a given dataset
		about	isAbout	[10,24]	cross-reference to the actual subject proxy IDs; may be multiples
		bindingName	hasBindingName	[10]
		bindingScope	hasBindingScope	[5,12]
		bindingType	hasBindingType	[5,11]
	Interface			[13]	the technical characteristics of the dataset that provide the essential information for enabling retrieval and interoperability; analogous to Interface in WSDL
		bindingName	hasBindingName	[10]
		dataFormalism	hasDataFormalism	[5,14]
		endpointLocation	hasEndpointLocation
		endpointType	hasEndpointType	[5,15]
		pattern	usesPattern	[16]
		pingType	hasPingType	[17]
		queryLanguage	usesQueryLanguage	[5,18]
		serialization	hasSerialization	[5,19]
		translator	usesTranslator	[20]
	Rights			[21]	various rights and restrictions to accessing, using or reproducing the subject data
		accessRight	hasAccessRight
		copyright	hasCopyright
		license	hasLicense
		rightsNote	hasRightsNote
	SubjectProxy			[22]	a preferred label that acts as a proxy to the topic concept(s) for which the given dataset content is bound; may be multiples, and backed with ‘synset’ synonyms
		altLabel	isAlternateLabel	[23]
		bindingName	hasBindingName	[10]
		prefLabel	isPreferredLabel	[23]
		primarySubject	isPrimarySubjectOf	[23]
		proxyID	hasProxyID	[25]
		subjectLanguage	hasSubjectLanguage	[28]
		subjectNote	hasSubjectNote	[26]

General table notes are provided under the endnotes [10].

Please note that the specific subject proxies and their defining classes and properties is being handled in a separate document. This outline, as being revised, is informing the first N3 version of the ‘core’ UMBEL ontology.

This structure is still quite preliminary. (For example, data type definitions and interface constructs are still in active discussion, without accepted standards.) Comments on this draft UMBEL ‘core’ ontology outline are welcomed either at the UMBEL discussion forum on Google or at the specific outline page on the UMBEL wiki.

Revisiting the ‘Lightweight’ Designation

We can thus see that there is only minimal semantics in the potential linkage between UMBEL datasets.

One way to place this system is through and interesting approach called the Levels of Conceptual Interoperability Model. One way to view these levels is through the following conceptual diagram [11]:

Under this model, UMBEL resides right at the interface between Levels 2 and 3, where syntactic interoperability is achieved but with only limited semantic understanding. In fact, this represents a clear analog to AI3‘s discussion of the structured Web, which is very much related to the syntactic level with the negotiation of semantics the next challenge.

This posting is part of a new, occasional series on the Structured Web.

[1] The UMBEL ontology has two parts. The first ‘core’ part is a flat listing, or pool, of concrete subject topics that are the proxy binding points for external data sets. The second ‘unofficial’ part is a reference look-up structure of hierarchical and interlinked subject relationships.

[2] The “Dublin” in the name refers to Dublin, Ohio, where the work originated from an invitational workshop hosted in 1995 by the Online Computer Library Center (OCLC), a library consortium that has its headquarters there. The “Core” refers to the fact that the metadata element set is a basic but expandable “core” list, used is a similar way to the UMBEL ‘core’.

[3] There are several large registries of OAI-compliant repositories: The OAI registry at University of Illinois at Urbana-Champaign, The Open Archives list of registered OAI repositories, The Celestial OAI registry, Eprint’s Institutional Archives Registry, Openarchives.eu The European Guide to OAI-PMH compliant repositories in the world, and the ScientificCommons.org A worldwide service and registry.

[4] The Standards Committee BB (Task Group 2): Collection & Service Descriptions, NISO Z39.91-200x, Collection Description Specification, November 2005. It also specifies an XML binding for serializing such descriptions for interchange between applications.

[5] Xiaorong Xiang and Eric Lease Morgan, “Exploiting ‘Light-weight’ Protocols and Open Source Tools to Implement Digital Library Collections and Services,’ D-Lib Magazine 11(10), October 2005. See http://www.dlib.org/dlib/october05/morgan/10morgan.html. Also see its MyLibrary reference for examples of facets applied to collections.

[6] Jacek Kopecký, Ed., Web Services Description Language (WSDL) Version 2.0: RDF Mapping, W3C Working Group Note, 26 June 2007. See http://www.w3.org/TR/wsdl20-rdf.

[7] Data binding from the Microsoft perspective is described at http://msdn2.microsoft.com/en-us/library/ms531387.aspx. The W3C’s perspective on XML data binding is described, for example, in Paul Downey, ed., XML Schema Patterns for Common Data Structures, see http://www.w3.org/2005/07/xml-schema-patterns.html more goes here.

[8] Also, there is an entire corpus related to topic maps. In specific reference to Web services, there is the so-called WS-Topics, Web Services Topics 1.3 (WS-Topics) OASIS Standard 1 October 2006; see http://docs.oasis-open.org/wsn/wsn-ws_topics-1.3-spec-os.htm more here. This is used in conjunction with WS-Notification. Also WS-BaseNotification Web Services Base Notification 1.3 (WS-BaseNotification) OASIS Standard 1 October 2006 see http://docs.oasis-open.org/wsn/wsn-ws_base_notification-1.3-spec-os.htm. PDFs of these documents are also available.

[9] For an intial introduction with a focus on Xcerpt, see François Bry, Tim Furche, and Benedikt Linse, “Let’s Mix It: Versatile Access to Web Data in Xcerpt,” in Proceedings of 3rd Workshop on Information Integration on the Web (IIWeb 2006), Edinburgh, Scotland, 22nd May 2006; also as REWERSE-RP-2006-034, see http://rewerse.net/publications/download/REWERSE-RP-2006-034.pdf. For a more detailed treatment, see T. Furche, F. Bry, S. Schaffert, R. Orsini, I. Horrocks, M. Krauss, and O. Bolzer. Survey over Existing Query and Transformation Languages. Deliverable I4-D1a Revision 2.0, REWERSE, 225 pp., April 2006. See http://rewerse.net/deliverables/m24/i4-d9a.pdf.

[10] Here are the general table notes:

[1] SubClasses have the advantage of inheritance and shared metadata; see main text for full subClass path
[2] hasPredicate is actually my preferred format; thoughts?
[3] Standard ISO languages
[4] extendable base listing including NISO, IESR, DLSR, DCMI, etc.; need completion
[5] uses the idea of an extended base class per XML Schema; the enumerated listings thus only need be partially complete; see below for most listings
[6] number of records or TBD metric?
[7] possibly unnecessary; can not see enumeration of profile Types
[8] related to idea of Fresnel or Zitgist “preferred” viewing format or XSLT-type stylesheet
[9] should there be other types than FOAF (what of other formal listings or organizations v. individuals?)?
[10] not sure how to do this; need a x-ref between two class categories (e.g., Binding <-> Interface, Binding <-> SubjectProxy)
[11] are there patterns for bindingTypes or a likely enumerated listing?

HTTP
JDBC
ODBC
RPC
RPI
SOAP
XML-RPC

[12] need an enumerated list (?) going from individual annotation / metadata (a la RDFa) to complete dataset

webPage
dataSet
dataRecords

[13] perhaps not best name; related to Interface and Services in WSDL, could also be called Construct, Composition, others
[14] see possible dataFormalisms below; needs completion; could be named differently (schema, format, etc.)

Atom
eRDF
Microformats
OPML
Other (unspecified)
Other Ontology
OWL DL
OWL Full
OWL Lite
RDF
RDFa
RDF-S
RSS
Spreadsheet
Topic Map
WebPage
XFML

[15] see possible endpointTypes below

dropdownList
fileExport
other Query formats ???
queryBox
SPARQL

[16] patterns are fairly prominent in WSDL and XML Schema; applicable here?
[17] need to discuss
[18] see possible queryLanguages below; needs completion

DQL
IR (standard text search)
N3QL
R-DEVICE
RDFQ
RDQ
RDQL
RQL
SeRQL
SPARQL
SQL
Versa
Xcerpt
XPath
XPointer
XQuery
XSLT
XUL

[19] see possible serializations below; needs completion

Atom
Gbase
html
JSON
JSON-P
N3
RDF/A
RDF/XML
Turtle
XML

[20] GRDDL, RDFizers, and various converters/translators; likely needs an expandable, enumerated list
[21] unlike the other subClasses, this is closely aligned with the standard Profile metadata
[22] likely a separate namespace (e.b., ‘umbels’) that will contain additional information such as synsets, etc. See text.
[23] SKOS concept
[24] need to check on overlap/replacement/use with SKOS subjectIndicator property
[25] should names or IDs be used for subjectProxys? (IDs have the advantage of changing labels and use in other languages)
[26] seems similar or identical to SKOS scopeNote
[27] see possible annotatorTypes below; needs completion

archivist
bot
owner
repository
third-party

[28] similar to dc:language, but must be kept separate from language of the resource (its metadata characterization) from the actual subject proxies

[11] A Tolk, S.Y. Diallo, C.D. Turnitsa and L.S. Winters LS, “Composable M&S Web Services for Net-centric Applications,” Journal for Defense Modeling & Simulation (JDMS), Volume 3 Number 1, pp. 27-44, January 2006.

Posted:July 24, 2007

Potluck: Another Tasty Meal from MIT Simile

Huynh Adds to His Winning Series of Lightweight Structured Data Tools

David Huynh, a Ph.D. grad student developer par excellence from MIT’s Simile program, has just announced the beta availability of Potluck. Potluck allows casual users to mashup data on the Web using direct manipulation and simultaneous editing techniques, generally (but not exclusively!) based on Exhibit-powered pages.

Besides Potluck and Exhibit, David has also been the lead developer on such innovative Simile efforts as Piggy Bank, Timeline, Ajax, Babel, and Sifter, as well as a contributor to Longwell and Solvent. Each merits a look. Those familiar with these other projects will notice David’s distinct interface style in Potluck.

Taking Your First Bites

There is a helpful 6-min movie on Potluck that gives a basic overview of use and operation. I recommend you start here. Those who want more details can also read the Potluck paper in PDF, just accepted for presentation at ISWC 2007. And, after playing with the online demo, you can also download the beta source code directly from the Simile site.

Please note that Firefox is the browser of choice for this beta; Internet Explorer support is limited.

To invoke Potluck, you simply go to the demo page, enter two or more appropriate source URLs for mashup, and press Mix Data:

[Click on image for full-size pop-up]

(You can also get to the movie from this page.)

Once the datasets are loaded, all fields from the respective sources are rendered as field tags. To combine different fields from different datasets, the respective field tags (color coded by dataset) to be matched are simply dragged to a new column. Differences in field value formats between datasets can be edited with an innovative approach to simultaneous group editing (see below). Once fields are aligned, they then may be assigned as browsing facets. The last step in working with the Potluck mashup is choosing either tabular or map views for the results display.

Potluck is designed to mashup existing Exhibit displays (JSON format), and is therefore lightweight in design. (Generally, Exhibit should be limited to about 500 data records or so per set.)

However, with the addition of the appropriate type name when specifying one of the sources to mash up, you can also use spreadsheet (xls), BibTeX, N3 or RDF/XML formats. The demo page contains a few sample data links. Additional sample data files for different mime types are (note entry using a space with type designator at end):

Besides the standard tabular display, you can also map results. For example, use the BibTeX example above and drop the “address” field into the first drop target area. Then, chose Map at the top of the display to get a mapping of conference locations.

In my own case, I mashed up this source and the xls sample on presidents, and then plotted out location in the US:

[Click on image for full-size pop-up]

Given the capabilities in some of the other Simile tool sets, incorporating timelines or other views should be relatively straightforward.

Pragmatic Lessons and Cautions with Semantic Mashups

Different datasets name similar or identical things differently and characterize their data differently. You can’t combine data from different datasets without resolving these differences. These various heterogeneities — which by some counts can be 40 or so classes of possible differences — were tabulated in one of my recent structured Web posts.

There has been considerable discussion in recent days on various ontology and semantic Web mailing lists about how some practices may solve or not questions of semantic matching. Some express sentiments that proper use of URIs, use of similar namespaces and use of some predicates like owl:sameAs may largely resolve these matters.

However, discussion in David’s ISWC 2007 paper and use of the Potluck demo readily show the pragmatic issues in such matches. Section 2 in the paper presents a readable scenario for real-world challenges in how a historian without programming skills would go about matching and merging data. Despite best practices, and even if all are pursued, actually “meshing” data together from different sources requires judgment and reconciliation. One of the great values of Potluck is as a heuristic and learning tool for making prominent these real-world semantic heterogeneities.

The complementary value of Potluck is its innovative interface design for actually doing such meshing. Potluck is a case argument that pragmatic solutions and designs only come about by just “doing it.”

Easy, Simultaneous Editing

(Note: Though a diagram illustrates some points below, it is no substitute for using Potluck yourself.)

Potluck uses a simple drag-and-drop model for matching fields from different datasets. In the left-hand oval in the diagram below, the user clicks on a field name in a record, drags it to a column, and then repeats that process for matching fields in a records of a different dataset. In the instance below, we are matching the field names of “address” and “birth-place”, which then also get color coded by dataset:

[Click on image for full-size pop-up]

This process can be repeated for multiple field matches. The merged fields themselves can be subsequently dragged-and-dropped to new columns for renaming or still further merging.

The core innovation at the heart of Potluck is what happens next. By clicking on Edit for any record in a merged field, the dialog shown above pops up. This dialog supports simultaneous group editing based on LAPIS, another MIT tool for editing text with lightweight structure developed by Ron Miller and team.

As implemented in JavaScript in Potluck, LAPIS first groups data items by similar patterned structure; this initial grouping is what determines the various columns in the above display. Then, when the user highlights any pattern in a column, these are repeated (see same cursors and shading in the right-hand oval) for all entries in the column. They can then be deleted (for pruning, in this case removing ‘USA’), or cut-and-pasted (such as for changing first- and last-name order) for all items in a column. (Single item editing is obviously also an option.)

The first grouping mostly ensures that data formatted differently in different datasets are displayed in their own column. One data form is used for the merged field, and all other columns are group edited to conform. The actual patterns are based on runs of digits, letters, white spaces, or individual punctuation marks and symbols, which are then “greedy” aligned for first the column grouping and then for cursor alignment within columns on highlighted patterns.

The net result is very fast and efficient bulk editing. This approach points the way to more complicated pattern matches and other substitution possibilities (such as unit changes or date and time formats).

Rough Spots and A Hope

I was tempted to award Potluck one of AI3‘s Jewels and Doubloons Awards, but the tool is still premature with rough spots and gaps. For examples, IE and browser support needs to be improved; it would be helpful to be able to delete a record from inclusion in the mashup. (Sometimes only after combining is it clear some records don’t belong together.)

One big issue is that the system does not yet work well with all external sites. For example, my own Sweet Tools Exhibit refused to load and the one from the European Space Agency’s Advanced Concept Team caused JavaScript errors.

Another big issue is that whole classes of functionality, such as writing out combined results or more data view options, are missing.

Of course, this code is not claimed to be commercial grade. What is most important is its pathbreaking approach to semantic mashups (actually, what some others such as Jonathan Lathem have called ‘smashups’) and interfaces and approaches to group editing techniques.

I hope that others pick up on this tool in earnest. David Huynh is himself getting close to completing his degree and may not have much time in the foreseeable future to continue Potluck development. Besides Potluck’s potential to evolve into a real production-grade utility, I think its potential to act as a learning test bed for new UI approaches and techniques for resolving semantic heterogeneities is even greater.

Main Links

Search

Author: Mike Bergman

Posted:September 9, 2007

A Milestone Week of Innovation

This Past Week’s Theme is Lightweight, and Cool!

Posted:August 23, 2007

Information is the Basis for Economic Growth

Was the Industrial Revolution Truly the Catalyst?

Statistical Leaps of Faith

Pulp Paper and Mass Media

Information, Not Machines

What Now Internet?

Posted:August 18, 2007

RDF123 Makes Generating Flexible RDF a Snap

UMBC’s Ebiquity Program Creates Another Great Tool

And, Thanks Josh for the Census RDF

Posted:August 8, 2007

Erecting Road Signs on the Structured Web

Structured Dataand UMBEL Will Benefit from a Standard Registration Format

The Why and How of These Road Signs

Digital Collections and Archives

Web Services and Bindings

WS-Notification and Topic Maps

Why the Need for a ‘Third Way’?

Some Initial Design Considerations

Revisiting the ‘Lightweight’ Designation

Posted:July 24, 2007

Potluck: Another Tasty Meal from MIT Simile

Huynh Adds to His Winning Series of Lightweight Structured Data Tools

Taking Your First Bites

Pragmatic Lessons and Cautions with Semantic Mashups

Easy, Simultaneous Editing

Rough Spots and A Hope