Posted:August 3, 2009

Structure the World

Multiple Techniques and Data Structs can Make the Vision a Reality

Linked data and subject and domain ontologies provide the organizing framework. Techniques for converting, tagging and authoring structure provide the content. In combination, we now have in hand the necessary pieces to enable all of us to “structure the World.”

In this vision, the nature of the links or connections between data need not be complicated to gain tremendous benefit. Similar to Metcalfe’s Law for the increasing value of networks as more nodes (users) get added, adding connections to existing data is a powerful force multiplier.

We can call this the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between data objects [1]. Further, if we are purposeful to include connective links where appropriate as we add more data (that is, nodes), this multiplier effect becomes even stronger.

Structured Dynamics is dedicated to help make this prospect real. Meaningful progress in doing so requires only a relatively few moving parts or techniques. Yet, because we sometimes bounce from talking or focusing on one part versus the others, we can lose context or sight of the overarching vision. The purpose of this article is to re-set and calibrate that overall vision.

The Vision: Data Federation of Any Desired Content

The vision is to get all data and information to interoperate, regardless of legacy or form. Much of this data is already structured, either from databases or simpler forms of data structs. Some of this information is unstructured or semi-structured, requiring extraction and tagging techniques. And new information is being constantly generated, which warrants better means to author and stage for interchange and interoperability.

No matter the provenance, all information has context and scope. As a chunk from here, and a piece from there, gets added to our linked data mix, having means to characterize what that data is about and how it can be meaningfully inter-related becomes crucial. Sometimes these contexts are informed by existing schema; sometimes they are not. But, in any case, it is the role of ontologies to both position these datasets into an “aboutness” framework and to help guide how the data can be described and related to other data. This part of the vision invokes semantics and coherent structures (schema or ontologies) for positioning and mapping datasets to one another.

As both the means for representing any extant data format and as the means for describing these conceptual relationships or schema, RDF provides the canonical data model. A single target representation and common data model also means we can develop and design a smaller universe of tools to operate and provide functionality over all of this data. Indeed, because our RDF data model and its ontologies are so richly structured, we can design our tools with generic functionality, the specific operation and expression of which is based on the inherent structure within the data and its relationships. This vision of data-driven apps leads to extreme leverage, incredible flexibility, and inherent “meshup” capabilities for tools.

Further, because we use Web identifiers (URIs) for our data and concepts and because we expose and access this linked data via the Web, we use the proven and scalable architectures of the Web itself for how we design our systems. This Web-oriented architecture (WOA) provides a completely decentralized and loosely coupled deployment model that can work ranging from public and open to private and proprietary, applicable to data and participants alike.

From the outset, it is essential to recognize that thousands of contributors are enabling this vision. So, while Structured Dynamics naturally uses its own tools and techniques to flesh out the various parts of this vision below, realize there are many players and many tools from which to choose [2]. For that is another aspect of this vision that is quite powerful: providing choice and avoiding lock-in.

RDF: The Canonical Data Model

The core construct — or fulcrum, if you will — of the vision is the RDF (Resource Description Framework) data model [3]. I have written elsewhere on the Advantages and Myths of RDF, which explains more precisely the advantages of that model. RDF provides a common data model to which any external format or schema can be converted and represented. It also provides a logic model and basis for building vocabularies that can inform and drive generic tools.

In the context of data interoperability, a critical premise is that a single, canonical data model is highly desirable. Why?

Simply because of 2N v N². That is, a single reference (“canon”) structure means that fewer tool variants and converters need be developed to talk to the myriad of data formats in the wild. With a canonical data model, talking to external sources and formats (N) only requires converters to and from the canonical form (2N). Without a canonical model, the combinatorial explosion of required format converters becomes N² [4].

Note, in general, such a canonical data model merely represents the agreed-upon internal representation. It need not affect data transfer formats. Indeed, in many cases, data systems employ quite different internal data models from what is used for data exchange. Many, in fact, have two or three favored flavors of data exchange such as XML, JSON or the like. More on this is discussed in a section below.

As this diagram shows, then, we have a single internal representation that is the target for all data and format converters and upon which all tools operate. These tools are themselves expressed as Web services so that they may be distributed and conform to general WOA guidelines. In addition, there may be multiple external “hubs” that represent alternative data models or formats or schema conversions (say, for relational databases). So long as we have converters between these alternate “hubs” and our canonical RDF form we can allow a thousand flowers to bloom:

Other canonical forms could be advocated. Yet RDF has the logical basis to represent any data form and any schema or conceptual structure. It is based on a robust set of open standards and languages and tools. It may be serialized in many formats. It can be grounded in description logics and, in appropriate forms, reasoned over and expressed in vocabularies and schema suitable for the most complex of conceptual structures and semantics. RDF is the data model explicitly designed for the Web, the clear global information basis for the foreseeable future.

For more than 30 years — since the widespread adoption of electronic information systems by enterprises — the Holy Grail has been complete, integrated access to all data. With the canonical RDF data model, that promise is now at hand.

Conversion: So Many Structs, So Little Time

Diversity is a truism of human communications as captured by the biblical Tower of Babel and the many thousands of current human languages. Diversity in data formats, serializations, notations and languages is a similar truism. We term the expression of each of these varied forms of data a struct.

While an internal canonical representation of data makes sense for the reasons noted above, pragmatic information systems must recognize the inherent diversity and chaos of data in the real world. The history of trying to find single representations or to impose standards via fiat have singularly failed. That will continue to be so due in part to inertia and legacy, sunk investments, existing infrastructure, and the purposes for the data.

In pursuing a vision of data interoperability, then, conversion is an essential glue for cementing understanding with what exists and will exist.

RDB-to-RDF

Arguably the largest source of structured data are enterprise and government information systems, with the predominant data representation being the relational data model managed by relational schema. Much of this data is also cleaner and mission critical compared to other sources in the wild. Fortunately, there are many logical and conceptual affinities between the relational model and the one for RDF [5].

Just as there are many RDFizers for simpler forms of data structs (see next), there are also nice ways to convert relational schema to RDF automatically. Given these overall conceptual and logical affinities the W3C is also in the process of graduating an incubator group to an official work group, RDB2RDF, focused on methods and specifications for mapping relational schema to RDF.

Amongst all techniques covered in this paper, Structured Dynamics views the layering of RDF ontologies over existing relational data stores as one of the most promising and important. Given the advantages of RDF for interoperability, this area should be a major emphasis of current and new vendors and service providers.

RDFizers

Much data, however, resides in much smaller datasets and often for less formal purposes than what is found in enterprise databases. Some of this data is geared for exchange or standardization; much is emerging from Web and Internet applications and uses; and much might be local or personal in nature, such as simple lists or spreadsheets.

RDF is well suited to convert (“RDFize”) these simpler and more naïve data formats. In my original census about 18 months ago, as reported in ‘Structs’: Naïve Data Formats and the ABox, I listed about 90 converters. My most recent update now lists nearly double that number, with about 150 converters [6]:

URN handlers (in addition to IRI and URI):

DOI
LSID
OAI

RDF

Serialization formats:
- N3
- RDF/XML
- Turtle
Languages and ontologies:
- AB Meta
- Annotea
- APML
- AtomOWL
- Bibliographic Ontology
- Creative Commons
- EXIF
- FOAF
- Java
- Javadoc
- MARC/MODS
- Meta Standards
- Music Ontology
- Natural Language
- Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
- Open Geospatial
- OWL
- SIOC
- SIOCT
- SKOS
- UMBEL
- vCard
- XML
- Others
(X)HTML pages
Embedded Microformats and GRDDL [7]:
- DC
- eRDF
- geoURL
- Google Base
- hAudio
- hCalendar

Embedded Microformats and GRDDL (con’t):
- hCard
- hListing
- hResume
- hReview
- HR-XML
- Ning
- RDFa
- relLicense
- SVG
- XBRL
- XFN
- xFolk
- XR-XML
- XSLT
Syndication Formats:
- Atom
- OPML
- OCS
- RSS 1.1
- RSS 2.0
- XBEL (for bookmarks)
REST-style Web service APIs:
- Amazon
- Apple
- Calais
- CrunchBase
- Del.icio.us
- Digg
- Discogs
- Disqus
- eBay
- Facebook
- Flickr
- Freebase (MQL)
- FriendFeed
- Garmin
- Get Satisfaction
- Google
- Hoover’s
- HTTP (raw)
- ISBN DB
- Last.fm
- Library Thing
- Magnolia

REST-style Web service APIs (con’t):
- Meetup
- MusicBrainz
- New York Times
- New York Times Campaign Finance (NYTCF)
- New York Times tags
- Open Library
- Open Social
- Open Street
- OpenLink (facets)
- O’Reilly
- Picasa
- Radio Pop (BBC)
- Rhapsody
- Salesforce
- Slideshare
- Slidy
- Technorati
- They Work For You
- Twine
- Twitter
- Weather
- Wikipedia
- World Bank
- Yahoo! Finance
- Yahoo! Maps
- Yahoo! Weather
- YouTube
- Zemanta
Files (multitude of file formats and MIME types, including):
- audio (general)
- BibJSON
- BibTEX and others
- BitTorrent
- CSV
- Fink
- Flat files
- JPEG
- JSON
- images
- MS Office
- OpenOffice
- Open Document Format
- Palm
- RDF123
- video
- XLS
- etc.

Metadata extractors:
- CRW
- DEB
- EXIF
- OCW
- RPM
- XMP
Email formats:
- EMail
- Outlook
- RFC822
Version control and related systems:
- Bugzilla
- Jira
- POM
- Subversion
Other Web service frameworks:
- BPEL
- WSDL
- XBRL
- XBEL
Data exchange formats:
- iCalendar
- LDIF
- vCalendar
- vCard
Relational databases and related:
- D2RQ
- D2RMAP
- RDF Views
Virtuoso VADs
OpenLink license files
Third party metadata extraction frameworks:
- Aperture
- Spotlight
Miscellaneous and other related converters:
- MPEG-7/CS → OWL
- Random
- XSD → OWL

Many of the sources above come from new and emerging Web-based APIs, which are also huge sources of content growth. Also note that alternative formats to RDF (e.g., microformats) or leading serializations and encodings (e.g, XML, JSON) also have many converter options.

For many typical naïve data structs, the data is represented as attribute-value pairs, which easily lend themselves to conversion to RDF as instance records [8]. See further the Authoring section below.

Tagging: The 80% Solution

An apocryphal statistic is that 80% to 85% of all information resides in unstructured text [9]. Besides lacking recent validation, this claim from a decade ago often attributed to Merrill Lynch also precedes much of the Internet and the emergence of metadata and tagging. Nevertheless, what is true is that written text content is ubiquitous and the majority of it remains untagged or uncharacterized by any form of metadata.

While such information can be searched, it only matches when exact terms match. This means that related information, particularly in the form of conceptual relationships and inferencing, can not be applied to untagged text content.

While information extraction — the basis by which tags for entities and concepts can be obtained — has been an active topic of research for two decades, it is only recently that we have begun to see Web-scale extractors appear. Examples include Yahoo’s term extractor, Thomson Reuter’s Calais, or Google’s Squared, to name but a few.

In Structured Dynamics’ case we have been working on the scones (Subject Concepts Or Named EntitieS) extractor for quite a while. scones uses rather simple natural language processing (NLP) methods as informed by concept ontologies and named entity (instance record) dictionaries to help guide the extraction process. The co-occurrence of matches between concepts and entities also aids the disambiguation task (though additional modules may be invoked with alternative disambiguation methods). In prototype forms, the resulting tags can be managed separately or fed to user interfaces or re-injected back into the original content as RDFa.

There are literally dozens of such extractors and services presently available on the Web and many that are available as open source or commercial products. Some are mostly algorithm based using machine-learning techiques or statistics, while others are gazeteer- or dictionary-driven.

These systems will lead to rapid tagging of existing content and the removal of some of the early “chicken-and-egg” challenges associated with the semantic Web. These systems will also be combined with the many existing bookmarking and tagging services.

So, just as we will see federation and interoperability of conventional data, we will also see linkages to relevant and supporting text content accompanying it. This combination, in turn, will also lead to richer browsing and discovery experiences.

Authoring: The Neglected Third Leg of the Stool

In addition to conversion and tagging, authoring is the third leg of the stool to expose structured data. It is a neglected leg to the structured content stool, and one important to make it easier for datasets to be easily exposed as RDF linked data.

One of the reasons for the proliferation of data structs has been the interest in finding notations and conventions for easier reading and authoring of small datasets. There have literally been hundreds of various formats proposed over decades for conveying lightweight data structures. Most have been proprietary or limited to specific domains or users. Some, such as fielded text, structured text, simple declarative language (SDL), or more recently YAML or its simpler cousin JSON, have become more widely adopted and supported by formal specifications, tools or APIs. JSON, especially, is a preferred form for Web 2.0 applications.

What has been less clear or intuitive in these forms, again mostly based on an attribute-value pair orientation, is how to adequately relate them to a more capable data model, such as RDF. In JSON or YAML, for example, the notations include the concepts of objects, arrays and datatypes (among other conventions). Other structures lack even these constructs.

To take the case of JSON as might be related to RDF, there are a couple of efforts to define representation conventions from Talis and GBV for serializing RDF. There was a floated idea for an RDF version of JSON called RDFON that has now evolved into the TURF approach. JDIL (JSON data integration layer) instructs how to add namespaces to JSON to enable encoding RDF. Jim Ley, Kanzaki Masahide and Dave Beckett (likely among others) have written simple and straightforward RDF and Turtle parsers and converters for JSON. And, still further examples are Beckett’s Triplr and Sören Auer‘s ASKW Triplify lightweight conversion services involving many different formats.

Because JSON is easily readable, can drive many Web 2.0 applications and widgets, and lends itself to fast conversions and tools in various scripting languages, Structured Dynamics was commissioned by the Bibliographic Knowledge Network (BKN) to formalize a BibJSON specification suitable for BibTeX-like data records and citations with an extensible schema to be converted to RDF.

The emerging result of that BibJSON effort will be published shortly. The specification includes conventions and vocabularies for creating bibliographic and citation instance records, for specifying structural schema, and for creating linkage files between the attributes in the record files with existing and new schema. BibJSON is itself grounded in IRON, which is an instance record and object notation developed by Structured Dyamics that can be serialized as JSON (called irJSON), XML (called irXML) or comma-separated values (or CSV comma-delimited files, called commON).

The purpose of these notations and serializations is to provide easier authoring environments and scripting support to RDF-ready datasets. This approach has the advantage of shielding most users from the nuances or lengthiness of RDF (though the N3 serialization also works well).

The design and development of commON was especially geared to using spreadsheets as authoring environments that would enable easy creation of instance record tables or simple hierarchical or outline structures. For example, here is a sample portion of Sweet Tools specified in a spreadsheet using the commON notation:

Once the philosophy and role of naïve data structs is embraced — with an appreciation of the many converters now available or easily written for translating to RDF — it becomes easier to determine data forms appropriate to the tools and natural work flow of the users and tasks at hand. Under this mindset, the role of RDF is to be the eventual conversion target, but not necessarily what is used for intermediate work tasks, and in particular not for authoring.

Getting it All Organized

OK, so now all of this stuff is converted, tagged or authored. How does it relate? What is the relation of one dataset to another dataset? Is there a context or framework for laying out these conceptual roadmaps?

Two years ago as we looked at the state of RDF and the incipient semantic Web as promised via linked data, we saw that such a specific framework was lacking. (Though there were existing higher-level ontologies, either their complexity or design were not well-suited to these purposes.) It was at that time that Frédérick Giasson and I began to formulate the UMBEL (Upper Mapping and Binding Exchange Layer) ontology, which eventually led to our more formal business partnership and Structured Dynamics.

What we sought to achieve with UMBEL was a coherent reference framework of about 20,000 subject concepts, connected and acting like constellations in the information sky for orienting content and new datasets. At the same time, we wanted to create a general vocabulary and approach that would lend themselves to creation of domain-specific ontologies, which would also naturally tie in and inter-relate to the more general UMBEL structure.

This objective was achieved, though UMBEL deserves an upgrade to OWL 2 and some other pending improvements. A number of domain ontologies have been created and now relate to UMBEL. So, rather than being an end to itself, UMBEL was one of the necessary infrastructure pieces to help make the vision herein a reality.

Similar approaches may be taken by others with new domain ontologies based on the UMBEL vocabulary with tie-in as appropriate to existing subject concepts, or by mapping to the existing UMBEL structure.

Of course, UMBEL is not an absolute condition to the vision herein. However, insofar as users desire to see multiple datasets inter-related, including the use of existing public Web data, something akin to UMBEL and related domain ontologies will be necessary to provide a similar roadmap.

Making it All Available

The parts and techniques discussed so far pertain almost exclusively to data and content. But, these structures so created now can inform data-driven applications which also now must be deployed. To do so, Structured Dynamics is committed to what is known as a Web-oriented architecture (WOA):

WOA = SOA + WWW + REST

WOA is a subset of the service-oriented architectural style, wherein discrete functions are packaged into modular and shareable elements (“services”) that are made available in a distributed and loosely coupled manner. WOA generally uses the representational state transfer (REST) architectural style defined by Roy Fielding in his 2000 doctoral thesis; Fielding is also one of the principal authors of the Hypertext Transfer Protocol (HTTP) specification.

REST provides principles for how resources are defined and used and addressed with simple interfaces without additional messaging layers such as SOAP or RPC. The principles are couched within the framework of a generalized architectural style and are not limited to the Web, though they are a foundation to it.

Within this design we need a suite of generic functions and tools that are driven by the structure of the available datasets. The deployment vehicle and design we have implemented to provide this WOA design is structWSF [10].

structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies). The master or controlling Web service in the framework is the module for granting access and use rights to datasets based on permissions.

The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import. More services can readily be added to the system.

All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and a document of resultsets (if the query result is not null). Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.

In initial release, structWSF has direct interfaces to the Virtuoso RDF triple store (via ODBC, and later HTTP) and the Solr faceted, full-text search engine (via HTTP). However, structWSF has been designed to be fully platform-independent. The framework is open source (Apache 2 license) and designed for extensibility.

No End in Sight

Like all visions, there are many aspects and many improvements possible. This vision is definitely a work-in-progress with no end in sight.

But, meaningful movement embracing the full scope of this vision is doable today. Structured Dynamics welcomes inquiries regarding any of these aspects, improvements to them, or application to your specific needs and problems.

We also welcome you to come back and visit our blogs (Fred’s is found here). We try to speak on various aspects of this vision in all of our posts and are pleased to share our experience and insights as gained.

[1] Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of users of the system (n²), where the linkages between users (nodes) exist by definition. For information bases, the data objects are the nodes. Linked data works to add the connections between the nodes. We can thus modify the original sense to become the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between the data objects. I first presented this formulation about a year ago in What is Linked Data?

[2] This piece introduces for the first time a couple of efforts-in-progress by Structured Dynamics. For a general tools listing, see my own Sweet Tools listing of about 800 semantic Web and -related tools.

[3] As quoted in The Lever, “”Archimedes, however, in writing to King Hiero, whose friend and near relation he was, had stated that given the force, any given weight might be moved, and even boasted, we are told, relying on the strength of demonstration, that if there were another earth, by going into it he could remove this.” from Plutarch (c. 45-120 AD) in the Life of Marcellus, as translated by John Dryden (1631-1700).

[4] The canonical data model is especially prevalent in enterprise application integration. An interesting animated visualization of the canonical data model may be found at: http://soa-eda.blogspot.com/2008/03/canonical-data-model-visualized.html.

[5] An excellent piece on those relations was written by Andrew Newman a bit over a year ago; see Andrew Newman, 2007. “A Relational View of the Semantic Web,” published on XML.com, March 14, 2007; http://www.xml.com/pub/a/2007/03/14/a-relational-view-of-the-semantic-web.html. RDF can be modeled relationally as a single table with three columns corresponding to the subject–predicate–object triple. Conversely, a relational table can be modeled in RDF with the subject IRI derived from the primary key or a blank node; the predicate from the column identifier; and the object from the cell value. Because of these affinities, it is also possible to store RDF data models in existing relational databases. (In fact, most RDF “triple stores” are RDBM systems with a tweak, sometimes as “quad stores” where the fourth tuple is the graph.) Moreover, these affinities also mean that RDF stored in this manner can also take advantage of the historical learnings around RDBMS and SQL query optimizations.

[6] The largest source for RDFizers, which it calls Sponger cartridges, is from OpenLink Software in relation to its Virtuoso universal server. Most of its converters use XSLT stylesheets to translate to RDF, but the system has other conversion capabilities as well. Two additional OpenLink resources are a clickable diagram of converters and relationships with links and an online storehouse of available XSLT converters. In addition, two other sources — the W3C’s Semantic Web wiki with converter listings and MIT’s Simile program and listing of RDFizers — have a rich set of listings. Note that many of the categories shown on the table also have multiple sources of converters, so that the absolute number of converters has also grown faster than the unique formats supported.

[7] GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a W3C markup format for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT GRDDL accomodates a wide variety of dialects (see one listing) and can be combined with arbitrary transformation mechanisms (though currently mostly based on XSLTs).

[8] We characterize instance records as representing the “ABox”, in accordance with our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”

[9] One of the more recent discussions of this percentage is by Seth Grimes, Unstructured Data and the 80 Percent Rule, 2009.

[10] structWSF is also designed to integrate with third-party apps and content management systems (CMSs) to provide the user interfaces to these functions. The first implementation of this design is conStruct SCS, a structured content system that extends the basic Drupal content management framework. conStruct enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces.

Posted:July 9, 2009

Five New Web Services Added to structWSF

Due to Fred Giasson‘s great work and the support of the BKN (Bibliographic Knowledge Network) project, five new Web services and associated documentation have been added to structWSF. The five Web services are:

Search
Browse
SPARQL
Converter: BibTeX (import and export)
Converter: TSV/CSV (import and export).

Thanks, Fred, for more amazing, clean work!

The code for these will be posted soon to the structWSF SVN on Google code. UPDATE: The code has now been posted.

Posted:July 2, 2009

structWSF: A Framework for Collaboration Networks

An Innovative, Distributed, Scalable Design with Dataset Access Rights

structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via separate ontologies (schema with accompanying vocabularies).

The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards, conforming to what is known as a Web-oriented architecture. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import. All Web services are exposed via APIs and SPARQL endpoints. It also has direct interfaces to the Virtuoso RDF triple store and the Solr faceted, full-text search engine.

This post follows the release of the alpha version of the open source structWSF code on the OpenStructs Web site. It is available for download under Apache 2 license.

But, Wait! There’s More!

These baseline capabilities are useful enough. But there is another foundation to structWSF that is quite innovative and exciting: Its explicit design to support collaboration networks. It is this aspect that is the focus of this current article.

The collaboration design is a result of the needs of the Bibliographic Knowledge Network (BibKN or BKN) [1]. BibKN has as one of its express purposes creating a network of collaborators in math and statistics, ranging from the individual researcher to departments and universities and various virtual organizations (VOs) representing different communities of interest. Moreover, this nucleus of researchers also has external collaborators ranging from major publishers to software and service providers of various sizes from around the globe.

Thus, one key requirement of the BKN project was to design an infrastructure responsive to this broad spectrum of interests, locations and organizations. And, besides questions of varying scale, locale and distribution, there was also the need to combine public and private data. In some cases, initial work products need to be kept within its sponsoring groups before being made public. Sometimes external publishers want to segregate network members by whether they are already paid subscribers or not. And, most importantly, the project had a mandate to create an easy and open framework for encouraging incipient collaborators and curators to add and take ownership of new datasets.

Boiled down, these requirements represent a completely fluid spectrum of scales, access rights, virtual groups and distributed locations. These requirements were daunting indeed to establish a workable and responsive framework. But, what has resulted from this mandate — structWSF — is a generalized solution that has applicability to collaboration within any knowledge network.

Four Exemplar Deployment Modes

BibKN anticipates and is to include four exemplar types of participants on the network (or “nodes’, which are not to be confused with the different meaning of node in Drupal):

VO Nodes — these “virtual organization” nodes are the main collaboration portals and are generally based on a content management system (CMS) [2]. VO nodes may also be publishers or consumers of datasets
Gateways — connections to existing external content in the native data formats of the publisher, which are converted and made available to the network in BKN-compliant forms [3]
Hubs — aggregate suppliers of useful datasets in BKN-compliant data formats, most often BibJSON [4], and
Individual dataset contributors and clients, generally located on a desktop machine.

Each of these nodes exposes its data to the rest of the network via a structWSF Web services framework. Each structWSF installation provides an access point and endpoint to the network. Through these installations, data is converted to “canonical” form for use by other nodes on the network with common tools and services provided.

In conceptual, form, then, the network can be represented as follows:

Each node has a structWSF instance, the common network denominator, shown in blue.

A key aspect of each structWSF installation is dataset registration and access authorization. Only users with proper authorization may access or exercise certain privileges such as write or updates for a given dataset.

The other core Web services provided with structWSF are the CRUD functional services (create – read – update – delete), import and export, browse and search, and a basic templating system [see (3) in the next figure]. These are viewed as core services for any structured dataset. The current alpha release supports CSV, TSV, RDF/XML, RDF/N3 and XML, with JSON forthcoming shortly. (UPDATE: Now provided.)

Rights: The Intersection of Web Service, Dataset, Group, Role and CRUD

The controlling Web service in structWSF is the Authentication/Registration WS [see (2) in the figure below]. The current alpha version of structWSF uses registered IP addresses as the basis to grant access and privileges to datasets and functional Web services. Later versions will be expanded to include other authentication methods such as OpenID, keys (a la Amazon EC2), foaf+ssl or oauth. A secure channel (HTTPS, SSH) could also be included.

A simple but elegant system guides access and use rights. First, every Web service is characterized as to whether it supports one or more of the CRUD actions. Second, each user is characterized as to whether they first have access rights to a dataset and, if they do, which of the CRUD permissions they have [see (4, 5)]. We can thus characterize the access and use protocol simply as A + CRUD.

Thereafter, a mapping of dataset access and CRUD rights (see below) determines whether users see a given dataset and what Web services (“tools”) are presented to them and how they might manipulate that data. When expressed in standard user interfaces this leads to a simple contextual display of datasets and tools. For example, under standard search or browse activities the user would only see results sets drawn from the datasets for which they have access. Similarly, users only see the tools that their CRUD rights allow.

At the Web service layer, these access values are part of the GET request. The system, however, is designed to more often be driven by user and group management at the CMS level via a lightweight plug-in or module layer.

Because a CMS may employ its own access system and protocols, the potential combinations can become quite large. Let’s take for an example a VO node in the BibKN scenario which layers Drupal (via the conStruct modules) over the structWSF framework. By including the additional third-party contributed Drupal module of Organic Groups, we also now add an entire dimension of group access to the standard roles access in the base Drupal [5]. So, in this scenario, we theoretically have these potential access and rights combinations:

By dataset
By Web service (tool) and whether that tool can potentially support create, read (access), update or delete [CRUD] operations
By user role (for example, administrator, owner, curator, contributor, unregistered)
By group (for example, SuperWhizBangs, SortOfOKs, Clueless, RockStars).

Since the group and user role categories can be quite extensive, the combinatorial result of these options can also be quite large.

Nonetheless, as a general proposition, these access and rights dimensions can capture most any reasonable use case.

Patterned Profiles Aid Management

One way to ease the management of these choices at the UI level is to create a series of access patterns or templates — called profiles — to which a newly registered dataset can be assigned. While the Drupal site owner could go in and change or tweak any of the individual assignments, the use of such profiles simplify the steps needed for the majority of newly registered datasets (Pareto assumption).

For instance, consider these possible profile patterns:

Profile: Public (standard) — this profile is for a dataset intended for broad public access
Profile: Registered — this profile is for datasets that are limited to registered users of a VO node (possibly as a way to prevent spam or to encourage membership or participation)
Profile: Curated — this profile is where a specific group or groups (which themselves can be flexibly determined and assigned) has curation rights for the dataset, or
Profile: Internal — this profile is for internal (private) datasets where only a specific group or groups may access or modify. In some instances, an internal dataset might be the profile type while the dataset is under development, with the profile shifting to a broader access category once completed.

We can now expand this concept for a given dataset by adding the dimension of user type or category. Four categories of users can illustrate this user dimension:

O = Owner (the original registrar of the dataset; often possibly the “owner” of the VO node, but not necessarily so)
G = Group member (a registered user who is a member of a specific group)
R = Registered user (an authorized VO node user with a Drupal login and password)
P = Public (anonymous user)

(Of course, with a multitude of groups, there are potentially many more than four categories of users.)

To illustrate how we can collapse this combinatorial space into something more manageable, let’s look at what one of the profile cases noted above — that is the Public profile — can now be expressed as a pattern or template. In this example, the Public profile means that owners and some groups may curate the data, but everyone can see and access the data. Also note that export is a special case, which could warrant a sub-profile.

We also need to relate this Public profile to a specific dataset. For this dataset, we can characterize our “possible” assignments as described above as to whether a specific user category (O, G, R and P as noted above) has available a given function (open dot), gets permission rights to that function by virtue of the assigned profile (solid dot), or whether that function may also be limited to a specific group or groups (half-filled dot) or not.

Thus, we can now see this example profile matrix for the Public profile for an example dataset with respect to the available structWSF Web services:

Note, of course, that these options and categories and assignments are purely arbitrary for our illustrative discussion. Your own needs and circumstances may vary wildly from this example.

Matrices such as this seem complex, but that is why profiles can collapse and simplify the potential assignments into a manageable number of discrete options. The relevant question, with a quick answer, is for you to assemble profiles responsive to your own specific circumstances.

And, of course, if your pre-packaged profiles need to be tweaked or adjusted for a particular circumstance, the CMS enables all assignments to be accessed in individual detail.

A Powerful Vision

Via this design, knowledge and collaboration networks can be deployed that support an unlimited number of configurations and options, all in a scalable, Web-accessible manner. The data that is accessed is automatically expressed as linked data. This same framework can be layered over in situ existing data assets to provide data federation and interoperable functionality, all responsive to standard enterprise concerns regarding data access, rights and permissions.

This is not science fiction, and this is not complex. When combined with its data mixing and conversion potentials [3], we can now see emerging a general framework that enables access and interoperability to virtually any data source and for virtually any purpose, with permissions and rights built in, anywhere and everywhere across the Web.

These are exciting prospects that were not possible until Web-oriented architectures with structured RDF data came to the fore. There are no longer any barriers to the powerful vision of complete data access and interoperability without disrupting existing assets.

And the mere thought of that, is, disruptive, indeed.

Note: The alpha version of structWSF and its related conStruct modules are somewhat raw or incomplete in some ways. A few of the functions expressed in this posting have not yet been released in these code bases.

[1] BibKN is a project to develop a suite of tools and services to encourage formation of virtual organizations in scientific communities of various types. The project started in September 2008 with funding by the NSF Cyber-enabled Discovery and Innovation (CDI) Program. The major participating organizations are the American Institute of Mathematics (AIM), Harvard University, Stanford University and the University of California, Berkeley. Research support to BibKN has come in part from NSF Award 0835851.

[2] structWSF is actually combined with the conStruct structured content system and Drupal for the delivery of the VO nodes.

[3] See the earlier posting on, structWSF: A Framework for Data Mixing, for discussion about structWSF data formats.

[4] BibJSON is the standard, human-readable and editable data exchange format used within the BKN project. It has a standard attribute vocabulary geared to bibliographic material and is based on the JSON (JavaScript Object Notation) data notation.

[5] Though the specifics may differ, including the modules and add-ins, other leading CMS systems provide similar functionality.

Posted:June 30, 2009

structWSF: A Framework for Data Mixing

Interoperable Naïve Data Structs, Datasets and Canonical RDF

As I noted in my review of SemTech 2009, one of the key themes of the conference was data federation. Unfortunately, data federation has been a term a bit out of vogue for a while. (Though I still think it best captures the space.)

The current vernacular has been pushing forward an alternative: data mixing. One of the larger product pushes at the conference was by Zepheira for its new Freemix service and product. Freemix is a hosted service largely built around the Exhibit data display application, aided by some tools to make creating an exhibit easier. Exhibit is an attractive presentation system; for nearly three years AI3‘s own Sweet Tools dataset listing of semantic Web and -related tools has been presented via Exhibit.

Freemix looks promising and is now being offered in beta. But one thing caught my ear when listening to the company’s announcement: they are not yet able and ready to show the “data mixing” part of the system. Its release is apparently being delayed until later this year because of the difficulties encountered.

This post coincides with the release of the alpha version of the structWSF code on the OpenStructs Web site. It is available for download under Apache 2 license.

We’ll be blogging a few more times in the coming days regarding other possible uses and applications for this platform-independent Web services framework.

What is Data Mixing and Why is it So Hard?

As a new term there is no “official” definition of data mixing. However, I think we can consider it as generally equivalent to the older data federation concept.

Data federation is the bringing together of data from heterogeneous and often physically distributed data sources into a single, coherent view. Sometimes this is the result of searching across multiple sources, in which case it is called federated search. But it is not limited to search. Data federation is a key concept in business intelligence and data warehousing and a driver behind master data management (MDM).

As I first wrote about data federation about five years ago [1]:

Data federation first became a research emphasis within the biology and computer science communities in the 1980s. At that time, extreme diversity in physical hardware, operating systems, databases, software and immature networking protocols hampered the sharing of data.

Yet it is easy to overlook the massive strides in overcoming these obstacles in the past two decades.

The Internet and its TCP/IP and Web HTTP protocols and XML standards in particular, have been major contributors to overcoming respective physical and syntactical and data exchange heterogeneities. The current challenge is to resolve differences in meaning, or semantics, between disparate data sources. Your “glad” may be someone else’s “happy” and you may organize the world into countries while others organize by regions or cultures.

Resolving semantic heterogeneities is also called semantic mediation or data mediation. Though it displays as a small portion of the pyramid above, resolving semantics is a complicated task and may involve structural conflicts (such as naming, generalization, aggregation), domain conflicts (such as schemas or units), or data conflicts (such as synonyms or missing values). Researchers have identified nearly 40 distinct types of possible semantic heterogeneities [2].

Ontologies provide a means to define and describe these different worldviews. Referentially integral languages such as RDF (Resource Description Framework) and its schema implementation (RDF-S) or the Web ontological description language OWL are leading standards among other emerging ones for machine-readable means to communicate the semantics of data.

Fortunately, we have climbed most of this data federation pyramid. The stumbling block now are the semantics. This is made all the harder when we place too much burden on the data transmission or “packet” itself. In other words, does exchange also carry with it the burden of meaning? The rest of this post tries to explain what I mean by this and how it relates to our new structWSF Web services framework.

Is it Apples or Oranges?

Not to pick on any one thing or any individuals, but three recent threads on semantic Web-related mailing lists help illustrate in various ways some interesting mindsets. While there is much on each of these threads of other value, I’m only focusing on a narrow topic from each based on my thesis at hand.

And, what is that thesis? It is simply that we too often mix instance record and attribute assertions with schema representations and world views. And, when we do, we sometimes make mountains out of molehills (or mix apples and oranges to completely mix metaphors).

Example 1: Squeezing RDF into JSON

JSON (JavaScript Object Notation) is a data notation or syntax, easily created and widely used for current Web apps. It has a rather simple syntax for representing attribute-value pairs. Many useful tools and parsers for the serialization exist.

In keeping with his general and broad criticisms of how the semantic Web standards and approaches have been promulgated by the W3C to date, John Sowa most recently expressed his ideas in a posting to the ontolog-forum mailing list under the heading of ‘Semantic Systems’ [3]. In this thread, John proposes:

1. The recommended exchange form for RDF will become JSON. Any JSON documents that are limited to triples can use the old XML-based RDF form, but they can also use the more compact and more general full JSON.

Then, in a subsequent posting to that thread he notes:

5. The W3C made a major blunder with a one-size-fits-all approach that tried to use a document tagging language as a knowledge representation language. The result was the *worst* notation for logic ever invented.

Finally, he goes on to note in a further post:

JSON could be used as an alternative to XML for the syntax, but the lack of a standard semantics for JSON means that it could *not* be used as a replacement for RDF *unless* an official standard were adopted for mapping RDF to and from a particular subset of JSON whose semantics was defined in Common Logic.

All of this John proposes in the spirit of:

The goal of my proposal is nothing less than a total *integration* of the Semantic Web methodologies with the methodologies that have been used in the traditional software development community [3].

I find common ground with a couple of the ideas in this proposal. First, accepted formats like JSON should have a prominent place in data exchange. Second, leveraging methodologies used in the traditional community is definitely a good thing.

But John, while suggesting reuse of existing traditions, is also paradoxically recommending a wholesale replacement for RDF. He is also positing a single exchange standard (JSON). And, he stops tantalizingly short of recognizing an important truth that I’m sure he knows: simple instance record assertions and representations — the essence of data exchange — can and should be viewed separately from schema representations.

As I have noted in my earlier naïve data ‘structs’ series, there are in fact scores of existing data transfer formats that have been adopted by their communities — and are likely to remain popular within those communities for some time — that can play a similar role to JSON. So long as the role of data exchange is kept to the assertions (“metadata”) about instances, many formats can play in the sandbox.

The role of RDF may or may not reside with data exchange. To conflate and equate RDF and JSON is to reduce the power of keeping instance record representations separate from schema and world view representations. John’s basic sensibilities, I think, could be more effectively promoted by not posing ‘either-or’ strawmen and recognizing that data exchange formats will ALWAYS be diverse and heterogeneous.

Observation: Existing and emerging data ‘structs’ useful to data exchange will remain manifest in format and diversity; data exchange imperatives are a different matter from schema and knowledge representation.

Example 2: RDFa is Not ‘Expressive’ Enough

Somewhat in contrast to this thread was a different one by Martin Hepp, editor of the excellent Good Relations ontology, on the LOD (linked open data) mailing list [4]. This thread, which sensibly questions how difficult it is for mere mortals to configure an Apache server to support publishing RDF, reached further into the realm of RDFa as a document annotation language.

As Hepp states,

The reason is that, as beautiful the idea is of using RDFa to make a) the human-readable presentation and b) the machine-readable meta-data link to the same literals, the problematic is it in reality once the structure of a) and b) are very different. For very simple property-value pairs, embedding RDFa markup is no problem. But if you have a bit more complexity at the conceptual level and in particular if there are significant differences to the structure of the presentation (e.g. in terms of granularity, ordering of elements, etc.), it gets very, very messy and hard to maintain.

Further discussion in this thread elaborates the interest in having the documents in which the RDFa is embedded carry much more schema-level information.

Like the Sowa case, this raises the question of where to draw the line. Should embedded metadata in documents carry complex schema information as well? So, we now shift the focus from data exchange to schema representation.

I think this is really unnecessary since it is quite easy in RDFa to refer to a separately specified schema. By, in this case, conflating metadata transfer and exchange with schema, the bar has been raised unnecessarily high.

If we need to capture schema and world views, fine, let us do so directly and succinctly. Then, let our document metadata (in this case using RDFa) make attribute assertions about that “payload” simply and cleanly. The Web certainly does not need individual documents carrying with them entire schema representational views of the world.

Observation: Data exchange, even based on RDF (via RDFa), is best kept to the assertions of facts and attributes.

Example 3: Mixing Vocabularies

In a microformats context, Thomas Loertsch posed some questions on mixing vocabularies [5] and how they should be interpreted. This caused an involved discussion of intent and possible implications and best practices, with discussants including Brian Suda, Peter Mika, Ben Ward and others. It also led to the start of a useful wiki page on how objects should be represented in Web pages when multiple microformats can be invoked.

For quite some time microformats, I think, have gotten the “mix” just about right. They have created well-reasoned attributes for distinct instance types and seek to keep their embedding of that information simple in existing documents. Some advocate while others question the rigor of the microformat structure; that is not the topic here.

What is interesting about this thread is that it evolved to discuss the implications and best practices when an author posts a document with more than one microformat. How do these vocabularies relate? How should we, as “consumers” of the document, parse the vocabularies?

Yahoo!’s SearchMonkey service has recognized microformats for some time, and its questions regarding interpretation and best practices in the thread were natural. But the interesting point that seemed to come out of this thread is that users will post microformats as they wish. While care and standards in the design of the microformats can help reduce confusion and conflict, it can not guarantee it. The final responsibility for proper ingest and processing likely resides with the aggregators and publishers that consume such data.

So, here, too, we have another case of asserting metadata and embedding for data exchange in a slightly different native format than RDF. Huzzah!

Observation: Standards setters and consuming agents (often aggregators, publishers or search engines) should take lead responsibility for best practices and processing attribute data, realizing that original authors and developers may not fully comply.

Revisiting the ABox and TBox Split

structWFS These examples are a bit of a long way around the barn to reinforce what we have been arguing for some time: the need for a proper split between the ABox (assertions related to instances) and the TBox (concept relationships, schema and world views) [6]. This has been a pretty constant theme in our writing, ranging from first introductions, to its relation to description logics, relationships to existing data ‘structs’, and explicit discussion of ABox and TBox roles in a four–part series.

One of the key points throughout this writing is that an ABox-TBox mindset provides a context and rigor for looking at questions such as our three examples above. In all three cases, I argue, the seeming conundrums result from lacking this mindset. Once this mindset is applied, the respective roles of various data formats, RDF, schema and the like naturally fall into place.

Of course, the Web is also a dirty and chaotic place where niceties of design and best practices are routinely ignored or unknown or purposefully rejected. So be it. This is reality. This reality needs to be accommodated. But good design can help overcome it and work to establish resilient, flexible architectures.

Of course, even though this might be good design, there is no ability to enforce such distinctions across the Web. However, insofar as key implementors are concerned (standards writers, major publishers, tools developers, industry experts, and the like) we can put in place better approaches. This mantra is at the heart of all that Structured Dynamics does — including the structWSF Web services framework, just released as open source code.

A General Data Mixing Model

So, now we can finally turn our attention to the structWSF Web services framework, more broadly described here.

There are a number of perspectives and contexts to view this structWSF framework. In this posting, we take the boundary conditions of data formats and data exchange [7]. The key question for this perspective is: given the realities noted above, what is an adaptive framework for data mixing on the Web? Our schematic answer to this question is below:

The basic design has two key data considerations. First, all structWSF tools and Web services and schema work from the canonical RDF data model. It is the hub and common denominator for all structWSF installations. We are able to design and optimize generic tools and services (including converters) around this canonical framework.

Second, we assume most everything in the outside world to be non-compliant with this canonical model, with the data representations often naïve and incomplete. Converters (also known as translators or RDFizers) are an essential bridge to this external world, and need to be designed for re-use and extensibility.

Where the outside world is compliant, they conform to the structWSF APIs or are themselves structWSF installations. In these cases, direct data exchange and access with permission rights occurs at a dataset level (not shown).

The Naïve Part of the Spectrum

Converters are themselves bona fide Web services at the structWSF level. (Only a few are presently included in the alpha release.) While some may be one-off converters (sometimes off-the-shelf RDFizers), and often devoted to large volume external data sources, it is also helpful to emphasize one or more “standard” naïve external formats. A “standard” external format allows for a more sophisticated converter and enables specific tools to be more easily justified around the standard naïve format.

As noted above, this “standard” is often JSON or a derivative of JSON. But, just as readily, the common ‘naïve’ format could be SQL from relational databases or another format common to the community at hand. In many ways, because the emphasis of data exchange is on the ABox and instance records and assertions (and attribute extensions), the actual format and serialization is pretty much immaterial.

Emphasizing one or a few naïve external formats allows more tools and services to be cost-effectively developed for those formats. And, even though the format(s) chosen for this external standard may lack the expressiveness of RDF (and, ultimately, OWL), because the burden is principally related to data exchange, this layer can be readily optimized for the deployment at hand.

Besides import converters it is also important to have export services for the more broadly used naïve external formats. In fact, some structWSF services can be devoted to data cleanup or attribute (property) or object reconciliation (including disambiguation as a possibility). In this manner, structWSF installations could also improve the authority and trustworthiness of standard data in the wild.

Another common service for this naïve data is to give it unique URI identifiers and to make it Web-accessible, thus turning it into linked data.

The RDF Canonical Data Model

Such generic services are possible because the “highest common denominator” for the system is the canonical RDF model. Because it is the consistent basis for tools and services, once a converter is available and the external information schema is mapped to the internal structure, all existing tools and services are available for re-use. Moreover, this system and its datasets are now ready for sharing with other structWSF instances, within the enterprise or beyond.

Thus, we begin to see a network of canonical “hubs” in a sea of heterogeneity, the interoperation of which is facilitated by a structWSF framework at every network node. This design is discussed more in the next part of this series.

Some, such as Sowa noted above, would prefer a grounding in common logic (CL) as opposed to RDF. Our choice to use RDF is based on the simplicity and understandability of the data model, plus the richness of languages and standards from the W3C that surround the framework.

Even here, however, the RDF basis of structWSF need not be the final word. Because of a keen intent to keep all designs and ontologies used by structWSF firmly grounded in description logics, it is possible for the structWSF basis to be converted to other languages and frameworks such as CL that can be expressed in DL.

Bringing it Back to Data Federation

Data mixing — or more preferably, data federation — has as its heart the premise of heterogeneous and distributed data sources. It implicitly acknowledges differences in syntax, semantics and serializations.

The design and architecture of structWSF is similarly premised. While each of us may prefer one model or one format over others, we must interoperate in the real world. And that world, for many understandable and immutable reasons, will retain its diversity. Accepting this reality is a first step to adaptive design.

So, we control what we can control, and we adapt to what else exists. We have chosen RDF as the canonical data model that we can control and have embedded it in a Web services framework that is Web-based and scalable; in other words, a fully compliant Web-oriented architecture. These are the conceptual foundations to structWSF.

To be sure, structWSF in its current alpha release is quite raw in many areas and incomplete in others. But we will continue to work on it — and invite your participation to do the same — such that it can fulfill its destiny as a data federation framework for the Web.

[1] I first wrote about this while at BrightPlanet; a page is still up on that Web site with the text above. I have re-caste this material in various ways since.

[2] I have previously written on the “40 sources” of data heterogeneity. See here, for example.

[3] See http://ontolog.cim3.net/forum/ontolog-forum/2009-06/msg00210.html and continue to follow the noted thread.

[4] See the thread, ‘ .htaccess a major bottleneck to Semantic Web adoption,’ at http://lists.w3.org/Archives/Public/public-lod/2009Jun/0341.html and continue to follow this thread.

[5] See http://microformats.org/discuss/mail/microformats-discuss/2009-June/012985.html and continue to follow the ‘mixing vocabularies’ thread.

[6] This is our working definition of the ABox and TBox in specific reference to description logics:

[7] For functionality, download, documentation or other direct materials on structWSF, please see OpenStructs.org and its related resources. There is also a Drupal instantiation of the system called conStruct, also available for download.

Posted:June 23, 2009

Data-driven Applications with conStruct

The slides for our conStruct announcement at SemTech 2009 have been posted on Slideshare.

The slideshow, Data-driven Apps with conStruct, have much on the architecture and benefits of conStruct, from the context of the Bibliographic Knowledge Network (BKN) project. The slides came from my talk on “BKN: Building Knowledge through Communities, and Communities through Knowledge.”

Enjoy!

Main Links

Search