There has been a bit of a manic-depressive character on the Web waves of late with respect to linked data. On the one hand, we have seen huzzahs and celebrations from the likes of ReadWriteWeb and Semantic Web.com and, just concluded, the Linked Data on the Web (LDOW) workshop at WWW2010. This treatment has tended to tout the coming of the linked data era and to seek ideas about possible, cool linked data apps . This rise in visibility has been accomplished by much manic and excited discussion on various mailing lists.
On the other hand, we have seen much wringing of hands and gnashing of teeth for why linked data is not being used more and why the broader issue of the semantic Web is not seeing more uptake. This depressive “call to arms” has sometimes felt like ravings with blame being given to the poor state of apps and user interfaces to badly linked data to the difficulty of publishing same. Actually using linked data for anything productive (other than single sources like DBpedia) still appears to be an issue.
Meanwhile, among others, Kingsley Idehen, ubiquitous voice on the Twitter #linkeddata channel, has been promoting the separation of identity of linked data from the notion of the semantic Web. He is also trying to change the narrative away from the association of linked data with RDF, instead advocating “Data 3.0″ and the entity-attribute-value (EAV) model understanding of structured data.
As someone less engaged in these topics since my own statements about linked data over the past couple of years , I have my own distanced-yet-still-biased view of what all of this crisis of confidence is about. I think I have a diagnosis for what may be causing this bipolar disorder of linked data .
A fairly universal response from enterprise prospects when raising the topic of the semantic Web is, “That was a big deal of about a decade ago, wasn’t it? It didn’t seem to go anywhere.” And, actually, I think both proponents and keen observers agree with this general sentiment. We have seen the original advocate, Tim Berners-Lee, float the Giant Global Graph balloon, and now Linked Data. Others have touted Web 3.0 or Web of Data or, frankly, dozens of alternatives. Linked data, which began as a set of techniques for publishing RDF, has emerged as a potential marketing hook and saviour for the tainted original semantic Web term.
And therein, I think, lies the rub and the answer to the bipolar disorder.
If one looks at the original principles for putting linked data on the Web or subsequent interpretations, it is clear that linked data (lower case) is merely a set of techniques. Useful techniques, for sure; but really a simple approach to exposing data using the Web with URLs as the naming convention for objects and their relationships. These techniques provide (1) methods to access data on the Web and (2) specifying the relationships to link the data (resources). The first part is mechanistic and not really of further concern here. And, while any predicate can be used to specify a data (resource) relationship, that relationship should also be discoverable with a URL (dereferencable) to qualify as linked data. Then, to actually be semantically useful, that relationship (predicate) should also have a precise definition and be part of a coherent schema. (Note, this last sentence is actually not part of the “standard” principles for linked data, which itself is a problem.)
When used right, these techniques can be powerful and useful. But, poor choices or execution in how relationships are specified often leads to saying little or nothing about semantics. Most linked data uses a woefully small vocabulary of data relationships, with even a smaller set ever used for setting linkages across existing linked data sets . Linked data techniques are a part of the foundation to overall best practices, but not the total foundation. As I have argued for some time, linked data alone does not speak to issues of context nor coherence.
To speak semantically, linked data is not a synonym for the semantic Web nor is it the sameAs the semantic Web. But, many proponents have tried to characterize it as such. The general tenor is to blow the horns hard anytime some large data set is “exposed” as linked data. (No matter whether the data is incoherent, lacks a schema, or is even poorly described and defined.) Heralding such events, followed by no apparent usefulness to the data, causes confusion to reign supreme and disappointment to naturally occur.
The semantic Web (or semantic enterprise or semantic government or similar expressions) is a vision and an ideal. It is also a fairly complete one that potentially embraces machines and agents working in the background to serve us and make us more productive. There is an entire stack of languages and techniques and methods that enable schema to be described and non-conforming data to be interoperated. Now, of course this ideal is still a work in progress. Does that make it a failure?
Well, maybe so, if one sees the semantic Web as marketing or branding. But, who said we had to present it or understand it as such?
The issue is not one of marketing and branding, but the lack of benefits. Now, maybe I have it all wrong, but it seems to me that the argument needs to start with what “linked data” and the “semantic Web” can do for me. What I actually call it is secondary. Rejecting the branding of the semantic Web for linked data or Web 3.0 or any other somesuch is still dressing the emperor in new clothes.
For a couple of years now I have tried in various posts to present linked data in a broader framework of structured and semantic Web data. I first tried to capture this continuum in a diagram from July 2007:
|Document Web||Structured Web||Semantic Web|
Now, three years later, I think the transitional phase of linked data is reaching an end. OK, we have figured out one useful way to publish large datasets staged for possible interoperability. Sure, we have billions of triples and assertions floating out there. But what are we to do with them? And, is any of it any good?
I think Kingsley is right in one sense to point to EAV and structured data. We, too, have not met a structured data format we did not like. There are hundreds of attribute-value pair models of even more generic nature that also belong to the conversation.
One of my most popular posts on this blog has been, ‘Structs’: Naïve Data Formats and the ABox, from January 2009. Today, we have a multitude of popular structured data formats from XML to JSON and even spreadsheets (CSV). Each form has its advocates, place and reasons for existence and popularity (or not). This inherent diversity is a fact and fixture of any discussion of data. It is a major reason why we developed the irON (instance record and object notation) non-RDF vocabulary to provide a bridge from such forms to RDF, which is accessible on the Web via URIs. irON clearly shows that entities can be usefully described and consumed in either RDF or non-RDF serialized forms.
Though RDF and linked data is a great form for expressing this structured information, other forms can convey the same meaning as well. Of the billions of linked data triples exposed to date, surely more than 99% are of this instance-level, “ABox” type of data . And, more telling, of all of the structured data that is publicly obtainable on the Web, my wild guess is that less than 0.0000000001% of that is even linked RDF data .
Neither linked data nor RDF alone will — today or in the near future — play a pivotal or essential role for instance data. The real contribution from RDF and the semantic Web will come from connecting things together, from interoperation and federation and conjoining. This is the provenance of the TBox and is a role barely touched by linked data. Publishing data as linked data helps tremendously in simplifying ingest and guiding the eventual connections, but the making of those connections, testing for their quality and reliability, are steps beyond the linked data ken or purpose.
It seems, then, that we see two different forces and perspectives at work, each contributing in its own way to today’s bipolar nature of linked data.
On the manic side, we see the celebration for the release of each large, linked data set. This perspective seems to care most about volumes and numbers, with less interest in how and whether the data is of quality or useful. This perspective seems to believe “post the data, and the public will come.” This same perspective is also quite parochial with respect to the unsuitability of non-linked data, be it microdata, microformats or any of the older junk.
On the depressed side, linked data has been seen as a more palatable packaging for the disappointments and perceived failures or slow adoption of the earlier semantic Web phrasing. When this perspective sees the lack of structure, defensible connections and other quality problems with linked data as it presently exists, despair and frustration ensue.
But both of these perspectives very much miss the mark. Linked data will never become the universal technique for publishing structured data, and should not be expected to be such. Numbers are never a substitute for quality. And linked data lacks the standards, scope and investment made in the semantic Web to date. Be patient; don’t despair; structured data and the growth of semantics and useful metadata is proceeding just fine.
Unrealistic expectations or wrong roles and metrics simply confuse the public. We are fortunate that most potential buyers do not frequent the community’s various mailing lists. Reduced expectations and an understanding of linked data’s natural role is perhaps the best way to bring back balance.
We have consciously moved our communications focus from speaking internally to the community to reaching out to the broader enterprise public. There is much of education, clarification and dialog that is now needed with the buying public. The time has moved past software demos and toys to workable, pragmatic platforms, and the methodologies and documentation necessary to support them. This particular missive speaking to the founding community is (perhaps many will Hurray!) likely to become even more rare as we continue to focus outward.
As Structured Dynamics has stated many times, we are committed to linked data, presenting our information as such, and providing better tools for producing and consuming it. We have made it one of the seven foundations to our technology stack and methodology.
But, linked data on its own is inadequate as an interoperability standard. Many practitioners don’t publish it right, characterize it right, or link to it right. That does not negate its benefits, but it does make it a poor candidate to install on the semantic Web throne.
Linked data based on RDF is perhaps the first citizen amongst all structured data citizens. It is an expressive and readily consumed means for publishing and relating structured instance data and one that can be easily interoperated. It is a natural citizen of the Web.
If we can accept and communicate linked data for these strengths, for what it naturally is — a useful set of techniques and best practices for enabling data that can be easily consumed — we can rest easy at night and not go crazy. Otherwise, bring on the Prozac.
OK, you’ve been reading the literature and perhaps have attended a conference or two. You have heard a lot about semantic technologies, but have some real questions and concerns:
Such questions — and more — are not infrequent when organizations first contemplate making the transition to become a semantic enterprise.
The diagram below shows a general workflow for migrating existing instance data into the semantic enterprise. The diagram is broken down into three parts. The first part is to characterize and stage existing data and information into the underlying structured data framework. This is what SD (that is, my firm, Structured Dynamics) does as data architects using our particular approach to adaptive ontologies. I’ll touch on this again in a moment.
Jumping to the right-hand side of the diagram is the access and display part. It is here that developers or users can make selections from dropdown lists and so forth to define the “slices” of diced results sets they wish to display. The results of those interactions are structured data results sets that are pre-staged to “drive” various applications and displays [1,2]. These same capabilities can also be embedded into standard Web end user applications, such as content management systems.
The third and middle part of the diagram is the critical part, the pivot point. It is the interface layer between the structured data on the left and the display and presentation of that data on the right. As provided by SD, this abstraction layer is the structWSF Web services framework that “bridges” between the black box of what happens with RDF and semantic Web structured data characterizations on the left in order to feed, or “drive”, useful services and functions on the right.
We call this general design and architecture “ontology-driven applications”. The bulk of this posting explains each of these three parts in a bit more detail, organized from left-to-right by these Parts 1 to 3.
Our approach relies on what we call “adaptive ontologies”. These ontologies set the structural basis for all subsequent data display, analysis, inferencing, entailments, and the like. We call them “adaptive” because we embrace a set of unique best practices. These practices enable the ontologies to do the double-duty of first structuring data and then driving generic applications by properly informing user interfaces, dropdown lists, menus and the like.
This structuring results in faceting key important dimensions and attributes of available content. Structured data gets organized. Unstructured data (text) gets tagged via this structure and integrated with it.
As Structured Dynamics’ general product schema makes clear (see the diagram at ), our approach leverages existing assets as much as possible. Often, this means leaving most existing data structures in place. These existing assets are staged and converted in two complementary manners that largely correspond to the conceptual ABox (instance) and TBox (concepts and schema) split central to description logics and pivotal to SD’s methodology .
Whether transitioning small chunks or big chunks, this staging of existing data in Part 1 results in an RDF-accessible characterization of the starting content. Instances and their attributes are represented via a common notation, generally based on irON (instance record and Object Notation) , that is an extensible notation and vocabulary for capturing the data characterizations, attributes and metadata of the candidate instance data (“records” in RDBMS parlance). These instances may either be internal or proprietary records, or instance data on the Web or in the public domain. By properly matching same or similar instances to one another, any source of instance characterization can be meaningfully combined.
This instance notation is extremely lightweight, and really is merely an RDF representation of data characterizations. In the characterizations to this point there is not yet any “world view” involved: we are simply describing instances and their attributes in a manner akin to key-value pairs. The process to this point is entirely descriptive.
However, these instance characteristics do contain within them the semantics as to how to describe these attributes (your “glad” is my “happy”), as well as potentially a schematic or conceptual view of how these instances relate to one another and to the broader world. Instance characterizations provide the building blocks, that are then related and made semantically whole via a second “terminological” level.
These terminological, or conceptual, relationships (the TBox ), reside at a different level from simply decribing things. Rather, these schema — what in this context are best known as ontologies — provide a precise language and means for describing conceptual relationships. If these structural relationships are done well, they are coherent: the hip bone is connected to the thigh bone and not to the ear. Coherence is a matter of a consistent world view that “hangs together” when analyzed via powerful logical techniques available via description logics and other broader mechanisms of the semantic enterprise.
Thus, as we transition from the existing, the operational workflow splits the input data stream into two pathways:
A sequential flow of these steps and splits is provided by this diagram below that shows: 1) the conceptual structure of the concept ontology; as 2) matched with the instances and their descriptive attributes that populate that schema.
A key point is that — while a proper starting ontology is essential to our process and proofs-of-concept — it can be grown and scaled incrementally. We leverage as much existing starting structure as possible and can readily bound the scope to meet budget and delivery imperatives.
The concepts and entities that occur within these structures help inform our fairly simple tagging system, scones . (There are also benefits from “triangulating” between entity or instance identification and concept identification that helps inform disambiguation nearly for free; see further ). It is also possible to integrate these initial proof-of-concept approaches with third-party tools (e.g., Calais, Expert System (Cogito), etc.) to improve unstructured content characterization.
These approaches are pretty straightforward for any organization wanting to test the idea of becoming a semantic enterprise. Real benefits — such as concept retrievals overcoming the limitations of standard keyword search — can be demonstrated from even small starting ontologies and structures. Given the inherent connectedness of the data, it is possible to expand the scope and usefulness of the information incrementally within fixed and manageable budgets.
A pivotal part of SD’s infrastructure software is structWSF , our platform-independent Web services middleware. structWSF is an abstraction layer that provides the APIs, search endpoints, and specific Web services for accessing, querying or getting results sets from the underlying structured data and ontologies.
structWSF has a standard set of access and retrieval services including browse, full-text search, CRUD, direct record retrievals, and the like. It is embedded within an access and permissions service that acts at the level of registered datasets. Then, based on the requested protocol, structWSF returns the filtered results set. These results sets can be delivered as XML, JSON, or any of the other formats already available . They can readily and dynamically populate HTML pages and forms in any deployment framework. For specific purposes, these results sets can also be returned as pre-staged, properly formatted results streams for driving specific applications.
As an API, the structWSF Web services can be interacted with and driven via standard HTTP requests. Alternatively, these requests can come from simple to complicated Web apps that create the API queries based on user interface choices such as selections from dropdown lists or clicking on various listed options. An interactive demo of this approach is shown by SD’s conStruct application , though even simpler Web pages or forms may drive the query interface.
Queries and requests to structWSF may also include a parameter for results sets to be returned in particular formats. SD’s irON protocol  supports requests or results in CSV, XML or JSON, in addition to other flavors including multiple serializations of RDF.
In this manner, only a simple converter need be added to the structWSF Web services stack in order to “drive” a new application with a particularly formatted results set stream.
structWSF thus acts as a single, uniform Web interface to all of the “black box” nuances of the structured data system organized by the adaptive ontologies. Further, virtually any data structure may be ingested and converted from external sources via an import service and made part of the underlying canonical structure, making the framework perfect for data federation . Lastly, the dataset nature of the framework, and its neutrality to underlying data stores or content management systems, also makes structWSF an excellent framework for one or many nodes to share information and collaborate across the Web .
The following diagram shows how a diverse, Web-based network, involving a diversity of Web portals and data gateways and hubs, can work via the structWSF framework to establish a complete collaboration network. Via datasets and differential access rights and permissions, virtually any combination of potential interactions can be supported:
These potentials are really fundamentally new, and we ourselves are still trying to find the language and analogies to best explain them. structWSF was initially designed as a platform-independent layer between the structured data representation of existing assets and the ontology-driven applications that interact with them. We are now finding that deployment in a broader Web-based context provides additional exciting prospects for integrating various regional offices or to enable direct collaboration with customers, partners or suppliers.
The basic design of structWSF is to provide a middleware layer that fulfills one or more of these broad user interaction modes:
SD has developed generic applications in these areas (with many more possible), the operations of which are guided by the instructions and nature of the underlying data that feeds them. We have proven it is possible to adopt data characterization practices within those ontologies so as to stage or “drive” such generic applications.
In the case of a standard structured data display (say, a simple table like a Wikipedia infobox, for example), such generic design includes templates tailored to various instance types (say, locational information presenting on a map versus people information warranting a image and vital statistics). Alternatively, in the generic design for some specialized application (say, Adobe Flash), the information output of the results set may need to contain certain formats and attributes.
SD’s “ontology-driven apps”, then, are really informed structured results sets that are outputted in a form suitable to various intended applications. This output form can include a variety of serializations, formats or metadata. This flexibility of output that is tailored to and responsive to particular generic applications is what makes our ontologies “adaptive”.
Expressed in this manner, “ontology-driven apps” seem neither remarkably profound nor clever. They are simply attentive to their intended uses.
Using this structure, then, it is possible to either “drive” queries and results sets selections via direct HTTP request or via simple dropdown selections on HTML forms (that is, from right to left as shown on the first diagram). Similarly, it is possible with a single parameter change to drive either a visualization app or a structured table template from the equivalent query request (that is, from left to right on the first diagram).
“Ontology-driven apps” through SD’s architecture design thus provide two profound benefits. First, the entire system can be driven via simple selections or interactions without the need for any programming or technical expertise. And, second, simple additions of new and minor output converters can work to power entirely new applications available to the system. If, say, Adobe graphics applications need to change tomorrow for Microsoft Silverlight, that switch is easy and can be made transparent to the designer.
The ability to develop these systems incrementally and the ability to integrate with external, public data is fundamentally dependent on the open world assumption. The open world assumption is a different logic premise than what many enterprises are used to; relational database systems, for example, embrace the alternate closed world premise.
Open world does not necessarily mean open data and it does not mean open source. Open world is merely a way to think about the information we have and how we act on it. An open world assumption accepts that we never have all necessary information and lacking that information does not itself lead to any conclusions.
Some enterprise circumstances – say a complete enumeration of customers or products or even controlled engineering or design environments — may warrant a closed world approach. In those circumstances, the domain of inquiry is well bounded and we can get relatively complete information about it. Engineering an oil drilling platform or launching the Space Shuttle in fact demands that.
But, in most real world circumstances, there is much we don’t know and we interact in complex and external environments. Open world is the proper logic premise for these circumstances. These circumstances also happen to be the very basis in which most most knowledge workers and analysts reside.
Open world frameworks provide some incredibly important benefits if the circumstances of their use apply:
One might argue, as we believe, that the biggest impediment to the semantic enterprise is the mind shift necessary to start thinking about and accepting the open world premise. Again, this perspective is not applicable to all problems and domains. But, where it is, much can be left in place and leveraged with semantic technologies, so long as the enterprise begins to look at these existing assets through a different open-world lens.
So, let’s return to the rhetorical questions that began this posting.
It should now be clear that it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with focus on early, high-value applications and domains.
Further, we need not abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives. However, in leveraging those assets, it is important that the enterprise begin to embrace and understand the open world assumption.
We also see that RDF and OWL, while important behind the scenes as a canonical data model and languages for organizing this information, need not be exposed as such to most users. Most instance data can be expressed as is with the data languages of choice such as XML, JSON or whatever.
We also see these technologies are neutral to the question of open or public sources. The techniques can equivalently be applied to internal, closed, proprietary data and structures. Moreover, the technologies can themselves be used as a basis for bringing external information into the enterprise.
Without a doubt, some of the early years in describing semantic technologies were burdened with some unfortunate bad information and lack of sophistication. Today’s semantic Web is nimble, agile, and ready to be deployed immediately at low cost and risk. So, jump on in! We think you’ll find the water to be just fine.
In a former life, I had the nickname of ‘Spreadsheet King’ (perhaps among others that I did not care to hear). I had gotten the nick because of my aggressive use of spreadsheets for financial models, competitors tracking, time series analyses, and the like. However, in all honesty, I have encountered many others in my career much more knowledgeable and capable with spreadsheets than I’ll ever be. So, maybe I was really more like a minor duke or a court jester than true nobility.
Yet, pro or amateur, there are perhaps 1 billion spreadsheet users worldwide , making spreadsheets undoubtedly the most prevalent data authoring environment in existence. And, despite moans and wails about how spreadsheets can lead to chaos, spaghetti code, or violations of internal standards, they are here to stay.
Spreadsheets often begin as simple notetaking environments. With the addition of new findings and more analysis, some of these worksheets may evolve to become full-blown datasets. Alternatively, some spreadsheets start from Day One as intended datasets or modeling environments. Whatever the case, clearly there is much accumulated information and data value “locked up” in existing spreadsheets.
How to “unlock” this value for sharing and collaboration was a major stimulus to development of the commON serialization of irON (instance record and Object Notation) . I recently published a case study  that describes the reasons and benefits of dataset authoring in a spreadsheet, and provides working examples and code based on Sweet Tools  to aid users in understanding and using the commON notation. I summarize portions of that study herein.
The dataset that is the focus of this use case, Sweet Tools, began as an informal tracking spreadsheet about four years ago. I began it as a way to learn about available tools in the semantic Web and -related spaces. I began publishing it and others found it of value so I continued to develop it.
As it grew over time, however, it gained in structure and size. Eventually, it became a reference dataset, with which many other people desired to use and interact. The current version has well over 800 tools listed, characterized by many structured data attributes such as type, programming language, description and so forth. As it has grown, a formal controlled vocabulary has also evolved to bring consistency to the characterization of many of these attributes.
It was natural for me to maintain this listing as a spreadsheet, which was also reinforced when I was one of the first to adopt an Exhibit presentation of the data based on a Google spreadsheet about three years back. Here is a partial view of this spreadsheet as I maintain it locally:
When we began to develop irON in earnest as a simple (“naïve”) dataset authoring framework, it was clear that a comma-separated value, or CSV , option should join the other two serializations under consideration, XML and JSON. CSV, though less expressive and capable as a data format than the other serializations, still has an attribute-value pair (also known as key-value pairs and many other variants ) orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc.
As a dataset very familiar to us as irON‘s editors, and directly relevant to the semantic Web, Sweet Tools provided a perfect prototype case study for helping to guide the development of irON, and specifically what came to be known as the commON serialization for irON. The Sweet Tools dataset is relatively large for a speciality source, has many different types and attributes, and is characterized by text, images, URLs and similar.
The premise was that if Sweet Tools could be specified and represented in commON sufficiently to be parsed and converted to interoperable RDF, then many similar instance-oriented datasets could likely be so as well. Thus, as we tried and refined notation and vocabulary, we tested applicability against the CSV representation of Sweet Tools in addition to other CSV, JSON and XML datasets.
A large portion of the case study describes the many advantages of authoring small datasets within spreadsheets. The useful thing about the CSV format is that these full functional capabilities of the spreadsheet are available during authoring or later updates and modifications, but, when exported, the CSV provides a relatively clean format for processing and parsing.
So, some of the reasons for small dataset authoring in a spreadsheet include:
The next major section of the case study deals with the minor conventions that must be followed in order to stage spreadsheets for commON. Not much of the specific commON vocabulary or notation is discussed below; for details, see .
Because you can create multiple worksheets within a spreadsheet, it is not necessary to modifiy existing worksheets or tabs. Rather, if you are reluctant or can not change existing information, merely create parallel duplicate sheets of the source information. These duplicate sheets have as their sole purpose export to commON CSV. You can maintain your spreadsheet as is while staging for commON.
To do so, use the simple = formula to create cross-references between the existing source spreadsheet tab and the target commON CSV export tab. (You can also do this for complete, highlighted blocks from source to target sheet.) Then, by adding the few minor conventions of commON, you have now created a staged export tab without modifying your source information in the slightest.
In standard form and for Excel and Open Office, single quotes, double quotes and commas when entered into a spreadsheet cell are automatically ‘escaped‘ when issued as CSV. commON allows you to specify your own delimiter for lists (the standard is the pipe ‘|’ character) and what the parser recognizes as the ‘escape’ character (‘\’ is the standard). However, you probably should not change for most conditions.
The standard commON parsers and converters are UTF-8 compatible. If your source content has unusual encodings, try to target UTF-8 as your canonical spreadsheet output.
In the irON specification there are a small number of defined modules or processing sections. In commON, these modules are denoted by the double-ampersand character sequence (‘&&‘), and apply to lists of instance records (&&recordList), dataset specifications and associated metadata describing the dataset (&&dataset), and mappings of attributes and types to existing schema (&&linkage). Similarly, attributes and types are denoted by a single ampersand prefix (&attributeName).
In commON, any or all of the modules can occur within a single CSV file or in multiple files. In any case, the start of one of these processing modules is signaled by the module keyword and &&keyword convention.
The first spreadsheet figure above shows a Sweet Tools example for the &&recordList module. The module begins with that keyword, indicating one of more instance records will follow. Note that the first line after the &&recordList keyword is devoted to the listing of attributes and types for the instance records (designated by the &attributeName convention in the columns for the first row after the &&recordList keyword is encountered).
The &&recordList format can also include the stacked style (see similar Dataset example below) in addition to the single row style shown above.
At any rate, once a worksheet is ready with its instance records following the straightforward irON and commON conventions, it can then be saved as a CSV file and appropriately named. Here is an example of what this “vanilla” CSV file now looks like when shown again in a spreadsheet:
Alternatively, you could open this same file in a text editor. Here is how this exact same instance record view looks in an editor:
Note that the CSV format separates each column by the comma separator, with escapes shown for the &description attribute when it includes a comma-separated clause. Without word wrap, each record in this format occupies a single row (though, again, for the stacked style, multiple entries are allowed on individual rows so long as a new instance record &id is not encountered in the first column).
The &&dataset module defines the dataset parameters and provides very flexible metadata attributes to describe the dataset . Note the dataset specification is exactly equivalent in form to the instance record (&&recordList) format, and also allows the single row or stacked styles (see these instance record examples), with this one being the stacked style:
The &&linkage module is used to map the structure of the instance records to some structural schema, which can also include external ontologies. The module has a simple, but specific structure.
Either attributes (presented as the &attributeList) or types (presented as the &typeList) are listed sequentially by row until the listing is exhausted . By convention, the second column in the listing is the targeted &mapTo value. Absent a prior &prefixList value, the &mapTo value needs to be a full URL to the corresponding attribute or type in some external schema:
Notice in the case of Sweet Tools that most values are from the actual COSMO mini-ontology underlying the listing. These need to be listed as well, since absent the specifications in commON the system has NO knowledge of linkages and mappings.
In its current state of development, commON does not support a spreadsheet-based means for specifying the schema structure (lightweight ontology) governing the datasets . Another irON serialization, irJSON, does. Either via this irJSON specification or via an offline ontology, a link reference is presently used by commON (and, therefore, Sweet Tools for this case study) to establish the governing structure of the input instance record datasets.
A spreadsheet-based schema structure for commON has been designed and tested in prototype form. commON should be enhanced with this capability in the near future .
If the modules are spread across more than one worksheet, then each worksheet must be saved as its own CSV file. In the case of Sweet Tools, as exhibited by its reference current spreadsheet, sweet_tools_20091110.xls, three individual CSV files get saved. These files can be named whatever you would like. However, it is essential that the names be remembered for later referencing.
My own naming convention is to use a format of appname_date_modulename.csv because it sorts well in a file manager accommodating multiple versions (dates) and keeps related files clustered. The appname in the case of Sweet Tools is generally swt. The modulename is generally the dataset, records, or linkage convention. I tend to use the date specification in the YYYYMMDD format. Thus, in the case of the records listings for Sweet Tools, its filename could be something like: swt_20091110_records.csv.
Once saved, these files are now ready to be imported into a structWSF  instance, which is where the CSV parsing and conversion to interoperable RDF occurs . In this case study, we used the Drupal-based conStruct SCS system . conStruct exposes the structWSF Web services via a user interface and a user permission and access system. The actual case study write-up offers more details about the import process.
We are now ready to interact with the Sweet Tools structured dataset using conStruct (assuming you have a Drupal installation with the conStruct modules) .
The screen capture below shows a couple of aspects of the system:
One of the absolutely cool things about this framework is that all tools, inferencing, user interfaces and data structure are a direct result of the ontology(ies) underlying the system (plus the irON instance ontology, as well). This means that switching datasets or adding datasets causes the entire system structure to now reflect those changes — without lifting a finger!!
Here are a few sample things you can do with these generic tools driven by the Sweet Tools dataset:
Note, if you access this conStruct instance you will do so as a demo user. Unfortunately, as such, you may not be able to see all of the write and update tools, which in this case are reserved for curators or admins. Recall that structWSF has a comprehensive user access and permissions layer.
Of course, one of the real advantages of the irON and structWSF designs is to enable different formats to be interchanged and to interoperate. Upon submission, the commON format and its datasets can then be exported in these alternate formats and serializations :
As should be obvious, one of the real benefits of the irON notation — in addition to easy dataset authoring — is the ability to more-or-less treat RDF, CSV, XML and JSON as interoperable data formats.
The formal Sweet Tools case study based on commON, with sample download files and PDF, is available from Annex: A commON Case Study using Sweet Tools, Supplementary Documentation .
Attribute-values can also be presented as pairs in a form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array..
Much has been happening on the Structured Dynamics front of late. Besides welcoming Steve Ardire as a senior advisor to the company, we also have been issuing a steady stream of new products from our semantic Web pipeline.
This new slide show attempts to capture these products and relate them to the various layers in Structured Dynamics’ enterprise product stack:
The show indicates the role of scones, irON, structWSF, UMBEL, conStruct and others and how they leverage existing information assets to enable the semantic enterprise. And, oh, by the way, all of this is done via Web-accessible linked data and our practical technologies.
On behalf of Structured Dynamics, I am pleased to announce our release into the open source community of irON — the instance record and Object Notation — and its family of frameworks and tools . With irON, you can now author and conduct business solely in the formats and tools most familiar and comfortable to you, all the while enabling your data to interact with the semantic Web.
irON is an abstract notation and associated vocabulary for specifying RDF triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF. The notation supports writing RDF and schema in JSON (irJSON), XML (irXML) and comma-delimited (CSV) formats (commON).
The surprising thing about irON is that — by following its simple conventions and vocabulary — you will be authoring and creating interoperable RDF datasets without doing much different than your normal practice.
This first specification for the irON notation includes guidance for creating instance records (including in bulk), linkages to existing ontologies and schema, and schema definitions. In this newly published irON specificatiion, profiles and examples are also provided for each of the irXML, irJSON and commON serializations. The irON release also includes a number of parsers and converters of the specification into RDF . Data ingested in the irON frameworks can also be exported as RDF and staged as linked data.
The objective of irON is to make it easy for data owners to author, read and publish data. This means the starting format should be a human readable, easily writable means for authoring and conveying instance records (that is, instances and their attributes and assigned values) and the datasets that contain them. Among other things, this means that irON‘s notation does not use RDF “triples“, but rather the native notations of the host serializations.
irON is premised on these considerations and observations:
The irON notation and vocabulary is designed to allow the conceptual structure (“schema”) of datasets to be described, to facilitate easy description of the instance records that populate those datasets, and to link different structures for different schema to one another. In these manners, more-or-less complete RDF data structures and instances can be described in alternate formats and be made interoperable. irON provides a simple and naïve information exchange notation expressive enough to describe most any data entity.
The notation also provides a framework for extending existing schema. This means that irON and its three serializations can represent many existing, common data formats and standards, while also providing a vehicle for extending them. Another intent of the specification is to be sparse in terms of requirements. For instance, this reserved vocabulary is fairly minimal and optional in most all cases. The irON specification supports skeletal submissions.
The aim of irON is to describe instance records. An instance record is simply a means to represent and convey the information (”attributes”) describing a given instance. An instance is the thing at hand, and need not represent an individual; it could, for example, represent the entire holdings or collection of books in a given library. Such instance records are also known as the ABox . The simple design of irON is in keeping with the limited roles and work associated with this ABox role.
Attributes provide descriptive characteristics for each instance. Every attribute is matched with a value, which can range from descriptive text strings to lists or numeric values. This design is in keeping with simple attribute-value pairs where, in using the terminology of RDF triples, the subject is the instance itself, the predicate is the attribute, and the object is the value. irON has a vocabulary of about 40 reserved attribute terms, though only two are ever required, with a few others strongly recommended for interoperability and interface rendering purposes.
A dataset is an aggregation of instance records used to keep a reference between the instance records and their source (provenance). It is also the container for transmitting those records and providing any metadata descriptions desired. A dataset can be split into multiple dataset slices. Each slice is written to a file serialized in some way. Each slice of a dataset shares the same <id> of the dataset.
Instances can also be assigned to types, which provide the set or classificatory structure for how to relate certain kinds of things (instances) to other kinds of things. The organizational relationships of these types and attributes is described in a schema. irON also has conventions and notations for describing the linkage of attributes and types in a given dataset to existing schema. These linkages are often mapped to established ontologies.
Each of these irON concepts of records, attributes, types, datasets, schema and linkages share similar notations with keywords signaling to the irON parsers and converters how to interpret incoming files and data. There are also provisions for metadata, name spaces, and local and global references.
In these manners, irON and its three serializations can capture virtually the entire scope and power of RDF as a data model, but with simpler and familiar terminology and constructs expected for each serialization.
For different reasons and for different audiences, the formats of XML, JSON and CSV (spreadsheets) were chosen as the representative formats across which to formulate the abstract irON notation.
XML, or eXtensible Markup Language, has become the leading data exchange format and syntax for modern applications. It is frequently adopted by industry groups for standards and standard exchange formats. There is a rich diversity of tools that support the language, importantly including capable parsers and query languages. There is also a serialization of RDF in XML. As implemented in the irON notation, we call this serialization irXML.
CSV, or comma-separated values, is a format that has been in existence for decades. It was made famous by Microsoft as a spreadsheet exchange format, which makes CSV very useful since spreadsheets are the most prevalent data authoring environment in existence. CSV is less expressive and capable as a data format than the other irON serializations, yet still has a attribute-value pair orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc. As implemented in the irON notation, we call this serialization commON.
The following diagram shows how these three formats relate to irON and then the canonical RDF target data model:
We have used the unique differences amongst XML, JSON and CSV to guide the embracing abstract notations within irON. Note the round-tripping implications of the framework.
One exciting prospect for the design is how, merely by following the simple conventions within irON, each of these three data formats — and RDF !! — can be used more-or-less interchangeably, and can be used to extend existing schema within their domains.
This first release of irON is in version 0.8. Updates and revisions are likely with use. Here are some key links for irON:
In addition, within the next week we will be publishing a case study of converting the Sweet Tools semantic Web and -related tools dataset to commON.
The irON specification and notation by Structured Dynamics LLC is licensed under a Creative Commons Attribution-Share Alike 3.0. irON‘s parsers or converters are available under the Apache License, Version 2.0.
irON is an important piece in the semantic enterprise puzzle that we are building at Structured Dynamics. It reflects our belief that knowledge workers should be able to author and create interoperable datasets without having to learn the arcana of RDF. At the same time we also believe that RDF is the appropriate data model for interoperability. irOn is an expression of our belief that many data formats have appropriate places and uses; there is no need to insist on a single format.
We would like to thank Dr. Jim Pitman for his advocacy of the importance of human-readable and easily authored datasets and formats. Via his leadership of the Bibliographic Knowledge Network (BKN) project and our contractual relationship with it , we have learned much regarding the BKN’s own format, BibJSON. Experience with this format has been a catalytic influence in our own work on irON.
— Mike Bergman and Fred Giasson, editors
Attribute-values can also be presented as pairs in the form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array.