OK, you’ve been reading the literature and perhaps have attended a conference or two. You have heard a lot about semantic technologies, but have some real questions and concerns:
Such questions — and more — are not infrequent when organizations first contemplate making the transition to become a semantic enterprise.
The diagram below shows a general workflow for migrating existing instance data into the semantic enterprise. The diagram is broken down into three parts. The first part is to characterize and stage existing data and information into the underlying structured data framework. This is what SD (that is, my firm, Structured Dynamics) does as data architects using our particular approach to adaptive ontologies. I’ll touch on this again in a moment.
Jumping to the right-hand side of the diagram is the access and display part. It is here that developers or users can make selections from dropdown lists and so forth to define the “slices” of diced results sets they wish to display. The results of those interactions are structured data results sets that are pre-staged to “drive” various applications and displays [1,2]. These same capabilities can also be embedded into standard Web end user applications, such as content management systems.
The third and middle part of the diagram is the critical part, the pivot point. It is the interface layer between the structured data on the left and the display and presentation of that data on the right. As provided by SD, this abstraction layer is the structWSF Web services framework that “bridges” between the black box of what happens with RDF and semantic Web structured data characterizations on the left in order to feed, or “drive”, useful services and functions on the right.
We call this general design and architecture “ontology-driven applications”. The bulk of this posting explains each of these three parts in a bit more detail, organized from left-to-right by these Parts 1 to 3.
Our approach relies on what we call “adaptive ontologies”. These ontologies set the structural basis for all subsequent data display, analysis, inferencing, entailments, and the like. We call them “adaptive” because we embrace a set of unique best practices. These practices enable the ontologies to do the double-duty of first structuring data and then driving generic applications by properly informing user interfaces, dropdown lists, menus and the like.
This structuring results in faceting key important dimensions and attributes of available content. Structured data gets organized. Unstructured data (text) gets tagged via this structure and integrated with it.
As Structured Dynamics’ general product schema makes clear (see the diagram at ), our approach leverages existing assets as much as possible. Often, this means leaving most existing data structures in place. These existing assets are staged and converted in two complementary manners that largely correspond to the conceptual ABox (instance) and TBox (concepts and schema) split central to description logics and pivotal to SD’s methodology .
Whether transitioning small chunks or big chunks, this staging of existing data in Part 1 results in an RDF-accessible characterization of the starting content. Instances and their attributes are represented via a common notation, generally based on irON (instance record and Object Notation) , that is an extensible notation and vocabulary for capturing the data characterizations, attributes and metadata of the candidate instance data (“records” in RDBMS parlance). These instances may either be internal or proprietary records, or instance data on the Web or in the public domain. By properly matching same or similar instances to one another, any source of instance characterization can be meaningfully combined.
This instance notation is extremely lightweight, and really is merely an RDF representation of data characterizations. In the characterizations to this point there is not yet any “world view” involved: we are simply describing instances and their attributes in a manner akin to key-value pairs. The process to this point is entirely descriptive.
However, these instance characteristics do contain within them the semantics as to how to describe these attributes (your “glad” is my “happy”), as well as potentially a schematic or conceptual view of how these instances relate to one another and to the broader world. Instance characterizations provide the building blocks, that are then related and made semantically whole via a second “terminological” level.
These terminological, or conceptual, relationships (the TBox ), reside at a different level from simply decribing things. Rather, these schema — what in this context are best known as ontologies — provide a precise language and means for describing conceptual relationships. If these structural relationships are done well, they are coherent: the hip bone is connected to the thigh bone and not to the ear. Coherence is a matter of a consistent world view that “hangs together” when analyzed via powerful logical techniques available via description logics and other broader mechanisms of the semantic enterprise.
Thus, as we transition from the existing, the operational workflow splits the input data stream into two pathways:
A sequential flow of these steps and splits is provided by this diagram below that shows: 1) the conceptual structure of the concept ontology; as 2) matched with the instances and their descriptive attributes that populate that schema.
A key point is that — while a proper starting ontology is essential to our process and proofs-of-concept — it can be grown and scaled incrementally. We leverage as much existing starting structure as possible and can readily bound the scope to meet budget and delivery imperatives.
The concepts and entities that occur within these structures help inform our fairly simple tagging system, scones . (There are also benefits from “triangulating” between entity or instance identification and concept identification that helps inform disambiguation nearly for free; see further ). It is also possible to integrate these initial proof-of-concept approaches with third-party tools (e.g., Calais, Expert System (Cogito), etc.) to improve unstructured content characterization.
These approaches are pretty straightforward for any organization wanting to test the idea of becoming a semantic enterprise. Real benefits — such as concept retrievals overcoming the limitations of standard keyword search — can be demonstrated from even small starting ontologies and structures. Given the inherent connectedness of the data, it is possible to expand the scope and usefulness of the information incrementally within fixed and manageable budgets.
A pivotal part of SD’s infrastructure software is structWSF , our platform-independent Web services middleware. structWSF is an abstraction layer that provides the APIs, search endpoints, and specific Web services for accessing, querying or getting results sets from the underlying structured data and ontologies.
structWSF has a standard set of access and retrieval services including browse, full-text search, CRUD, direct record retrievals, and the like. It is embedded within an access and permissions service that acts at the level of registered datasets. Then, based on the requested protocol, structWSF returns the filtered results set. These results sets can be delivered as XML, JSON, or any of the other formats already available . They can readily and dynamically populate HTML pages and forms in any deployment framework. For specific purposes, these results sets can also be returned as pre-staged, properly formatted results streams for driving specific applications.
As an API, the structWSF Web services can be interacted with and driven via standard HTTP requests. Alternatively, these requests can come from simple to complicated Web apps that create the API queries based on user interface choices such as selections from dropdown lists or clicking on various listed options. An interactive demo of this approach is shown by SD’s conStruct application , though even simpler Web pages or forms may drive the query interface.
Queries and requests to structWSF may also include a parameter for results sets to be returned in particular formats. SD’s irON protocol  supports requests or results in CSV, XML or JSON, in addition to other flavors including multiple serializations of RDF.
In this manner, only a simple converter need be added to the structWSF Web services stack in order to “drive” a new application with a particularly formatted results set stream.
structWSF thus acts as a single, uniform Web interface to all of the “black box” nuances of the structured data system organized by the adaptive ontologies. Further, virtually any data structure may be ingested and converted from external sources via an import service and made part of the underlying canonical structure, making the framework perfect for data federation . Lastly, the dataset nature of the framework, and its neutrality to underlying data stores or content management systems, also makes structWSF an excellent framework for one or many nodes to share information and collaborate across the Web .
The following diagram shows how a diverse, Web-based network, involving a diversity of Web portals and data gateways and hubs, can work via the structWSF framework to establish a complete collaboration network. Via datasets and differential access rights and permissions, virtually any combination of potential interactions can be supported:
These potentials are really fundamentally new, and we ourselves are still trying to find the language and analogies to best explain them. structWSF was initially designed as a platform-independent layer between the structured data representation of existing assets and the ontology-driven applications that interact with them. We are now finding that deployment in a broader Web-based context provides additional exciting prospects for integrating various regional offices or to enable direct collaboration with customers, partners or suppliers.
The basic design of structWSF is to provide a middleware layer that fulfills one or more of these broad user interaction modes:
SD has developed generic applications in these areas (with many more possible), the operations of which are guided by the instructions and nature of the underlying data that feeds them. We have proven it is possible to adopt data characterization practices within those ontologies so as to stage or “drive” such generic applications.
In the case of a standard structured data display (say, a simple table like a Wikipedia infobox, for example), such generic design includes templates tailored to various instance types (say, locational information presenting on a map versus people information warranting a image and vital statistics). Alternatively, in the generic design for some specialized application (say, Adobe Flash), the information output of the results set may need to contain certain formats and attributes.
SD’s “ontology-driven apps”, then, are really informed structured results sets that are outputted in a form suitable to various intended applications. This output form can include a variety of serializations, formats or metadata. This flexibility of output that is tailored to and responsive to particular generic applications is what makes our ontologies “adaptive”.
Expressed in this manner, “ontology-driven apps” seem neither remarkably profound nor clever. They are simply attentive to their intended uses.
Using this structure, then, it is possible to either “drive” queries and results sets selections via direct HTTP request or via simple dropdown selections on HTML forms (that is, from right to left as shown on the first diagram). Similarly, it is possible with a single parameter change to drive either a visualization app or a structured table template from the equivalent query request (that is, from left to right on the first diagram).
“Ontology-driven apps” through SD’s architecture design thus provide two profound benefits. First, the entire system can be driven via simple selections or interactions without the need for any programming or technical expertise. And, second, simple additions of new and minor output converters can work to power entirely new applications available to the system. If, say, Adobe graphics applications need to change tomorrow for Microsoft Silverlight, that switch is easy and can be made transparent to the designer.
The ability to develop these systems incrementally and the ability to integrate with external, public data is fundamentally dependent on the open world assumption. The open world assumption is a different logic premise than what many enterprises are used to; relational database systems, for example, embrace the alternate closed world premise.
Open world does not necessarily mean open data and it does not mean open source. Open world is merely a way to think about the information we have and how we act on it. An open world assumption accepts that we never have all necessary information and lacking that information does not itself lead to any conclusions.
Some enterprise circumstances – say a complete enumeration of customers or products or even controlled engineering or design environments — may warrant a closed world approach. In those circumstances, the domain of inquiry is well bounded and we can get relatively complete information about it. Engineering an oil drilling platform or launching the Space Shuttle in fact demands that.
But, in most real world circumstances, there is much we don’t know and we interact in complex and external environments. Open world is the proper logic premise for these circumstances. These circumstances also happen to be the very basis in which most most knowledge workers and analysts reside.
Open world frameworks provide some incredibly important benefits if the circumstances of their use apply:
One might argue, as we believe, that the biggest impediment to the semantic enterprise is the mind shift necessary to start thinking about and accepting the open world premise. Again, this perspective is not applicable to all problems and domains. But, where it is, much can be left in place and leveraged with semantic technologies, so long as the enterprise begins to look at these existing assets through a different open-world lens.
So, let’s return to the rhetorical questions that began this posting.
It should now be clear that it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with focus on early, high-value applications and domains.
Further, we need not abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives. However, in leveraging those assets, it is important that the enterprise begin to embrace and understand the open world assumption.
We also see that RDF and OWL, while important behind the scenes as a canonical data model and languages for organizing this information, need not be exposed as such to most users. Most instance data can be expressed as is with the data languages of choice such as XML, JSON or whatever.
We also see these technologies are neutral to the question of open or public sources. The techniques can equivalently be applied to internal, closed, proprietary data and structures. Moreover, the technologies can themselves be used as a basis for bringing external information into the enterprise.
Without a doubt, some of the early years in describing semantic technologies were burdened with some unfortunate bad information and lack of sophistication. Today’s semantic Web is nimble, agile, and ready to be deployed immediately at low cost and risk. So, jump on in! We think you’ll find the water to be just fine.
When I say, “shot”, what do you think of? A flu shot? A shot of whisky? A moon shot? A gun shot? What if I add the term “bank”? Do you now think of someone being shot in an armed robbery of a local bank or similar?
And, now, what if I add a reference to say, The Hustler, or Minnesota Fats, or “Fast Eddie” Felson? Do you now see the connection to a pressure-packed banked pool shot in some smoky bar room?
As humans we need context to make connections and remove ambiguity. For machines, with their limited reasoning and inference engines, context and accurate connections are even more important.
Over the past few weeks we have seen announcements of two large and high-visibility linked data projects: One, a first release of references for articles concerning about 5,000 people from the New York Times at data.nytimes.com; and Two, a massive exposure of 5 billion triples from data.gov datasets provided by the Tetherless World Constellation (TWC) at Rennselaer Polytechnic Institute (RPI).
On various grounds from licensing to data characterization and to creating linked data for its own sake, some prominent commentators have weighed in on what is good and what is not so good with these datasets. One of us, Mike, commented about a week ago that “we have now moved beyond ‘proof of concept’ to the need for actual useful data of trustworthy provenance and proper mapping and characterization. Recent efforts are a disappointment that no enterprise would or could rely upon.”
It is useful, then, to return to first principles, namely the original four “rules” posed by Tim Berners-Lee in his design note on linked data :
The first two rules are definitional to the idea of linked data. They cement the basis of linked data in the Web, and are not at issue with either of the two linked data projects that are the subject of this posting.
However, it is the lack of specifics and guidance in the last two rules where the breakdowns occur. Both the NYT and the RPI datasets suffer from a lack of “providing useful information” (Rule #3). And, the nature of the links in Rule #4 is a real problem for the NYT dataset.
The Wikipedia entry on linked data expands on “useful information” by augmenting the original rule with the parenthetical clause, ” (i.e., a structured description — metadata).” But even that expansion is insufficient.
Fundamentally, what are we talking about with linked data? Well, we are talking about instances that are characterized by one or more attributes. Those instances exist within contexts of various natures. And, those contexts may relate to other existing contexts.
We can break this problem description down into three parts:
At minimum, ANY dataset exposed as linked data needs to be described by a vocabulary. Both the NYT and RPI datasets fail on this score, as we elaborate below. Better practice is to also provide a schema of relationships in which to embed each instance record. And, best practice is to also map those structures to external schema.
Lacking this “useful information”, especially a defining vocabulary, we cannot begin to understand whether our instances deal with drinks, bank robberies or pool shots. This lack, in essence, makes the information worthless, even though available via URL.
With the support of NSF and various grant funding, RPI has set up the Data-Gov Wiki , which is in the process of converting the datasets on data.gov to RDF, placing them into a semantic wiki to enable comment and annotation, and providing that data as RSS feeds. Other demos are also being placed on the site.
As of the date of this posting, the site had a catalog of 116 datasets from the 800 or so available on data.gov, leading to these statistics:
We’ll take one of these datasets, #319, and look a bit closer at it:
|Wiki||Title||Agency||Name||data.gov Link||No Properties||No Triples||RDF File|
|Dataset 319||Consumer Expenditure Survey||Department of Labor||LABOR-STAT||http://www.data.gov/details/319||22||1,583,236||http://data-gov.tw.rpi.edu/raw/319/index.rdf|
This report was picked solely because it had a small number of attributes (properties), and is thus easier to screen capture. The summary report on the wiki is shown by this page:
So, we see that this specific dataset contains about 22 of the nearly 8,000 attributes across all datasets.
When we click on one of these attribute names, we are then taken to a specific wiki page that only reiterates its label. There is no definition or explanation.
When we inspect this page further we see that, other than the broad characterization of the dataset itself (the bulk of the page), we see at the bottom 22 undefined attributes with labels such as item code, periodicity code, seasonal, and the like. These attributes are the real structural basis for the data in this dataset.
But, what does all of this mean???
To gain a clue, now let’s go to the source data.gov site for this dataset (#319). Here is how that report looks:
Contained within this report we see a listing for additional metadata. This link tells us about the various data fields contained in this dataset; we see many of these attributes are “codes” to various data categories.
Probing further into the dataset’s technical documentation, we see that there is indeed a rich structure underneath this report, again provided via various code lookups. There are codes for geography, seasonality (adjusted or not), consumer demographic profiles and a variety of consumption categories. (See, for example, the link to this glossary page.) These are the keys to understanding the actual values within this dataset.
For example, one major dimension of the data is captured by the attribute item_code. The survey breaks down consumption expenditures within the broad categories of Food, Housing, Apparel and Services, Transportation, Health Care, Entertainment, and Other. Within a category, there is also a rich structural breakdown. For example, expenditures for Bakery Products within Food is given a code of FHC2.
But, nowhere are these codes defined or unlocked in the RDF datasets. This absence is true for virtually all of the datasets exposed on this wiki.
So, for literally billions of triples, and 8,000 attributes, we have ABSOLUTELY NO INFORMATION ABOUT WHAT THE DATA CONTAINS OTHER THAN A PROPERTY LABEL. There is much, much rich value here in data.gov, but all of it remains locked up and hidden.
The sad truth about this data release is that it provides absolutely no value in its current form. We lack the keys to unlock the value.
To be sure, early essential spade work has been done here to begin putting in place the conversion infrastructure for moving text files, spreadsheets and the like to an RDF form. This is yeoman work important to ultimate access. But, until a vocabulary is published that defines the attributes and their codes so we can unlock this value, it will remain hidden. And only when its further value (by connecting attributes and relations across datasets) through a schema of some nature is also published, the real value from connecting the dots will also remain hidden.
These datasets may meet the partial conditions of providing clickable URLs, but the crucial “useful information” as to what any of this data means is absent.
Every single dataset on data.gov has supporting references to text files, PDFs, Web pages or the like that describe the nature of the data within each dataset. Until that information is exposed and made usable, we have no linked data.
Until ontologies get created from these technical documents, the value of these data instances remain locked up, and no value can be created from having these datasets expressed in RDF.
The devil lies in the details. The essential hard work has not yet begun.
Though at a much smaller scale with many fewer attributes, the NYT dataset suffers from the same failing: it too lacks a vocabulary.
So, let’s take the case of one of the lead actors in The Hustler, Paul Newman, who played the role of “Fast Eddie” Felson. Here is the NYT record for the “person” Paul Newman (which they also refer to as http://data.nytimes.com/newman_paul_per). Note the header title of Newman, Paul:
Click on any of the internal labels used by the NYT for its own attributes (such as nyt:first_use), and you will be given this message:
“An RDFS description and English language documentation for the NYT namespace will be provided soon. Thanks for your patience.”
We again have no idea what is meant by all of this data except for the labels used for its attributes. In this case for nyt:first_use we have a value of “2001-03-18″.
Hello? What? What is a “first use” for a “Paul Newman” of “2001-03-18″???
The NYT put the cart before the horse: even if minimal, they should have released their ontology first — or at least at the same time — as they released their data instances. (See further this discussion about how an ontology creation workflow can be incremental by starting simple and then upgrading as needed.)
Since there really are no links to other things on the Data-Gov Wiki, our focus in this section continues with the NYT dataset using our same example.
We now are in the territory of the fourth “rule” of linked data: 4. Include links to other URIs so that they can discover more things.
This will seem a bit basic at first, but before we can talk about linking to other things, we first need to understand and define the starting “thing” to which we are linking.
Of course, without its own vocabulary, we are left to deduce what this thing “Newman, Paul“ is that is shown in the previous screen shot. Our first clue comes from the statement that it is of rdf:type SKOS concept. By looking to the SKOS vocabulary, we see that concept is a class and is defined as:
A SKOS concept can be viewed as an idea or notion; a unit of thought. However, what constitutes a unit of thought is subjective, and this definition is meant to be suggestive, rather than restrictive. The notion of a SKOS concept is useful when describing the conceptual or intellectual structure of a knowledge organization system, and when referring to specific ideas or meanings established within a KOS.
We also see that this instance is given a foaf:primaryTopic of Paul Newman.
So, we can deduce so far that this instance is about the concept or idea of Paul Newman. Now, looking to the attributes of this instance — that is the defining properties provided by the NYT — we see the properties of nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage. Completing our deductions, and in the absence of its own vocabulary, we can now define this concept instance somewhat as follows:
New York Times articles in the period 2001 to 2009 having as their primary topic the actor Paul Newman
(BTW, across all records in this dataset, we could see what the earliest first use was to better deduce the time period over which these articles have been assembled, but that has not been done.)
We also would re-title this instance more akin to “2001-2009 NYT Articles with a Primary Topic of Paul Newman” or some such and use URIs more akin to this usage.
Thus, in order to make links or connections with other data, it is essential to understand what the nature is of the subject “thing” at hand. There is much confusion about actual “things” and the references to “things” and what is the nature of a “thing” within the literature and on mailing lists.
Our belief and usage in matters of the semantic Web is that all “things” we deal with are a reference to whatever the “true”, actual thing is. The question then becomes: What is the nature (or scope) of this referent?
There are actually quite easy ways to determine this nature. First, look to one or more instance examples of the “thing” being referred to. In our case above, we have the “Newman, Paul” instance record. Then, look to the properties (or attributes) the publisher of that record has used to describe that thing. Again, in the case above, we have nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage.
Clearly, this instance record — that is, its nature — deals with articles or groups of articles. The relation to Paul Newman occurs as a basis of the primary topic of these articles, and not a person basis for which to describe the instance. If the nature of the instance was indeed the person Paul Newman, then the attributes of the record would more properly be related to “person” properties such as age, sex, birthdate, death date, marital status, etc.
This confusion by NYT as to the nature of the “things” they are describing then leads to some very serious errors. By confusing the topic (Paul Newman) of a record with the nature of that record (articles about topics), NYT next misuses one of the most powerful semantic Web predicates available, owl:sameAs.
By asserting in the “Newman, Paul” record that the instance has a sameAs relationship with external records in Freebase and DBpedia, the NYT both entails that properties from any of the associated records are shared and infers a chain of other types to describe the record. More precisely, the NYT is asserting that the “thing” referred to by these instances are identical resources.
Thus, by the sameAs statements in the “Newman, Paul” record, the NYT is also asserting that that record is an instance of all these things :
Furthermore, because of its strong, reciprocal entailments, the owl:sameAs assertion would also now entail that the person Paul Newman has the nyt:first_use and nyt:last_use attributes, clearly illogical for a “person” thing.
This connection is clearly wrong in both directions. Articles are not persons and don’t have marital status; and persons do not have first_uses. By misapplying this sameAs linkage relationship, we have screwed things up in every which way. And the error began with misunderstanding what kinds of “things” our data is about.
However, there are solutions. First, the sameAs assertions, at least involving these external resources, should be dropped.
Second, if linkages are still desired, a vocabulary such as UMBEL  could be used to make an assertion between such a concept, and these other related resources. So, even though these resources are not the same, they are closely related. The UMBEL ontology helps us to define this kind of relation between related, but non-identical, resources.
Instead of using the owl:sameAs property, we would suggest the usage of the umbel:linksEntity, which links a skos:Concept to related named entities resources. Additionally, Freebase, which also currently asserts a sameAs relationship to the NYT resource, could use the umbel:isAbout relationship to assert that their resource “is about” a certain concept, which is the one defined by the NYT.
Alternatively, still other external vocabularies that more precisely capture the intent of the NYT publishers could be found, or the NYT editors could define their own properties specifically addressing their unique linkage interests.
As a couple of additional, minor suggestions for the NYT dataset, we would suggest:
There are very valuable benefits from entailment, inference and logic to be gained from linking resources. However, if the nature of the “things” being linked — or the properties that define these linkages — are incorrect, then very wrong logical implications result. Great care and understanding should be applied to linkage assertions.
Our critical comments are not meant to be disrespectful and are not being picky. The NYT and TWC are prominent institutions for which we should expect leadership on these issues. Our criticisms (and we believe those of others) are also not an expression of a “trough of disillusionment” as some have been pointing out.
This posting is about poor practices, pure and simple. The time to correct them is now. If asked, we would be pleased to help either institution establish exemplar practices. This is not automatic, and it is not always easy. The data.gov datasets, in particular, will require much time and effort to get right. There is much documentation that needs to be transitioned and expressed in semantic Web formats.
In a broader sense, we also seem to lack a definition of best practices related to vocabularies, schema and mappings. The Berners-Lee rules are imprecise and insufficient as is. Prior best guidance documents tend to be more how to publish and make URIs linkable, than to properly characterize, describe and connect the data.
Perhaps, in part, this is a bit of a semantics issue. The challenge is not the mechanics of linking data, but the meaning and basis for connecting that data. Connections require logic and rationality sufficient to reliably inform inference and rule-based engines. It also needs to pass the sniff test as we “follow our nose” by clicking the links exposed by the data.
It is exciting to see high-quality content such as from national governments and major publishers like the New York Times begin to be exposed as linked data. When this content finally gets embedded into usable contexts, we should see manifest uses and benefits emerge. We hope both institutions take our criticisms in that spirit.
In a former life, I had the nickname of ‘Spreadsheet King’ (perhaps among others that I did not care to hear). I had gotten the nick because of my aggressive use of spreadsheets for financial models, competitors tracking, time series analyses, and the like. However, in all honesty, I have encountered many others in my career much more knowledgeable and capable with spreadsheets than I’ll ever be. So, maybe I was really more like a minor duke or a court jester than true nobility.
Yet, pro or amateur, there are perhaps 1 billion spreadsheet users worldwide , making spreadsheets undoubtedly the most prevalent data authoring environment in existence. And, despite moans and wails about how spreadsheets can lead to chaos, spaghetti code, or violations of internal standards, they are here to stay.
Spreadsheets often begin as simple notetaking environments. With the addition of new findings and more analysis, some of these worksheets may evolve to become full-blown datasets. Alternatively, some spreadsheets start from Day One as intended datasets or modeling environments. Whatever the case, clearly there is much accumulated information and data value “locked up” in existing spreadsheets.
How to “unlock” this value for sharing and collaboration was a major stimulus to development of the commON serialization of irON (instance record and Object Notation) . I recently published a case study  that describes the reasons and benefits of dataset authoring in a spreadsheet, and provides working examples and code based on Sweet Tools  to aid users in understanding and using the commON notation. I summarize portions of that study herein.
The dataset that is the focus of this use case, Sweet Tools, began as an informal tracking spreadsheet about four years ago. I began it as a way to learn about available tools in the semantic Web and -related spaces. I began publishing it and others found it of value so I continued to develop it.
As it grew over time, however, it gained in structure and size. Eventually, it became a reference dataset, with which many other people desired to use and interact. The current version has well over 800 tools listed, characterized by many structured data attributes such as type, programming language, description and so forth. As it has grown, a formal controlled vocabulary has also evolved to bring consistency to the characterization of many of these attributes.
It was natural for me to maintain this listing as a spreadsheet, which was also reinforced when I was one of the first to adopt an Exhibit presentation of the data based on a Google spreadsheet about three years back. Here is a partial view of this spreadsheet as I maintain it locally:
When we began to develop irON in earnest as a simple (“naïve”) dataset authoring framework, it was clear that a comma-separated value, or CSV , option should join the other two serializations under consideration, XML and JSON. CSV, though less expressive and capable as a data format than the other serializations, still has an attribute-value pair (also known as key-value pairs and many other variants ) orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc.
As a dataset very familiar to us as irON‘s editors, and directly relevant to the semantic Web, Sweet Tools provided a perfect prototype case study for helping to guide the development of irON, and specifically what came to be known as the commON serialization for irON. The Sweet Tools dataset is relatively large for a speciality source, has many different types and attributes, and is characterized by text, images, URLs and similar.
The premise was that if Sweet Tools could be specified and represented in commON sufficiently to be parsed and converted to interoperable RDF, then many similar instance-oriented datasets could likely be so as well. Thus, as we tried and refined notation and vocabulary, we tested applicability against the CSV representation of Sweet Tools in addition to other CSV, JSON and XML datasets.
A large portion of the case study describes the many advantages of authoring small datasets within spreadsheets. The useful thing about the CSV format is that these full functional capabilities of the spreadsheet are available during authoring or later updates and modifications, but, when exported, the CSV provides a relatively clean format for processing and parsing.
So, some of the reasons for small dataset authoring in a spreadsheet include:
The next major section of the case study deals with the minor conventions that must be followed in order to stage spreadsheets for commON. Not much of the specific commON vocabulary or notation is discussed below; for details, see .
Because you can create multiple worksheets within a spreadsheet, it is not necessary to modifiy existing worksheets or tabs. Rather, if you are reluctant or can not change existing information, merely create parallel duplicate sheets of the source information. These duplicate sheets have as their sole purpose export to commON CSV. You can maintain your spreadsheet as is while staging for commON.
To do so, use the simple = formula to create cross-references between the existing source spreadsheet tab and the target commON CSV export tab. (You can also do this for complete, highlighted blocks from source to target sheet.) Then, by adding the few minor conventions of commON, you have now created a staged export tab without modifying your source information in the slightest.
In standard form and for Excel and Open Office, single quotes, double quotes and commas when entered into a spreadsheet cell are automatically ‘escaped‘ when issued as CSV. commON allows you to specify your own delimiter for lists (the standard is the pipe ‘|’ character) and what the parser recognizes as the ‘escape’ character (‘\’ is the standard). However, you probably should not change for most conditions.
The standard commON parsers and converters are UTF-8 compatible. If your source content has unusual encodings, try to target UTF-8 as your canonical spreadsheet output.
In the irON specification there are a small number of defined modules or processing sections. In commON, these modules are denoted by the double-ampersand character sequence (‘&&‘), and apply to lists of instance records (&&recordList), dataset specifications and associated metadata describing the dataset (&&dataset), and mappings of attributes and types to existing schema (&&linkage). Similarly, attributes and types are denoted by a single ampersand prefix (&attributeName).
In commON, any or all of the modules can occur within a single CSV file or in multiple files. In any case, the start of one of these processing modules is signaled by the module keyword and &&keyword convention.
The first spreadsheet figure above shows a Sweet Tools example for the &&recordList module. The module begins with that keyword, indicating one of more instance records will follow. Note that the first line after the &&recordList keyword is devoted to the listing of attributes and types for the instance records (designated by the &attributeName convention in the columns for the first row after the &&recordList keyword is encountered).
The &&recordList format can also include the stacked style (see similar Dataset example below) in addition to the single row style shown above.
At any rate, once a worksheet is ready with its instance records following the straightforward irON and commON conventions, it can then be saved as a CSV file and appropriately named. Here is an example of what this “vanilla” CSV file now looks like when shown again in a spreadsheet:
Alternatively, you could open this same file in a text editor. Here is how this exact same instance record view looks in an editor:
Note that the CSV format separates each column by the comma separator, with escapes shown for the &description attribute when it includes a comma-separated clause. Without word wrap, each record in this format occupies a single row (though, again, for the stacked style, multiple entries are allowed on individual rows so long as a new instance record &id is not encountered in the first column).
The &&dataset module defines the dataset parameters and provides very flexible metadata attributes to describe the dataset . Note the dataset specification is exactly equivalent in form to the instance record (&&recordList) format, and also allows the single row or stacked styles (see these instance record examples), with this one being the stacked style:
The &&linkage module is used to map the structure of the instance records to some structural schema, which can also include external ontologies. The module has a simple, but specific structure.
Either attributes (presented as the &attributeList) or types (presented as the &typeList) are listed sequentially by row until the listing is exhausted . By convention, the second column in the listing is the targeted &mapTo value. Absent a prior &prefixList value, the &mapTo value needs to be a full URL to the corresponding attribute or type in some external schema:
Notice in the case of Sweet Tools that most values are from the actual COSMO mini-ontology underlying the listing. These need to be listed as well, since absent the specifications in commON the system has NO knowledge of linkages and mappings.
In its current state of development, commON does not support a spreadsheet-based means for specifying the schema structure (lightweight ontology) governing the datasets . Another irON serialization, irJSON, does. Either via this irJSON specification or via an offline ontology, a link reference is presently used by commON (and, therefore, Sweet Tools for this case study) to establish the governing structure of the input instance record datasets.
A spreadsheet-based schema structure for commON has been designed and tested in prototype form. commON should be enhanced with this capability in the near future .
If the modules are spread across more than one worksheet, then each worksheet must be saved as its own CSV file. In the case of Sweet Tools, as exhibited by its reference current spreadsheet, sweet_tools_20091110.xls, three individual CSV files get saved. These files can be named whatever you would like. However, it is essential that the names be remembered for later referencing.
My own naming convention is to use a format of appname_date_modulename.csv because it sorts well in a file manager accommodating multiple versions (dates) and keeps related files clustered. The appname in the case of Sweet Tools is generally swt. The modulename is generally the dataset, records, or linkage convention. I tend to use the date specification in the YYYYMMDD format. Thus, in the case of the records listings for Sweet Tools, its filename could be something like: swt_20091110_records.csv.
Once saved, these files are now ready to be imported into a structWSF  instance, which is where the CSV parsing and conversion to interoperable RDF occurs . In this case study, we used the Drupal-based conStruct SCS system . conStruct exposes the structWSF Web services via a user interface and a user permission and access system. The actual case study write-up offers more details about the import process.
We are now ready to interact with the Sweet Tools structured dataset using conStruct (assuming you have a Drupal installation with the conStruct modules) .
The screen capture below shows a couple of aspects of the system:
One of the absolutely cool things about this framework is that all tools, inferencing, user interfaces and data structure are a direct result of the ontology(ies) underlying the system (plus the irON instance ontology, as well). This means that switching datasets or adding datasets causes the entire system structure to now reflect those changes — without lifting a finger!!
Here are a few sample things you can do with these generic tools driven by the Sweet Tools dataset:
Note, if you access this conStruct instance you will do so as a demo user. Unfortunately, as such, you may not be able to see all of the write and update tools, which in this case are reserved for curators or admins. Recall that structWSF has a comprehensive user access and permissions layer.
Of course, one of the real advantages of the irON and structWSF designs is to enable different formats to be interchanged and to interoperate. Upon submission, the commON format and its datasets can then be exported in these alternate formats and serializations :
As should be obvious, one of the real benefits of the irON notation — in addition to easy dataset authoring — is the ability to more-or-less treat RDF, CSV, XML and JSON as interoperable data formats.
The formal Sweet Tools case study based on commON, with sample download files and PDF, is available from Annex: A commON Case Study using Sweet Tools, Supplementary Documentation .
Attribute-values can also be presented as pairs in a form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array..
It has been eight months since the last major update to Sweet Tools, AI3‘s listing of semantic Web and -related tools. With today’s release, there are now a total of 810 tools listed, crashing through the sound barrier of 761 tools. With the retirement of 19 prior tools, this new listing represents an increase of 93 tools, or 13%, from the previous version that listed 736.
But simply adding to the tools listing is not the cause of this longer than normal period between updates.
This little Sweet Tools dataset is now showing the way to a couple of exciting innovations: new generic ontology-driven applications for structured data; and, tools for authoring structured data via spreadsheets.
We deal with the former in this post. I will deal with the spreadsheet business in a subsequent post.
So, here is the summary of major changes in this new listing:
To see the major Sweet Tools page for this updated listing in its existing format, filter on ‘New’ under New or Existing? to see the recent additions. Alternatively, you can also see this same filtering using the conStruct structured data view by searching on the Status attribute using the value ‘New’; see example here.
Though still formative, the most exciting change with the Sweet Tools listing is this new presentation via conStruct. It is a structured data Web services framework with a UI, all offered as a set of modules to Drupal. To kick the tires with this new system, you may want to look at:
BTW, there are some helpful documentation pages that show how all of these various tools work and more, such as, for example, Browse. (Also, BTW, as a demo user, you also are not seeing all of the write and update tools, either; again, see the documentation.)
The essential underlying basis to conStruct is the structWSF Web services framework. There are still some aspects to this system that we feel are incomplete and we are working on. Some of these things include dropdown selections (controlled vocabulary selects); easier template creation; and intuitive template re-use. Nonetheless, these additions will come quickly, and what is here is already a great demonstration of how structured data can drive generic tools and interfaces.
As I said: More on this in a later post.
The updated Sweet Tools listing now includes nearly 50 different tools categories. The most prevalent categories are browser tools (RDF, OWL), information extraction, parsers or converters, composite application frameworks and general ontology tools. Each accounts for more than 8% — or more than 50 tools — of the total. This breakdown is as follows (click to expand):
As for the languages these applications are written in, that has stayed pretty steady, too. Java is still the leading language at about 46%, which has been very slightly trending downward over the past three years or so. PHP has increased a bit as well. The current splits are (click to expand):
Background on prior listings and earlier statistics may be found on these previous posts:
With interim updates periodically over that period.
Note: Because of comments expirations on prior posts, this entry is now the new location for adding a suggested new tool. Simply provide your information in the comments section, and your tool will be included in the next update.
Stefano Mazzocchi has been a clear thinker for years and an innovative contributor to the community since his early leadership of the Apache Cocoon project. One of his best qualities is he speaks his mind. Now at Freebase, but previously with MIT’s Simile program, he is one of my dedicated reads via his Stefano’s Linotype blog.
His aforementioned post, Data Smoke and Mirrors, stands on its own, and I highly recommend it. He particularly focuses on the conversion of data.gov datasets to “linked data” (my quotes are purposeful). Combined with the recent poor conversion of New York Times datasets to linked data, I think he is the canary sending out a warning about a disturbing trend.
Posting linked data for its own sake — whatever the reasons — risks undercutting the premise.
We have now moved beyond “proof of concept” to the need for actual useful data of trustworthy provenance and proper mapping and characterization. Recent efforts are a disappointment that no enterprise would or could rely upon.
Listen up, folks.