Posted:November 11, 2009

irON - instance record and Object Notation

A Case Study of Turning Spreadsheets into Structured Data Powerhouses

In a former life, I had the nickname of ‘Spreadsheet King’ (perhaps among others that I did not care to hear). I had gotten the nick because of my aggressive use of spreadsheets for financial models, competitors tracking, time series analyses, and the like. However, in all honesty, I have encountered many others in my career much more knowledgeable and capable with spreadsheets than I’ll ever be. So, maybe I was really more like a minor duke or a court jester than true nobility.

Yet, pro or amateur, there are perhaps 1 billion spreadsheet users worldwide [1], making spreadsheets undoubtedly the most prevalent data authoring environment in existence. And, despite moans and wails about how spreadsheets can lead to chaos, spaghetti code, or violations of internal standards, they are here to stay.

Spreadsheets often begin as simple notetaking environments. With the addition of new findings and more analysis, some of these worksheets may evolve to become full-blown datasets. Alternatively, some spreadsheets start from Day One as intended datasets or modeling environments. Whatever the case, clearly there is much accumulated information and data value “locked up” in existing spreadsheets.

How to “unlock” this value for sharing and collaboration was a major stimulus to development of the commON serialization of irON (instance record and Object Notation) [2]. I recently published a case study [3] that describes the reasons and benefits of dataset authoring in a spreadsheet, and provides working examples and code based on Sweet Tools [4] to aid users in understanding and using the commON notation. I summarize portions of that study herein.

This is the second article of a two-part series related to the recent Sweet Tools update.

Background on Sweet Tools and irON

The dataset that is the focus of this use case, Sweet Tools, began as an informal tracking spreadsheet about four years ago. I began it as a way to learn about available tools in the semantic Web and -related spaces. I began publishing it and others found it of value so I continued to develop it.

As it grew over time, however, it gained in structure and size. Eventually, it became a reference dataset, with which many other people desired to use and interact. The current version has well over 800 tools listed, characterized by many structured data attributes such as type, programming language, description and so forth. As it has grown, a formal controlled vocabulary has also evolved to bring consistency to the characterization of many of these attributes.

It was natural for me to maintain this listing as a spreadsheet, which was also reinforced when I was one of the first to adopt an Exhibit presentation of the data based on a Google spreadsheet about three years back. Here is a partial view of this spreadsheet as I maintain it locally:

Sweet Tools Main Spreadsheet Screen
(click to expand)

When we began to develop irON in earnest as a simple (“naïve”) dataset authoring framework, it was clear that a comma-separated value, or CSV [5], option should join the other two serializations under consideration, XML and JSON. CSV, though less expressive and capable as a data format than the other serializations, still has an attribute-value pair (also known as key-value pairs and many other variants [6]) orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc.

As a dataset very familiar to us as irON‘s editors, and directly relevant to the semantic Web, Sweet Tools provided a perfect prototype case study for helping to guide the development of irON, and specifically what came to be known as the commON serialization for irON. The Sweet Tools dataset is relatively large for a speciality source, has many different types and attributes, and is characterized by text, images, URLs and similar.

The premise was that if Sweet Tools could be specified and represented in commON sufficiently to be parsed and converted to interoperable RDF, then many similar instance-oriented datasets could likely be so as well. Thus, as we tried and refined notation and vocabulary, we tested applicability against the CSV representation of Sweet Tools in addition to other CSV, JSON and XML datasets.

Dataset Authoring in a Spreadsheet

A large portion of the case study describes the many advantages of authoring small datasets within spreadsheets. The useful thing about the CSV format is that these full functional capabilities of the spreadsheet are available during authoring or later updates and modifications, but, when exported, the CSV provides a relatively clean format for processing and parsing.

So, some of the reasons for small dataset authoring in a spreadsheet include:

  • Formatting and on-sheet management -  the first usefulness of a spreadsheet comes from being able to format and organize the records. Records can be given background colors to highlight distinctions (new entries, for example); live URL links can be embedded; contents can be wrapped and styled within cells; and the column and row heads can be “frozen”, useful when scrolling large workspaces
  • Named blocks and sorting – named blocks are a powerful feature of modern spreadsheets, useful for data manipulation, printing and internal referencing by formulas and the like.  Sorting with named blocks is especially important as an aid to check consistency of terminology, records completeness, duplicates checks, missing value checks, and the like. Named blocks can also be used as references in calculations. All of these features are real time savers, especially when datasets grow large and consistency of treatment and terminology is important
  • Multiple sheets and consolidated accesscommON modules can be specified on a single worksheet or multiple worksheets and saved as individual CSV files; because of its size and relative complexity, the Sweet Tools dataset is maintained on multiple sheets. Multi-worksheet environments help keep related data and notes consolidated and more easily managed on local hard drives
  • Completeness and counts - the spreadsheet counta function is useful to sum counts for cell entries by both column and row, a useful aid to indicate if an attribute or type value is missing or if a record is incomplete.  Of course, similar helps and uses can be found for many of the hundreds of embedded functions within a spreadsheet
  • Controlled vocabularies and data entry validation – quality datasets often hinge on consistency and uniform values and terminology; the data validation utilities within spreadsheets can be applied to Boolean, ranges and mins and maxes, and to controlled vocabulary lists. Here is an example for Sweet Tools, enforcing proper tool category assignments from a 50-item pick list:
Controlled Vocabularies and Data Entry Validation
  • Specialized functions and macrosall functionality of spreadsheets may be employed in the development of commON datasets. Then, once employed, only the values embedded within the sheets are then exported as CSV.

Staging Sweet Tools for commON

The next major section of the case study deals with the minor conventions that must be followed in order to stage spreadsheets for commON. Not much of the specific commON vocabulary or notation is discussed below; for details, see [7].

Because you can create multiple worksheets within a spreadsheet, it is not necessary to modifiy existing worksheets or tabs. Rather, if you are reluctant or can not change existing information, merely create parallel duplicate sheets of the source information. These duplicate sheets have as their sole purpose export to commON CSV. You can maintain your spreadsheet as is while staging for commON.

To do so, use the simple = formula to create cross-references between the existing source spreadsheet tab and the target commON CSV export tab. (You can also do this for complete, highlighted blocks from source to target sheet.) Then, by adding the few minor conventions of commON, you have now created a staged export tab without modifying your source information in the slightest.

In standard form and for Excel and Open Office, single quotes, double quotes and commas when entered into a spreadsheet cell are automatically ‘escaped‘ when issued as CSV. commON allows you to specify your own delimiter for lists (the standard is the pipe ‘|’ character) and what the parser recognizes as the ‘escape’ character (‘\’ is the standard). However, you probably should not change for most conditions.

The standard commON parsers and converters are UTF-8 compatible. If your source content has unusual encodings, try to target UTF-8 as your canonical spreadsheet output.

In the irON specification there are a small number of defined modules or processing sections. In commON, these modules are denoted by the double-ampersand character sequence (‘&&‘), and apply to lists of instance records (&&recordList), dataset specifications and associated metadata describing the dataset (&&dataset), and mappings of attributes and types to existing schema (&&linkage). Similarly, attributes and types are denoted by a single ampersand prefix (&attributeName).

In commON, any or all of the modules can occur within a single CSV file or in multiple files. In any case, the start of one of these processing modules is signaled by the module keyword and &&keyword convention.

The RecordList Module

The first spreadsheet figure above shows a Sweet Tools example for the &&recordList module. The module begins with that keyword, indicating one of more instance records will follow. Note that the first line after the &&recordList keyword is devoted to the listing of attributes and types for the instance records (designated by the &attributeName convention in the columns for the first row after the &&recordList keyword is encountered).

The &&recordList format can also include the stacked style (see similar Dataset example below) in addition to the single row style shown above.

At any rate, once a worksheet is ready with its instance records following the straightforward irON and commON conventions, it can then be saved as a CSV file and appropriately named. Here is an example of what this “vanilla” CSV file now looks like when shown again in a spreadsheet:

Spreadsheet View of the CSV File
(click to expand)

Alternatively, you could open this same file in a text editor. Here is how this exact same instance record view looks in an editor:

Editor View of the CSV Record File
(click to expand)

Note that the CSV format separates each column by the comma separator, with escapes shown for the &description attribute when it includes a comma-separated clause. Without word wrap, each record in this format occupies a single row (though, again, for the stacked style, multiple entries are allowed on individual rows so long as a new instance record &id is not encountered in the first column).

The Dataset Module

The &&dataset module defines the dataset parameters and provides very flexible metadata attributes to describe the dataset [8]. Note the dataset specification is exactly equivalent in form to the instance record (&&recordList) format, and also allows the single row or stacked styles (see these instance record examples), with this one being the stacked style:

The Dataset Module
(click to expand)

The Linkage Module

The &&linkage module is used to map the structure of the instance records to some structural schema, which can also include external ontologies. The module has a simple, but specific structure.

Either attributes (presented as the &attributeList) or types (presented as the &typeList) are listed sequentially by row until the listing is exhausted [8]. By convention, the second column in the listing is the targeted &mapTo value. Absent a prior &prefixList value, the &mapTo value needs to be a full URL to the corresponding attribute or type in some external schema:

The Linkage Module

Notice in the case of Sweet Tools that most values are from the actual COSMO mini-ontology underlying the listing. These need to be listed as well, since absent the specifications in commON the system has NO knowledge of linkages and mappings.

The Schema (structure) Module

In its current state of development, commON does not support a spreadsheet-based means for specifying the schema structure (lightweight ontology) governing the datasets [2]. Another irON serialization, irJSON, does. Either via this irJSON specification or via an offline ontology, a link reference is presently used by commON (and, therefore, Sweet Tools for this case study) to establish the governing structure of the input instance record datasets.

A spreadsheet-based schema structure for commON has been designed and tested in prototype form. commON should be enhanced with this capability in the near future [8].

Saving and Importing

If the modules are spread across more than one worksheet, then each worksheet must be saved as its own CSV file. In the case of Sweet Tools, as exhibited by its reference current spreadsheet, sweet_tools_20091110.xls, three individual CSV files get saved. These files can be named whatever you would like. However, it is essential that the names be remembered for later referencing.

My own naming convention is to use a format of appname_date_modulename.csv because it sorts well in a file manager accommodating multiple versions (dates) and keeps related files clustered. The appname in the case of Sweet Tools is generally swt. The modulename is generally the dataset, records, or linkage convention. I tend to use the date specification in the YYYYMMDD format. Thus, in the case of the records listings for Sweet Tools, its filename could be something like:  swt_20091110_records.csv.

Once saved, these files are now ready to be imported into a structWSF [9] instance, which is where the CSV parsing and conversion to interoperable RDF occurs [8]. In this case study, we used the Drupal-based conStruct SCS system [10]. conStruct exposes the structWSF Web services via a user interface and a user permission and access system. The actual case study write-up offers more details about the import process.

Using the Dataset

We are now ready to interact with the Sweet Tools structured dataset using conStruct (assuming you have a Drupal installation with the conStruct modules) [10].

Introduction to the App

The screen capture below shows a couple of aspects of the system:

  • First, the left hand panel (according to how this specific Drupal install was themed) shows the various tools available to conStruct.  These include (with links to their documentation) Search, Browse, View Record, Import, Export, Datasets, Create Record, Update Record, Delete Record and Settings [11];
  • The Browse tree in the main part of the screen shows the full mini-ontology that classifies Sweet Tools. Via simple inferencing, clicking on any parent link displays all children projects for that category as well (click to expand):
conStruct (Drupal) Browse Screen for Sweet Tools(click to expand)

One of the absolutely cool things about this framework is that all tools, inferencing, user interfaces and data structure are a direct result of the ontology(ies) underlying the system (plus the irON instance ontology, as well). This means that switching datasets or adding datasets causes the entire system structure to now reflect those changes — without lifting a finger!!

Some Sample Uses

Here are a few sample things you can do with these generic tools driven by the Sweet Tools dataset:

Note, if you access this conStruct instance you will do so as a demo user. Unfortunately, as such, you may not be able to see all of the write and update tools, which in this case are reserved for curators or admins. Recall that structWSF has a comprehensive user access and permissions layer.

Exporting in Alternative Formats

Of course, one of the real advantages of the irON and structWSF designs is to enable different formats to be interchanged and to interoperate. Upon submission, the commON format and its datasets can then be exported in these alternate formats and serializations [8]:

  • commON
  • irJSON
  • irXML
  • N-Triples/CSV
  • N-Triples/TSV
  • RDF+N3

As should be obvious, one of the real benefits of the irON notation — in addition to easy dataset authoring — is the ability to more-or-less treat RDF, CSV, XML and JSON as interoperable data formats.

The Formal Case Study

The formal Sweet Tools case study based on commON, with sample download files and PDF, is available from Annex: A commON Case Study using Sweet Tools, Supplementary Documentation [3].

[1] In 2003, Microsoft estimated its worldwide users of the Excel spreadsheet, which then had about a 90% market share globally, at 400 million. Others at that time estimated unauthorized use to perhaps double that amount. There has been significant growth since then, and online spreadsheets such as Google Docs and Zoho have also grown wildly. This surely puts spreadsheet users globally into the 1 billion range.
[2] See Frédérick Giasson and Michael Bergman, eds., Instance Record and Object Notation (irON) Specification, Specification Document, version 0.82, 20 October 2009.  See Also see the irON Web site, Google discussion group, and code distribution site.
[3] Michael Bergman, 2009. Annex: A commON Case Study using Sweet Tools, Supplementary Documentation, prepared by Structured Dynamics LLC, November 10, 2009. See It may also be downloaded in PDF .
[4] See Michael K. Bergman’s AI3:::Adaptive Information blog, Sweet Tools (Sem Web). In addition, the commON version of Sweet Tools is available at the conStruct site.
[5] The CSV mime type is defined in Common Format and MIME Type for Comma-Separated Values (CSV) Files [RFC 4180]. A useful overview of the CSV format is provided by The Comma Separated Value (CSV) File Format. Also, see that author’s related CTX reference for a discussion of how schema and structure can be added to the basic CSV framework; see, especially the section on the comma-delimited version (
[6] An attribute-value system is a basic knowledge representation framework comprising a table with columns designating “attributes” (also known as properties, predicates, features, parameters, dimensions, characteristics or independent variables) and rows designating “objects” (also known as entities, instances, exemplars, elements or dependent variables). Each table cell therefore designates the value (also known as state) of a particular attribute of a particular object. This is the basic table presentation of a spreadsheet or relational data table.

Attribute-values can also be presented as pairs in a form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array..

[7] See especially SUB-PART 3: commON PROFILE in, Frédérick Giasson and Michael Bergman, eds., Instance Record and Object Notation (irON) Specification, Specification Document, version 0.82, 20 October 2009.
[8] As of the date of this case study, some of the processing steps in the commON pipeline are manual. For example, the parser creates an intermediate N3 file that is actually submitted to the structWSF. Within a week or two of publication, these capabilities should be available as a direct import to a structWSF instance. However, there is one exception to this:  the specification for the schema structure. That module has been prototyped, but will not be released with the first commON upgrade. That enhancement is likely a few weeks off from the date of this posting. Please check the irON or structWSF discussion groups for announcements.
[9] structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data, with generic tools driven by underlying data structures. Its central perspective is that of the dataset. Access and user rights are granted around these datasets, making the framework enterprise-ready and designed for collaboration. Since a structWSF layer may be placed over virtually any existing datastore with Web access — including large instance record stores in existing relational databases — it is also a framework for Web-wide deployments and interoperability.
[10] conStruct SCS is a structured content system built on the Drupal content management framework. conStruct enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces. It is based on RDF and SD’s structWSF platform-independent Web services framework [6]. In addition to user access control and management and a general user interface, conStruct provides Drupal-level CRUD, data display templating, faceted browsing, full-text search, and import and export over structured data stores based on RDF.
[11] More Web services are being added to structWSF on a fairly constant basis, and the existng ones have been through a number of upgrades.
Posted:November 2, 2009

Structured Dynamics LLC

A New Slide Show Consolidates, Explains Recent Developments

Much has been happening on the Structured Dynamics front of late. Besides welcoming Steve Ardire as a senior advisor to the company, we also have been issuing a steady stream of new products from our semantic Web pipeline.

This new slide show attempts to capture these products and relate them to the various layers in Structured Dynamics’ enterprise product stack:

The show indicates the role of scones, irON, structWSF, UMBEL, conStruct and others and how they leverage existing information assets to enable the semantic enterprise. And, oh, by the way, all of this is done via Web-accessible linked data and our practical technologies.


Posted:October 11, 2009

The Marshal Has Come to TownA Marshal to Bring Order to the Town of Data Gulch

Though not the first, I have been touting the Linked Data Law for a couple of years now [1]. But in a conversation last week, I found that my colleague did not find the premise very clear. I suspect that is due both to cryptic language on my part and the fact no one has really tackled the topic with focus. So, in this post, I try to redress that and also comment on the related role of linked data in the semantic enterprise.

Adding connections to existing information via linked data is a powerful force multiplier, similar to Metcalfe’s law for how the value of a network increases with more users (nodes). I have come to call this the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between data objects.

“In the network economy, the connections are as important as the nodes.” [2]

An early direct mention of the semantic Web and its possible ability to generate network effects comes from a 2003 Mitre report for the government [3]. In it, the authors state, “At present a very small proportion of the data exposed on the web is marked up using Semantic Web vocabularies like RDF and OWL. As more data gets mapped to ontologies, the potential exists to achieve a ‘network effect’.” Prescient, for sure.

In July 2006, both Henry Story and Dion Hinchliffe discussed Metcalfe’s law, with Henry specifically looking to relate it to the semantic Web [4]. He noted that his initial intuition was that “the value of your information grows exponentially with your ability to combine it with new information.” He noted he was trying to find ways to adapt Metcalfe’s law for applicability to the semantic Web.

I picked up on those observations and commented to Henry at that time and in my own post, “The Exponential Driver of Combining Information.” I have been enamoured of the idea ever since, and have begun to weave the idea into my writings.

More recently, in late 2008, James Hendler and Jennifer Golbeck devoted an entire paper to Metcalfe’s law and the semantic Web [5]. In it, they note:

“This linking between ontologies, and between instances in documents that refer to terms in another ontology, is where much of the latent value of the Semantic Web lies. The vocabularies, and particularly linked vocabularies using URIs, of the Semantic Web create a graph space with the ability to link any term to any other. As this link space grows with the use of RDF and OWL, Metcalfe’s law will once again be exploited – the more terms to link to, and the more links created, the more value in creating more terms and linking them in.”

A Refresher on Metcalfe’s Law

Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of users of the system (n²) (note: it is not exponential, as some of the points above imply). Robert Metcalfe formulated it about 1980 in relation to Ethernet and fax machines; the “law” was then named for Metcalfe and popularized by George Gilder in 1993.

These attempts to estimate the value of physical networks were in keeping with earlier efforts to estimate the value of a broadcast network. That value is almost universally agreed to be proportional to the number of users, as accepted as Sarnoff’s law (see further below).

The actual algorithm proposed by Metcalfe calculates the number of unique connections in a network with n nodes to be n(n − 1)/2, which is proportional to n2. This makes Metcalfe’s law a quadratic growth equation.

As nodes get added, then, we see the following increase in connections:

Metcalfe Law Network Effect

‘Network Effect’ for Physical Networks

This diagram, modified from Wikipedia to be a horizontal image, shows how two telephones can make only one connection, five can make 10 connections, and twelve can make 66 connections, etc.

By definition, a physical network is a connected network. Thus, every time a new node is added to the network, connections are added, too. This general formula has also been embraced as a way to discuss social connections on the Internet [6].

Analogies to Linked Data

Like physical networks, the interconnectedness of the semantic Web or semantic enterprise is a graph.

The idea behind linked data is to make connections between data. Unlike physical telecommunication networks, however, the nodes in the form of datasets and data are (largely) already there. What is missing are the connections. The build-out and growth that produces the network effects in a linked data context do not result from adding more nodes, but from the linking or connecting of existing nodes.

The fact that adding a node to a physical network carries with it an associated connection has tended to conjoin these two complementary requirements of node and connection. But, to grok the real dynamics and to gain network effects, we need to realize: Both nodes and connections are necessary.

One circumstance of the enterprise is that data nodes are everywhere. The fact that the overwhelming majority are unconnected is why we have adopted the popular colloquialism of data “silos”. There are also massive amounts of unconnected data on the Web in the form of dynamic databases only accessible via search form, and isolated data tables and listings virtually everywhere.

Thus, the essence of the semantic enterprise and the semantic Web is no more complicated than connecting — meaningfully — data nodes that already exist.

As the following diagram shows, unconnected data nodes or silos look like random particles caught in the chaos of Brownian motion:

Linked Data Law Network Effect

‘Network Effect’ for Coherent Linked Data

As initial connections get made, bits of structure begin to emerge. But, as connections are proliferated — exactly equivalant to the network effects of connected networks — coherence and value emerge.

Look at the last part in the series diagram above. We not only see that the same nodes are now all connected, with the inferences and relationships that result from those connections, but we can also see entirely new structures emerge by virtue of those connections. All of this structure and meaning was totally absent prior to making the linked data connections.

Quantifying the Network Effect

So, what is the benefit of this linked data? It depends on the product of the value of the connections and the multiplier of the network effect:

linked data benefit = connections value X network effect multiplier

Just as it is hard to have a conversation via phone with yourself, or to collaborate with yourself, the ability to gain perspective and context from data comes from connections. But like some phone calls or some collaborations, the value depends on the participants. In the case of linked data, that depends on the quality of the data and its coherence [7]. The value “constant” for connected linked data depends in some manner on these factors, as well as the purposes and circumstances to which that linked data might be applied.

Even in physical networks or social collaboration contexts, the “value” of the network has been hard to quantify. And, while academics and researchers will appropriately and naturally call for more research on these questions, we do not need to be so timid. Whatever the alpha constant is for quantifying the value of a linked data network, our intuition should be clear that making connections, finding relationships, making inferences, and making discoveries can not occur when data is in isolation.

Because I am an advocate, I believe this alpha constant of value to be quite large. I believe this constant is also higher for circumstances of business intelligence, knowledge management and discovery.

The second part of the benefit equation is the multiplier for network effects. We’ve mentioned before the linear growth advantage due to broadcast networks (Sarnoff law) and the standard quadratic growth assumption of physical and social networks (Metcalfe law). Naturally, there have been other estimates and advocacies.

David Reed [8], for example, also adds group effects and has asserted an exponential multiplier to the network effect (like Henry Story’s initial intuition noted above). As he states,

“[E]ven Metcalfe’s Law understates the value created by a group-forming network [GFN] as it grows. Let’s say you have a GFN with n members. If you add up all the potential two-person groups, three-person groups, and so on that those members could form, the number of possible groups equals 2n. So the value of a GFN increases exponentially, in proportion to 2n. I call that Reed’s Law. And its implications are profound.”

Yet not all agree with the assertion of an exponential multiplier, let alone the quadratic one of Metcalfe. Odlyzko and Tilly [9] note that Metcalfe’s law would hold if the value that an individual gets personally from a network is directly proportional to the number of people in that network. But, then they argue that does not hold because of local preferences or different qualities of interaction. In a linked data context, such arguments have merit, though you may also want to see Metcalfe’s own counter-arguments [6].

Hinchliffe’s earlier commentary [4] provided a nice graphic that shows the implications of these various multiplers on the network effect, as a function of nodes in a network:

Potency of the Network Effect from Dion Hinchliffe

Various Estimates for the ‘Network Effect’

I believe we can dismiss the lower linear bound of this question and likely the higher exponential one as well (that is, Reed’s law, because quality and relevance questions make some linked data connections less valuable than others). Per the above, that would suggest that the multiplier of the linked data network is perhaps closer to the Metcalfe estimate or similar.

In any event, it is also essential to point out that connecting data indiscriminantly for linked data’s sake will likely deliver few, if any, benefits. Connections must still be coherent and logical for the value benefits to be realized.

The Role and Contribution of Linked Data

I elsewhere discuss the role of linked data in the enterprise and will continue to do so. But, there are some implications in the above that warrant some further observations.

It should be clear that the graph and network basis of linked data, not to mention some of the uncertainties as to quantifying benefits, suggests the practice should be considered apart from mission-critical or transactional uses in the enterprise. That may change with time and experience.

There are also open questions about data quality in terms of inputs to linked data and possible erroneous semantics and ontologies to guide the linked connections. Operational uses should be kept off the table for now. Like physical networks, not all links perform well and not all have usefulness. Similarly to how poor connections may be encountered in physical networks, they should be either taken off-ledger or relegated to a back-up basis. Linked data should be understood and treated no differently than networks of variable quality.

Such realism is important — for both internal and external linked data advocates — to allow linked data to be applied in the right venues at acceptable risk and with likely demonstrable benefits. Elsewhere I have advocated an approach that builds on existing assets; here I advocate a clear and smart understanding of where linked data can best deliver network effects in the near term.

And, so, in the nearest term, enterprise applications that best fit linked data promises and uncertainties include:

  • Establishing frameworks for data federation
  • Business intelligence
  • Discovery
  • Knowledge management and knowledge resources
  • Reasoning and inference
  • Development of internal common language
  • Learning and adopting data-driven apps [10], and
  • Staging and analysis for data cleaning.

A New Deputy Has Come to Town

As in the Wild West, the new deputy marshal and his tin badge did not guarantee prosperity. But a good marshal would deliver law and order. And those are the preconditions for the town folk to take charge of building their own prosperity.

Linked data is a practice for starting to bring order and connections to your existing data. Once some order has been imposed, the framework then becomes a basis for defining meanings and then gaining value from those connections.

Once order has been gained, it is up to the good citizens of Data Gulch to then deliver the prosperity. Broad participation and the network effect are one way to promote that aim. But success and prosperity still depends on intelligence and good policies and practice.

[1] I first put forward this linked data aspect in What is Linked Data?, dated June 23, 2008. I then formalized it in Structure the World, dated August 3, 2009.
[2] Paul Tearnen, 2006. “Integration in the Network Economy,” Information Management Special Reports, October 2006. See
[3] Salim K. Semy, Mark Linderman and Mary K. Pulvermacher, 2003. “Information Management Meets the Semantic Web,” DOD Report by MITRE Corporation, November 2003, 10 pp. See
[4] On July 15, 2006, Dion Hinchcliffe wrote, Web 2.0′s Real Secret Sauce: Network Effects. He produced a couple of useful graphics and expanded upon some earlier comments to the Wall Street Journal. Shortly thereafter, on July 29, Story wrote his own post, RDF and Metcalfe’s law, as noted. I commented on July 30.
[5] James Hendler and Jennifer Golbeck, 2008. “Metcalfe’s Law, Web 2.0, and the Semantic Web,” in Journal of Web Semantics 6(1):14-20, 2008. See
[6] Robert Metcalfe, 2006. Metcalfe’s Law Recurses Down the Long Tail of Social Networking, see
[7] See my When is Content Coherent? posting of July 25, 2008. ‘Coherence’ is a frequent theme of my blog posts; see my chronological listing for additional candidates.
[8] From David P. Reed, 2001. “The Law of the Pack,” Harvard Business Review, February 2001, pp 23-4. For more on Reed’s position, see Wikipedia’s entry on Reed’s law.
[9] Andrew Odlyzko and Benjamin Tilly, 2005. A Refutation of Metcalfe’s Law and a Better Estimate for the Value of Networks and Network Interconnections, personal publication; see
[10] Data-driven applications are the term we have adopted for modular, generic tools that operate and present results to users based on the underlying data structures that feed them. See further the discussion of Structured Dynamics’s products.

Posted by AI3's author, Mike Bergman Posted on October 11, 2009 at 8:16 pm in Linked Data, Semantic Web | Comments (4)
The URI link reference to this post is:
The URI to trackback this post is:
Posted:September 28, 2009

The Tower of Babel by Pieter Brueghel the Elder (1563)The Benefits are Greater — and Risks and Costs Lower — Than Many Realize

I have been meaning to write on the semantic enterprise for some time. I have been collecting notes on this topic since the publication by PricewaterhouseCoopers (PWC) of an insightful 58-pp report earlier this year [1]. The PWC folks put their finger squarely on the importance of ontologies and the delivery of semantic information via linked data in that publication.

The recent publication of a special issue of the Cutter IT Journal devoted to the semantic enterprise [2] has prompted me to finally put my notes in order. This Cutter volume has a couple of good articles including its editorial intro [3], but is overall spotty in quality and surprisingly unexciting. I think it gets some topics like the importance of semantics to data integration and business intelligence right, but in other areas is either flat wrong or misses the boat.

The biggest mistake are statements such as “. . . a revolutionary mindset will be needed in the way we’ve traditionally approached enterprise architecture” or that the “. . . semantic enterprise means rethinking everything.”

This is just plain hooey. From the outset, let’s make one thing clear:  No one needs to replace anything in their existing architecture to begin with semantic technologies. Such overheated rhetoric is typical consultant hype and fundamentally mischaracterizes the role and use of semantics in the enterprise. (It also tends to scare CIOs and to close wallets.)

As an advocate for semantics in the enterprise, I can appreciate the attraction of framing the issue as one of revolution, paradigm shifts, and The Next Big Thing. Yes, there are manifest benefits and advantages for the semantic enterprise. And, sure, there will be changes and differences. But these changes can occur incrementally and at low risk while experience is gained.

The real key to the semantic enterprise is to build upon and leverage the assets that already exist. Semantic technologies enable us to do just that.

Think about semantic technologies as a new, adaptive layer in an emerging interoperable stack, and not as a wholesale replacement or substitution for all of the good stuff that has come before. Semantics are helping us to bridge and talk across multiple existing systems and schema. They are asking us to become multi-lingual while still allowing us to retain our native tongues. And, hey! we need not be instantly fluent in these new semantic languages in order to begin to gain immediate benefits.

As I noted in my popular article on the Advantages and Myths of RDF from earlier this year:

We can truly call RDF a disruptive data model or framework. But, it does so without disrupting what exists in the slightest. And that is a most remarkable achievement.

That is still a key takeaway message from this piece. But, let’s look and list with a fresh perspective the advantages of moving toward the semantic enterprise [4].

Perspective #1: Incremental, Learn-as-you-Go is Best Strategy

For the interconnected reasons noted below, RDF and semantic technologies are inherently incremental, additive and adaptive. The RDF data model and the vocabularies built upon it allow us to progress in the sophistication of our expressions from pidgin English (simple Dick sees Jane triples or assertions) to elegant and expressive King’s English. Premised on the open world assumption (see below), we also have the freedom to only describe partial domains or problem areas.

From a risk standpoint, this is extremely important. To get started with semantic technologies we neither need to: 1) comprehensively describe or tackle the entire enterprise information space; nor 2) do so initially with precision and full expressiveness. We can be partial and somewhat crude or simplistic in our beginning efforts.

Also extremely important is that we can add expressivity and scope as we go. There is no penalty for starting small or simple and then growing in scope or sophistication. Just like progressing from a kindergarten reader to reading Tolstoy or Dickens, we can write and read schema of whatever complexity our current knowledge and understanding allow.

Perspective #2: Augment and Layer on to Existing Assets, Don’t Replace Them!

Semantic technology does not change or alter the fact that most activities of the enterprise are transactional, communicative or documentary in nature. Structured, relational data systems for transactions or records are proven, performant and understood. Writing and publishing information, sometimes as documents and sometimes as spreadsheets or Web pages, is (and will remain) the major vehicle for communicating within the enterprise and to external constituents.

On its very face, it should be clear that the meaning of these activities — their semantics, if you will — is by nature an augmentation or added layer to how to conduct the activities themselves. Moreover, as we also know, these activities are undertaken for many different purposes and within many different contexts. The inherent meaning of these activities is also therefore contextual and varied.

This simple truth affirms that semantic technologies are not a starting basis, then, for these activities, but a way of expressing and interoperating their outcomes. Sure, some semantic understanding and common vocabularies at the front end can help bring consistency and a common language to an enterprise’s activities. This is good practice, and the more that can be done within reason while not stifling innovation, all the better. But we all know that the budget department and function has its own way of doing things separate from sales or R&D. And that is perfectly OK and natural.

These observations — in combination with semantic technologies — can thus lead to a conceptual architecture for the enterprise that recognizes there are “silo” activities that can still be bridged with the semantic layer:

Under this conceptual architecture, “RDFizers” (similar to the ETL function) or information extractors working upon unstructured or semi-structured documents expose their underlying information assets in RDF-ready form. This RDF is characterized by one or more ontologies (multiples are actually natural and preferred [5]), which then can be queried using the semantic querying language, SPARQL.

We have written at length about proper separation of instance records and data and schema, what is called the ABox and TBox, respectively, in description logics [6], a key logic premise to the semantic Web. Thus, through appropriate architecting of existing information assets, it is possible to leave those systems in place while still gaining the interoperability advantages of the semantic enterprise.

Another aspect of this information re-use is also a commitment to leverage existing schema structures, be they industry standards, XML, MDM, relational schema or corporate taxonomies. The mappings of these structures in the resulting ontologies thus become the means to codify the enterprise’s circumstances into an actionable set of relationships bridging across multiple, existing information assets.

Perspective #3: The First Major Benefit is from Data Federation

Clearly, then, the first obvious benefit to the semantic enterprise is to federate across existing data silos, as featured prominently in the figure above. Data federation has been the Holy Grail of IT systems and enterprises for more than three decades. Expensive and involved efforts from ETL and MDM and then to enterprise information integration (EII), enterprise application integration (EAI) and business intelligence (BI) have been a major focus.

Frankly, it is surprising that no known vendors in these spaces (aside from our own Structured Dynamics, hehe) premise their offerings on RDF and semantic technologies. (Though some claim so.) This is a major opportunity area. (And we don’t mind giving our competitors useful tips.)

Perspective #4: Wave Goodbye to Rigid, Inflexible Schema

Instance-level records and the ABox work well with relational databases. Their schema are simple and relatively fixed. This is fortunate, because such instance records are the basis of transactional systems where performance and throughput are necessary and valued.

But at the level of the enterprise itself — what its business is, its business environment, what is constantly changing around it — trying to model its world with relational schema has proven frustrating, brittle and inflexible. Though relational and RDF schema share much logically, the physical basis of the relational schema does not lend itself to changes and it lacks the flexibility and malleability of the graph-based RDF conceptual structure.

Knowledge management and business intelligence are by no means new concepts for the enterprise. What is new and exciting, however, is how the emergence of RDF and the semantic enterprise will open new doors and perspectives. Once freed of schema constraints, we should see the emergence of “agile KM” similar to the benefits of agile software development.

Because semantic technologies can operate in a layer apart from the standard data basis for the enterprise, there is also a smaller footprint and risk to experimenting at the KM or conceptual level. More options and more testing and much lower costs and risks will surely translate to more innovation.

Just as semantic technologies are poorly suited for transactional or throughput purposes, we should see the complementary and natural migration of KM to the semantic side of the shop. There are no impediments for this migration to begin today. In the process, as yet unforeseen and manifest benefits in agility, experimentation, inferencing and reasoning, and therefore new insights, will emerge.

Perspective #5: Data-driven Apps Shift the Software Paradigm

The same ontologies that guide the data federation and interoperability layer can also do double-duty as the specifications for data-driven applications. The premise is really quite simple: Once it is realized that the inherent information structure contained within ontologies can guide hierarchies, facets, structured retrievals and inferencing, the logical software design is then to “drive” the application solely based on that structure. And, once that insight is realized, then it becomes important, as a best practice, to add further specifications in order to also carry along the information useful for “driving” user interfaces [7].

Thus, while ontologies are often thought solely to be for the purpose of machine interpretation and communication, this double-duty purpose now tells us that useful labels and such for human use and consumption is also an important goal.

When these best practices of structure and useful human labels are made real, it then becomes possible to develop generic software applications, the operations of which vary solely by the nature of the structure and ontologies fed to them. In other words, ontologies now become the application, not custom-written software.

Of course, this does not remove the requirement to develop and write software. But the nature and focus of that development shifts dramatically.

From the outset, data-driven software applications are designed to be responsive to the structure fed them. Granted, specific applications in such areas as search, report writing, analysis, data visualization, import and export, format conversions, and the like, still must be written. But, when done, they require little or no further modification to respond to whatever compliant ontologies are fed to them — irrespective of domain or scope.

It thus becomes possible to see a relatively small number of these generic apps that can respond to any compliant structure.

The shift this represents can be illustrated by two areas that have been traditional choke points for IT within the enterprise: queries to local data stores (in order to get needed information for analysis and decisions) and report writers (necessary to communicate with management and constituents).

It is not unusual to hear of weeks or months delays in IT groups responding to such requests. It is not that the IT departments are lazy or unresponsive, but that the schema and tools used to fulfill their user demands are not flexible.

It is hard to know just how large the huge upside is for data-driven apps and generic tools. But, this may prove to be of even greater import than overcoming the data federation challenge.

In any event, while potentially disruptive, this prospect of data-driven applications can start small and exist in parallel with all existing ways of doing business. Yes, the upside is huge, but it need not be gained by abandoning what already works.

Perspective #6: Adaptive Ontologies Flatten, Democratize the KM Process

So, assume, then, a knowledge management (KM) environment supported by these data-driven apps. What perspective arises from this prospect?

One obvious perspective is where the KM effort shifts to become the actual description, nature and relationships of the information environment. In other words, ontologies themselves become the focus of effort and development. The KM problem no longer needs to be abstracted to the IT department or third-party software. The actual concepts, terminology and relations that comprise coherent ontologies now become the foundation of KM activities.

An earlier perspective emphasized how most any existing structure can become a starting basis for ontologies and their vocabularies, from spreadsheets to naïve data structures and lists and taxonomies. So, while producing an operating ontology that meets the best practice thresholds noted herein has certain requirements, kicking off or contributing to this process poses few technical or technology demands.

The skills needed to create these adaptive ontologies are logic, coherent thinking and domain knowledge. That is, any subject matter expert or knowledge worker worth keeping on the payroll has, by definition, the necessary skills to contribute to useful ontology development and refinement.

With adaptive ontologies powering data-driven apps we thus see a shift in roles and responsibilities away from IT to knowledge workers themselves. This shift acts to democratize the knowledge management function and flatten the organization.

Perspective #7: The Semantic Enterprise is ‘Open’ to the World

Enterprise information systems, particularly relational ones, embody a closed world assumption that holds that any statement that is not known to be true is false. This premise works well where there is complete coverage of the entities within a knowledge base, such as the enumeration of all customers or all products of an enterprise.

Yet, in the real (”open”) world there is no guarantee or likelihood of complete coverage. Thus, under an open world assumption the lack of a given assertion or fact being available neither implies whether that possible assertion is true or false: it simply is not known. An open world assumption is one of the key factors for enabing adaptive ontologies to grow incrementally. It is also the basis for enabling linkage to external (and surely incomplete) datasets.

Fortunately, there is no requirement for enterprises to make some philosophical commitment to either closed- or open-world systems or reasoning. It is perfectly acceptable to combine traditional closed-world relational systems with open-world reasoning at the ontology level. It is also not necessary to make any choices or trade-offs about using public v. private data or combinations thereof. All combinations are acceptable and easily accommodated.

As noted, one advantage of open-world reasoning at the ontological level is the ability to readily change and grow the conceptual understanding and coverage of the world, including incorporation of external ontologies and data. Since this can easily co-exist with underlying closed-world data, the semantic enterprise can readily bridge both worlds.

Perspective #8: The Semantic Enterprise is a Disruptive Innovation, without Being Disruptive

Unfortunately, as a relatively new area there are advantages for some pundits or consultants to present the semantic Web as more complicated and commitment-laden than it need be. Either the proponents of that viewpoint don’t know what they are saying, or are being cynical to the market. The major point underlying the fresh perspectives herein is to iterate that it is quite possible to start small, and do so with low cost and risk.

While it is true that semantic technologies within the enterprise promise some startling upside potentials and disruptions to the old ways of doing business, the total beauty of RDF and its capabilities and this layered model is that those promises can be realized incrementally and without hard choices. No, it is not for free: a commitment to begin the process and to learn is necessary. But, yes, it can be done so with exciting enterprise-wide benefits at a pace and risk level that is comfortable.

The good news about the dedicated issue of the Cutter IT Journal and the earlier PWC publication is that the importance of semantic technologies to the enterprise is now beginning to receive its just due. But as we ramp up this visibility, let’s be sure that we frame these costs and benefits with the right perspectives.

The semantic enterprise offers some important new benefits not obtainable from prior approaches and technologies. And, the best news is that these advantages can be obtained incrementally and at low risk and cost while leveraging prior investments and information assets.

[1] Paul Horowittz, ed., 2009. Technology Forecast: A Quarterly Journal, PricewaterhouseCoopers, Spring 2009, 58 pp. See (after filling out contact form). I reviewed this publication in an earlier post.
[2] Mitchell Ummell, ed., 2009. “The Rise of the Semantic Enterprise,” special dedicated edition of the Cutter IT Journal, Vol. 22(9), 40pp., September 2009. See (after filling out contact form).
[3] It is really not my purpose to review the Cutter IT Journal issue nor to point out specific articles that are weaker than others. It is excellent we are getting this degree of attention, and for that I recommend signing up and reading the issue yourself. IMO, the two useful articles are: John Kuriakose, “Understanding and Adopting Semantic Web Technology,” pp. 10-18; and Shamod Lacoul, “Leveraging the Semantic Web for Data Integration,” pp. 19-23.
[4] As a working definition, a semantic enterprise is one that adopts the languages and standards of the semantic Web, including RDF, RDFS, OWL and SPARQL and others, and applies them to the issues of information interoperability, preferably using the best practices of linked data.
[5] One prevalent misconception is that is it desirable to have a single, large, comprehensive ontology. In fact, multiple ontologies, developing and growing on multiple tracks in various contexts, are much preferable. This decentralized approach brings ontology development closer to ultimate users, allows departmental efforts to proceed at different paces, and lowers risk.

[6] Here is our standard working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[7] I first introduced this topic in Ontologies as the ‘Engine’ for Data-Driven Applications. Some of the user interface considerations that can be driven by adaptive ontologies include: attribute labels and tooltips; navigation and browsing structures and trees; menu structures; auto-completion of entered data; contextual dropdown list choices; spell checkers; online help systems; etc.
Posted:September 20, 2009

The Unbearable Lightness of Being, by Milan KunderaA Technique is Neither a ‘Meme’ nor a Philosophy

I have been a participant in an interesting series of discussions recently: Whither goes ‘linked data’?

As I described to someone, I was clearly not a father to the idea of ‘linked data‘, but I was handing out cigars pretty close on to the birth. Chris Bizer and Richard Cyganiak were the innovators that first proposed the original project to the W3C [1]. (Thanks guys!)

From that point forward, now a bit over 2-1/2 years ago, we have seen a massive increase in attention and visibility to the idea of ‘linked data.’ I take a small amount of reflected pride that I helped promote the idea in some way with my early writings.

That visibility was well-deserved. After all, here was the concept:

  • Expose your data in an accessible way on the Web
  • Use Web identifiers (URIs) as the means to uniquely identify that data
  • Use RDF “triples” to describe the relationships between the data.

Much other puffery got layered on to those ideas, but I think those premises are the key basis.

Early Cracks in the Vision

My first personal concern with where linked data was going dealt with an absence of context or conceptual structure for how these new datasets related to one another. I will not repeat those arguments here; simply see many of my blog postings from the past two years or so. Exposing millions of “things” was wonderful, but what did all of that mean? How does one “thing” relate to another “thing”? Are some “things” the same as or similar to other things? If nothing else, these concerns stimulated the genesis of the UMBEL subject concept ontology, an outcome for which I need to thank the community.

It would be petty of me to question the basis that attracted millions of data items to get exposed from linked data techniques. In fact, the richness we have today in exposed Web data objects comes solely from this linked data initiative. But, nonetheless, my guess is that even the most ardent linked data advocate would have a hard time finding a logical way to present the current linked data reality in context. We see the big bubble diagram of available datasets, but, frankly, the position and relationships amongst datasets appears somewhat arbitrary. We have lots of bubbles, but little meaning.

The Constant is Transition

The semantic Web was in serious crisis prior to linked data. It had bad perception, little delivery, and unmet hype. Linked data at least began to show how exposed and properly characterized data can begin to become interconnected.

For a couple of years now I have tried in various posts to present linked data in a broader framework of structured and semantic Web data.  I first tried to capture this continuum in a diagram from July 2007:

Transition in Web Structure
Document Web Structured Web Semantic Web
Linked Data
  • Document-centric
  • Document resources
  • Unstructured data and semi-structured data
  • HTML
  • URL-centric
  • circa 1993
  • Data-centric
  • Structured data
  • Semi-structured data and structured data
  • XML, JSON, RDF, etc
  • URI-centric
  • circa 2003
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S
  • URI-centric
  • circa 2006
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • URI-centric
  • circa ???

The point is not whether those earlier characterizations were “correct”, but that linked data be properly seen as merely a natural step in an ongoing transition. IMO, we are progressing nicely along this spectrum.

A Caricature of Itself

Linked data is a set of techniques — nothing more — and certainly not a philosophy or meme (whatever the hell that means). We have way too many breathy pontifications about “linked data this” and “linked data that” that frankly are undercutting the usefulness of the practice and making it a caricature of itself.

In the enterprise world we see similar attempts at marketing that need to give everything a three-letter acronym. In this case, we have a bunch of academics and researchers trying to act like market and business gurus. All it is doing is confusing the marketplace and hurting the practice.

The elevation of techniques or best practices into roles clearly beyond their pay grade produces completely the opposite effect:  the idea comes under question and ridicule. The logic and rationale for why we should be following these best practices gets lost in the hyperbole. I spend most of my time hitting the delete button on the mailing lists. I fear what others new to these practices — that is, my company’s customers and prospects — perceive when they look into this topic.

Linked data is useful and needed. But come on, folks, these are not tribal or religious matters.

Declaring Victory, and Moving On

Through the initial project vehicle of DBpedia and then how it nucleated other “linked” data sets, the linked data practice certainly became viral. Today, we have many millions of data items available in linked data form. This is unalloyed goodness.

I will continue to use the phrase ‘linked data’ to refer to those useful techniques noted in the opening. Actually, I think it is best to think of linked data as a set of best practices, but by no means an end unto itself.

Beyond linked data we need context, we need our data to be embedded and related to interoperable ontologies, we need much better user interfaces and attainability, and we need quality in our assertions and use. These are issues that extend well beyond the techniques of linked data and form the next set of challenges in gaining broader acceptance for the semantic Web and the semantic enterprise.

Like most everything else in this world, there are real problems and real needs out there. Thankfully, we have heard mostly the end of the silliness about Web 3.0.  Perhaps we can now also broaden our horizons beyond the useful techniques of linked data to tackle the next set of semantic challenges.

So, let me be the first to congratulate the community on a victory well achieved! As for myself and my company, we will now focus our attentions on the next tier of challenges. It is time to deprecate the rhetoric. Huzzah!

[1] For the record, in addition to Bizer and Cyganiak, the first publication on the project, “Interlinking Open Data on the Web”, in the Proceedings Poster Track, ESWC2007, Innsbruck, Austria, June 2007, by Bizer, Tom Heath, Danny Ayers and Yves Raimond, also noted the early contributions of Sören Auer, Orri Erling, Frederick Giasson, Kingsley Idehen, Georgi Kobilarov, Stefano Mazzocchi, Josh Tauberer, Bernard Vatant and Marc Wick.

Posted by AI3's author, Mike Bergman Posted on September 20, 2009 at 8:09 pm in Linked Data, Semantic Web, Structured Web | Comments (5)
The URI link reference to this post is:
The URI to trackback this post is: