On behalf of Structured Dynamics, I am pleased to announce our release into the open source community of irON — the instance record and Object Notation — and its family of frameworks and tools . With irON, you can now author and conduct business solely in the formats and tools most familiar and comfortable to you, all the while enabling your data to interact with the semantic Web.
irON is an abstract notation and associated vocabulary for specifying RDF triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF. The notation supports writing RDF and schema in JSON (irJSON), XML (irXML) and comma-delimited (CSV) formats (commON).
The surprising thing about irON is that — by following its simple conventions and vocabulary — you will be authoring and creating interoperable RDF datasets without doing much different than your normal practice.
This first specification for the irON notation includes guidance for creating instance records (including in bulk), linkages to existing ontologies and schema, and schema definitions. In this newly published irON specificatiion, profiles and examples are also provided for each of the irXML, irJSON and commON serializations. The irON release also includes a number of parsers and converters of the specification into RDF . Data ingested in the irON frameworks can also be exported as RDF and staged as linked data.
The objective of irON is to make it easy for data owners to author, read and publish data. This means the starting format should be a human readable, easily writable means for authoring and conveying instance records (that is, instances and their attributes and assigned values) and the datasets that contain them. Among other things, this means that irON‘s notation does not use RDF “triples“, but rather the native notations of the host serializations.
irON is premised on these considerations and observations:
The irON notation and vocabulary is designed to allow the conceptual structure (“schema”) of datasets to be described, to facilitate easy description of the instance records that populate those datasets, and to link different structures for different schema to one another. In these manners, more-or-less complete RDF data structures and instances can be described in alternate formats and be made interoperable. irON provides a simple and naïve information exchange notation expressive enough to describe most any data entity.
The notation also provides a framework for extending existing schema. This means that irON and its three serializations can represent many existing, common data formats and standards, while also providing a vehicle for extending them. Another intent of the specification is to be sparse in terms of requirements. For instance, this reserved vocabulary is fairly minimal and optional in most all cases. The irON specification supports skeletal submissions.
The aim of irON is to describe instance records. An instance record is simply a means to represent and convey the information (”attributes”) describing a given instance. An instance is the thing at hand, and need not represent an individual; it could, for example, represent the entire holdings or collection of books in a given library. Such instance records are also known as the ABox . The simple design of irON is in keeping with the limited roles and work associated with this ABox role.
Attributes provide descriptive characteristics for each instance. Every attribute is matched with a value, which can range from descriptive text strings to lists or numeric values. This design is in keeping with simple attribute-value pairs where, in using the terminology of RDF triples, the subject is the instance itself, the predicate is the attribute, and the object is the value. irON has a vocabulary of about 40 reserved attribute terms, though only two are ever required, with a few others strongly recommended for interoperability and interface rendering purposes.
A dataset is an aggregation of instance records used to keep a reference between the instance records and their source (provenance). It is also the container for transmitting those records and providing any metadata descriptions desired. A dataset can be split into multiple dataset slices. Each slice is written to a file serialized in some way. Each slice of a dataset shares the same <id> of the dataset.
Instances can also be assigned to types, which provide the set or classificatory structure for how to relate certain kinds of things (instances) to other kinds of things. The organizational relationships of these types and attributes is described in a schema. irON also has conventions and notations for describing the linkage of attributes and types in a given dataset to existing schema. These linkages are often mapped to established ontologies.
Each of these irON concepts of records, attributes, types, datasets, schema and linkages share similar notations with keywords signaling to the irON parsers and converters how to interpret incoming files and data. There are also provisions for metadata, name spaces, and local and global references.
In these manners, irON and its three serializations can capture virtually the entire scope and power of RDF as a data model, but with simpler and familiar terminology and constructs expected for each serialization.
For different reasons and for different audiences, the formats of XML, JSON and CSV (spreadsheets) were chosen as the representative formats across which to formulate the abstract irON notation.
XML, or eXtensible Markup Language, has become the leading data exchange format and syntax for modern applications. It is frequently adopted by industry groups for standards and standard exchange formats. There is a rich diversity of tools that support the language, importantly including capable parsers and query languages. There is also a serialization of RDF in XML. As implemented in the irON notation, we call this serialization irXML.
CSV, or comma-separated values, is a format that has been in existence for decades. It was made famous by Microsoft as a spreadsheet exchange format, which makes CSV very useful since spreadsheets are the most prevalent data authoring environment in existence. CSV is less expressive and capable as a data format than the other irON serializations, yet still has a attribute-value pair orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc. As implemented in the irON notation, we call this serialization commON.
The following diagram shows how these three formats relate to irON and then the canonical RDF target data model:
We have used the unique differences amongst XML, JSON and CSV to guide the embracing abstract notations within irON. Note the round-tripping implications of the framework.
One exciting prospect for the design is how, merely by following the simple conventions within irON, each of these three data formats — and RDF !! — can be used more-or-less interchangeably, and can be used to extend existing schema within their domains.
This first release of irON is in version 0.8. Updates and revisions are likely with use. Here are some key links for irON:
In addition, within the next week we will be publishing a case study of converting the Sweet Tools semantic Web and -related tools dataset to commON.
The irON specification and notation by Structured Dynamics LLC is licensed under a Creative Commons Attribution-Share Alike 3.0. irON‘s parsers or converters are available under the Apache License, Version 2.0.
irON is an important piece in the semantic enterprise puzzle that we are building at Structured Dynamics. It reflects our belief that knowledge workers should be able to author and create interoperable datasets without having to learn the arcana of RDF. At the same time we also believe that RDF is the appropriate data model for interoperability. irOn is an expression of our belief that many data formats have appropriate places and uses; there is no need to insist on a single format.
We would like to thank Dr. Jim Pitman for his advocacy of the importance of human-readable and easily authored datasets and formats. Via his leadership of the Bibliographic Knowledge Network (BKN) project and our contractual relationship with it , we have learned much regarding the BKN’s own format, BibJSON. Experience with this format has been a catalytic influence in our own work on irON.
— Mike Bergman and Fred Giasson, editors
Attribute-values can also be presented as pairs in the form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array.
Though not the first, I have been touting the Linked Data Law for a couple of years now . But in a conversation last week, I found that my colleague did not find the premise very clear. I suspect that is due both to cryptic language on my part and the fact no one has really tackled the topic with focus. So, in this post, I try to redress that and also comment on the related role of linked data in the semantic enterprise.
Adding connections to existing information via linked data is a powerful force multiplier, similar to Metcalfe’s law for how the value of a network increases with more users (nodes). I have come to call this the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between data objects.
An early direct mention of the semantic Web and its possible ability to generate network effects comes from a 2003 Mitre report for the government . In it, the authors state, “At present a very small proportion of the data exposed on the web is marked up using Semantic Web vocabularies like RDF and OWL. As more data gets mapped to ontologies, the potential exists to achieve a ‘network effect’.” Prescient, for sure.
In July 2006, both Henry Story and Dion Hinchliffe discussed Metcalfe’s law, with Henry specifically looking to relate it to the semantic Web . He noted that his initial intuition was that “the value of your information grows exponentially with your ability to combine it with new information.” He noted he was trying to find ways to adapt Metcalfe’s law for applicability to the semantic Web.
I picked up on those observations and commented to Henry at that time and in my own post, “The Exponential Driver of Combining Information.” I have been enamoured of the idea ever since, and have begun to weave the idea into my writings.
More recently, in late 2008, James Hendler and Jennifer Golbeck devoted an entire paper to Metcalfe’s law and the semantic Web . In it, they note:
“This linking between ontologies, and between instances in documents that refer to terms in another ontology, is where much of the latent value of the Semantic Web lies. The vocabularies, and particularly linked vocabularies using URIs, of the Semantic Web create a graph space with the ability to link any term to any other. As this link space grows with the use of RDF and OWL, Metcalfe’s law will once again be exploited – the more terms to link to, and the more links created, the more value in creating more terms and linking them in.”
Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of users of the system (n²) (note: it is not exponential, as some of the points above imply). Robert Metcalfe formulated it about 1980 in relation to Ethernet and fax machines; the “law” was then named for Metcalfe and popularized by George Gilder in 1993.
These attempts to estimate the value of physical networks were in keeping with earlier efforts to estimate the value of a broadcast network. That value is almost universally agreed to be proportional to the number of users, as accepted as Sarnoff’s law (see further below).
The actual algorithm proposed by Metcalfe calculates the number of unique connections in a network with n nodes to be n(n − 1)/2, which is proportional to n2. This makes Metcalfe’s law a quadratic growth equation.
As nodes get added, then, we see the following increase in connections:
This diagram, modified from Wikipedia to be a horizontal image, shows how two telephones can make only one connection, five can make 10 connections, and twelve can make 66 connections, etc.
By definition, a physical network is a connected network. Thus, every time a new node is added to the network, connections are added, too. This general formula has also been embraced as a way to discuss social connections on the Internet .
Like physical networks, the interconnectedness of the semantic Web or semantic enterprise is a graph.
The idea behind linked data is to make connections between data. Unlike physical telecommunication networks, however, the nodes in the form of datasets and data are (largely) already there. What is missing are the connections. The build-out and growth that produces the network effects in a linked data context do not result from adding more nodes, but from the linking or connecting of existing nodes.
The fact that adding a node to a physical network carries with it an associated connection has tended to conjoin these two complementary requirements of node and connection. But, to grok the real dynamics and to gain network effects, we need to realize: Both nodes and connections are necessary.
One circumstance of the enterprise is that data nodes are everywhere. The fact that the overwhelming majority are unconnected is why we have adopted the popular colloquialism of data “silos”. There are also massive amounts of unconnected data on the Web in the form of dynamic databases only accessible via search form, and isolated data tables and listings virtually everywhere.
Thus, the essence of the semantic enterprise and the semantic Web is no more complicated than connecting — meaningfully — data nodes that already exist.
As the following diagram shows, unconnected data nodes or silos look like random particles caught in the chaos of Brownian motion:
As initial connections get made, bits of structure begin to emerge. But, as connections are proliferated — exactly equivalant to the network effects of connected networks — coherence and value emerge.
Look at the last part in the series diagram above. We not only see that the same nodes are now all connected, with the inferences and relationships that result from those connections, but we can also see entirely new structures emerge by virtue of those connections. All of this structure and meaning was totally absent prior to making the linked data connections.
So, what is the benefit of this linked data? It depends on the product of the value of the connections and the multiplier of the network effect:
Just as it is hard to have a conversation via phone with yourself, or to collaborate with yourself, the ability to gain perspective and context from data comes from connections. But like some phone calls or some collaborations, the value depends on the participants. In the case of linked data, that depends on the quality of the data and its coherence . The value “constant” for connected linked data depends in some manner on these factors, as well as the purposes and circumstances to which that linked data might be applied.
Even in physical networks or social collaboration contexts, the “value” of the network has been hard to quantify. And, while academics and researchers will appropriately and naturally call for more research on these questions, we do not need to be so timid. Whatever the alpha constant is for quantifying the value of a linked data network, our intuition should be clear that making connections, finding relationships, making inferences, and making discoveries can not occur when data is in isolation.
Because I am an advocate, I believe this alpha constant of value to be quite large. I believe this constant is also higher for circumstances of business intelligence, knowledge management and discovery.
The second part of the benefit equation is the multiplier for network effects. We’ve mentioned before the linear growth advantage due to broadcast networks (Sarnoff law) and the standard quadratic growth assumption of physical and social networks (Metcalfe law). Naturally, there have been other estimates and advocacies.
David Reed , for example, also adds group effects and has asserted an exponential multiplier to the network effect (like Henry Story’s initial intuition noted above). As he states,
“[E]ven Metcalfe’s Law understates the value created by a group-forming network [GFN] as it grows. Let’s say you have a GFN with n members. If you add up all the potential two-person groups, three-person groups, and so on that those members could form, the number of possible groups equals 2n. So the value of a GFN increases exponentially, in proportion to 2n. I call that Reed’s Law. And its implications are profound.”
Yet not all agree with the assertion of an exponential multiplier, let alone the quadratic one of Metcalfe. Odlyzko and Tilly  note that Metcalfe’s law would hold if the value that an individual gets personally from a network is directly proportional to the number of people in that network. But, then they argue that does not hold because of local preferences or different qualities of interaction. In a linked data context, such arguments have merit, though you may also want to see Metcalfe’s own counter-arguments .
Hinchliffe’s earlier commentary  provided a nice graphic that shows the implications of these various multiplers on the network effect, as a function of nodes in a network:
I believe we can dismiss the lower linear bound of this question and likely the higher exponential one as well (that is, Reed’s law, because quality and relevance questions make some linked data connections less valuable than others). Per the above, that would suggest that the multiplier of the linked data network is perhaps closer to the Metcalfe estimate or similar.
In any event, it is also essential to point out that connecting data indiscriminantly for linked data’s sake will likely deliver few, if any, benefits. Connections must still be coherent and logical for the value benefits to be realized.
I elsewhere discuss the role of linked data in the enterprise and will continue to do so. But, there are some implications in the above that warrant some further observations.
It should be clear that the graph and network basis of linked data, not to mention some of the uncertainties as to quantifying benefits, suggests the practice should be considered apart from mission-critical or transactional uses in the enterprise. That may change with time and experience.
There are also open questions about data quality in terms of inputs to linked data and possible erroneous semantics and ontologies to guide the linked connections. Operational uses should be kept off the table for now. Like physical networks, not all links perform well and not all have usefulness. Similarly to how poor connections may be encountered in physical networks, they should be either taken off-ledger or relegated to a back-up basis. Linked data should be understood and treated no differently than networks of variable quality.
Such realism is important — for both internal and external linked data advocates — to allow linked data to be applied in the right venues at acceptable risk and with likely demonstrable benefits. Elsewhere I have advocated an approach that builds on existing assets; here I advocate a clear and smart understanding of where linked data can best deliver network effects in the near term.
And, so, in the nearest term, enterprise applications that best fit linked data promises and uncertainties include:
As in the Wild West, the new deputy marshal and his tin badge did not guarantee prosperity. But a good marshal would deliver law and order. And those are the preconditions for the town folk to take charge of building their own prosperity.
Linked data is a practice for starting to bring order and connections to your existing data. Once some order has been imposed, the framework then becomes a basis for defining meanings and then gaining value from those connections.
Once order has been gained, it is up to the good citizens of Data Gulch to then deliver the prosperity. Broad participation and the network effect are one way to promote that aim. But success and prosperity still depends on intelligence and good policies and practice.
The idea of the ‘long tail’ was brilliant, and Chris Anderson’s meme has become part of our current lingo in record time. The long tail is the colloquial name for a common feature in some statistical distributions where an initial high-amplitude peak within a population distribution is followed by a rapid decline and then a relatively stable, declining low-amplitude population that “tails off.” (An asymptotic curve.) This sounds fancy; it really is not. It simply means that a very few things are very popular or topical, most everything else is not.
The following graph is a typical depiction of such a statistical distribution with the long tail shown in yellow. Such distributions often go by the names of power laws, Zipf distributions, Pareto distributions or general Lévy distributions. (Generally, such curves when plotted on a semi-logarithmic scale now show the curve to be straight, with the slope being an expression of its “power”.)
It is a common observation that virtually everything measurable on the Internet — site popularity, site traffic, ad revenues, tag frequencies on del.icio.us, open source downloads by title, Web sites chosen to be digg‘ed, Google search terms — follows such power laws or curves.
However, the real argument that Anderson made first in Wired magazine and then in his 2006 book, The Long Tail: Why the Future of Business is Selling Less of More, is that the Internet with either electronic or distributed fulfillment means that the cumulative provision of items in the long tail is now enabling the economics of some companies to move from “mass” commodities to “specialized” desires. Or, more simply put: There is money to be made in catering to individualized tastes.
I, too, agree with this argument, and it is a brilliant recognition of the fact that the Internet changes everything.
Yet what is amazing about this observation of long tails on the Internet has been the total lack of discussion of its natural reciprocal: namely, long tails have teeny heads, the red portion of the diagram. For, after all, what also is the curve above telling us? While Anderson’s point that Amazon can carry millions of book titles and still make a profit by only selling a few of each, what is going on at the other end of the curve — the head end of the curve?
Well, if we’re thinking about book sales, we can make the natural and expected observation that the head end of the curve represents sales of the best seller books; that is, all of those things in the old 80-20 world that is now being blown away with the long tail economics of the Internet. Given today’s understandings, this observation is pretty prosaic since it forms the basis of Anderson’s new long tail argument. Pre-Internet limits (it’s almost like saying before the Industrial Revolution) kept diversity low and choices few.
Okaaaay! Now that seems to make sense. But aren’t we still missing something? Indeed we are.
So, when we look at many of those aspects that make up what is known as Web 2.0 or even the emerging semantic Web, we see that collaboration and user-submitted content stands at the fore. And our general power law curves then also affirm that it is a very few who supply most of that user-generated content — namely, those at the head end, the teeny heads. If those relative few individuals are not motivated, the engine that drives the social content stalls and stutters. Successful social collaboration sites are the ones that are able to marshal “large numbers of the small percentage.”
The natural question thus arises: What makes those “teeny heads” want to contribute? And what makes it so they want to contribute big — that is, frequently and with dedication? So, suddenly now, here’s a new success factor: to be successful as a collaboration site, you must appeal to the top 1% of users. They will drive your content generation. They are the ‘teeny heads’ at the top of your power curve.
Well, things just got really more difficult. We need tools, mindshare and other intangibles to attract the “1%” that will actually generate our site’s content. But we also need easy frameworks and interfaces for the general Internet population to live comfortably within the long tail.
So, heads or tails? Naahh, that’s the wrong question. Keep flipping until you get both!
Venture capitalists, when the straw gets short or the proverbial hits the fan, are famous for calling for new managerial blood. After all, we did our due dilgence on this company, it is not profitable — perhaps even bleeding excessively — so what went wrong?
Actually, to be fair, perhaps the founding entrepreneurs are having the same thoughts. We wrote the business plan, we beat the odds to even get angel and (”Isn’t that special,” says the Church Lady) VC financing, thus we have had affirmation about our markets, technology, team and other aspects from the “smart” money, so why is it not working? Why aren’t we profitable? What went wrong?
Getting external financing from professional VCs is non-trivial and itself is putting a company in the “less-than-0.1% club.” And, of course, getting any financing is hard to do, be it an angel, your own checking account, your spouse or your friends and family. Forsaking Janie’s college education for a chance on a start-up requires tremendous belief and suspension of dis-belief for any early investor.
But, the initial financing hurdle has been met. Some time has passed. Neither profits nor the plan are fulfilling themselves. What do we — obviously the smart ones since we put up the money or had the ideas — do about our belief while return is not being fulfilled?
In nearly two decades of mentoring various ventures I’ve observed one possible reaction is to look for Superman. If only the company had the right missing individual in a CEO or senior manager position, then many of the current problems would go away. But as my Mom used to say, nothing is easy. Easy answers can lead to uneasy situations. And, I think, the myth of Superman more often than not fits into such a facile error.
When things go wrong (or, at least, are not going as desired), things are tough for all of those with a stake in success. Is the source of discomfort that money was put up and is now at risk of loss? Is it that individuals were supported but are not yet achieving success? Is it ego that due diligence was made but success is looking tenuous? And, if things are going wrong or progress is disappointing, what is the root cause? Is the market needful or ready? Is the technology or product responsive or ready? Is the business model correct? Are other pieces such as partners, advisors, infrastructure, collateral, or whatever in place?
New people do not need to be hired to pose these questions nor to spend purposeful and thoughtful time addressing them. And, even if new people and skills are deemed critical to supplement the skills presently available, setting expectations that are too high or too superhuman are likely to not be fulfilled, take to long to do so if even achievable, and cost too much in focus and precious resources.
In fact, pursuing the myth of Superman can actually worsen a current situation for the following reasons:
Raising the Superman option only occurs when a company is in trouble and needs help. The key individuals associated with a startup — Board and management alike — are better advised to concentrate on business model, strategy, execution and maintaining focus than searching for the impossible or (at least) statistically highly unlikely.
When problems arise, look to problem identification and problem-solving approaches before copping out with easy Superman answers.
Efforts should be focused; business models should be clear; execution should be emphasized; resources should be zealously protected and stewarded; questions should be constantly asked; and team efforts and building should be fostered. Patience is not a four-letter word, especially if progress is steady and being accomplished in a cost-effective manner.
Nurture and training of initial founders and staff is important. Financing would not have been initially achieved without some belief in these individuals. Not now actually performing to plan is, in fact, an expected outcome, not one warranting excoriation.
These positive mindsets are hard to keep when the venture’s performance or sales is not meeting plan. And, of course, some of these instances will warrant abandonment of the venture rather throwing more good after bad. There are no guarantees. And mistakes get made.
But make the choice. Commit to the venture and improving its prospects through hard work and engagement, or walk away. Superman is a false middle ground.
Please, don’t get me wrong. Without a doubt some people are better managers, some are some are better salespeople, some are better intellects, some are better strategists, some are better marketers and some are better networkers than others. Anyone who is superior, committed and a believer in the cause of your venture will likely bring some value. And there are indeed rare individuals and rare circumstances when hiring the right new executive could and should make all of the difference toward success.
The more important point, however, is that startups are more often than not constrained in their team and resources. Be smart about where to spend limited time and focus. Hiring good and even great people is a good focus. Searching for Superman is not. Rather than the impossible combination in a single person, look to a collective team that embodies the needed and valuable traits deemed important for your venture’s success.
My wife and I are not gamblers, and were somewhat surprised to find ourselves at our local destination casino last weekend to see a concert by Boz Scaggs and to spend the night in a high-roller suite with a glassed-in shower and electrically controlled window shades. The only missing piece was a mirror on the ceiling. Of course there was an occasion involved, and from top to bottom we had an absolute, total great time.
The highlight of the whole affair was Boz Scaggs himself and his band. Boz Scaggs goes back to our courtship; and we celebrated our 30th wedding anniversary this year! But, this was not a geriatric trip down memory lane: this was top-drawer, great music and entertainment. Not to get too excessive, but this show was close to one of the best I have ever seen!
I normally would not comment on such matters on this blog. After all, I’m generally trucking down an esoteric trail with an audience that at most fills living rooms, not concert halls (let alone stadiums). Such is the semantic Web today.
But then one of those somethings happened this week: I was asked to go down memory lane and resurrect some of my older posts. I read quite a few from years back, and liked some of what I read. It was actually kinda fun. And, I had forgotten many of these older hobby horses or even that I had written them.
Now, in the original Stone Age days of this blog, namely 4-5 years ago, I had like 20 – 30 readers per day. Today, I’m closer to 2500 per day, and seemingly growing pretty steadily. I also now have a backlog of about 400 prior posts. Most have not been read, or at least not by any notable readership.
So, like Boz Skaggs, I decided I would on occasion bring back one of those older contributions that maybe did not get too much airplay in the older days. And, since these are re-treads, I should also re-introduce them on Friday when the news cycle is slow and no one is really very attentive anyway. I mean, afer all, they are only electrons!
So, with this convolution, I’m pleased to introduce this occasional Friday re-release of selected earlier posts. I may make some minor changes to these older posts to make them current or correct typos and such. If I do, I will so note.
I do not have enough historical backlog of posts to warrant a re-tread every Friday. But, on occasion, including this Friday, I will post again. Look for the brown bag symbol on these reprised posts.