Evolution
AI³
Adaptive Information
Adaptive Innovation
Adaptive Infrastructure
a·dap·tive adj. Showing or having a capacity to make fit for new or special situations; flexible; a successful adjustment.

Blogasbörd (cloud version):
Send Email   Get SIOC Profile   Get FOAF Profile   Syndicate full contents for this site using RSS 20
Main Links
Categories
Calendar
March 2010
S M T W T F S
« Feb    
 123456
78910111213
14151617181920
21222324252627
28293031  
Archives
More . . .  
Search
 
Sponsored Links
-->
Affiliations
structWSF
Credits
Blog software courtesy of WordPress Obtain Technorati profile Subscribe with Bloglines
View Mike's profile on LinkedIn
Date:   January 12, 2010

Seven Pillars of the Open Semantic Enterprise

Guideposts for How to Make the Transition

The beginning of a new year and a new decade is a perfect opportunity to take stock of how the world is changing and how we can change with it. Over the past year I have been writing on many foundational topics relevant to the use of semantic technologies in enterprises.

In this post I bring those threads together to present a unified view of these foundations — some seven pillars — to the open semantic enterprise.

By open semantic enterprise we mean an organization that uses the languages and standards of the semantic Web, including RDF, RDFS, OWL, SPARQL and others to integrate existing information assets, using the best practices of linked data and the open world assumption, and targeting knowledge management applications. It does so using some or all of the seven foundational pieces (”pillars”) noted herein.

The foundational approaches to the open semantic enterprise do not necessarily mean open data nor open source (though they are suitable for these purposes with many open source tools available [3]). The techniques can equivalently be applied to internal, closed, proprietary data and structures. The techniques can themselves be used as a basis for bringing external information into the enterprise. ‘Open’ is in reference to the critical use of the open world assumption.

These practices do not require replacing current systems and assets; they can be applied equally to public or proprietary information; and they can be tested and deployed incrementally at low risk and cost. The very foundations of the practice encourage a learn-as-you-go approach and active and agile adaptation. While embracing the open semantic enterprise can lead to quite disruptive benefits and changes, it can be accomplished as such with minimal disruption in itself. This is its most compelling aspect.

Like any change in practice or learning, embracing the open semantic enterprise is fundamentally a people process. This is the pivotal piece to the puzzle, but also the one that does not lend itself to ready formula about pillars or best practices. Leadership and vision is necessary to begin the process. People are the fuel for impelling it. So, we’ll take this fuel as a given below, and concentrate instead on the mechanics and techniques by which this vision can be achieved. In this sense, then, there are really eight pillars to the open semantic enterprise, with people residing at the apex.

This article is synthetic, with links to (largely) my preparatory blog postings and topics that preceded it. Assuming you are interested in becoming one of those leaders who wants to bring the benefits of an open semantic enterprise to your organization, I encourage you to follow the reference links for more background and detail.

Benefits A Review of the Benefits

OK, so what’s the big deal about an open semantic enterprise and why should my organization care?

We should first be clear that the natural scope of the open semantic enterprise is in knowledge management and representation [1]. Suitable applications include data federation, data warehousing, search, enterprise information integration, business intelligence, competitive intelligence, knowledge representation, and so forth [2]. In the knowledge domain, the benefits for embracing the open semantic enterprise can be summarized as greater insight with lower risk, lower cost, faster deployment, and more agile responsiveness.

The intersection of knowledge domain, semantic technologies and the approaches herein means it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with a focus on early, high-value applications and domains.

There is absolutely no need to abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives.

Embracing the pillars of the open semantic enterprise brings these knowledge management benefits:

  • Domains can be analyzed and inspected incrementally
  • Schema can be incomplete and developed and refined incrementally
  • The data and the structures within these frameworks can be used and expressed in a piecemeal or incomplete manner
  • Data with partial characterizations can be combined with other data having complete characterizations
  • Systems built with these frameworks are flexible and robust; as new information or structure is gained, it can be incorporated without negating the information already resident, and
  • Both open and closed world subsystems can be bridged.

Moreover, by building on successful Web architectures, we can also put in place loosely coupled, distributed systems that can grow and interoperate in a decentralized manner. These also happen to be perfect architectures for flexible collaboration systems and networks.

These benefits arise both from individual pillars in the open semantic enterprise foundation, as well as in the interactions between them. Let’s now re-introduce these seven pillars.

Pillar #1Pillar #1: The RDF Data Model

As I stated on the occasion of the 10th birthday of the Resource Description Framework data model, I belief RDF is the single most important foundation to the open semantic enterprise [4]. RDF can be applied equally to all structured, semi-structured and unstructured content. By defining new types and predicates, it is possible to create more expressive vocabularies within RDF. This expressiveness enables RDF to define controlled vocabularies with exact semantics. These features make RDF a powerful data model and language for data federation and interoperability across disparate datasets.

Via various processors or extractors, RDF can capture and convey the metadata or information in unstructured (say, text), semi-structured (say, HTML documents) or structured sources (say, standard databases). This makes RDF almost a “universal solvent” for representing data structure.

Because of this universality, there are now more than 150 off-the-shelf ‘RDFizers’ for converting various non-RDF notations (data formats and serializations) to RDF [5]. Because of its diversity of serializations and simple data model, it is also easy to create new converters. Once in a common RDF representation, it is easy to incorporate new datasets or new attributes. It is also easy to aggregate disparate data sources as if they came from a single source. This enables meaningful compositions of data from different applications regardless of format or serialization.

What this practically means is that the integration layer can be based on RDF, but that all source data and schema can still reside in their native forms [6]. If it is easier or more convenient to author, transfer or represent data in non-RDF forms, great [7]. RDF is only necessary at the point of federation, and not all knowledge workers need be versed in the framework.

Pillar #2 Pillar #2: Linked Data Techniques

Linked data is a set of best practices for publishing and deploying instance and class data using the RDF data model. Two of the best practices are to name the data objects using uniform resource identifiers (URIs), and to expose the data for access via the HTTP protocol. Both of these practices enable the Web to become a distributed database, which also means that Web architectures can also be readily employed (see Pillar #5 below).

Linked data is applicable to public or enterprise data, open or proprietary. It is really straightforward to employ. Structured Dynamics has published a useful FAQ on linked data.

Additional linked data best practices relate to how to characterize and classify data, especially in the use of predicates with the proper semantics for establishing the degree of relatedness for linked data items from disparate sources.

Linked data has been a frequent topic of this blog, including how adding linkages creates value for existing data, with a four-part series about a year ago on linked data best practices [8]. As advocated by Structured Dynamics, our linked data best practices are geared to data interconnections, interrelationships and context that is equally useful to both humans and machine agents.

Pillar #3 Pillar #3: Adaptive Ontologies

Ontologies are the guiding structures for how information is interrelated and made coherent using RDF and its related schema and ontology vocabularies, RDFS and OWL [10]. Thousands of off-the-shelf ontologies exist — a minority of which are suitable for re-use — and new ones appropriate to any domain or scope at hand can be readily constructed.

In standard form, semantic Web ontologies may range from the small and simple to the large and complex, and may perform the roles of defining relationships among concepts, integrating instance data, orienting to other knowledge and domains, or mapping to other schema [11]. These are explicit uses in the way that we construct ontologies; we also believe it is important to keep concept definitions and relationships expressed separately from instance data and their attributes [9].

But, in addition to these standard roles, we also look to ontologies to stand on their own as guiding structures for ontology-driven applications (see next pillar). With a relatively few minor and new best practices, ontologies can take on the double role of informing user interfaces in addition to standard information integration.

In this vein we term our structures adaptive ontologies [11,12,13]. Some of the user interface considerations that can be driven by adaptive ontologies include: attribute labels and tooltips; navigation and browsing structures and trees; menu structures; auto-completion of entered data; contextual dropdown list choices; spell checkers; online help systems; etc. Put another way, what makes an ontology adaptive is to supplement the standard machine-readable purpose of ontologies to add human-readable labels, synonyms, definitions and the like.

A neat trick occurs with this slight expansion of roles. The knowledge management effort can now shift to the actual description, nature and relationships of the information environment. In other words, ontologies themselves become the focus of effort and development. The KM problem no longer needs to be abstracted to the IT department or third-party software. The actual concepts, terminology and relations that comprise coherent ontologies now become the explicit focus of KM activities.

Any existing structure (or multiples thereof) can become a starting basis for these ontologies and their vocabularies, from spreadsheets to naïve data structures and lists and taxonomies. So, while producing an operating ontology that meets the best practice thresholds noted herein has certain requirements, kicking off or contributing to this process poses few technical or technology demands.

The skills needed to create these adaptive ontologies are logic, coherent thinking and domain knowledge. That is, any subject matter expert or knowledge worker likely has the necessary skills to contribute to useful ontology development and refinement. With adaptive ontologies powering ontology-driven apps (see next), we thus see a shift in roles and responsibilities away from IT to the knowledge workers themselves. This shift acts to democratize the knowledge management function and flatten the organization.

Pillar #4 Pillar #4: Ontology-driven Applications

The complement to adaptive ontologies are ontology-driven applications. By definition, ontology-driven apps are modular, generic software applications designed to operate in accordance with the specifications contained in an adaptive ontology. The relationships and structure of the information driving these applications are based on the standard functions and roles of ontologies, as supplemented by the human and user interface roles noted above [11,12,13].

Ontology-driven apps fulfill specific generic tasks. Examples of current ontology-driven apps include imports and exports in various formats, dataset creation and management, data record creation and management, reporting, browsing, searching, data visualization, user access rights and permissions, and similar. These applications provide their specific functionality in response to the specifications in the ontologies fed to them.

The applications are designed more similarly to widgets or API-based frameworks than to the dedicated software of the past, though the dedicated functionality (e.g., graphing, reporting, etc.) is obviously quite similar. The major change in these ontology-driven apps is to accommodate a relatively common abstraction layer that responds to the structure and conventions of the guiding ontologies. The major advantage is that single generic applications can supply shared functionality based on any properly constructed adaptive ontology.

This design thus limits software brittleness and maximizes software re-use. Moreover, as noted above, it shifts the locus of effort from software development and maintenance to the creation and modification of knowledge structures. The KM emphasis can shift from programming and software to logic and terminology [12].

Pillar #5 Pillar #5: A Web-oriented Architecture

A Web-oriented architecture (WOA) is a subset of the service-oriented architectural (SOA) style, wherein discrete functions are packaged into modular and shareable elements (”services”) that are made available in a distributed and loosely coupled manner. WOA uses the representational state transfer (REST) style. REST provides principles for how resources are defined and used and addressed with simple interfaces without additional messaging layers such as SOAP or RPC. The principles are couched within the framework of a generalized architectural style and are not limited to the Web, though they are a foundation to it [14].

REST and WOA stand in contrast to earlier Web service styles that are often known by the WS-* acronym (such as WSDL, etc.). WOA has proven itself to be highly scalable and robust for decentralized users since all messages and interactions are self-contained.

Enterprises have much to learn from the Web’s success. WOA has a simple design with REST and idempotent operations, simple messaging, distributed and modular services, and simple interfaces. It has a natural synergy with linked data via the use of URI identifiers and the HTTP transport protocol. As we see with the explosion of searchable dynamic databases exposed via the Web, so too can we envision the same architecture and design providing a distributed framework for data federation. Our daily experience with browser access of the Web shows how incredibly diverse and distributed systems can meaningfully interoperate [15].

This same architecture has worked beautifully in linking documents; it is now pointing the way to linking data; and we are seeing but the first phases of linking people and groups together via meaningful collaboration. While generally based on only the most rudimentary basis of connections, today’s social networking platforms are changing the nature of contacts and interaction.

The foundations herein provide a basis for marrying data and documents in a design geared from the ground up for collaboration. These capabilities are proven and deployable today. The only unclear aspects will be the scale and nature of the benefits [16].

Pillar #6 Pillar #6: An Incremental, Layered Approach

To this point, you’ll note that we have been speaking in what are essentially “layers”. We began with existing assets, both internal and external, in many diverse formats. These are then converted or transformed into RDF-capable forms. These various sources are then exposed via a WOA Web services layer for distributed and loosely-coupled access. Then, we integrate and federate this information via adaptive ontologies, which then can be searched, inspected and managed via ontology-driven apps. We have presented this layered architecture before [13], and have also expressed this design in relation to current Structured Dynamics’ products [17].

A slight update of this layered view is presented below, made even more general for the purposes of this foundational discussion:

Open Enterprise Architecture
(click to expand)

Semantic technology does not change or alter the fact that most activities of the enterprise are transactional, communicative or documentary in nature. Structured, relational data systems for transactions or records are proven, performant and understood. On its very face, it should be clear that the meaning of these activities — their semantics, if you will — is by nature an augmentation or added layer to how to conduct the activities themselves.

This simple truth affirms that semantic technologies are not a starting basis, then, for these activities, but a way of expressing and interoperating their outcomes. Sure, some semantic understanding and common vocabularies at the front end can help bring consistency and a common language to an enterprise’s activities. This is good practice, and the more that can be done within reason while not stifling innovation, all the better. But we all know that the budget department and function has its own way of doing things separate from sales or R&D. And that is perfectly OK and natural.

Clearly, then, an obvious benefit to the semantic enterprise is to federate across existing data silos. This should be an objective of the first semantic “layer”, and to do so in a way that leverages existing information already in hand. This approach is inherently incremental; if done right, it is also low cost and low risk.

Pillar #7 Pillar #7: The Open World Mindset

As these pillars took shape in our thinking and arguments over the past year, an illusive piece seemed always to be missing. It was like having one of those meaningful dreams, and then waking up in the morning wracking your memory trying to recall that essential, missing insight.

As I most recently wrote [1], that missing piece for this story is the open world assumption (OWA). I argue that this somewhat obscure concept holds within it the key as to why there have been decades of too-frequent failures in the enterprise in business intelligence, data warehousing, data integration and federation, and knowledge management.

Enterprises have been captive to the mindset of traditional relational data management and its (most often unstated) closed world assumption (CWA). Given the success of relational systems for transaction and operational systems — applications for which they are still clearly superior — it is understandable and not surprising that this same mindset has seemed logical for knowledge management problems as well.  But knowledge and KM are by their nature incomplete, changing and uncertain. A closed-world mindset carries with it certainty and logic implications not supportable by real circumstances.

This is not an esoteric point, but a fundamental one. How one thinks about the world and evaluates it is pivotal to what can be learned and how and with what information. Transactions require completeness and performance; insight requires drawing connections in the face of incompleteness or unknowns.

The absolute applicability of the semantic Web stack to an open-world circumstance is the elephant in the room [1]. By itself, the open world mindset provides no assurance of gaining insight or wisdom. But, absent it, we place thresholds on information and understanding that may neither be affordable nor achievable with traditional, closed-world approaches.

And, by either serendipity or some cosmic beauty, the open world mindset also enables incremental development, testing and refinement. Even if my basic argument of the open world advantage for knowledge management purposes is wrong, we can test that premise at low cost and risk. So, within available budget, pick a doable proof-of-concept, and decide for yourself.

Seven Pillars The Foundations for the Open Semantic Enterprise

The seven pillars above are not magic bullets and each is likely not absolutely essential. But, based on today’s understandings and with still-emerging use cases being developed, we can see our open semantic enterprise as resulting from the interplay of these seven factors:

Open Semantic Enterprise

Thirty years of disappointing knowledge management projects and much wasted money and effort compel that better ways must be found. On the other hand, until recently, too much of the semantic Web discussion has been either revolutionary (“change everything!!”) or argued from pie-in-the-sky bases. Something needs to give.

Our work over the past few years — but especially as focused in the last 12 months — tells us that meaningful semantic Web initiatives can be mounted in the enterprise with potentially huge benefits, all at manageable risks and costs. These seven pillars point to way to how this might happen. What is now required is that eighth pillar — you.


[1] See, M.K. Bergman, 2009. “The Open World Assumption: Elephant in the Room“, AI3:::Adaptive Information blog, December 21, 2009.
[2] In most instances, semantic technologies are poorly suited to transactional or operational applications. Also, there are instances in modeling specific closed-world domains where ontologies can be quite useful, such as in aerospace, petrochemicals, engineering, etc., where the scope of the domain can be precisely bounded and defined. Such efforts tend to be high cost with lengthy lead times. There are vendors who support efforts in these areas, though my company, Structured Dynamics, does not. Our focus and the more generally suitable case for semantic technologies we believe is in knowledge representation and management.
[3] The standard Sweet Tools listing on my AI3:::Adaptive Information blog contains more than 800 semantic Web and -related tools, most of which are open source, which can be inspected via filtered and faceted search.
[4] See, M.K. Bergman, 2009. “Advantages and Myths of RDF”, AI3:::Adaptive Information blog, April 8, 2009.
[5] For example, see this listing of more than 150 specific format options available as open source. These converters can also work directly with major application APIs.
[6] For an expansion on RDF as a canonical data model, see further M.K. Bergman, 2009. “Structure the World”, AI3:::Adaptive Information blog, August 3, 2009.
[7] For example, for dataset authoring, Structured Dynamics has developed irON, an instance record and object notation that can be serialized as JSON (called irJSON), XML (called irXML) or comma-separated values (or CSV comma-delimited files, called commON). The purpose of these notations is to provide easier authoring environments and scripting support to RDF-ready datasets. The advantage is to shield users from the nuances of RDF. The design of commON is especially geared to using spreadsheets as authoring environments for instance record tables or simple outline structures.  See further the irON specification.
[8] For a general listing of linked data articles, please see that category on this AI3:::Adaptive Information blog. Specific articles of interest include the four-part series on “Making Linked Data Reasonable Using Description Logics” [9] (February 11, February 15, February 18 and February 23, 2009) and the “The Law of Linked Data” (October 11, 2009).
[9] Our best practices approach makes explicit splits between the “ABox” (for instance data) and “TBox” (for ontology schema) in accordance with our working definition for description logics, a fundamental underpinning for how we use RDF:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[10] Those unfamiliar with the term ontology might be interested in my first introduction to the subject: M.K. Bergman, 2007. An Intrepid Guide to Ontologies, AI3:::Adaptive Information blog, May 16, 2007.
[11] See M.K. Bergman, 2009. Ontologies as the ‘Engine’ for Data-Driven Applications, AI3:::Adaptive Information blog, June 10, 2009. This is the most detailed explanation, but the specific term adaptive ontology was not yet used. The first dedicated focus on adaptive ontologies was in “Confronting Misconceptions with Adaptive Ontologies” (August 17, 2009). See also [12] and [13].
[12] See, M.K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies”, AI3:::Adaptive Information blog, November 23, 2009.
[13] See, M.K. Bergman, 2009. “Fresh Perspectives on the Semantic Enterprise”, AI3:::Adaptive Information blog, September 28, 2009.
[14] See, M.K. Bergman, 2009. “A General Web-oriented Architecture (WOA) for Structured Data”, AI3:::Adaptive Information blog, May 3, 2009. Also, see the related WOA category for other articles in this area.
[15] See, M.K. Bergman, 2008. “WOA: A New Enterprise Partner for Linked Data”, AI3:::Adaptive Information blog, October 12, 2008.
[16] See, M.K. Bergman, 2009. “structWSF: A Framework for Collaboration Networks”, AI3:::Adaptive Information blog, July 2, 2009.
[17] See http://structureddynamics.com/products.html for a general descriptive illustration of Structured Dynamics’ product stack. There is also a longer slideshow, with particular reference to slide #37.

Posted on January 12, 2010 at 3:26 pm in Description Logics, Linked Data, Ontologies, Ontology Best Practices, Semantic Web, Structured Dynamics, Web-oriented Architecture | Comments (11)
The URI link reference to this post is: http://www.mkbergman.com/859/seven-pillars-of-the-open-semantic-enterprise/
The URI to trackback this post is: http://www.mkbergman.com/859/seven-pillars-of-the-open-semantic-enterprise/trackback/
Date:   November 16, 2009

Image Source: www.adhd-mindbydesign.com

High Visibility Problems with NYT, data.gov Show Need for Better Practices

When I say, “shot”, what do you think of? A flu shot? A shot of whisky? A moon shot? A gun shot? What if I add the term “bank”? Do you now think of someone being shot in an armed robbery of a local bank or similar?

And, now, what if I add a reference to say, The Hustler, or Minnesota Fats, or “Fast Eddie” Felson? Do you now see the connection to a pressure-packed banked pool shot in some smoky bar room?

As humans we need context to make connections and remove ambiguity. For machines, with their limited reasoning and inference engines, context and accurate connections are even more important.

Over the past few weeks we have seen announcements of two large and high-visibility linked data projects:  One, a first release of references for articles concerning about 5,000 people from the New York Times at data.nytimes.com; and Two, a massive exposure of 5 billion triples from data.gov datasets provided by the Tetherless World Constellation (TWC) at Rennselaer Polytechnic Institute (RPI).

On various grounds from licensing to data characterization and to creating linked data for its own sake, some prominent commentators have weighed in on what is good and what is not so good with these datasets. One of us, Mike, commented about a week ago that “we have now moved beyond ‘proof of concept’ to the need for actual useful data of trustworthy provenance and proper mapping and characterization. Recent efforts are a disappointment that no enterprise would or could rely upon.”

Reactions to that posting and continued discussion on various mailing lists warrant a more precise dissection of what is wrong and still needs to be done with these datasets [1].

Berners-Lee’s Four Linked Data “Rules”

It is useful, then, to return to first principles, namely the original four “rules” posed by Tim Berners-Lee in his design note on linked data [2]:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
  4. Include links to other URIs so that they can discover more things.

The first two rules are definitional to the idea of linked data. They cement the basis of linked data in the Web, and are not at issue with either of the two linked data projects that are the subject of this posting.

However, it is the lack of specifics and guidance in the last two rules where the breakdowns occur. Both the NYT and the RPI datasets suffer from a lack of “providing useful information” (Rule #3). And, the nature of the links in Rule #4 is a real problem for the NYT dataset.

What Constitutes “Useful Information”?

The Wikipedia entry on linked data expands on “useful information” by augmenting the original rule with the parenthetical clause, ” (i.e., a structured description — metadata).” But even that expansion is insufficient.

Fundamentally, what are we talking about with linked data? Well, we are talking about instances that are characterized by one or more attributes. Those instances exist within contexts of various natures. And, those contexts may relate to other existing contexts.

We can break this problem description down into three parts:

  • A vocabulary that defines the nature of the instances and their descriptive attributes
  • A schema of some nature that describes the structural relationships amongst instances and their characteristics, and, optimally,
  • A mapping to existing external schema or constructs that help place the data into context.

At minimum, ANY dataset exposed as linked data needs to be described by a vocabulary. Both the NYT and RPI datasets fail on this score, as we elaborate below. Better practice is to also provide a schema of relationships in which to embed each instance record. And, best practice is to also map those structures to external schema.

Lacking this “useful information”, especially a defining vocabulary, we cannot begin to understand whether our instances deal with drinks, bank robberies or pool shots. This lack, in essence, makes the information worthless, even though available via URL.

The data.gov (RPI) Case

With the support of NSF and various grant funding, RPI has set up the Data-Gov Wiki [3], which is in the process of converting the datasets on data.gov to RDF, placing them into a semantic wiki to enable comment and annotation, and providing that data as RSS feeds. Other demos are also being placed on the site.

As of the date of this posting, the site had a catalog of 116 datasets from the 800 or so available on data.gov, leading to these statistics:

  • 459,412,419 table entries
  • 5,074,932,510 triples, and
  • 7,564 properties (or attributes).

We’ll take one of these datasets, #319, and look a bit closer at it:

Wiki Title Agency Name data.gov Link No Properties No Triples RDF File
Dataset 319 Consumer Expenditure Survey Department of Labor LABOR-STAT http://www.data.gov/details/319 22 1,583,236 http://data-gov.tw.rpi.edu/raw/319/index.rdf

This report was picked solely because it had a small number of attributes (properties), and is thus easier to screen capture. The summary report on the wiki is shown by this page:

Data-gov-Wiki Dataset #319

(click to expand)

So, we see that this specific dataset contains about 22 of the nearly 8,000 attributes across all datasets.

When we click on one of these attribute names, we are then taken to a specific wiki page that only reiterates its label. There is no definition or explanation.

When we inspect this page further we see that, other than the broad characterization of the dataset itself (the bulk of the page), we see at the bottom 22 undefined attributes with labels such as item code, periodicity code, seasonal, and the like. These attributes are the real structural basis for the data in this dataset.

But, what does all of this mean???

To gain a clue, now let’s go to the source data.gov site for this dataset (#319). Here is how that report looks:

Data.gov Dataset #319

(click to expand)

Contained within this report we see a listing for additional metadata. This link tells us about the various data fields contained in this dataset; we see many of these attributes are “codes” to various data categories.

Probing further into the dataset’s technical documentation, we see that there is indeed a rich structure underneath this report, again provided via various code lookups. There are codes for geography, seasonality (adjusted or not), consumer demographic profiles and a variety of consumption categories. (See, for example, the link to this glossary page.) These are the keys to understanding the actual values within this dataset.

For example, one major dimension of the data is captured by the attribute item_code. The survey breaks down consumption expenditures within the broad categories of  Food, Housing, Apparel and Services, Transportation, Health Care, Entertainment, and Other. Within a category, there is also a rich structural breakdown. For example, expenditures for Bakery Products within Food is given a code of FHC2.

But, nowhere are these codes defined or unlocked in the RDF datasets. This absence is true for virtually all of the datasets exposed on this wiki.

So, for literally billions of triples, and 8,000 attributes, we have ABSOLUTELY NO INFORMATION ABOUT WHAT THE DATA CONTAINS OTHER THAN A PROPERTY LABEL. There is much, much rich value here in data.gov, but all of it remains locked up and hidden.

The sad truth about this data release is that it provides absolutely no value in its current form. We lack the keys to unlock the value.

To be sure, early essential spade work has been done here to begin putting in place the conversion infrastructure for moving text files, spreadsheets and the like to an RDF form. This is yeoman work important to ultimate access. But, until a vocabulary is published that defines the attributes and their codes so we can unlock this value, it will remain hidden. And only when its further value (by connecting attributes and relations across datasets) through a schema of some nature is also published, the real value from connecting the dots will also remain hidden.The Hustler

These datasets may meet the partial conditions of providing clickable URLs, but the crucial “useful information” as to what any of this data means is absent.

Every single dataset on data.gov has supporting references to text files, PDFs, Web pages or the like that describe the nature of the data within each dataset. Until that information is exposed and made usable, we have no linked data.

Until ontologies get created from these technical documents, the value of these data instances remain locked up, and no value can be created from having these datasets expressed in RDF.

The devil lies in the details. The essential hard work has not yet begun.

The NYT Case

Though at a much smaller scale with many fewer attributes, the NYT dataset suffers from the same failing: it too lacks a vocabulary.

So, let’s take the case of one of the lead actors in The Hustler, Paul Newman, who played the role of “Fast Eddie” Felson. Here is the NYT record for the “person” Paul Newman (which they also refer to as http://data.nytimes.com/newman_paul_per). Note the header title of Newman, Paul:

NYT 'Paul Newman Articles' Record

(click to expand)

Click on any of the internal labels used by the NYT for its own attributes (such as nyt:first_use), and you will be given this message:

“An RDFS description and English language documentation for the NYT namespace will be provided soon. Thanks for your patience.”

We again have no idea what is meant by all of this data except for the labels used for its attributes. In this case for nyt:first_use we have a value of “2001-03-18″.

Hello? What? What is a “first use” for a “Paul Newman” of “2001-03-18″???

The NYT put the cart before the horse: even if minimal, they should have released their ontology first — or at least at the same time — as they released their data instances. (See further this discussion about how an ontology creation workflow can be incremental by starting simple and then upgrading as needed.)

Links to Other Things

Since there really are no links to other things on the Data-Gov Wiki, our focus in this section continues with the NYT dataset using our same example.

We now are in the territory of the fourth “rule” of linked data: 4. Include links to other URIs so that they can discover more things.

This will seem a bit basic at first, but before we can talk about linking to other things, we first need to understand and define the starting “thing” to which we are linking.

What is a “Newman, Paul” Thing?

Of course, without its own vocabulary, we are left to deduce what this thing “Newman, Paul“ is that is shown in the previous screen shot. Our first clue comes from the statement that it is of rdf:type SKOS concept. By looking to the SKOS vocabulary, we see that concept is a class and is defined as:

A SKOS concept can be viewed as an idea or notion; a unit of thought. However, what constitutes a unit of thought is subjective, and this definition is meant to be suggestive, rather than restrictive. The notion of a SKOS concept is useful when describing the conceptual or intellectual structure of a knowledge organization system, and when referring to specific ideas or meanings established within a KOS.

We also see that this instance is given a foaf:primaryTopic of Paul Newman.

So, we can deduce so far that this instance is about the concept or idea of Paul Newman. Now, looking to the attributes of this instance — that is the defining properties provided by the NYT — we see the properties of nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage. Completing our deductions, and in the absence of its own vocabulary, we can now define this concept instance somewhat as follows:

New York Times articles in the period 2001 to 2009 having as their primary topic the actor Paul Newman

(BTW, across all records in this dataset, we could see what the earliest first use was to better deduce the time period over which these articles have been assembled, but that has not been done.)

We also would re-title this instance more akin to “2001-2009 NYT Articles with a Primary Topic of Paul Newman” or some such and use URIs more akin to this usage.

sameAs Woes

Thus, in order to make links or connections with other data, it is essential to understand what the nature is of the subject “thing” at hand. There is much confusion about actual “things” and the references to “things” and what is the nature of a “thing” within the literature and on mailing lists.

Our belief and usage in matters of the semantic Web is that all “things” we deal with are a reference to whatever the “true”, actual thing is. The question then becomes:  What is the nature (or scope) of this referent?

There are actually quite easy ways to determine this nature. First, look to one or more instance examples of the “thing” being referred to. In our case above, we have the “Newman, Paul” instance record. Then, look to the properties (or attributes) the publisher of that record has used to describe that thing. Again, in the case above, we have nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage.

Clearly, this instance record — that is, its nature — deals with articles or groups of articles. The relation to Paul Newman occurs as a basis of the primary topic of these articles, and not a person basis for which to describe the instance. If the nature of the instance was indeed the person Paul Newman, then the attributes of the record would more properly be related to “person” properties such as age, sex, birthdate, death date, marital status, etc.

This confusion by NYT as to the nature of the “things” they are describing then leads to some very serious errors. By confusing the topic (Paul Newman) of a record with the nature of that record (articles about topics), NYT next misuses one of the most powerful semantic Web predicates available, owl:sameAs.

By asserting in the “Newman, Paul” record that the instance has a sameAs relationship with external records in Freebase and DBpedia, the NYT both entails that properties from any of the associated records are shared and infers a chain of other types to describe the record. More precisely, the NYT is asserting that the “thing” referred to by these instances are identical resources.

Thus, by the sameAs statements in the “Newman, Paul” record, the NYT is also asserting that that record is an instance of all these things [5]:

Furthermore, because of its strong, reciprocal entailments, the owl:sameAs assertion would also now entail that the person Paul Newman has the nyt:first_use and nyt:last_use attributes, clearly illogical for a “person” thing.

This connection is clearly wrong in both directions. Articles are not persons and don’t have marital status; and persons do not have first_uses. By misapplying this sameAs linkage relationship, we have screwed things up in every which way. And the error began with misunderstanding what kinds of “things” our data is about.

Some Options

However, there are solutions. First, the sameAs assertions, at least involving these external resources, should be dropped.

Second, if linkages are still desired, a vocabulary such as UMBEL [4] could be used to make an assertion between such a concept, and these other related resources. So, even though these resources are not the same, they are closely related. The UMBEL ontology helps us to define this kind of relation between related, but non-identical, resources.

Instead of using the owl:sameAs property, we would suggest the usage of the umbel:linksEntity, which links a skos:Concept to related named entities resources. Additionally, Freebase, which also currently asserts a sameAs relationship to the NYT resource, could use the umbel:isAbout relationship to assert that their resource “is about” a certain concept, which is the one defined by the NYT.

Alternatively, still other external vocabularies that more precisely capture the intent of the NYT publishers could be found, or the NYT editors could define their own properties specifically addressing their unique linkage interests.

Other Minor Issues

As a couple of additional, minor suggestions for the NYT dataset, we would suggest:

  • Create a foaf:Organization description of the NYT organization, then use it with dc:creator and dcterms:rightsHolder rather than using a literal, and
  • The dual URIs such as “http://data.nytimes.com/N31738445835662083893” and “http://data.nytimes.com/newman_paul_per” are not wrong in themselves, but the purpose is hard to understand. Why does a single organization need to create multiple resources for the identical resource, when it comes from the same system and has the same purpose?

Re-visiting the Linkage “Rule”

There are very valuable benefits from entailment, inference and logic to be gained from linking resources. However, if the nature of the “things” being linked — or the properties that define these linkages — are incorrect, then very wrong logical implications result. Great care and understanding should be applied to linkage assertions.

In the End, the Challenge is Not Linked Data, but Connected Data

Our critical comments are not meant to be disrespectful and are not being picky. The NYT and TWC are prominent institutions for which we should expect leadership on these issues. Our criticisms (and we believe those of others) are also not an expression of a “trough of disillusionment” as some have been pointing out.

This posting has been jointly authored by Mike Bergman and Fred Giasson and simultaneously published on both of their blogs, hoping to draw more attention to the need for better practices in publishing linked data.

This posting is about poor practices, pure and simple. The time to correct them is now. If asked, we would be pleased to help either institution establish exemplar practices. This is not automatic, and it is not always easy. The data.gov datasets, in particular, will require much time and effort to get right. There is much documentation that needs to be transitioned and expressed in semantic Web formats.

In a broader sense, we also seem to lack a definition of best practices related to vocabularies, schema and mappings. The Berners-Lee rules are imprecise and insufficient as is. Prior best guidance documents tend to be more how to publish and make URIs linkable, than to properly characterize, describe and connect the data.

Perhaps, in part, this is a bit of a semantics issue. The challenge is not the mechanics of linking data, but the meaning and basis for connecting that data. Connections require logic and rationality sufficient to reliably inform inference and rule-based engines. It also needs to pass the sniff test as we “follow our nose” by clicking the links exposed by the data.

It is exciting to see high-quality content such as from national governments and major publishers like the New York Times begin to be exposed as linked data. When this content finally gets embedded into usable contexts, we should see manifest uses and benefits emerge. We hope both institutions take our criticisms in that spirit.


[1] The NYT has been updated with improvements and they fixed multiple issues from the first release. The problems listed herein, however, still pertain after these improvements.
[2] Tim Berners-Lee, 2006. Linked Data (Design Issues), first posted on 2006-07-27; last updated on 2009-06-18. See http://www.w3.org/DesignIssues/LinkedData.html. Berners-Lee refers to the steps above as “rules,” but he elaborates they are expectations of behavior. Most later citations refer to these as “principles.”
[3] Li Ding, Dominic DiFranzo, Sarah Magidson, Deborah L. McGuinness and Jim Hendler, 2009. Data-GovWiki: Towards Linked Government Data. See http://www.cs.vu.nl/~pmika/swc/documents/Data-gov%20Wiki-data-gov-wiki-v1.pdf.
[4] UMBEL (Upper Mapping and Binding Exchange Layer) is a lightweight ontology structure in development for relating Web content and data to a standard set of subject concepts. It purpose has resulted in its creation of an associated vocabulary geared to both class-instance and reciprocal relationships, as well as partial or likelihood relationships. See http://umbel.org/technical_documentation.html#vocabulary.
[5] We’d like to thank Denny Vrandecic (see comments) for pointing out an imprecision in our original wording. This phrase was originally stated as, “Thus, by the sameAs statements in the ‘Newman, Paul’ record, the NYT is also asserting that that record is the same as these other things.”

Posted on November 16, 2009 at 12:04 pm in Linked Data, Ontology Best Practices, Semantic Web | Comments (13)
The URI link reference to this post is: http://www.mkbergman.com/846/when-linked-data-rules-fail/
The URI to trackback this post is: http://www.mkbergman.com/846/when-linked-data-rules-fail/trackback/
Date:   November 8, 2009

Mazzocchi Sounds a Warning to Linked Data Advocates

Stefano Mazzocchi has been a clear thinker for years and an innovative contributor to the community since his early leadership of the Apache Cocoon project. One of his best qualities is he speaks his mind. Now at Freebase, but previously with MIT’s Simile program, he is one of my dedicated reads via his Stefano’s Linotype blog.

His aforementioned post, Data Smoke and Mirrors, stands on its own, and I highly recommend it. He particularly focuses on the conversion of data.gov datasets to “linked data” (my quotes are purposeful). Combined with the recent poor conversion of New York Times datasets to linked data, I think he is the canary sending out a warning about a disturbing trend.

Posting linked data for its own sake — whatever the reasons — risks undercutting the premise.

We have now moved beyond “proof of concept” to the need for actual useful data of trustworthy provenance and proper mapping and characterization. Recent efforts are a disappointment that no enterprise would or could rely upon.

Listen up, folks.

Posted on November 8, 2009 at 9:32 pm in Linked Data | Comments (4)
The URI link reference to this post is: http://www.mkbergman.com/843/must-read-data-smoke-and-mirrors/
The URI to trackback this post is: http://www.mkbergman.com/843/must-read-data-smoke-and-mirrors/trackback/
Date:   November 2, 2009

Structured Dynamics LLC

A New Slide Show Consolidates, Explains Recent Developments

Much has been happening on the Structured Dynamics front of late. Besides welcoming Steve Ardire as a senior advisor to the company, we also have been issuing a steady stream of new products from our semantic Web pipeline.

This new slide show attempts to capture these products and relate them to the various layers in Structured Dynamics’ enterprise product stack:

The show indicates the role of scones, irON, structWSF, UMBEL, conStruct and others and how they leverage existing information assets to enable the semantic enterprise. And, oh, by the way, all of this is done via Web-accessible linked data and our practical technologies.

Enjoy!

Posted on November 2, 2009 at 5:54 pm in Information Automation, Linked Data, Ontologies, Open Source, Semantic Web, Semantic Web Tools, Structured Dynamics, UMBEL, Web-oriented Architecture, irON | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/842/structured-dynamics-product-stack/
The URI to trackback this post is: http://www.mkbergman.com/842/structured-dynamics-product-stack/trackback/
Date:   October 11, 2009

The Marshal Has Come to Town

A Marshal to Bring Order to the Town of Data Gulch

Though not the first, I have been touting the Linked Data Law for a couple of years now [1]. But in a conversation last week, I found that my colleague did not find the premise very clear. I suspect that is due both to cryptic language on my part and the fact no one has really tackled the topic with focus. So, in this post, I try to redress that and also comment on the related role of linked data in the semantic enterprise.

Adding connections to existing information via linked data is a powerful force multiplier, similar to Metcalfe’s law for how the value of a network increases with more users (nodes). I have come to call this the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between data objects.

“In the network economy, the connections are as important as the nodes.” [2]

An early direct mention of the semantic Web and its possible ability to generate network effects comes from a 2003 Mitre report for the government [3]. In it, the authors state, “At present a very small proportion of the data exposed on the web is marked up using Semantic Web vocabularies like RDF and OWL. As more data gets mapped to ontologies, the potential exists to achieve a ‘network effect’.” Prescient, for sure.

In July 2006, both Henry Story and Dion Hinchliffe discussed Metcalfe’s law, with Henry specifically looking to relate it to the semantic Web [4]. He noted that his initial intuition was that “the value of your information grows exponentially with your ability to combine it with new information.” He noted he was trying to find ways to adapt Metcalfe’s law for applicability to the semantic Web.

I picked up on those observations and commented to Henry at that time and in my own post, “The Exponential Driver of Combining Information.” I have been enamoured of the idea ever since, and have begun to weave the idea into my writings.

More recently, in late 2008, James Hendler and Jennifer Golbeck devoted an entire paper to Metcalfe’s law and the semantic Web [5]. In it, they note:

“This linking between ontologies, and between instances in documents that refer to terms in another ontology, is where much of the latent value of the Semantic Web lies. The vocabularies, and particularly linked vocabularies using URIs, of the Semantic Web create a graph space with the ability to link any term to any other. As this link space grows with the use of RDF and OWL, Metcalfe’s law will once again be exploited – the more terms to link to, and the more links created, the more value in creating more terms and linking them in.”

A Refresher on Metcalfe’s Law

Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of users of the system (n²) (note: it is not exponential, as some of the points above imply). Robert Metcalfe formulated it about 1980 in relation to Ethernet and fax machines; the “law” was then named for Metcalfe and popularized by George Gilder in 1993.

These attempts to estimate the value of physical networks were in keeping with earlier efforts to estimate the value of a broadcast network. That value is almost universally agreed to be proportional to the number of users, as accepted as Sarnoff’s law (see further below).

The actual algorithm proposed by Metcalfe calculates the number of unique connections in a network with n nodes to be n(n − 1)/2, which is proportional to n2. This makes Metcalfe’s law a quadratic growth equation.

As nodes get added, then, we see the following increase in connections:

Metcalfe Law Network Effect

‘Network Effect’ for Physical Networks

This diagram, modified from Wikipedia to be a horizontal image, shows how two telephones can make only one connection, five can make 10 connections, and twelve can make 66 connections, etc.

By definition, a physical network is a connected network. Thus, every time a new node is added to the network, connections are added, too. This general formula has also been embraced as a way to discuss social connections on the Internet [6].

Analogies to Linked Data

Like physical networks, the interconnectedness of the semantic Web or semantic enterprise is a graph.

The idea behind linked data is to make connections between data. Unlike physical telecommunication networks, however, the nodes in the form of datasets and data are (largely) already there. What is missing are the connections. The build-out and growth that produces the network effects in a linked data context do not result from adding more nodes, but from the linking or connecting of existing nodes.

The fact that adding a node to a physical network carries with it an associated connection has tended to conjoin these two complementary requirements of node and connection. But, to grok the real dynamics and to gain network effects, we need to realize: Both nodes and connections are necessary.

One circumstance of the enterprise is that data nodes are everywhere. The fact that the overwhelming majority are unconnected is why we have adopted the popular colloquialism of data “silos”. There are also massive amounts of unconnected data on the Web in the form of dynamic databases only accessible via search form, and isolated data tables and listings virtually everywhere.

Thus, the essence of the semantic enterprise and the semantic Web is no more complicated than connecting — meaningfully — data nodes that already exist.

As the following diagram shows, unconnected data nodes or silos look like random particles caught in the chaos of Brownian motion:

Linked Data Law Network Effect

‘Network Effect’ for Coherent Linked Data

As initial connections get made, bits of structure begin to emerge. But, as connections are proliferated — exactly equivalant to the network effects of connected networks — coherence and value emerge.

Look at the last part in the series diagram above. We not only see that the same nodes are now all connected, with the inferences and relationships that result from those connections, but we can also see entirely new structures emerge by virtue of those connections. All of this structure and meaning was totally absent prior to making the linked data connections.

Quantifying the Network Effect

So, what is the benefit of this linked data? It depends on the product of the value of the connections and the multiplier of the network effect:

linked data benefit = connections value X network effect multiplier

Just as it is hard to have a conversation via phone with yourself, or to collaborate with yourself, the ability to gain perspective and context from data comes from connections. But like some phone calls or some collaborations, the value depends on the participants. In the case of linked data, that depends on the quality of the data and its coherence [7]. The value “constant” for connected linked data depends in some manner on these factors, as well as the purposes and circumstances to which that linked data might be applied.

Even in physical networks or social collaboration contexts, the “value” of the network has been hard to quantify. And, while academics and researchers will appropriately and naturally call for more research on these questions, we do not need to be so timid. Whatever the alpha constant is for quantifying the value of a linked data network, our intuition should be clear that making connections, finding relationships, making inferences, and making discoveries can not occur when data is in isolation.

Because I am an advocate, I believe this alpha constant of value to be quite large. I believe this constant is also higher for circumstances of business intelligence, knowledge management and discovery.

The second part of the benefit equation is the multiplier for network effects. We’ve mentioned before the linear growth advantage due to broadcast networks (Sarnoff law) and the standard quadratic growth assumption of physical and social networks (Metcalfe law). Naturally, there have been other estimates and advocacies.

David Reed [8], for example, also adds group effects and has asserted an exponential multiplier to the network effect (like Henry Story’s initial intuition noted above). As he states,

“[E]ven Metcalfe’s Law understates the value created by a group-forming network [GFN] as it grows. Let’s say you have a GFN with n members. If you add up all the potential two-person groups, three-person groups, and so on that those members could form, the number of possible groups equals 2n. So the value of a GFN increases exponentially, in proportion to 2n. I call that Reed’s Law. And its implications are profound.”

Yet not all agree with the assertion of an exponential multiplier, let alone the quadratic one of Metcalfe. Odlyzko and Tilly [9] note that Metcalfe’s law would hold if the value that an individual gets personally from a network is directly proportional to the number of people in that network. But, then they argue that does not hold because of local preferences or different qualities of interaction. In a linked data context, such arguments have merit, though you may also want to see Metcalfe’s own counter-arguments [6].

Hinchliffe’s earlier commentary [4] provided a nice graphic that shows the implications of these various multiplers on the network effect, as a function of nodes in a network:

Potency of the Network Effect from Dion Hinchliffe

Various Estimates for the ‘Network Effect’

I believe we can dismiss the lower linear bound of this question and likely the higher exponential one as well (that is, Reed’s law, because quality and relevance questions make some linked data connections less valuable than others). Per the above, that would suggest that the multiplier of the linked data network is perhaps closer to the Metcalfe estimate or similar.

In any event, it is also essential to point out that connecting data indiscriminantly for linked data’s sake will likely deliver few, if any, benefits. Connections must still be coherent and logical for the value benefits to be realized.

The Role and Contribution of Linked Data

I elsewhere discuss the role of linked data in the enterprise and will continue to do so. But, there are some implications in the above that warrant some further observations.

It should be clear that the graph and network basis of linked data, not to mention some of the uncertainties as to quantifying benefits, suggests the practice should be considered apart from mission-critical or transactional uses in the enterprise. That may change with time and experience.

There are also open questions about data quality in terms of inputs to linked data and possible erroneous semantics and ontologies to guide the linked connections. Operational uses should be kept off the table for now. Like physical networks, not all links perform well and not all have usefulness. Similarly to how poor connections may be encountered in physical networks, they should be either taken off-ledger or relegated to a back-up basis. Linked data should be understood and treated no differently than networks of variable quality.

Such realism is important — for both internal and external linked data advocates — to allow linked data to be applied in the right venues at acceptable risk and with likely demonstrable benefits. Elsewhere I have advocated an approach that builds on existing assets; here I advocate a clear and smart understanding of where linked data can best deliver network effects in the near term.

And, so, in the nearest term, enterprise applications that best fit linked data promises and uncertainties include:

  • Establishing frameworks for data federation
  • Business intelligence
  • Discovery
  • Knowledge management and knowledge resources
  • Reasoning and inference
  • Development of internal common language
  • Learning and adopting data-driven apps [10], and
  • Staging and analysis for data cleaning.

A New Deputy Has Come to Town

As in the Wild West, the new deputy marshal and his tin badge did not guarantee prosperity. But a good marshal would deliver law and order. And those are the preconditions for the town folk to take charge of building their own prosperity.

Linked data is a practice for starting to bring order and connections to your existing data. Once some order has been imposed, the framework then becomes a basis for defining meanings and then gaining value from those connections.

Once order has been gained, it is up to the good citizens of Data Gulch to then deliver the prosperity. Broad participation and the network effect are one way to promote that aim. But success and prosperity still depends on intelligence and good policies and practice.


[1] I first put forward this linked data aspect in What is Linked Data?, dated June 23, 2008. I then formalized it in Structure the World, dated August 3, 2009.
[2] Paul Tearnen, 2006. “Integration in the Network Economy,” Information Management Special Reports, October 2006. See http://www.information-management.com/specialreports/20061010/1064941-1.html.
[3] Salim K. Semy, Mark Linderman and Mary K. Pulvermacher, 2003. “Information Management Meets the Semantic Web,” DOD Report by MITRE Corporation, November 2003, 10 pp. See http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA460265&Location=U2&doc=GetTRDoc.pdf.
[4] On July 15, 2006, Dion Hinchcliffe wrote, Web 2.0’s Real Secret Sauce: Network Effects. He produced a couple of useful graphics and expanded upon some earlier comments to the Wall Street Journal. Shortly thereafter, on July 29, Story wrote his own post, RDF and Metcalfe’s law, as noted. I commented on July 30.
[5] James Hendler and Jennifer Golbeck, 2008. “Metcalfe’s Law, Web 2.0, and the Semantic Web,” in Journal of Web Semantics 6(1):14-20, 2008. See http://www.cs.umd.edu/~golbeck/downloads/Web20-SW-JWS-webVersion.pdf.
[6] Robert Metcalfe, 2006. Metcalfe’s Law Recurses Down the Long Tail of Social Networking, see http://vcmike.wordpress.com/2006/08/18/metcalfe-social-networks/.
[7] See my When is Content Coherent? posting of July 25, 2008. ‘Coherence’ is a frequent theme of my blog posts; see my chronological listing for additional candidates.
[8] From David P. Reed, 2001. “The Law of the Pack,” Harvard Business Review, February 2001, pp 23-4. For more on Reed’s position, see Wikipedia’s entry on Reed’s law.
[9] Andrew Odlyzko and Benjamin Tilly, 2005. A Refutation of Metcalfe’s Law and a Better Estimate for the Value of Networks and Network Interconnections, personal publication; see http://www.dtc.umn.edu/~odlyzko/doc/metcalfe.pdf.
[10] Data-driven applications are the term we have adopted for modular, generic tools that operate and present results to users based on the underlying data structures that feed them. See further the discussion of Structured Dynamics’s products.

Posted on October 11, 2009 at 8:16 pm in Linked Data, Semantic Web | Comments (3)
The URI link reference to this post is: http://www.mkbergman.com/837/the-law-of-linked-data/
The URI to trackback this post is: http://www.mkbergman.com/837/the-law-of-linked-data/trackback/
Date:   September 20, 2009

The Unbearable Lightness of Being, by Milan Kundera

A Technique is Neither a ‘Meme’ nor a Philosophy

I have been a participant in an interesting series of discussions recently: Whither goes ‘linked data’?

As I described to someone, I was clearly not a father to the idea of ‘linked data‘, but I was handing out cigars pretty close on to the birth. Chris Bizer and Richard Cyganiak were the innovators that first proposed the original project to the W3C [1]. (Thanks guys!)

From that point forward, now a bit over 2-1/2 years ago, we have seen a massive increase in attention and visibility to the idea of ‘linked data.’ I take a small amount of reflected pride that I helped promote the idea in some way with my early writings.

That visibility was well-deserved. After all, here was the concept:

  • Expose your data in an accessible way on the Web
  • Use Web identifiers (URIs) as the means to uniquely identify that data
  • Use RDF “triples” to describe the relationships between the data.

Much other puffery got layered on to those ideas, but I think those premises are the key basis.

Early Cracks in the Vision

My first personal concern with where linked data was going dealt with an absence of context or conceptual structure for how these new datasets related to one another. I will not repeat those arguments here; simply see many of my blog postings from the past two years or so. Exposing millions of “things” was wonderful, but what did all of that mean? How does one “thing” relate to another “thing”? Are some “things” the same as or similar to other things? If nothing else, these concerns stimulated the genesis of the UMBEL subject concept ontology, an outcome for which I need to thank the community.

It would be petty of me to question the basis that attracted millions of data items to get exposed from linked data techniques. In fact, the richness we have today in exposed Web data objects comes solely from this linked data initiative. But, nonetheless, my guess is that even the most ardent linked data advocate would have a hard time finding a logical way to present the current linked data reality in context. We see the big bubble diagram of available datasets, but, frankly, the position and relationships amongst datasets appears somewhat arbitrary. We have lots of bubbles, but little meaning.

The Constant is Transition

The semantic Web was in serious crisis prior to linked data. It had bad perception, little delivery, and unmet hype. Linked data at least began to show how exposed and properly characterized data can begin to become interconnected.

For a couple of years now I have tried in various posts to present linked data in a broader framework of structured and semantic Web data.  I first tried to capture this continuum in a diagram from July 2007:

Transition in Web Structure
Document Web Structured Web Semantic Web
Linked Data
  • Document-centric
  • Document resources
  • Unstructured data and semi-structured data
  • HTML
  • URL-centric
  • circa 1993
  • Data-centric
  • Structured data
  • Semi-structured data and structured data
  • XML, JSON, RDF, etc
  • URI-centric
  • circa 2003
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S
  • URI-centric
  • circa 2006
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S, OWL
  • URI-centric
  • circa ???

The point is not whether those earlier characterizations were “correct”, but that linked data be properly seen as merely a natural step in an ongoing transition. IMO, we are progressing nicely along this spectrum.

A Caricature of Itself

Linked data is a set of techniques — nothing more — and certainly not a philosophy or meme (whatever the hell that means). We have way too many breathy pontifications about “linked data this” and “linked data that” that frankly are undercutting the usefulness of the practice and making it a caricature of itself.

In the enterprise world we see similar attempts at marketing that need to give everything a three-letter acronym. In this case, we have a bunch of academics and researchers trying to act like market and business gurus. All it is doing is confusing the marketplace and hurting the practice.

The elevation of techniques or best practices into roles clearly beyond their pay grade produces completely the opposite effect:  the idea comes under question and ridicule. The logic and rationale for why we should be following these best practices gets lost in the hyperbole. I spend most of my time hitting the delete button on the mailing lists. I fear what others new to these practices — that is, my company’s customers and prospects — perceive when they look into this topic.

Linked data is useful and needed. But come on, folks, these are not tribal or religious matters.

Declaring Victory, and Moving On

Through the initial project vehicle of DBpedia and then how it nucleated other “linked” data sets, the linked data practice certainly became viral. Today, we have many millions of data items available in linked data form. This is unalloyed goodness.

I will continue to use the phrase ‘linked data’ to refer to those useful techniques noted in the opening. Actually, I think it is best to think of linked data as a set of best practices, but by no means an end unto itself.

Beyond linked data we need context, we need our data to be embedded and related to interoperable ontologies, we need much better user interfaces and attainability, and we need quality in our assertions and use. These are issues that extend well beyond the techniques of linked data and form the next set of challenges in gaining broader acceptance for the semantic Web and the semantic enterprise.

Like most everything else in this world, there are real problems and real needs out there. Thankfully, we have heard mostly the end of the silliness about Web 3.0.  Perhaps we can now also broaden our horizons beyond the useful techniques of linked data to tackle the next set of semantic challenges.

So, let me be the first to congratulate the community on a victory well achieved! As for myself and my company, we will now focus our attentions on the next tier of challenges. It is time to deprecate the rhetoric. Huzzah!


[1] For the record, in addition to Bizer and Cyganiak, the first publication on the project, “Interlinking Open Data on the Web”, in the Proceedings Poster Track, ESWC2007, Innsbruck, Austria, June 2007, by Bizer, Tom Heath, Danny Ayers and Yves Raimond, also noted the early contributions of Sören Auer, Orri Erling, Frederick Giasson, Kingsley Idehen, Georgi Kobilarov, Stefano Mazzocchi, Josh Tauberer, Bernard Vatant and Marc Wick.

Posted on September 20, 2009 at 8:09 pm in Linked Data, Semantic Web, Structured Web | Comments (5)
The URI link reference to this post is: http://www.mkbergman.com/802/moving-beyond-linked-data/
The URI to trackback this post is: http://www.mkbergman.com/802/moving-beyond-linked-data/trackback/
Date:   August 3, 2009

The "Blue Marble": The Earth seen from Apollo 17.jpg from Wikipedia.org

Multiple Techniques and Data Structs can Make the Vision a Reality

Linked data and subject and domain ontologies provide the organizing framework. Techniques for converting, tagging and authoring structure provide the content. In combination, we now have in hand the necessary pieces to enable all of us to “structure the World.”

In this vision, the nature of the links or connections between data need not be complicated to gain tremendous benefit. Similar to Metcalfe’s Law for the increasing value of networks as more nodes (users) get added, adding connections to existing data is a powerful force multiplier.

We can call this the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between data objects [1]. Further, if we are purposeful to include connective links where appropriate as we add more data (that is, nodes), this multiplier effect becomes even stronger.

Structured Dynamics is dedicated to help make this prospect real. Meaningful progress in doing so requires only a relatively few moving parts or techniques. Yet, because we sometimes bounce from talking or focusing on one part versus the others, we can lose context or sight of the overarching vision. The purpose of this article is to re-set and calibrate that overall vision.

The Vision: Data Federation of Any Desired Content

The vision is to get all data and information to interoperate, regardless of legacy or form. Much of this data is already structured, either from databases or simpler forms of data structs. Some of this information is unstructured or semi-structured, requiring extraction and tagging techniques. And new information is being constantly generated, which warrants better means to author and stage for interchange and interoperability.

No matter the provenance, all information has context and scope. As a chunk from here, and a piece from there, gets added to our linked data mix, having means to characterize what that data is about and how it can be meaningfully inter-related becomes crucial. Sometimes these contexts are informed by existing schema; sometimes they are not. But, in any case, it is the role of ontologies to both position these datasets into an “aboutness” framework and to help guide how the data can be described and related to other data. This part of the vision invokes semantics and coherent structures (schema or ontologies) for positioning and mapping datasets to one another.

As both the means for representing any extant data format and as the means for describing these conceptual relationships or schema, RDF provides the canonical data model. A single target representation and common data model also means we can develop and design a smaller universe of tools to operate and provide functionality over all of this data. Indeed, because our RDF data model and its ontologies are so richly structured, we can design our tools with generic functionality, the specific operation and expression of which is based on the inherent structure within the data and its relationships. This vision of data-driven apps leads to extreme leverage, incredible flexibility, and inherent “meshup” capabilities for tools.

Further, because we use Web identifiers (URIs) for our data and concepts and because we expose and access this linked data via the Web, we use the proven and scalable architectures of the Web itself for how we design our systems. This Web-oriented architecture (WOA) provides a completely decentralized and loosely coupled deployment model that can work ranging from public and open to private and proprietary, applicable to data and participants alike.

From the outset, it is essential to recognize that thousands of contributors are enabling this vision. So, while Structured Dynamics naturally uses its own tools and techniques to flesh out the various parts of this vision below, realize there are many players and many tools from which to choose [2]. For that is another aspect of this vision that is quite powerful: providing choice and avoiding lock-in.

RDF: The Canonical Data Model

The core construct — or fulcrum, if you will — of the vision is the RDF (Resource Description Framework) data model [3]. I have written elsewhere on the Advantages and Myths of RDF, which explains more precisely the advantages of that model. RDF provides a common data model to which any external format or schema can be converted and represented. It also provides a logic model and basis for building vocabularies that can inform and drive generic tools.

In the context of data interoperability, a critical premise is that a single, canonical data model is highly desirable. Why?

Simply because of 2N v N2. That is, a single reference (”canon”) structure means that fewer tool variants and converters need be developed to talk to the myriad of data formats in the wild. With a canonical data model, talking to external sources and formats (N) only requires converters to and from the canonical form (2N). Without a canonical model, the combinatorial explosion of required format converters becomes N2 [4].

Note, in general, such a canonical data model merely represents the agreed-upon internal representation. It need not affect data transfer formats. Indeed, in many cases, data systems employ quite different internal data models from what is used for data exchange. Many, in fact, have two or three favored flavors of data exchange such as XML, JSON or the like. More on this is discussed in a section below.

As this diagram shows, then, we have a single internal representation that is the target for all data and format converters and upon which all tools operate. These tools are themselves expressed as Web services so that they may be distributed and conform to general WOA guidelines. In addition, there may be multiple external “hubs” that represent alternative data models or formats or schema conversions (say, for relational databases). So long as we have converters between these alternate “hubs” and our canonical RDF form we can allow a thousand flowers to bloom:

structWSF Data Model Relationships

Other canonical forms could be advocated. Yet RDF has the logical basis to represent any data form and any schema or conceptual structure. It is based on a robust set of open standards and languages and tools. It may be serialized in many formats. It can be grounded in description logics and, in appropriate forms, reasoned over and expressed in vocabularies and schema suitable for the most complex of conceptual structures and semantics. RDF is the data model explicitly designed for the Web, the clear global information basis for the foreseeable future.

For more than 30 years — since the widespread adoption of electronic information systems by enterprises — the Holy Grail has been complete, integrated access to all data. With the canonical RDF data model, that promise is now at hand.

Conversion: So Many Structs, So Little Time

Diversity is a truism of human communications as captured by the biblical Tower of Babel and the many thousands of current human languages. Diversity in data formats, serializations, notations and languages is a similar truism. We term the expression of each of these varied forms of data a struct.

While an internal canonical representation of data makes sense for the reasons noted above, pragmatic information systems must recognize the inherent diversity and chaos of data in the real world. The history of trying to find single representations or to impose standards via fiat have singularly failed. That will continue to be so due in part to inertia and legacy, sunk investments, existing infrastructure, and the purposes for the data.

In pursuing a vision of data interoperability, then, conversion is an essential glue for cementing understanding with what exists and will exist.

RDB-to-RDF

Arguably the largest source of structured data are enterprise and government information systems, with the predominant data representation being the relational data model managed by relational schema. Much of this data is also cleaner and mission critical compared to other sources in the wild. Fortunately, there are many logical and conceptual affinities between the relational model and the one for RDF [5].

Just as there are many RDFizers for simpler forms of data structs (see next), there are also nice ways to convert relational schema to RDF automatically. Given these overall conceptual and logical affinities the W3C is also in the process of graduating an incubator group to an official work group, RDB2RDF, focused on methods and specifications for mapping relational schema to RDF.

Amongst all techniques covered in this paper, Structured Dynamics views the layering of RDF ontologies over existing relational data stores as one of the most promising and important. Given the advantages of RDF for interoperability, this area should be a major emphasis of current and new vendors and service providers.

RDFizers

Much data, however, resides in much smaller datasets and often for less formal purposes than what is found in enterprise databases. Some of this data is geared for exchange or standardization; much is emerging from Web and Internet applications and uses; and much might be local or personal in nature, such as simple lists or spreadsheets.

RDF is well suited to convert (”RDFize”) these simpler and more naïve data formats. In my original census about 18 months ago, as reported in ‘Structs’: Naïve Data Formats and the ABox, I listed about 90 converters. My most recent update now lists nearly double that number, with about 150 converters [6]:

URN handlers (in addition to IRI and URI):

  • DOI
  • LSID
  • OAI

RDF

  • Serialization formats:
    • N3
    • RDF/XML
    • Turtle
  • Languages and ontologies:
    • AB Meta
    • Annotea
    • APML
    • AtomOWL
    • Bibliographic Ontology
    • Creative Commons
    • EXIF
    • FOAF
    • Java
    • Javadoc
    • MARC/MODS
    • Meta Standards
    • Music Ontology
    • Natural Language
    • Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
    • Open Geospatial
    • OWL
    • SIOC
    • SIOCT
    • SKOS
    • UMBEL
    • vCard
    • XML
    • Others
  • (X)HTML pages
  • Embedded Microformats and GRDDL [7]:
    • DC
    • eRDF
    • geoURL
    • Google Base
    • hAudio
    • hCalendar
  • Embedded Microformats and GRDDL (con’t):
    • hCard
    • hListing
    • hResume
    • hReview
    • HR-XML
    • Ning
    • RDFa
    • relLicense
    • SVG
    • XBRL
    • XFN
    • xFolk
    • XR-XML
    • XSLT
  • Syndication Formats:
    • Atom
    • OPML
    • OCS
    • RSS 1.1
    • RSS 2.0
    • XBEL (for bookmarks)
  • REST-style Web service APIs:
    • Amazon
    • Apple
    • Calais
    • CrunchBase
    • Del.icio.us
    • Digg
    • Discogs
    • Disqus
    • eBay
    • Facebook
    • Flickr
    • Freebase (MQL)
    • FriendFeed
    • Garmin
    • Get Satisfaction
    • Google
    • Hoover’s
    • HTTP (raw)
    • ISBN DB
    • Last.fm
    • Library Thing
    • Magnolia
  • REST-style Web service APIs (con’t):
    • Meetup
    • MusicBrainz
    • New York Times
    • New York Times Campaign Finance (NYTCF)
    • New York Times tags
    • Open Library
    • Open Social
    • Open Street
    • OpenLink (facets)
    • O’Reilly
    • Picasa
    • Radio Pop (BBC)
    • Rhapsody
    • Salesforce
    • Slideshare
    • Slidy
    • Technorati
    • They Work For You
    • Twine
    • Twitter
    • Weather
    • Wikipedia
    • World Bank
    • Yahoo! Finance
    • Yahoo! Maps
    • Yahoo! Weather
    • YouTube
    • Zemanta
  • Files (multitude of file formats and MIME types, including):

Many of the sources above come from new and emerging Web-based APIs, which are also huge sources of content growth. Also note that alternative formats to RDF (e.g., microformats) or leading serializations and encodings (e.g, XML, JSON) also have many converter options.

For many typical naïve data structs, the data is represented as attribute-value pairs, which easily lend themselves to conversion to RDF as instance records [8]. See further the Authoring section below.

Tagging: The 80% Solution

An apocryphal statistic is that 80% to 85% of all information resides in unstructured text [9]. Besides lacking recent validation, this claim from a decade ago often attributed to Merrill Lynch also precedes much of the Internet and the emergence of metadata and tagging. Nevertheless, what is true is that written text content is ubiquitous and the majority of it remains untagged or uncharacterized by any form of metadata.

While such information can be searched, it only matches when exact terms match. This means that related information, particularly in the form of conceptual relationships and inferencing, can not be applied to untagged text content.

While information extraction — the basis by which tags for entities and concepts can be obtained — has been an active topic of research for two decades, it is only recently that we have begun to see Web-scale extractors appear. Examples include Yahoo’s term extractor, Thomson Reuter’s Calais, or Google’s Squared, to name but a few.

scones - Subject Concepts or Named Entities In Structured Dynamics’ case we have been working on the scones (Subject Concepts Or Named EntitieS) extractor for quite a while. scones uses rather simple natural language processing (NLP) methods as informed by concept ontologies and named entity (instance record) dictionaries to help guide the extraction process. The co-occurrence of matches between concepts and entities also aids the disambiguation task (though additional modules may be invoked with alternative disambiguation methods). In prototype forms, the resulting tags can be managed separately or fed to user interfaces or re-injected back into the original content as RDFa.

There are literally dozens of such extractors and services presently available on the Web and many that are available as open source or commercial products. Some are mostly algorithm based using machine-learning techiques or statistics, while others are gazeteer- or dictionary-driven.

These systems will lead to rapid tagging of existing content and the removal of some of the early “chicken-and-egg” challenges associated with the semantic Web. These systems will also be combined with the many existing bookmarking and tagging services.

So, just as we will see federation and interoperability of conventional data, we will also see linkages to relevant and supporting text content accompanying it. This combination, in turn, will also lead to richer browsing and discovery experiences.

Authoring: The Neglected Third Leg of the Stool

In addition to conversion and tagging, authoring is the third leg of the stool to expose structured data. It is a neglected leg to the structured content stool, and one important to make it easier for datasets to be easily exposed as RDF linked data.

One of the reasons for the proliferation of data structs has been the interest in finding notations and conventions for easier reading and authoring of small datasets. There have literally been hundreds of various formats proposed over decades for conveying lightweight data structures. Most have been proprietary or limited to specific domains or users. Some, such as fielded text, structured text, simple declarative language (SDL), or more recently YAML or its simpler cousin JSON, have become more widely adopted and supported by formal specifications, tools or APIs. JSON, especially, is a preferred form for Web 2.0 applications.

What has been less clear or intuitive in these forms, again mostly based on an attribute-value pair orientation, is how to adequately relate them to a more capable data model, such as RDF. In JSON or YAML, for example, the notations include the concepts of objects, arrays and datatypes (among other conventions). Other structures lack even these constructs.

To take the case of JSON as might be related to RDF, there are a couple of efforts to define representation conventions from Talis and GBV for serializing RDF. There was a floated idea for an RDF version of JSON called RDFON that has now evolved into the TURF approach. JDIL (JSON data integration layer) instructs how to add namespaces to JSON to enable encoding RDF. Jim Ley, Kanzaki Masahide and Dave Beckett (likely among others) have written simple and straightforward RDF and Turtle parsers and converters for JSON. And, still further examples are Beckett’s Triplr and Sören Auer’s ASKW Triplify lightweight conversion services involving many different formats.

Because JSON is easily readable, can drive many Web 2.0 applications and widgets, and lends itself to fast conversions and tools in various scripting languages, Structured Dynamics was commissioned by the Bibliographic Knowledge Network (BKN) to formalize a BibJSON specification suitable for BibTeX-like data records and citations with an extensible schema to be converted to RDF.

The emerging result of that BibJSON effort will be published shortly. The specification includes conventions and vocabularies for creating bibliographic and citation instance records, for specifying structural schema, and for creating linkage files between the attributes in the record files with existing and new schema. BibJSON is itself grounded in IRON, which is an instance record and object notation developed by Structured Dyamics that can be serialized as JSON (called irJSON), XML (called irXML) or comma-separated values (or CSV comma-delimited files, called commON).

The purpose of these notations and serializations is to provide easier authoring environments and scripting support to RDF-ready datasets. This approach has the advantage of shielding most users from the nuances or lengthiness of RDF (though the N3 serialization also works well).

The design and development of commON was especially geared to using spreadsheets as authoring environments that would enable easy creation of instance record tables or simple hierarchical or outline structures. For example, here is a sample portion of Sweet Tools specified in a spreadsheet using the commON notation:

Sweet Tools Sample Spreadsheet

Once the philosophy and role of naïve data structs is embraced — with an appreciation of the many converters now available or easily written for translating to RDF — it becomes easier to determine data forms appropriate to the tools and natural work flow of the users and tasks at hand. Under this mindset, the role of RDF is to be the eventual conversion target, but not necessarily what is used for intermediate work tasks, and in particular not for authoring.

Getting it All Organized

OK, so now all of this stuff is converted, tagged or authored. How does it relate? What is the relation of one dataset to another dataset? Is there a context or framework for laying out these conceptual roadmaps?

UMBEL (Upper Mapping and Binding Exchange Layer) Two years ago as we looked at the state of RDF and the incipient semantic Web as promised via linked data, we saw that such a specific framework was lacking. (Though there were existing higher-level ontologies, either their complexity or design were not well-suited to these purposes.) It was at that time that Frédérick Giasson and I began to formulate the UMBEL (Upper Mapping and Binding Exchange Layer) ontology, which eventually led to our more formal business partnership and Structured Dynamics.

What we sought to achieve with UMBEL was a coherent reference framework of about 20,000 subject concepts, connected and acting like constellations in the information sky for orienting content and new datasets. At the same time, we wanted to create a general vocabulary and approach that would lend themselves to creation of domain-specific ontologies, which would also naturally tie in and inter-relate to the more general UMBEL structure.

This objective was achieved, though UMBEL deserves an upgrade to OWL 2 and some other pending improvements. A number of domain ontologies have been created and now relate to UMBEL. So, rather than being an end to itself, UMBEL was one of the necessary infrastructure pieces to help make the vision herein a reality.

Similar approaches may be taken by others with new domain ontologies based on the UMBEL vocabulary with tie-in as appropriate to existing subject concepts, or by mapping to the existing UMBEL structure.

Of course, UMBEL is not an absolute condition to the vision herein. However, insofar as users desire to see multiple datasets inter-related, including the use of existing public Web data, something akin to UMBEL and related domain ontologies will be necessary to provide a similar roadmap.

Making it All Available

The parts and techniques discussed so far pertain almost exclusively to data and content. But, these structures so created now can inform data-driven applications which also now must be deployed. To do so, Structured Dynamics is committed to what is known as a Web-oriented architecture (WOA):

WOA = SOA + WWW + REST

WOA is a subset of the service-oriented architectural style, wherein discrete functions are packaged into modular and shareable elements (”services”) that are made available in a distributed and loosely coupled manner. WOA generally uses the representational state transfer (REST) architectural style defined by Roy Fielding in his 2000 doctoral thesis; Fielding is also one of the principal authors of the Hypertext Transfer Protocol (HTTP) specification.

REST provides principles for how resources are defined and used and addressed with simple interfaces without additional messaging layers such as SOAP or RPC. The principles are couched within the framework of a generalized architectural style and are not limited to the Web, though they are a foundation to it.

structWSF Web Services FrameworkWithin this design we need a suite of generic functions and tools that are driven by the structure of the available datasets. The deployment vehicle and design we have implemented to provide this WOA design is structWSF [10].

structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies). The master or controlling Web service in the framework is the module for granting access and use rights to datasets based on permissions.

The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import. More services can readily be added to the system.

All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and a document of resultsets (if the query result is not null). Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.

In initial release, structWSF has direct interfaces to the Virtuoso RDF triple store (via ODBC, and later HTTP) and the Solr faceted, full-text search engine (via HTTP). However, structWSF has been designed to be fully platform-independent. The framework is open source (Apache 2 license) and designed for extensibility.

No End in Sight

Like all visions, there are many aspects and many improvements possible. This vision is definitely a work-in-progress with no end in sight.

But, meaningful movement embracing the full scope of this vision is doable today. Structured Dynamics welcomes inquiries regarding any of these aspects, improvements to them, or application to your specific needs and problems.

We also welcome you to come back and visit our blogs (Fred’s is found here). We try to speak on various aspects of this vision in all of our posts and are pleased to share our experience and insights as gained.


[1] Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of users of the system (n2), where the linkages between users (nodes) exist by definition. For information bases, the data objects are the nodes. Linked data works to add the connections between the nodes. We can thus modify the original sense to become the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between the data objects. I first presented this formulation about a year ago in What is Linked Data?
[2] This piece introduces for the first time a couple of efforts-in-progress by Structured Dynamics. For a general tools listing, see my own Sweet Tools listing of about 800 semantic Web and -related tools.
[3] As quoted in The Lever, “”Archimedes, however, in writing to King Hiero, whose friend and near relation he was, had stated that given the force, any given weight might be moved, and even boasted, we are told, relying on the strength of demonstration, that if there were another earth, by going into it he could remove this.” from Plutarch (c. 45-120 AD) in the Life of Marcellus, as translated by John Dryden (1631-1700).
[4] The canonical data model is especially prevalent in enterprise application integration. An interesting animated visualization of the canonical data model may be found at: http://soa-eda.blogspot.com/2008/03/canonical-data-model-visualized.html.
[5] An excellent piece on those relations was written by Andrew Newman a bit over a year ago; see Andrew Newman, 2007. “A Relational View of the Semantic Web,” published on XML.com, March 14, 2007; http://www.xml.com/pub/a/2007/03/14/a-relational-view-of-the-semantic-web.html. RDF can be modeled relationally as a single table with three columns corresponding to the subject-predicate-object triple. Conversely, a relational table can be modeled in RDF with the subject IRI derived from the primary key or a blank node; the predicate from the column identifier; and the object from the cell value. Because of these affinities, it is also possible to store RDF data models in existing relational databases. (In fact, most RDF “triple stores” are RDBM systems with a tweak, sometimes as “quad stores” where the fourth tuple is the graph.) Moreover, these affinities also mean that RDF stored in this manner can also take advantage of the historical learnings around RDBMS and SQL query optimizations.
[6] The largest source for RDFizers, which it calls Sponger cartridges, is from OpenLink Software in relation to its Virtuoso universal server. Most of its converters use XSLT stylesheets to translate to RDF, but the system has other conversion capabilities as well. Two additional OpenLink resources are a clickable diagram of converters and relationships with links and an online storehouse of available XSLT converters. In addition, two other sources — the W3C’s Semantic Web wiki with converter listings and MIT’s Simile program and listing of RDFizers — have a rich set of listings. Note that many of the categories shown on the table also have multiple sources of converters, so that the absolute number of converters has also grown faster than the unique formats supported.
[7] GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a W3C markup format for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT GRDDL accomodates a wide variety of dialects (see one listing) and can be combined with arbitrary transformation mechanisms (though currently mostly based on XSLTs).
[8] We characterize instance records as representing the “ABox”, in accordance with our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[9] One of the more recent discussions of this percentage is by Seth Grimes, Unstructured Data and the 80 Percent Rule, 2009.
[10] structWSF is also designed to integrate with third-party apps and content management systems (CMSs) to provide the user interfaces to these functions. The first implementation of this design is conStruct SCS, a structured content system that extends the basic Drupal content management framework. conStruct enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces.

Posted on August 3, 2009 at 10:23 pm in Bibliographic Knowledge Network, Linked Data, Ontologies, Semantic Web, Structured Dynamics, Structured Web, UMBEL, Web-oriented Architecture Comments Off
The URI link reference to this post is: http://www.mkbergman.com/533/structure-the-world/
The URI to trackback this post is: http://www.mkbergman.com/533/structure-the-world/trackback/
Date:   July 9, 2009
structWFSDue to Fred Giasson’s great work and the support of the BKN (Bibliographic Knowledge Network) project, five new Web services and associated documentation have been added to structWSF. The five Web services are:

Thanks, Fred, for more amazing, clean work!

The code for these will be posted soon to the structWSF SVN on Google code. UPDATE: The code has now been posted.

Posted on July 9, 2009 at 9:55 pm in Adaptive Innovation, Bibliographic Knowledge Network, Linked Data, Open Source, Structured Web, Web-oriented Architecture Comments Off
The URI link reference to this post is: http://www.mkbergman.com/498/five-new-web-services-added-to-structwsf/
The URI to trackback this post is: http://www.mkbergman.com/498/five-new-web-services-added-to-structwsf/trackback/
Page 1 of 41234»
Copyright © 2004–2010 Michael K. Bergman.   This work is licensed under a Creative Commons License