Posted:December 21, 2009

Open World
OWA Enables Incremental, Low-risk Wins for the Semantic Enterprise

In speaking of the semantic Web, it is not infrequent that the open world assumption (OWA) gets mentioned. What this post argues is that this somewhat obscure concept may hold within it the key as to why there have been decades of too-frequent failures in the enterprise in business intelligence, data warehousing, data integration and federation, and knowledge management.

This is a fairly bold assertion. In order to support it, we first need to look to the logic and mindset assumptions associated with traditional relational data management and the semantic Web. We then need to look to the nature of knowledge itself and its relation to data federation. It is in this intersection that the key of decades of faulty premises may reside.

The main argument is that the closed world assumption (CWA) and its prevalent mindset in traditional database systems have hindered the ability of enterprises and the vendors that support them to adopt incremental, low-risk means to knowledge systems and management. CWA, in turn, has led to over-engineered schema, too-complicated architectures and massive specification efforts that have led to high deployment costs, blown schedules and brittleness.

The good news is that abandoning these failed practices and embracing the open world approach can be done immediately based on existing assets. Simply shifting from the closed world to open world premise can, I argue, improve the odds for enterprise IT success in these areas.

It is time to meet the elephant in the room.

Scope and Some Root Causes of Enterprise IT Failures

It is, of course, a bit of editorial hyperbole to label most enterprise initiatives in business intelligence and knowledge management as being failures over the past few decades. And, insofar as failures have occurred, I also do not believe they are the result of vendor greed or cynicism, or IT management mistakes or incompetence. Rather, I believe the fault resides in the attempt to pound a square peg (relational model) into a round hole (knowledge representation).

The scope of these failures is not known. We have seen anecdotal claims of trillions of dollars in annual loses due to IT project failures worldwide; failure rates for major IT projects in the 65% to 80% ranges; and analysis of waste and failures in individual firms that are fairly eye-popping [1]. The real point of this post is not to try to quantify these problems. However, in my many years within IT it has been a common perception and concern that many — if not most — large-scale information technology deployments have disappointed in one way or another.

These disappointments range from cost overruns, to late delivery, to unmet objectives, or to low user acceptance. Many initiatives are simply cancelled before any such metrics can be documented. Whatever the absolute quantification, I think most experienced IT managers and executives would agree that these failures and disappointments have been all too commonplace.

“Business Intelligence projects are famous for low success rates, high costs and time overruns. The economics of BI are visibly broken, and have been for years. Yet BI remains the #1 technology priority according to Gartner.”[2]

Why might this be?

I truly believe the reasons for these disappointments do not reside in bad faith or incompetence. The potential importance of IT knowledge projects to improve competitive position, lower costs, or aid innovation for new markets is understood by all. Dilbert aside, I find it simply incomprehensible that disappointments or failures are rooted in these causes.

Rather, I suspect the root cause resides in the success of the relational model in the enterprise.

As transaction systems and for modeling narrowly bound and structured domains (such as products, inventory or customer lists), the relational model and its proven and optimized RDBMs and SQL query language have been resounding successes. It is natural to take a successful approach and try to extend it to other areas.

However, beginning with data warehouses in the 1980s, business intelligence (BI) systems in the 1990s, and the general issue of most enterprise information being bound up in documents for decades, the application of the relational model to these areas has been disappointing.

The reasons for this do not reside in areas such as storage or hardware; these areas have seen remarkable improvements over the decades. Rather, the problem resides in the nature of the relational model itself, and its lack of suitability to knowledge-based problems.

Technical Aspects of OWA, Broadly Defined

I have noted the importance of the open world assumption to the semantic enterprise in many of my more recent posts [3,4]. But I, like many others, often refer to the open world assumption with facile summaries such as it means that a lack of information does not imply the missing information to be false. Yet to fully understand the implications of OWA and many of its associated assumptions, it is necessary to delve deeper.

I am using here a shorthand that poses the closed world assumption (CWA) vs. the open world assumption (OWA). Actually, the data models behind these approaches (Datalog or non-monotonic logic in the case of CWA; monotonic in the case of OWA [5]; OWA is also firmly grounded in description logics [4]) tend be coupled with a few other assumptions. I use the shorthand of relational approach vs. (open) semantic Web approach to contrast these two models.

There are instances where the relational model can embrace the open world assumption (for example, the null in SQL) and there are instances where semantic Web approaches can be closed world (as with frame logic or Prolog or other special considerations; see conclusion). But, as generally applied and as generally understood, this contrast between typical relational practice and the semantic Web (based on RDF and OWL) tends to hold.

From a theoretical standpoint, I have found the treatment of Patel-­Schneider and Horrocks [6] to be most useful in comparing these approaches. However, the Description Logics Handbook and some other varied sources are also helpful [7,5]. Much of the technical aspects summarized in the table below are from these sources; I refer you to these sources for more informed technical discussions:

Relational Approach(Open) Semantic Web Approach

Closed World Assumption (CWA)

That which is not known to be true is presumed to be false; it needs to be explicitly stated as true. Negation as failure (NAF) is a related assumption, since it assumes as false every predicate that cannot be proven to be true. Under CWA, any statement not known to be true is false.

Everything is prohibited until it is permitted.

Open World Assumption (OWA)

The lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity.

Everything is permitted until it is prohibited.

Unique Name Assumption (UNA)

The unique name assumption (UNA) is premised that different names always refer to different entities in the world.

Duplicate Labels Allowed

OWL allows different synonym labels to be used for the same object; same names may refer to different objects. Identity assertions must be explicitly stated.

Complete Information

The data system at hand is assumed to be complete. (Missing information is often handled via the null statement in SQL, but that has been controversial and contentious in its own right.) This is also known as the domain-closure assumption.

Incomplete Information

A central tenet of OWA is that information is incomplete. A corollary is that the attributes of specific objects or instances may also be incomplete or partially known.

Single Schema (one world)

A single schema is necessary to define the scope and interpretation of the world (domain at hand).

Many World Interpretations

Schema and data instance assertions are kept separate. Multiple interpretations (worlds) for the same data are possible.

Integrity Constraints

Integrity constraints prevent “incorrect” values from being asserted in the relational model. It is useful for validation/parsing/data input and is related to the single model that contains only the facts asserted. Strict cardinality is used for checking validation.

Logical Axioms (restrictions)

Logical axioms provide restrictions through property domains and ranges. Everything can be true unless proven otherwise, and multiple possible models can satisfy the axioms. This provides more powerful inferencing, though can also be unintuitive at times. Cardinality and range restrictions exhibit different behavior for objects (inferred) or datatypes.

Non-monotonic Logic

The set of conclusions warranted on the basis of a given knowledge base does not increase (in fact, it likely shrinks) with the size of the knowledge base [5].

Monotonic Logic

The hypotheses of any derived fact may be freely extended with additional assumptions. Additional assertions tend to reduce the inferences or entailments that can be applied. A new piece of knowledge cannot reduce what is known [5]. New knowledge can arise through inference.

Fixed and Brittle

Changing the schema requires re-architecting the database; not inherently extensible.

Reusable and Extensible

Designed from the ground up to reuse existing ontologies (axioms) and to be extensible. Database design and management can be more agile, with schema evolving incrementally.

Flat Structure; Strong Typing

Information organized into flat tables; linkages and connections between tables based on foreign keys or joins. Strong data typing orientation.

Graph Structure; Open Typing

Inherent graph structure, supporting of linkage and connectivity analysis. Datatypes are inherently loose, though axioms can add strong types. Datatypes treated in the same way as classes, and datatype values are treated in the same way as individual identiers (i.e., a data value is treated as referring to an object).

Querying and Tooling

SQL and query optimizers well developed. Tooling well developed. Disjunction not supported; negation must be accommodated through approaches such as NAF. Sums and counts are easier due to unique name premise. Answer closure (one answer passable to a next calculation) is easier than OWA. Most tools are not suitable for any arbitrary schema.

Querying and Tooling

SPARQL and emerging rule languages used for querying; performance at scale and with broad distribution a concern. Queries require contextual information for proper set selection. Negation and disjunction are allowed and are powerful constructs. Tools generally less developed. Exciting opportunities for ontology-driven applications working against a small set of generic tools.

In well-characterized or self-contained domains (seats on a plane, books in a library, customers of a company, products sold via distribution channels), the traditional relational model works well. A closed-world assumption is performant for transaction operations with easier data validation. The number of negative facts about a given domain is typically much greater than the number of the positive ones. So, in many bounded applications, the number of negative facts is so large that their explicit representation can become practically impossible [7]. In such cases, it is simpler and shorter to state known “true” statements than to enumerate all “false” conditions.

However, the relational model is a paradigm where the information must be complete and it must be described by a single schema. Traditional databases require an agreement on a schema, which must be made before data can be stored and queried. The relational model assumes that the only objects and relationships that exist in the domain are those that are explicitly represented in the database, and that names uniquely identify objects in this domain. The result of these assumptions is that there is a single (canonical) model for relational systems where objects and relationships are in a one-to-one correspondence with the data in the database [6].

This makes CWA and its related assumptions a very poor choice when attempting to combine information from multiple sources, to deal with uncertainty or incompleteness in the world, or to try to integrate internal, proprietary information with external data.

The process of describing an open, semantic Web “world” can proceed incrementally, sequentially asserting new statements or conditions. The schema in the open semantic Web — the ontology — consists of sets of statements (called axioms) that describe characteristics that must be satisfied by the ontology designer’s idea of “reasonable” states of the world. Formally, such statements correspond to logical sentences, and an ontology corresponds to a logical theory [6].

Irregularity and incompleteness are toxic to relational model design. In the open semantic Web, data that is structured differently can still be stored together via RDF triple statements (subjectpredicateobject). For example, OWA allows suppliers without cities and names to be stored along alongside suppliers with that information. Information can be combined about similar objects or individuals even though they have different or non-overlapping attributes. Duplicate checking now occurs based on the logic of the system and not unique name evaluations. Data validation in OWA systems can both become more complicated (via testing against restriction statements) or partially easier (via inference).

It is interesting to note that the theoretical underpinnings of CWA by Reiter [8] began to be understood about the same time (1978) that data federation and knowledge representation (KR) activities also began to come to the fore. CWA and later work on (for example) default reasoning [5] appeared to have informed early work in description logics and its alternative OWA approach. This heavily influenced the development of the semantic Web languages RDF and OWL. However, the early path toward KM work based on the relational model also appears to have been set in this timeframe.

We are still reaping the whirlwind from this unfortunate early choice of the relational model for KR, KM and BI purposes. Moreover, though there is quite a bit of theoretical and logical discussion of the alternative OWA and CWA data models, there are surprisingly few discussions of what the implications of these models are to the enterprise. (That is, the elephant in the room.) The next two sections tackle this gap.

The Knowledge Management Argument for OWA

The above should make clear that the relational model and CWA are appropriate for defined and bounded systems. However, many of the new knowledge economy challenges are anything but defined and bounded. These applications all reside in the broad category of knowledge management (KM), and include such applications as data federation, data warehousing, enterprise information integration, business intelligence, competitive intelligence, knowledge representation, and so forth.

Let’s looks at the characteristics of such knowledge systems and why they are more appropriately modeled through the open world assumption (OWA) rather than the relational model and CWA:

  • Knowledge is never complete — gaining and using knowledge is a process, and is never complete. A completeness assumption around knowledge is by definition inappropriate
  • Knowledge is found in structured, semi-structured and unstructured forms — structured databases represent only a portion of structured information in the enterprise (spreadsheets and other non-relational datastores provide the remainder). Further, general estimates are that 80% of information available to enterprises reside in documents, with a growing importance to metadata, Web pages, markup documents and other semi-structured sources. A proper data model for knowledge representation should be equally applicable to these various information forms; the open semantic language of RDF is specifically designed for this purpose
  • Knowledge can be found anywhere — the open world assumption does not imply open information only. However, it is also just as true that relevant information about customers, products, competitors, the environment or virtually any knowledge-based topic can also not be gained via internal information alone. The emergence of the Internet and the universal availability and access to mountains of public and shared information demands its thoughtful incorporation into KM systems. This requirement, in turn, demands OWA data models
  • Knowledge structure evolves with the incorporation of more information — our ability to describe and understand the world or our problems at hand requires inspection, description and definition. Birdwatchers, botanists and experts in all domains know well how inspection and study of specific domains leads to more discerning understanding and “seeing” of that domain. Before learning, everything is just a shade of green or a herb, shrub or tree to the incipient botanist; eventually, she learns how to discern entire families and individual plant species, all accompanied by a rich domain language. This truth of how increased knowledge leads to more structure and more vocabulary needs to be explicitly reflected in our KM systems
  • Knowledge is contextual — the importance or meaning of given information changes by perspective and context. Further, exactly the same information may be used differently or given different importance depending on circumstance. Still further, what is important to describe (the “attributes”) about certain information also varies by context and perspective. Large knowledge management initiatives that attempt to use the relational model and single perspectives or schema to capture this information are doomed in one of two ways:  either they fail to capture the relevant perspectives of some users; or they take forever and massive dollars and effort to embrace all relevant stakeholders’ contexts
  • Knowledge should be coherentcoherence is the state of having internal logical consistency. A library of books organized by the Dewey Decimal Classification v. the Library of Congress Classification v. the Colon classification system (or others) is not inherently correct or wrong, but it is important that whatever system is used be applied consistently. Because of the power of OWA logics in inferencing and entailments, whatever “world” is chosen for a given knowledge representation should be coherent.  Fantasies such as Avatar and the Lord of the Rings trilogy, even though not real, can be made believable and compelling by virtue of their coherence
  • Knowledge is about connections — the epistemological nature of knowledge can be argued endlessly, but I submit much of what distinguishes knowledge from information is that knowledge makes the connections between disparate pieces of relevant information. As these relationships accrete, the knowledge base grows. Again, RDF and the open world approach are essentially connective in nature. New connections and relationships tend to break brittle relational models, and
  • Knowledge is about its users defining its structure and use — since knowledge is a state of understanding by practitioners and experts in a given domain, it is also important that those very same users be active in its gathering, organization (structure) and use. Data models that allow more direct involvement and authoring and modification by users — as is inherently the case with RDF and OWA approaches — bring the knowledge process closer to hand. Besides this ability to manipulate the model directly, there are also the immediacy advantages of incremental changes, tests and tweaks of the OWA model. The schema consensus and delays from single-world views inherent to CWA remove this immediacy, and often result in delays of months or years before knowledge structures can actually be used and tested [9].

To be sure, there are many circumstances where large stores of instance data and their analysis are necessary for knowledge purposes. In these cases, hybrid CWA-OWA systems (see conclusion) may make sense.

But, as these points emphasize, the general assembly and organization of knowledge is open world in nature. Trying to fit KM and related applications into the straightjacket of the relational model is folly. The relational model and CWA for KM is the elephant in the room. Three decades of failures and disappointments affirm this fact.

The Business Argument for OWA

Besides the native match of knowledge systems with OWA, there are sound business arguments for embracing the (open) semantic enterprise as well. These arguments can be summarized as lower risklower cost, faster deployment, and more agile responsiveness. What is there not to love?

It should now be clear that it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with a focus on early, high-value applications and domains.

Open world does not necessarily mean open data and it does not mean open source. Open world is simply a way to think about the information we have and how we act on it. OWA technologies are neutral to the question of open or public sources. The techniques can equivalently be applied to internal, closed, proprietary data and structures. Moreover, the technologies can themselves be used as a basis for bringing external information into the enterprise. An open world assumption merely asserts that we never have all necessary information and lacking that information does not itself lead to any conclusions.

Further, we need not abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives. However, in leveraging those assets, it is important that the enterprise begin to embrace and understand the open world assumption.

We also see that RDF and OWL, while important behind the scenes as a canonical data model and languages for organizing this information, need not be exposed as such to most users. Most instance data can be expressed as is with the data languages of choice such as XML, JSON or whatever. We are merely using the techniques of the (open) semantic Web as the data model to organize our information assets at hand. These assets need not themselves be represented in the native RDF or OWL languages.

Thus, open world frameworks provide some incredibly important benefits for knowledge management applications in the enterprise:

  • Domains can be analyzed and inspected incrementally
  • Schema can be incomplete and developed and refined incrementally
  • The data and the structures within these open world frameworks can be used and expressed in a piecemeal or incomplete manner
  • We can readily combine data with partial characterizations with other data having complete characterizations
  • Systems built with open world frameworks are flexible and robust; as new information or structure is gained, it can be incorporated without negating the information already resident, and
  • Open world systems can readily bridge or embrace closed world subsystems.

One might argue, as we believe, that the biggest impediment to the semantic enterprise is the mind shift necessary to start thinking about and accepting the open world premise. Again, this perspective is not applicable to all problems and domains. But, where it is, much can be left in place and leveraged with semantic technologies, so long as the enterprise begins to look at these existing assets through a different open-world lens.

In most real world circumstances, there is much we don’t know and we interact in complex and external environments. Knowledge management inherently occupies this space. Ultimately, data interoperability implies a global context. Open world is the proper logic premise for these circumstances. Via the OWA framework, we can readily change and grow our conceptual understanding and coverage of the world, including incorporation of external ontologies and data. Since this can easily co-exist with underlying closed-world data, the semantic enterprise can readily bridge both worlds.

So, we can now define the open semantic enterprise as one that embraces OWA for its knowledge management applications and engages in rapid and low-risk testing of incremental learning. The open world assumption is the proper framework to reverse decades of failure and disappointment for knowledge projects in the enterprise.

Some Open Questions about OWA

In our own discussions about ABox – TBox splits [10], we have, in essence, supported a hybrid OWA-CWA argument for the enterprise. It is beyond the scope of this current piece to describe these approaches in detail, but some of the options include local CWA, the addition of rule languages and constraints to basic OWA, use of the new OWL 2, TopQuadrant’s SPIN notation, and others [11]. I will address some of these in a later post.

There are also questions about performance and scalability with open semantic technologies. Here, too, progress is rapid, with billion triple thresholds rapidly falling with daily reports of better performance [12]. Fortunately, the incremental approach that we advocate herein dovetails well with these rapid developments. There should be no arguing the benefits of a successful incremental project in a smaller domain, perhaps repeated across multiple domains, in comparison to large, costly initiatives that never produce (even though their underlying technologies are performant).

There are also architecture issues inherent in these OWA designs. In one of our next posts, we return to the topic of Web-oriented architecture and its role in support of these OWA knowledge management initiatives.

In the end, there is no substitute for doing and learning. KM based on OWA for the open semantic enterprise can be started today, in a focused manner with tangible benefits and outcomes, at low cost and risk. Let’s push the elephant out of the room and let the learning and doing begin.


[1] For example, see Roger Sessions, 2009. Cost of IT Failure, September 28, 2009. This analysis suggests failure rates of 65% with a total estimated worldwide cost of $6.2 trillion in 2009. Commenters have raised questions as to what constitutes failure and have questioned some of the analysis assumptions. Nonetheless, even with over-estimates, the scale of the numbers is alarming; see Jorge Dominguez, 2009. The CHAOS Report 2009 on IT Project Failure, June 16, 2009, which indicates combined failure and challenge rates for IT projects have ranged from 65% to 84% over the period 1994 to 2009; see Dan Galorath, 2008. Software Project Failure Costs Billions; Better Estimation & Planning Can Help, June 7, 2008. In this report, Galorath compares and combines many of the available IT failure studies and summarizes that 3 of 5 IT projects do not do what they were supposed to for the expected costs, with 49% showing budget overruns, 47% showing higher than expected maintenance costs, and 41% failing to deliver expected business value; the anecdotal failure rate for years for IT projects has been claimed as 80%, with business intelligence and data warehousing particularly failure-prone areas; in 2001, a study by Mark N. Frolick and Keith Lindsey, Critical Factors for Data Warehouse Failures, for the Data Warehousing Institute noted conventional wisdom says the failure rate of data warehousing projects is 70 to 80 percent, with a then-recent study in the insurance industry found a 90-percent failure rate. This report is useful for combining many historical studies.
[2] According to this article, by Antone Gonsalves, Poor Use Of Data Integration Tools Can Waste $500,000 Annually: Gartner (April 27, 2009), which reports on a recent Gartner Report, large global 2000 companies, using several data integration tools with overlapping features, can reduce costs by more than $500,000 annually by eliminating redundant software and leveraging a shared services model. In a further report by Roman Stanek, Business Intelligence Projects are Famous for Low Success Rates, High Costs and Time Overruns (April 25, 2009), Gartner is talking about a dirty little secret in the world of data integration, the fact that the data integration technology in place is based on generations of data integration technology being layered in the enterprise over the years. Thus, technology that was purchased to solve data integration problems, and reduce costs, is actually making the data integration problem more complex and no longer cost efficient.
[3] Here are some of my earlier postings dealing in some degree with OWA: Ontology-driven Applications Using Adaptive Ontologies, November 23, 2009; Fresh Perspectives on the Semantic Enterprise, September 28, 2009; Confronting Misconceptions with Adaptive Ontologies, August 17, 2009; Advantages and Myths of RDF, April 8, 2009; Making Linked Data Reasonable using Description Logics, Part 2, February 15, 2009, which specifically relates OWA to the ABox and TBox [4]; and, The Role of UMBEL: Stuck in the Middle with You . . ., May 11, 2008.

[4] We use the reference to “ABox” and “TBox” in accordance with our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[5] A model theory is a formal semantic theory which relates expressions to interpretations. A “model” refers to a given logical “interpretation” or “world”. (See, for example, the discussion of interpretation in Patrick Hayes, ed., 2004. RDF Semantics – W3C Recommendation, 10 February 2004.) The logic or inference system of classical model theory is monotonic. That is, it has the behavior that if S entails E then (S + T) entails E. In other words, adding information to some prior conditions or assertions cannot invalidate a valid entailment. The basic intuition of model-theoretic semantics is that asserting a statement makes a claim about the world: it is another way of saying that the world is, in fact, so arranged as to be an interpretation which makes the statement true. An assertion amounts to stating a constraint on the possible ways the world might be. In comparison, a non-monotonic logic system may include default reasoning, where one assumes a ‘normal’ general truth unless it is contradicted by more particular information (birds normally fly, but penguins don’t fly); negation-by-failure, commonly assumed in logic programming systems, where one concludes, from a failure to prove a proposition, that the proposition is false; and implicit closed-world assumptions, often assumed in database applications, where one concludes from a lack of information about an entity in some corpus that the information is false (e.g., that if someone is not listed in an employee database, that he or she is not an employee.) See further, Non-monotonic Logic from the Stanford Encyclopedia of Philosophy.
[6] Peter F. Patel-­Schneider and Ian Horrocks, 2006. Position Paper: A Comparison of Two Modelling Paradigms in the Semantic Web,” in WWW2006, May 22–-26, 2006, Edinburgh, UK. See http://www.comlab.ox.ac.uk/people/ian.horrocks/Publications/download/2006/PaHo06a.pdf.
[7] Other resources include: Franz Baader, Diego Calvanese, Deborah McGuiness, Daniele Nardi, and Peter Patel-Schneider, eds., 2003. The Description Logic Handbook: Theory, Implementation and Applications, Cambridge University Press, 2003. Online access to much of the book is available at http://www.inf.unibz.it/~franconi/dl/course/; see esp. Chapters 1, 2, 4 and 16 relate to this topic; Jos de Bruijn, Axel Polleres, Ruben Lara and Dieter Fensel, 2005. OWL DL vs. OWL Flight: Conceptual Modeling and Reasoning for the Semantic Web, in Proceedings of the Ninth World Wide Web Conference, Japan, May 2005. This paper argues against the use of description logics for the semantic Web; Andrew Newman, 2007. A Relational View of the Semantic Web, March 14, 2007; Hai Wang, 2006. Frames and OWL Side by Side, presented at the 9th International Protégé Conference, July 23-26, 2006, Stanford, CA; Nick Drummond and Rob Shearer, 2006. The Open World Assumption, Powerpoint presentation at The Chris Date Seminar: The Closed World of Databases Meets the Open World of the Semantic Web, e-Science Institute, Edinburgh, Scotland, 12 Ocotober 2006; Yulia Levin, 2008. Closed World Reasoning, presentation at Non-classical Logics and Applications Seminar – Winter 2008, Tel Aviv University; and Pat Hayes, 2001. “Why must the web be monotonic?”, email thread at http://lists.w3.org/Archives/Public/www-rdf-logic/2001Jul/0067.html.
[8] Raymond Reiter, 1978. “On Closed World Data Bases”, in Logic and Data Bases, H. Gallaire and J. Minker, eds., New York: Plenum Press, 55-76; see also, Raymond Reiter, 1980. “A Logic for Default Reasoning,” Artificial Intelligence, 13:81-132.
[9] See this Google search on ontology-driven applications.
[10] See this Google search on ABox-TBox articles.
[11] See, as examples: J. Heflin and H. Munoz-Avila, 2002. LCW-Based Agent Planning for the Semantic Web, in AAAI ’02 Workshop on Ontologies and the Semantic Web, AAAI Press, pp. 63–70. See http://www.cse.lehigh.edu/~heflin/pubs/lcw-aaai02.pdf (one of the first local CWA suggestions in specific regard to the semantic Web); K. Golden, O. Etzioni and D. Weld, D. 1994. Omnipresence Without Omniscience: Efficient Sensor Managment for Planning, in Proceedings of AAAI-94 (one of the first to propose LCWA in general); Evren Sirin, Michael Smith and Evan Wallace, 2008. Integrity constraints: Opening, Closing Worlds — On Integrity Constraints, presented at OWL: Experiences and Directions (OWLED 2008), Fifth International Workshop, Karlsruhe, Germany, October 26-27, 2008; Timothy L. Hinrichs, Jui-Yi Kao and Michael R. Genesereth, 2009. Inconsistency-tolerant Reasoning with Classical Logic and Large Databases, in Proceedings of the Eighth Symposium on Abstraction, Reformulation, and Approximation (SARA2009), July 2009; S. Gómez, C. Chesñevar and G. Simari 2008. An Argumentative Approach to Reasoning with Inconsistent Ontologies, in Proceedings of the KR Workshop on Knowledge Representation and Ontologies (KROW 2008), Conferences in Research and Practice in Information Technology, Vol. 90, pp. 11-20. Eds. T.Meyer, M. Orgun. Australian Computer Society, Sidney, Australia, July 2008. Holger Knoblauch, The Object-Oriented Semantic Web with SPIN, Sunday, January 18, 2009, that discusses the SPIN (SPARQL Inferencing Notation) Modeling Vocabulary, which is a light-weight collection of RDF properties and classes to support the use of SPARQL to specify rules and logical constraints.
[12] For example, the BigOWLIM can perform reasoning against 12 billion explicit statements and loads about 12,000 statements per second on a standard server; see http://www.ontotext.com/owlim/benchmarking/lubm.html; also, see Orri Erling’s blog regarding performance of the Virtuoso RDF triple store (http://www.openlinksw.com/weblog/oerling/). In any case, these performance benchmarks continue to rise steadily and indicate the performance of RDF as an ontology integration layer.

Posted by AI3's author, Mike Bergman Posted on December 21, 2009 at 11:20 pm in Description Logics, Ontologies, Semantic Web | Comments (4)
The URI link reference to this post is: http://www.mkbergman.com/852/the-open-world-assumption-elephant-in-the-room/
The URI to trackback this post is: http://www.mkbergman.com/852/the-open-world-assumption-elephant-in-the-room/trackback/
Posted:November 16, 2009

Image Source: www.adhd-mindbydesign.com

High Visibility Problems with NYT, data.gov Show Need for Better Practices

When I say, “shot”, what do you think of? A flu shot? A shot of whisky? A moon shot? A gun shot? What if I add the term “bank”? Do you now think of someone being shot in an armed robbery of a local bank or similar?

And, now, what if I add a reference to say, The Hustler, or Minnesota Fats, or “Fast Eddie” Felson? Do you now see the connection to a pressure-packed banked pool shot in some smoky bar room?

As humans we need context to make connections and remove ambiguity. For machines, with their limited reasoning and inference engines, context and accurate connections are even more important.

Over the past few weeks we have seen announcements of two large and high-visibility linked data projects:  One, a first release of references for articles concerning about 5,000 people from the New York Times at data.nytimes.com; and Two, a massive exposure of 5 billion triples from data.gov datasets provided by the Tetherless World Constellation (TWC) at Rennselaer Polytechnic Institute (RPI).

On various grounds from licensing to data characterization and to creating linked data for its own sake, some prominent commentators have weighed in on what is good and what is not so good with these datasets. One of us, Mike, commented about a week ago that “we have now moved beyond ‘proof of concept’ to the need for actual useful data of trustworthy provenance and proper mapping and characterization. Recent efforts are a disappointment that no enterprise would or could rely upon.”

Reactions to that posting and continued discussion on various mailing lists warrant a more precise dissection of what is wrong and still needs to be done with these datasets [1].

Berners-Lee’s Four Linked Data “Rules”

It is useful, then, to return to first principles, namely the original four “rules” posed by Tim Berners-Lee in his design note on linked data [2]:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
  4. Include links to other URIs so that they can discover more things.

The first two rules are definitional to the idea of linked data. They cement the basis of linked data in the Web, and are not at issue with either of the two linked data projects that are the subject of this posting.

However, it is the lack of specifics and guidance in the last two rules where the breakdowns occur. Both the NYT and the RPI datasets suffer from a lack of “providing useful information” (Rule #3). And, the nature of the links in Rule #4 is a real problem for the NYT dataset.

What Constitutes “Useful Information”?

The Wikipedia entry on linked data expands on “useful information” by augmenting the original rule with the parenthetical clause, ” (i.e., a structured description — metadata).” But even that expansion is insufficient.

Fundamentally, what are we talking about with linked data? Well, we are talking about instances that are characterized by one or more attributes. Those instances exist within contexts of various natures. And, those contexts may relate to other existing contexts.

We can break this problem description down into three parts:

  • A vocabulary that defines the nature of the instances and their descriptive attributes
  • A schema of some nature that describes the structural relationships amongst instances and their characteristics, and, optimally,
  • A mapping to existing external schema or constructs that help place the data into context.

At minimum, ANY dataset exposed as linked data needs to be described by a vocabulary. Both the NYT and RPI datasets fail on this score, as we elaborate below. Better practice is to also provide a schema of relationships in which to embed each instance record. And, best practice is to also map those structures to external schema.

Lacking this “useful information”, especially a defining vocabulary, we cannot begin to understand whether our instances deal with drinks, bank robberies or pool shots. This lack, in essence, makes the information worthless, even though available via URL.

The data.gov (RPI) Case

With the support of NSF and various grant funding, RPI has set up the Data-Gov Wiki [3], which is in the process of converting the datasets on data.gov to RDF, placing them into a semantic wiki to enable comment and annotation, and providing that data as RSS feeds. Other demos are also being placed on the site.

As of the date of this posting, the site had a catalog of 116 datasets from the 800 or so available on data.gov, leading to these statistics:

  • 459,412,419 table entries
  • 5,074,932,510 triples, and
  • 7,564 properties (or attributes).

We’ll take one of these datasets, #319, and look a bit closer at it:

Wiki Title Agency Name data.gov Link No Properties No Triples RDF File
Dataset 319Consumer Expenditure SurveyDepartment of LaborLABOR-STAThttp://www.data.gov/details/319221,583,236http://data-gov.tw.rpi.edu/raw/319/index.rdf

This report was picked solely because it had a small number of attributes (properties), and is thus easier to screen capture. The summary report on the wiki is shown by this page:

Data-gov-Wiki Dataset #319

(click to expand)

So, we see that this specific dataset contains about 22 of the nearly 8,000 attributes across all datasets.

When we click on one of these attribute names, we are then taken to a specific wiki page that only reiterates its label. There is no definition or explanation.

When we inspect this page further we see that, other than the broad characterization of the dataset itself (the bulk of the page), we see at the bottom 22 undefined attributes with labels such as item code, periodicity code, seasonal, and the like. These attributes are the real structural basis for the data in this dataset.

But, what does all of this mean???

To gain a clue, now let’s go to the source data.gov site for this dataset (#319). Here is how that report looks:

Data.gov Dataset #319

(click to expand)

Contained within this report we see a listing for additional metadata. This link tells us about the various data fields contained in this dataset; we see many of these attributes are “codes” to various data categories.

Probing further into the dataset’s technical documentation, we see that there is indeed a rich structure underneath this report, again provided via various code lookups. There are codes for geography, seasonality (adjusted or not), consumer demographic profiles and a variety of consumption categories. (See, for example, the link to this glossary page.) These are the keys to understanding the actual values within this dataset.

For example, one major dimension of the data is captured by the attribute item_code. The survey breaks down consumption expenditures within the broad categories of  Food, Housing, Apparel and Services, Transportation, Health Care, Entertainment, and Other. Within a category, there is also a rich structural breakdown. For example, expenditures for Bakery Products within Food is given a code of FHC2.

But, nowhere are these codes defined or unlocked in the RDF datasets. This absence is true for virtually all of the datasets exposed on this wiki.

So, for literally billions of triples, and 8,000 attributes, we have ABSOLUTELY NO INFORMATION ABOUT WHAT THE DATA CONTAINS OTHER THAN A PROPERTY LABEL. There is much, much rich value here in data.gov, but all of it remains locked up and hidden.

The sad truth about this data release is that it provides absolutely no value in its current form. We lack the keys to unlock the value.

To be sure, early essential spade work has been done here to begin putting in place the conversion infrastructure for moving text files, spreadsheets and the like to an RDF form. This is yeoman work important to ultimate access. But, until a vocabulary is published that defines the attributes and their codes so we can unlock this value, it will remain hidden. And only when its further value (by connecting attributes and relations across datasets) through a schema of some nature is also published, the real value from connecting the dots will also remain hidden.The Hustler

These datasets may meet the partial conditions of providing clickable URLs, but the crucial “useful information” as to what any of this data means is absent.

Every single dataset on data.gov has supporting references to text files, PDFs, Web pages or the like that describe the nature of the data within each dataset. Until that information is exposed and made usable, we have no linked data.

Until ontologies get created from these technical documents, the value of these data instances remain locked up, and no value can be created from having these datasets expressed in RDF.

The devil lies in the details. The essential hard work has not yet begun.

The NYT Case

Though at a much smaller scale with many fewer attributes, the NYT dataset suffers from the same failing: it too lacks a vocabulary.

So, let’s take the case of one of the lead actors in The Hustler, Paul Newman, who played the role of “Fast Eddie” Felson. Here is the NYT record for the “person” Paul Newman (which they also refer to as http://data.nytimes.com/newman_paul_per). Note the header title of Newman, Paul:

NYT 'Paul Newman Articles' Record

(click to expand)

Click on any of the internal labels used by the NYT for its own attributes (such as nyt:first_use), and you will be given this message:

“An RDFS description and English language documentation for the NYT namespace will be provided soon. Thanks for your patience.”

We again have no idea what is meant by all of this data except for the labels used for its attributes. In this case for nyt:first_use we have a value of “2001-03-18″.

Hello? What? What is a “first use” for a “Paul Newman” of “2001-03-18″???

The NYT put the cart before the horse: even if minimal, they should have released their ontology first — or at least at the same time — as they released their data instances. (See further this discussion about how an ontology creation workflow can be incremental by starting simple and then upgrading as needed.)

Links to Other Things

Since there really are no links to other things on the Data-Gov Wiki, our focus in this section continues with the NYT dataset using our same example.

We now are in the territory of the fourth “rule” of linked data: 4. Include links to other URIs so that they can discover more things.

This will seem a bit basic at first, but before we can talk about linking to other things, we first need to understand and define the starting “thing” to which we are linking.

What is a “Newman, Paul” Thing?

Of course, without its own vocabulary, we are left to deduce what this thing “Newman, Paul“ is that is shown in the previous screen shot. Our first clue comes from the statement that it is of rdf:type SKOS concept. By looking to the SKOS vocabulary, we see that concept is a class and is defined as:

A SKOS concept can be viewed as an idea or notion; a unit of thought. However, what constitutes a unit of thought is subjective, and this definition is meant to be suggestive, rather than restrictive. The notion of a SKOS concept is useful when describing the conceptual or intellectual structure of a knowledge organization system, and when referring to specific ideas or meanings established within a KOS.

We also see that this instance is given a foaf:primaryTopic of Paul Newman.

So, we can deduce so far that this instance is about the concept or idea of Paul Newman. Now, looking to the attributes of this instance — that is the defining properties provided by the NYT — we see the properties of nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage. Completing our deductions, and in the absence of its own vocabulary, we can now define this concept instance somewhat as follows:

New York Times articles in the period 2001 to 2009 having as their primary topic the actor Paul Newman

(BTW, across all records in this dataset, we could see what the earliest first use was to better deduce the time period over which these articles have been assembled, but that has not been done.)

We also would re-title this instance more akin to “2001-2009 NYT Articles with a Primary Topic of Paul Newman” or some such and use URIs more akin to this usage.

sameAs Woes

Thus, in order to make links or connections with other data, it is essential to understand what the nature is of the subject “thing” at hand. There is much confusion about actual “things” and the references to “things” and what is the nature of a “thing” within the literature and on mailing lists.

Our belief and usage in matters of the semantic Web is that all “things” we deal with are a reference to whatever the “true”, actual thing is. The question then becomes:  What is the nature (or scope) of this referent?

There are actually quite easy ways to determine this nature. First, look to one or more instance examples of the “thing” being referred to. In our case above, we have the “Newman, Paul” instance record. Then, look to the properties (or attributes) the publisher of that record has used to describe that thing. Again, in the case above, we have nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage.

Clearly, this instance record — that is, its nature — deals with articles or groups of articles. The relation to Paul Newman occurs as a basis of the primary topic of these articles, and not a person basis for which to describe the instance. If the nature of the instance was indeed the person Paul Newman, then the attributes of the record would more properly be related to “person” properties such as age, sex, birthdate, death date, marital status, etc.

This confusion by NYT as to the nature of the “things” they are describing then leads to some very serious errors. By confusing the topic (Paul Newman) of a record with the nature of that record (articles about topics), NYT next misuses one of the most powerful semantic Web predicates available, owl:sameAs.

By asserting in the “Newman, Paul” record that the instance has a sameAs relationship with external records in Freebase and DBpedia, the NYT both entails that properties from any of the associated records are shared and infers a chain of other types to describe the record. More precisely, the NYT is asserting that the “thing” referred to by these instances are identical resources.

Thus, by the sameAs statements in the “Newman, Paul” record, the NYT is also asserting that that record is an instance of all these things [5]:

Furthermore, because of its strong, reciprocal entailments, the owl:sameAs assertion would also now entail that the person Paul Newman has the nyt:first_use and nyt:last_use attributes, clearly illogical for a “person” thing.

This connection is clearly wrong in both directions. Articles are not persons and don’t have marital status; and persons do not have first_uses. By misapplying this sameAs linkage relationship, we have screwed things up in every which way. And the error began with misunderstanding what kinds of “things” our data is about.

Some Options

However, there are solutions. First, the sameAs assertions, at least involving these external resources, should be dropped.

Second, if linkages are still desired, a vocabulary such as UMBEL [4] could be used to make an assertion between such a concept, and these other related resources. So, even though these resources are not the same, they are closely related. The UMBEL ontology helps us to define this kind of relation between related, but non-identical, resources.

Instead of using the owl:sameAs property, we would suggest the usage of the umbel:linksEntity, which links a skos:Concept to related named entities resources. Additionally, Freebase, which also currently asserts a sameAs relationship to the NYT resource, could use the umbel:isAbout relationship to assert that their resource “is about” a certain concept, which is the one defined by the NYT.

Alternatively, still other external vocabularies that more precisely capture the intent of the NYT publishers could be found, or the NYT editors could define their own properties specifically addressing their unique linkage interests.

Other Minor Issues

As a couple of additional, minor suggestions for the NYT dataset, we would suggest:

  • Create a foaf:Organization description of the NYT organization, then use it with dc:creator and dcterms:rightsHolder rather than using a literal, and
  • The dual URIs such as “http://data.nytimes.com/N31738445835662083893” and “http://data.nytimes.com/newman_paul_per” are not wrong in themselves, but the purpose is hard to understand. Why does a single organization need to create multiple resources for the identical resource, when it comes from the same system and has the same purpose?

Re-visiting the Linkage “Rule”

There are very valuable benefits from entailment, inference and logic to be gained from linking resources. However, if the nature of the “things” being linked — or the properties that define these linkages — are incorrect, then very wrong logical implications result. Great care and understanding should be applied to linkage assertions.

In the End, the Challenge is Not Linked Data, but Connected Data

Our critical comments are not meant to be disrespectful and are not being picky. The NYT and TWC are prominent institutions for which we should expect leadership on these issues. Our criticisms (and we believe those of others) are also not an expression of a “trough of disillusionment” as some have been pointing out.

This posting has been jointly authored by Mike Bergman and Fred Giasson and simultaneously published on both of their blogs, hoping to draw more attention to the need for better practices in publishing linked data.

This posting is about poor practices, pure and simple. The time to correct them is now. If asked, we would be pleased to help either institution establish exemplar practices. This is not automatic, and it is not always easy. The data.gov datasets, in particular, will require much time and effort to get right. There is much documentation that needs to be transitioned and expressed in semantic Web formats.

In a broader sense, we also seem to lack a definition of best practices related to vocabularies, schema and mappings. The Berners-Lee rules are imprecise and insufficient as is. Prior best guidance documents tend to be more how to publish and make URIs linkable, than to properly characterize, describe and connect the data.

Perhaps, in part, this is a bit of a semantics issue. The challenge is not the mechanics of linking data, but the meaning and basis for connecting that data. Connections require logic and rationality sufficient to reliably inform inference and rule-based engines. It also needs to pass the sniff test as we “follow our nose” by clicking the links exposed by the data.

It is exciting to see high-quality content such as from national governments and major publishers like the New York Times begin to be exposed as linked data. When this content finally gets embedded into usable contexts, we should see manifest uses and benefits emerge. We hope both institutions take our criticisms in that spirit.


[1] The NYT has been updated with improvements and they fixed multiple issues from the first release. The problems listed herein, however, still pertain after these improvements.
[2] Tim Berners-Lee, 2006. Linked Data (Design Issues), first posted on 2006-07-27; last updated on 2009-06-18. See http://www.w3.org/DesignIssues/LinkedData.html. Berners-Lee refers to the steps above as “rules,” but he elaborates they are expectations of behavior. Most later citations refer to these as “principles.”
[3] Li Ding, Dominic DiFranzo, Sarah Magidson, Deborah L. McGuinness and Jim Hendler, 2009. Data-GovWiki: Towards Linked Government Data. See http://www.cs.vu.nl/~pmika/swc/documents/Data-gov%20Wiki-data-gov-wiki-v1.pdf.
[4] UMBEL (Upper Mapping and Binding Exchange Layer) is a lightweight ontology structure in development for relating Web content and data to a standard set of subject concepts. It purpose has resulted in its creation of an associated vocabulary geared to both class-instance and reciprocal relationships, as well as partial or likelihood relationships. See http://umbel.org/technical_documentation.html#vocabulary.
[5] We’d like to thank Denny Vrandecic (see comments) for pointing out an imprecision in our original wording. This phrase was originally stated as, “Thus, by the sameAs statements in the ‘Newman, Paul’ record, the NYT is also asserting that that record is the same as these other things.”
Posted:November 11, 2009

irON - instance record and Object Notation

A Case Study of Turning Spreadsheets into Structured Data Powerhouses

In a former life, I had the nickname of ‘Spreadsheet King’ (perhaps among others that I did not care to hear). I had gotten the nick because of my aggressive use of spreadsheets for financial models, competitors tracking, time series analyses, and the like. However, in all honesty, I have encountered many others in my career much more knowledgeable and capable with spreadsheets than I’ll ever be. So, maybe I was really more like a minor duke or a court jester than true nobility.

Yet, pro or amateur, there are perhaps 1 billion spreadsheet users worldwide [1], making spreadsheets undoubtedly the most prevalent data authoring environment in existence. And, despite moans and wails about how spreadsheets can lead to chaos, spaghetti code, or violations of internal standards, they are here to stay.

Spreadsheets often begin as simple notetaking environments. With the addition of new findings and more analysis, some of these worksheets may evolve to become full-blown datasets. Alternatively, some spreadsheets start from Day One as intended datasets or modeling environments. Whatever the case, clearly there is much accumulated information and data value “locked up” in existing spreadsheets.

How to “unlock” this value for sharing and collaboration was a major stimulus to development of the commON serialization of irON (instance record and Object Notation) [2]. I recently published a case study [3] that describes the reasons and benefits of dataset authoring in a spreadsheet, and provides working examples and code based on Sweet Tools [4] to aid users in understanding and using the commON notation. I summarize portions of that study herein.

This is the second article of a two-part series related to the recent Sweet Tools update.

Background on Sweet Tools and irON

The dataset that is the focus of this use case, Sweet Tools, began as an informal tracking spreadsheet about four years ago. I began it as a way to learn about available tools in the semantic Web and -related spaces. I began publishing it and others found it of value so I continued to develop it.

As it grew over time, however, it gained in structure and size. Eventually, it became a reference dataset, with which many other people desired to use and interact. The current version has well over 800 tools listed, characterized by many structured data attributes such as type, programming language, description and so forth. As it has grown, a formal controlled vocabulary has also evolved to bring consistency to the characterization of many of these attributes.

It was natural for me to maintain this listing as a spreadsheet, which was also reinforced when I was one of the first to adopt an Exhibit presentation of the data based on a Google spreadsheet about three years back. Here is a partial view of this spreadsheet as I maintain it locally:

Sweet Tools Main Spreadsheet Screen
(click to expand)

When we began to develop irON in earnest as a simple (“naïve”) dataset authoring framework, it was clear that a comma-separated value, or CSV [5], option should join the other two serializations under consideration, XML and JSON. CSV, though less expressive and capable as a data format than the other serializations, still has an attribute-value pair (also known as key-value pairs and many other variants [6]) orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc.

As a dataset very familiar to us as irON‘s editors, and directly relevant to the semantic Web, Sweet Tools provided a perfect prototype case study for helping to guide the development of irON, and specifically what came to be known as the commON serialization for irON. The Sweet Tools dataset is relatively large for a speciality source, has many different types and attributes, and is characterized by text, images, URLs and similar.

The premise was that if Sweet Tools could be specified and represented in commON sufficiently to be parsed and converted to interoperable RDF, then many similar instance-oriented datasets could likely be so as well. Thus, as we tried and refined notation and vocabulary, we tested applicability against the CSV representation of Sweet Tools in addition to other CSV, JSON and XML datasets.

Dataset Authoring in a Spreadsheet

A large portion of the case study describes the many advantages of authoring small datasets within spreadsheets. The useful thing about the CSV format is that these full functional capabilities of the spreadsheet are available during authoring or later updates and modifications, but, when exported, the CSV provides a relatively clean format for processing and parsing.

So, some of the reasons for small dataset authoring in a spreadsheet include:

  • Formatting and on-sheet management -  the first usefulness of a spreadsheet comes from being able to format and organize the records. Records can be given background colors to highlight distinctions (new entries, for example); live URL links can be embedded; contents can be wrapped and styled within cells; and the column and row heads can be “frozen”, useful when scrolling large workspaces
  • Named blocks and sorting – named blocks are a powerful feature of modern spreadsheets, useful for data manipulation, printing and internal referencing by formulas and the like.  Sorting with named blocks is especially important as an aid to check consistency of terminology, records completeness, duplicates checks, missing value checks, and the like. Named blocks can also be used as references in calculations. All of these features are real time savers, especially when datasets grow large and consistency of treatment and terminology is important
  • Multiple sheets and consolidated accesscommON modules can be specified on a single worksheet or multiple worksheets and saved as individual CSV files; because of its size and relative complexity, the Sweet Tools dataset is maintained on multiple sheets. Multi-worksheet environments help keep related data and notes consolidated and more easily managed on local hard drives
  • Completeness and counts - the spreadsheet counta function is useful to sum counts for cell entries by both column and row, a useful aid to indicate if an attribute or type value is missing or if a record is incomplete.  Of course, similar helps and uses can be found for many of the hundreds of embedded functions within a spreadsheet
  • Controlled vocabularies and data entry validation – quality datasets often hinge on consistency and uniform values and terminology; the data validation utilities within spreadsheets can be applied to Boolean, ranges and mins and maxes, and to controlled vocabulary lists. Here is an example for Sweet Tools, enforcing proper tool category assignments from a 50-item pick list:
Controlled Vocabularies and Data Entry Validation
  • Specialized functions and macrosall functionality of spreadsheets may be employed in the development of commON datasets. Then, once employed, only the values embedded within the sheets are then exported as CSV.

Staging Sweet Tools for commON

The next major section of the case study deals with the minor conventions that must be followed in order to stage spreadsheets for commON. Not much of the specific commON vocabulary or notation is discussed below; for details, see [7].

Because you can create multiple worksheets within a spreadsheet, it is not necessary to modifiy existing worksheets or tabs. Rather, if you are reluctant or can not change existing information, merely create parallel duplicate sheets of the source information. These duplicate sheets have as their sole purpose export to commON CSV. You can maintain your spreadsheet as is while staging for commON.

To do so, use the simple = formula to create cross-references between the existing source spreadsheet tab and the target commON CSV export tab. (You can also do this for complete, highlighted blocks from source to target sheet.) Then, by adding the few minor conventions of commON, you have now created a staged export tab without modifying your source information in the slightest.

In standard form and for Excel and Open Office, single quotes, double quotes and commas when entered into a spreadsheet cell are automatically ‘escaped‘ when issued as CSV. commON allows you to specify your own delimiter for lists (the standard is the pipe ‘|’ character) and what the parser recognizes as the ‘escape’ character (‘\’ is the standard). However, you probably should not change for most conditions.

The standard commON parsers and converters are UTF-8 compatible. If your source content has unusual encodings, try to target UTF-8 as your canonical spreadsheet output.

In the irON specification there are a small number of defined modules or processing sections. In commON, these modules are denoted by the double-ampersand character sequence (‘&&‘), and apply to lists of instance records (&&recordList), dataset specifications and associated metadata describing the dataset (&&dataset), and mappings of attributes and types to existing schema (&&linkage). Similarly, attributes and types are denoted by a single ampersand prefix (&attributeName).

In commON, any or all of the modules can occur within a single CSV file or in multiple files. In any case, the start of one of these processing modules is signaled by the module keyword and &&keyword convention.

The RecordList Module

The first spreadsheet figure above shows a Sweet Tools example for the &&recordList module. The module begins with that keyword, indicating one of more instance records will follow. Note that the first line after the &&recordList keyword is devoted to the listing of attributes and types for the instance records (designated by the &attributeName convention in the columns for the first row after the &&recordList keyword is encountered).

The &&recordList format can also include the stacked style (see similar Dataset example below) in addition to the single row style shown above.

At any rate, once a worksheet is ready with its instance records following the straightforward irON and commON conventions, it can then be saved as a CSV file and appropriately named. Here is an example of what this “vanilla” CSV file now looks like when shown again in a spreadsheet:

Spreadsheet View of the CSV File
(click to expand)

Alternatively, you could open this same file in a text editor. Here is how this exact same instance record view looks in an editor:

Editor View of the CSV Record File
(click to expand)

Note that the CSV format separates each column by the comma separator, with escapes shown for the &description attribute when it includes a comma-separated clause. Without word wrap, each record in this format occupies a single row (though, again, for the stacked style, multiple entries are allowed on individual rows so long as a new instance record &id is not encountered in the first column).

The Dataset Module

The &&dataset module defines the dataset parameters and provides very flexible metadata attributes to describe the dataset [8]. Note the dataset specification is exactly equivalent in form to the instance record (&&recordList) format, and also allows the single row or stacked styles (see these instance record examples), with this one being the stacked style:

The Dataset Module
(click to expand)

The Linkage Module

The &&linkage module is used to map the structure of the instance records to some structural schema, which can also include external ontologies. The module has a simple, but specific structure.

Either attributes (presented as the &attributeList) or types (presented as the &typeList) are listed sequentially by row until the listing is exhausted [8]. By convention, the second column in the listing is the targeted &mapTo value. Absent a prior &prefixList value, the &mapTo value needs to be a full URL to the corresponding attribute or type in some external schema:

The Linkage Module

Notice in the case of Sweet Tools that most values are from the actual COSMO mini-ontology underlying the listing. These need to be listed as well, since absent the specifications in commON the system has NO knowledge of linkages and mappings.

The Schema (structure) Module

In its current state of development, commON does not support a spreadsheet-based means for specifying the schema structure (lightweight ontology) governing the datasets [2]. Another irON serialization, irJSON, does. Either via this irJSON specification or via an offline ontology, a link reference is presently used by commON (and, therefore, Sweet Tools for this case study) to establish the governing structure of the input instance record datasets.

A spreadsheet-based schema structure for commON has been designed and tested in prototype form. commON should be enhanced with this capability in the near future [8].

Saving and Importing

If the modules are spread across more than one worksheet, then each worksheet must be saved as its own CSV file. In the case of Sweet Tools, as exhibited by its reference current spreadsheet, sweet_tools_20091110.xls, three individual CSV files get saved. These files can be named whatever you would like. However, it is essential that the names be remembered for later referencing.

My own naming convention is to use a format of appname_date_modulename.csv because it sorts well in a file manager accommodating multiple versions (dates) and keeps related files clustered. The appname in the case of Sweet Tools is generally swt. The modulename is generally the dataset, records, or linkage convention. I tend to use the date specification in the YYYYMMDD format. Thus, in the case of the records listings for Sweet Tools, its filename could be something like:  swt_20091110_records.csv.

Once saved, these files are now ready to be imported into a structWSF [9] instance, which is where the CSV parsing and conversion to interoperable RDF occurs [8]. In this case study, we used the Drupal-based conStruct SCS system [10]. conStruct exposes the structWSF Web services via a user interface and a user permission and access system. The actual case study write-up offers more details about the import process.

Using the Dataset

We are now ready to interact with the Sweet Tools structured dataset using conStruct (assuming you have a Drupal installation with the conStruct modules) [10].

Introduction to the App

The screen capture below shows a couple of aspects of the system:

  • First, the left hand panel (according to how this specific Drupal install was themed) shows the various tools available to conStruct.  These include (with links to their documentation) Search, Browse, View Record, Import, Export, Datasets, Create Record, Update Record, Delete Record and Settings [11];
  • The Browse tree in the main part of the screen shows the full mini-ontology that classifies Sweet Tools. Via simple inferencing, clicking on any parent link displays all children projects for that category as well (click to expand):
conStruct (Drupal) Browse Screen for Sweet Tools(click to expand)

One of the absolutely cool things about this framework is that all tools, inferencing, user interfaces and data structure are a direct result of the ontology(ies) underlying the system (plus the irON instance ontology, as well). This means that switching datasets or adding datasets causes the entire system structure to now reflect those changes — without lifting a finger!!

Some Sample Uses

Here are a few sample things you can do with these generic tools driven by the Sweet Tools dataset:

Note, if you access this conStruct instance you will do so as a demo user. Unfortunately, as such, you may not be able to see all of the write and update tools, which in this case are reserved for curators or admins. Recall that structWSF has a comprehensive user access and permissions layer.

Exporting in Alternative Formats

Of course, one of the real advantages of the irON and structWSF designs is to enable different formats to be interchanged and to interoperate. Upon submission, the commON format and its datasets can then be exported in these alternate formats and serializations [8]:

  • commON
  • irJSON
  • irXML
  • N-Triples/CSV
  • N-Triples/TSV
  • RDF+N3
  • RDF+XML

As should be obvious, one of the real benefits of the irON notation — in addition to easy dataset authoring — is the ability to more-or-less treat RDF, CSV, XML and JSON as interoperable data formats.

The Formal Case Study

The formal Sweet Tools case study based on commON, with sample download files and PDF, is available from Annex: A commON Case Study using Sweet Tools, Supplementary Documentation [3].


[1] In 2003, Microsoft estimated its worldwide users of the Excel spreadsheet, which then had about a 90% market share globally, at 400 million. Others at that time estimated unauthorized use to perhaps double that amount. There has been significant growth since then, and online spreadsheets such as Google Docs and Zoho have also grown wildly. This surely puts spreadsheet users globally into the 1 billion range.
[2] See Frédérick Giasson and Michael Bergman, eds., Instance Record and Object Notation (irON) Specification, Specification Document, version 0.82, 20 October 2009.  See http://openstructs.org/iron/iron-specification. Also see the irON Web site, Google discussion group, and code distribution site.
[3] Michael Bergman, 2009. Annex: A commON Case Study using Sweet Tools, Supplementary Documentation, prepared by Structured Dynamics LLC, November 10, 2009. See http://openstructs.org/iron/common-swt-annex. It may also be downloaded in PDF .
[4] See Michael K. Bergman’s AI3:::Adaptive Information blog, Sweet Tools (Sem Web). In addition, the commON version of Sweet Tools is available at the conStruct site.
[5] The CSV mime type is defined in Common Format and MIME Type for Comma-Separated Values (CSV) Files [RFC 4180]. A useful overview of the CSV format is provided by The Comma Separated Value (CSV) File Format. Also, see that author’s related CTX reference for a discussion of how schema and structure can be added to the basic CSV framework; see http://www.creativyst.com/Doc/Std/ctx/ctx.htm, especially the section on the comma-delimited version (http://www.creativyst.com/Doc/Std/ctx/ctx.htm#CTC).
[6] An attribute-value system is a basic knowledge representation framework comprising a table with columns designating “attributes” (also known as properties, predicates, features, parameters, dimensions, characteristics or independent variables) and rows designating “objects” (also known as entities, instances, exemplars, elements or dependent variables). Each table cell therefore designates the value (also known as state) of a particular attribute of a particular object. This is the basic table presentation of a spreadsheet or relational data table.

Attribute-values can also be presented as pairs in a form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the “object” is implied, or is introduced as the name of the array..

[7] See especially SUB-PART 3: commON PROFILE in, Frédérick Giasson and Michael Bergman, eds., Instance Record and Object Notation (irON) Specification, Specification Document, version 0.82, 20 October 2009.
[8] As of the date of this case study, some of the processing steps in the commON pipeline are manual. For example, the parser creates an intermediate N3 file that is actually submitted to the structWSF. Within a week or two of publication, these capabilities should be available as a direct import to a structWSF instance. However, there is one exception to this:  the specification for the schema structure. That module has been prototyped, but will not be released with the first commON upgrade. That enhancement is likely a few weeks off from the date of this posting. Please check the irON or structWSF discussion groups for announcements.
[9] structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data, with generic tools driven by underlying data structures. Its central perspective is that of the dataset. Access and user rights are granted around these datasets, making the framework enterprise-ready and designed for collaboration. Since a structWSF layer may be placed over virtually any existing datastore with Web access — including large instance record stores in existing relational databases — it is also a framework for Web-wide deployments and interoperability.
[10] conStruct SCS is a structured content system built on the Drupal content management framework. conStruct enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces. It is based on RDF and SD’s structWSF platform-independent Web services framework [6]. In addition to user access control and management and a general user interface, conStruct provides Drupal-level CRUD, data display templating, faceted browsing, full-text search, and import and export over structured data stores based on RDF.
[11] More Web services are being added to structWSF on a fairly constant basis, and the existng ones have been through a number of upgrades.
Posted:November 2, 2009

Structured Dynamics LLC

A New Slide Show Consolidates, Explains Recent Developments

Much has been happening on the Structured Dynamics front of late. Besides welcoming Steve Ardire as a senior advisor to the company, we also have been issuing a steady stream of new products from our semantic Web pipeline.

This new slide show attempts to capture these products and relate them to the various layers in Structured Dynamics’ enterprise product stack:

The show indicates the role of scones, irON, structWSF, UMBEL, conStruct and others and how they leverage existing information assets to enable the semantic enterprise. And, oh, by the way, all of this is done via Web-accessible linked data and our practical technologies.

Enjoy!

Posted:October 11, 2009

The Marshal Has Come to Town

A Marshal to Bring Order to the Town of Data Gulch

Though not the first, I have been touting the Linked Data Law for a couple of years now [1]. But in a conversation last week, I found that my colleague did not find the premise very clear. I suspect that is due both to cryptic language on my part and the fact no one has really tackled the topic with focus. So, in this post, I try to redress that and also comment on the related role of linked data in the semantic enterprise.

Adding connections to existing information via linked data is a powerful force multiplier, similar to Metcalfe’s law for how the value of a network increases with more users (nodes). I have come to call this the Linked Data Law: the value of a linked data network is proportional to the square of the number of links between data objects.

“In the network economy, the connections are as important as the nodes.” [2]

An early direct mention of the semantic Web and its possible ability to generate network effects comes from a 2003 Mitre report for the government [3]. In it, the authors state, “At present a very small proportion of the data exposed on the web is marked up using Semantic Web vocabularies like RDF and OWL. As more data gets mapped to ontologies, the potential exists to achieve a ‘network effect’.” Prescient, for sure.

In July 2006, both Henry Story and Dion Hinchliffe discussed Metcalfe’s law, with Henry specifically looking to relate it to the semantic Web [4]. He noted that his initial intuition was that “the value of your information grows exponentially with your ability to combine it with new information.” He noted he was trying to find ways to adapt Metcalfe’s law for applicability to the semantic Web.

I picked up on those observations and commented to Henry at that time and in my own post, “The Exponential Driver of Combining Information.” I have been enamoured of the idea ever since, and have begun to weave the idea into my writings.

More recently, in late 2008, James Hendler and Jennifer Golbeck devoted an entire paper to Metcalfe’s law and the semantic Web [5]. In it, they note:

“This linking between ontologies, and between instances in documents that refer to terms in another ontology, is where much of the latent value of the Semantic Web lies. The vocabularies, and particularly linked vocabularies using URIs, of the Semantic Web create a graph space with the ability to link any term to any other. As this link space grows with the use of RDF and OWL, Metcalfe’s law will once again be exploited – the more terms to link to, and the more links created, the more value in creating more terms and linking them in.”

A Refresher on Metcalfe’s Law

Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of users of the system (n²) (note: it is not exponential, as some of the points above imply). Robert Metcalfe formulated it about 1980 in relation to Ethernet and fax machines; the “law” was then named for Metcalfe and popularized by George Gilder in 1993.

These attempts to estimate the value of physical networks were in keeping with earlier efforts to estimate the value of a broadcast network. That value is almost universally agreed to be proportional to the number of users, as accepted as Sarnoff’s law (see further below).

The actual algorithm proposed by Metcalfe calculates the number of unique connections in a network with n nodes to be n(n − 1)/2, which is proportional to n2. This makes Metcalfe’s law a quadratic growth equation.

As nodes get added, then, we see the following increase in connections:

Metcalfe Law Network Effect

‘Network Effect’ for Physical Networks

This diagram, modified from Wikipedia to be a horizontal image, shows how two telephones can make only one connection, five can make 10 connections, and twelve can make 66 connections, etc.

By definition, a physical network is a connected network. Thus, every time a new node is added to the network, connections are added, too. This general formula has also been embraced as a way to discuss social connections on the Internet [6].

Analogies to Linked Data

Like physical networks, the interconnectedness of the semantic Web or semantic enterprise is a graph.

The idea behind linked data is to make connections between data. Unlike physical telecommunication networks, however, the nodes in the form of datasets and data are (largely) already there. What is missing are the connections. The build-out and growth that produces the network effects in a linked data context do not result from adding more nodes, but from the linking or connecting of existing nodes.

The fact that adding a node to a physical network carries with it an associated connection has tended to conjoin these two complementary requirements of node and connection. But, to grok the real dynamics and to gain network effects, we need to realize: Both nodes and connections are necessary.

One circumstance of the enterprise is that data nodes are everywhere. The fact that the overwhelming majority are unconnected is why we have adopted the popular colloquialism of data “silos”. There are also massive amounts of unconnected data on the Web in the form of dynamic databases only accessible via search form, and isolated data tables and listings virtually everywhere.

Thus, the essence of the semantic enterprise and the semantic Web is no more complicated than connecting — meaningfully — data nodes that already exist.

As the following diagram shows, unconnected data nodes or silos look like random particles caught in the chaos of Brownian motion:

Linked Data Law Network Effect

‘Network Effect’ for Coherent Linked Data

As initial connections get made, bits of structure begin to emerge. But, as connections are proliferated — exactly equivalant to the network effects of connected networks — coherence and value emerge.

Look at the last part in the series diagram above. We not only see that the same nodes are now all connected, with the inferences and relationships that result from those connections, but we can also see entirely new structures emerge by virtue of those connections. All of this structure and meaning was totally absent prior to making the linked data connections.

Quantifying the Network Effect

So, what is the benefit of this linked data? It depends on the product of the value of the connections and the multiplier of the network effect:

linked data benefit = connections value X network effect multiplier

Just as it is hard to have a conversation via phone with yourself, or to collaborate with yourself, the ability to gain perspective and context from data comes from connections. But like some phone calls or some collaborations, the value depends on the participants. In the case of linked data, that depends on the quality of the data and its coherence [7]. The value “constant” for connected linked data depends in some manner on these factors, as well as the purposes and circumstances to which that linked data might be applied.

Even in physical networks or social collaboration contexts, the “value” of the network has been hard to quantify. And, while academics and researchers will appropriately and naturally call for more research on these questions, we do not need to be so timid. Whatever the alpha constant is for quantifying the value of a linked data network, our intuition should be clear that making connections, finding relationships, making inferences, and making discoveries can not occur when data is in isolation.

Because I am an advocate, I believe this alpha constant of value to be quite large. I believe this constant is also higher for circumstances of business intelligence, knowledge management and discovery.

The second part of the benefit equation is the multiplier for network effects. We’ve mentioned before the linear growth advantage due to broadcast networks (Sarnoff law) and the standard quadratic growth assumption of physical and social networks (Metcalfe law). Naturally, there have been other estimates and advocacies.

David Reed [8], for example, also adds group effects and has asserted an exponential multiplier to the network effect (like Henry Story’s initial intuition noted above). As he states,

“[E]ven Metcalfe’s Law understates the value created by a group-forming network [GFN] as it grows. Let’s say you have a GFN with n members. If you add up all the potential two-person groups, three-person groups, and so on that those members could form, the number of possible groups equals 2n. So the value of a GFN increases exponentially, in proportion to 2n. I call that Reed’s Law. And its implications are profound.”

Yet not all agree with the assertion of an exponential multiplier, let alone the quadratic one of Metcalfe. Odlyzko and Tilly [9] note that Metcalfe’s law would hold if the value that an individual gets personally from a network is directly proportional to the number of people in that network. But, then they argue that does not hold because of local preferences or different qualities of interaction. In a linked data context, such arguments have merit, though you may also want to see Metcalfe’s own counter-arguments [6].

Hinchliffe’s earlier commentary [4] provided a nice graphic that shows the implications of these various multiplers on the network effect, as a function of nodes in a network:

Potency of the Network Effect from Dion Hinchliffe

Various Estimates for the ‘Network Effect’

I believe we can dismiss the lower linear bound of this question and likely the higher exponential one as well (that is, Reed’s law, because quality and relevance questions make some linked data connections less valuable than others). Per the above, that would suggest that the multiplier of the linked data network is perhaps closer to the Metcalfe estimate or similar.

In any event, it is also essential to point out that connecting data indiscriminantly for linked data’s sake will likely deliver few, if any, benefits. Connections must still be coherent and logical for the value benefits to be realized.

The Role and Contribution of Linked Data

I elsewhere discuss the role of linked data in the enterprise and will continue to do so. But, there are some implications in the above that warrant some further observations.

It should be clear that the graph and network basis of linked data, not to mention some of the uncertainties as to quantifying benefits, suggests the practice should be considered apart from mission-critical or transactional uses in the enterprise. That may change with time and experience.

There are also open questions about data quality in terms of inputs to linked data and possible erroneous semantics and ontologies to guide the linked connections. Operational uses should be kept off the table for now. Like physical networks, not all links perform well and not all have usefulness. Similarly to how poor connections may be encountered in physical networks, they should be either taken off-ledger or relegated to a back-up basis. Linked data should be understood and treated no differently than networks of variable quality.

Such realism is important — for both internal and external linked data advocates — to allow linked data to be applied in the right venues at acceptable risk and with likely demonstrable benefits. Elsewhere I have advocated an approach that builds on existing assets; here I advocate a clear and smart understanding of where linked data can best deliver network effects in the near term.

And, so, in the nearest term, enterprise applications that best fit linked data promises and uncertainties include:

  • Establishing frameworks for data federation
  • Business intelligence
  • Discovery
  • Knowledge management and knowledge resources
  • Reasoning and inference
  • Development of internal common language
  • Learning and adopting data-driven apps [10], and
  • Staging and analysis for data cleaning.

A New Deputy Has Come to Town

As in the Wild West, the new deputy marshal and his tin badge did not guarantee prosperity. But a good marshal would deliver law and order. And those are the preconditions for the town folk to take charge of building their own prosperity.

Linked data is a practice for starting to bring order and connections to your existing data. Once some order has been imposed, the framework then becomes a basis for defining meanings and then gaining value from those connections.

Once order has been gained, it is up to the good citizens of Data Gulch to then deliver the prosperity. Broad participation and the network effect are one way to promote that aim. But success and prosperity still depends on intelligence and good policies and practice.


[1] I first put forward this linked data aspect in What is Linked Data?, dated June 23, 2008. I then formalized it in Structure the World, dated August 3, 2009.
[2] Paul Tearnen, 2006. “Integration in the Network Economy,” Information Management Special Reports, October 2006. See http://www.information-management.com/specialreports/20061010/1064941-1.html.
[3] Salim K. Semy, Mark Linderman and Mary K. Pulvermacher, 2003. “Information Management Meets the Semantic Web,” DOD Report by MITRE Corporation, November 2003, 10 pp. See http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA460265&Location=U2&doc=GetTRDoc.pdf.
[4] On July 15, 2006, Dion Hinchcliffe wrote, Web 2.0′s Real Secret Sauce: Network Effects. He produced a couple of useful graphics and expanded upon some earlier comments to the Wall Street Journal. Shortly thereafter, on July 29, Story wrote his own post, RDF and Metcalfe’s law, as noted. I commented on July 30.
[5] James Hendler and Jennifer Golbeck, 2008. “Metcalfe’s Law, Web 2.0, and the Semantic Web,” in Journal of Web Semantics 6(1):14-20, 2008. See http://www.cs.umd.edu/~golbeck/downloads/Web20-SW-JWS-webVersion.pdf.
[6] Robert Metcalfe, 2006. Metcalfe’s Law Recurses Down the Long Tail of Social Networking, see http://vcmike.wordpress.com/2006/08/18/metcalfe-social-networks/.
[7] See my When is Content Coherent? posting of July 25, 2008. ‘Coherence’ is a frequent theme of my blog posts; see my chronological listing for additional candidates.
[8] From David P. Reed, 2001. “The Law of the Pack,” Harvard Business Review, February 2001, pp 23-4. For more on Reed’s position, see Wikipedia’s entry on Reed’s law.
[9] Andrew Odlyzko and Benjamin Tilly, 2005. A Refutation of Metcalfe’s Law and a Better Estimate for the Value of Networks and Network Interconnections, personal publication; see http://www.dtc.umn.edu/~odlyzko/doc/metcalfe.pdf.
[10] Data-driven applications are the term we have adopted for modular, generic tools that operate and present results to users based on the underlying data structures that feed them. See further the discussion of Structured Dynamics’s products.

Posted by AI3's author, Mike Bergman Posted on October 11, 2009 at 8:16 pm in Linked Data, Semantic Web | Comments (4)
The URI link reference to this post is: http://www.mkbergman.com/837/the-law-of-linked-data/
The URI to trackback this post is: http://www.mkbergman.com/837/the-law-of-linked-data/trackback/