Evolution
AI³
Adaptive Information
Adaptive Innovation
Adaptive Infrastructure
a·dap·tive adj. Showing or having a capacity to make fit for new or special situations; flexible; a successful adjustment.

Blogasbörd (cloud version):
Send Email   Get SIOC Profile   Get FOAF Profile   Syndicate full contents for this site using RSS 20
Main Links
Categories
Calendar
September 2010
S M T W T F S
« Aug    
 1234
567891011
12131415161718
19202122232425
2627282930  
Archives
More . . .  
Search
Affiliations
structWSF
Credits
Blog software courtesy of WordPress Obtain Technorati profile Subscribe with Bloglines
View Mike's profile on LinkedIn
Date:   January 12, 2010

Seven Pillars of the Open Semantic Enterprise

Guideposts for How to Make the Transition

The beginning of a new year and a new decade is a perfect opportunity to take stock of how the world is changing and how we can change with it. Over the past year I have been writing on many foundational topics relevant to the use of semantic technologies in enterprises.

In this post I bring those threads together to present a unified view of these foundations — some seven pillars — to the open semantic enterprise.

By open semantic enterprise we mean an organization that uses the languages and standards of the semantic Web, including RDF, RDFS, OWL, SPARQL and others to integrate existing information assets, using the best practices of linked data and the open world assumption, and targeting knowledge management applications. It does so using some or all of the seven foundational pieces (”pillars”) noted herein.

The foundational approaches to the open semantic enterprise do not necessarily mean open data nor open source (though they are suitable for these purposes with many open source tools available [3]). The techniques can equivalently be applied to internal, closed, proprietary data and structures. The techniques can themselves be used as a basis for bringing external information into the enterprise. ‘Open’ is in reference to the critical use of the open world assumption.

These practices do not require replacing current systems and assets; they can be applied equally to public or proprietary information; and they can be tested and deployed incrementally at low risk and cost. The very foundations of the practice encourage a learn-as-you-go approach and active and agile adaptation. While embracing the open semantic enterprise can lead to quite disruptive benefits and changes, it can be accomplished as such with minimal disruption in itself. This is its most compelling aspect.

Like any change in practice or learning, embracing the open semantic enterprise is fundamentally a people process. This is the pivotal piece to the puzzle, but also the one that does not lend itself to ready formula about pillars or best practices. Leadership and vision is necessary to begin the process. People are the fuel for impelling it. So, we’ll take this fuel as a given below, and concentrate instead on the mechanics and techniques by which this vision can be achieved. In this sense, then, there are really eight pillars to the open semantic enterprise, with people residing at the apex.

This article is synthetic, with links to (largely) my preparatory blog postings and topics that preceded it. Assuming you are interested in becoming one of those leaders who wants to bring the benefits of an open semantic enterprise to your organization, I encourage you to follow the reference links for more background and detail.

Benefits A Review of the Benefits

OK, so what’s the big deal about an open semantic enterprise and why should my organization care?

We should first be clear that the natural scope of the open semantic enterprise is in knowledge management and representation [1]. Suitable applications include data federation, data warehousing, search, enterprise information integration, business intelligence, competitive intelligence, knowledge representation, and so forth [2]. In the knowledge domain, the benefits for embracing the open semantic enterprise can be summarized as greater insight with lower risk, lower cost, faster deployment, and more agile responsiveness.

The intersection of knowledge domain, semantic technologies and the approaches herein means it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with a focus on early, high-value applications and domains.

There is absolutely no need to abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives.

Embracing the pillars of the open semantic enterprise brings these knowledge management benefits:

  • Domains can be analyzed and inspected incrementally
  • Schema can be incomplete and developed and refined incrementally
  • The data and the structures within these frameworks can be used and expressed in a piecemeal or incomplete manner
  • Data with partial characterizations can be combined with other data having complete characterizations
  • Systems built with these frameworks are flexible and robust; as new information or structure is gained, it can be incorporated without negating the information already resident, and
  • Both open and closed world subsystems can be bridged.

Moreover, by building on successful Web architectures, we can also put in place loosely coupled, distributed systems that can grow and interoperate in a decentralized manner. These also happen to be perfect architectures for flexible collaboration systems and networks.

These benefits arise both from individual pillars in the open semantic enterprise foundation, as well as in the interactions between them. Let’s now re-introduce these seven pillars.

Pillar #1Pillar #1: The RDF Data Model

As I stated on the occasion of the 10th birthday of the Resource Description Framework data model, I belief RDF is the single most important foundation to the open semantic enterprise [4]. RDF can be applied equally to all structured, semi-structured and unstructured content. By defining new types and predicates, it is possible to create more expressive vocabularies within RDF. This expressiveness enables RDF to define controlled vocabularies with exact semantics. These features make RDF a powerful data model and language for data federation and interoperability across disparate datasets.

Via various processors or extractors, RDF can capture and convey the metadata or information in unstructured (say, text), semi-structured (say, HTML documents) or structured sources (say, standard databases). This makes RDF almost a “universal solvent” for representing data structure.

Because of this universality, there are now more than 150 off-the-shelf ‘RDFizers’ for converting various non-RDF notations (data formats and serializations) to RDF [5]. Because of its diversity of serializations and simple data model, it is also easy to create new converters. Once in a common RDF representation, it is easy to incorporate new datasets or new attributes. It is also easy to aggregate disparate data sources as if they came from a single source. This enables meaningful compositions of data from different applications regardless of format or serialization.

What this practically means is that the integration layer can be based on RDF, but that all source data and schema can still reside in their native forms [6]. If it is easier or more convenient to author, transfer or represent data in non-RDF forms, great [7]. RDF is only necessary at the point of federation, and not all knowledge workers need be versed in the framework.

Pillar #2 Pillar #2: Linked Data Techniques

Linked data is a set of best practices for publishing and deploying instance and class data using the RDF data model. Two of the best practices are to name the data objects using uniform resource identifiers (URIs), and to expose the data for access via the HTTP protocol. Both of these practices enable the Web to become a distributed database, which also means that Web architectures can also be readily employed (see Pillar #5 below).

Linked data is applicable to public or enterprise data, open or proprietary. It is really straightforward to employ. Structured Dynamics has published a useful FAQ on linked data.

Additional linked data best practices relate to how to characterize and classify data, especially in the use of predicates with the proper semantics for establishing the degree of relatedness for linked data items from disparate sources.

Linked data has been a frequent topic of this blog, including how adding linkages creates value for existing data, with a four-part series about a year ago on linked data best practices [8]. As advocated by Structured Dynamics, our linked data best practices are geared to data interconnections, interrelationships and context that is equally useful to both humans and machine agents.

Pillar #3 Pillar #3: Adaptive Ontologies

Ontologies are the guiding structures for how information is interrelated and made coherent using RDF and its related schema and ontology vocabularies, RDFS and OWL [10]. Thousands of off-the-shelf ontologies exist — a minority of which are suitable for re-use — and new ones appropriate to any domain or scope at hand can be readily constructed.

In standard form, semantic Web ontologies may range from the small and simple to the large and complex, and may perform the roles of defining relationships among concepts, integrating instance data, orienting to other knowledge and domains, or mapping to other schema [11]. These are explicit uses in the way that we construct ontologies; we also believe it is important to keep concept definitions and relationships expressed separately from instance data and their attributes [9].

But, in addition to these standard roles, we also look to ontologies to stand on their own as guiding structures for ontology-driven applications (see next pillar). With a relatively few minor and new best practices, ontologies can take on the double role of informing user interfaces in addition to standard information integration.

In this vein we term our structures adaptive ontologies [11,12,13]. Some of the user interface considerations that can be driven by adaptive ontologies include: attribute labels and tooltips; navigation and browsing structures and trees; menu structures; auto-completion of entered data; contextual dropdown list choices; spell checkers; online help systems; etc. Put another way, what makes an ontology adaptive is to supplement the standard machine-readable purpose of ontologies to add human-readable labels, synonyms, definitions and the like.

A neat trick occurs with this slight expansion of roles. The knowledge management effort can now shift to the actual description, nature and relationships of the information environment. In other words, ontologies themselves become the focus of effort and development. The KM problem no longer needs to be abstracted to the IT department or third-party software. The actual concepts, terminology and relations that comprise coherent ontologies now become the explicit focus of KM activities.

Any existing structure (or multiples thereof) can become a starting basis for these ontologies and their vocabularies, from spreadsheets to naïve data structures and lists and taxonomies. So, while producing an operating ontology that meets the best practice thresholds noted herein has certain requirements, kicking off or contributing to this process poses few technical or technology demands.

The skills needed to create these adaptive ontologies are logic, coherent thinking and domain knowledge. That is, any subject matter expert or knowledge worker likely has the necessary skills to contribute to useful ontology development and refinement. With adaptive ontologies powering ontology-driven apps (see next), we thus see a shift in roles and responsibilities away from IT to the knowledge workers themselves. This shift acts to democratize the knowledge management function and flatten the organization.

Pillar #4 Pillar #4: Ontology-driven Applications

The complement to adaptive ontologies are ontology-driven applications. By definition, ontology-driven apps are modular, generic software applications designed to operate in accordance with the specifications contained in an adaptive ontology. The relationships and structure of the information driving these applications are based on the standard functions and roles of ontologies, as supplemented by the human and user interface roles noted above [11,12,13].

Ontology-driven apps fulfill specific generic tasks. Examples of current ontology-driven apps include imports and exports in various formats, dataset creation and management, data record creation and management, reporting, browsing, searching, data visualization, user access rights and permissions, and similar. These applications provide their specific functionality in response to the specifications in the ontologies fed to them.

The applications are designed more similarly to widgets or API-based frameworks than to the dedicated software of the past, though the dedicated functionality (e.g., graphing, reporting, etc.) is obviously quite similar. The major change in these ontology-driven apps is to accommodate a relatively common abstraction layer that responds to the structure and conventions of the guiding ontologies. The major advantage is that single generic applications can supply shared functionality based on any properly constructed adaptive ontology.

This design thus limits software brittleness and maximizes software re-use. Moreover, as noted above, it shifts the locus of effort from software development and maintenance to the creation and modification of knowledge structures. The KM emphasis can shift from programming and software to logic and terminology [12].

Pillar #5 Pillar #5: A Web-oriented Architecture

A Web-oriented architecture (WOA) is a subset of the service-oriented architectural (SOA) style, wherein discrete functions are packaged into modular and shareable elements (”services”) that are made available in a distributed and loosely coupled manner. WOA uses the representational state transfer (REST) style. REST provides principles for how resources are defined and used and addressed with simple interfaces without additional messaging layers such as SOAP or RPC. The principles are couched within the framework of a generalized architectural style and are not limited to the Web, though they are a foundation to it [14].

REST and WOA stand in contrast to earlier Web service styles that are often known by the WS-* acronym (such as WSDL, etc.). WOA has proven itself to be highly scalable and robust for decentralized users since all messages and interactions are self-contained.

Enterprises have much to learn from the Web’s success. WOA has a simple design with REST and idempotent operations, simple messaging, distributed and modular services, and simple interfaces. It has a natural synergy with linked data via the use of URI identifiers and the HTTP transport protocol. As we see with the explosion of searchable dynamic databases exposed via the Web, so too can we envision the same architecture and design providing a distributed framework for data federation. Our daily experience with browser access of the Web shows how incredibly diverse and distributed systems can meaningfully interoperate [15].

This same architecture has worked beautifully in linking documents; it is now pointing the way to linking data; and we are seeing but the first phases of linking people and groups together via meaningful collaboration. While generally based on only the most rudimentary basis of connections, today’s social networking platforms are changing the nature of contacts and interaction.

The foundations herein provide a basis for marrying data and documents in a design geared from the ground up for collaboration. These capabilities are proven and deployable today. The only unclear aspects will be the scale and nature of the benefits [16].

Pillar #6 Pillar #6: An Incremental, Layered Approach

To this point, you’ll note that we have been speaking in what are essentially “layers”. We began with existing assets, both internal and external, in many diverse formats. These are then converted or transformed into RDF-capable forms. These various sources are then exposed via a WOA Web services layer for distributed and loosely-coupled access. Then, we integrate and federate this information via adaptive ontologies, which then can be searched, inspected and managed via ontology-driven apps. We have presented this layered architecture before [13], and have also expressed this design in relation to current Structured Dynamics’ products [17].

A slight update of this layered view is presented below, made even more general for the purposes of this foundational discussion:

Open Enterprise Architecture
(click to expand)

Semantic technology does not change or alter the fact that most activities of the enterprise are transactional, communicative or documentary in nature. Structured, relational data systems for transactions or records are proven, performant and understood. On its very face, it should be clear that the meaning of these activities — their semantics, if you will — is by nature an augmentation or added layer to how to conduct the activities themselves.

This simple truth affirms that semantic technologies are not a starting basis, then, for these activities, but a way of expressing and interoperating their outcomes. Sure, some semantic understanding and common vocabularies at the front end can help bring consistency and a common language to an enterprise’s activities. This is good practice, and the more that can be done within reason while not stifling innovation, all the better. But we all know that the budget department and function has its own way of doing things separate from sales or R&D. And that is perfectly OK and natural.

Clearly, then, an obvious benefit to the semantic enterprise is to federate across existing data silos. This should be an objective of the first semantic “layer”, and to do so in a way that leverages existing information already in hand. This approach is inherently incremental; if done right, it is also low cost and low risk.

Pillar #7 Pillar #7: The Open World Mindset

As these pillars took shape in our thinking and arguments over the past year, an illusive piece seemed always to be missing. It was like having one of those meaningful dreams, and then waking up in the morning wracking your memory trying to recall that essential, missing insight.

As I most recently wrote [1], that missing piece for this story is the open world assumption (OWA). I argue that this somewhat obscure concept holds within it the key as to why there have been decades of too-frequent failures in the enterprise in business intelligence, data warehousing, data integration and federation, and knowledge management.

Enterprises have been captive to the mindset of traditional relational data management and its (most often unstated) closed world assumption (CWA). Given the success of relational systems for transaction and operational systems — applications for which they are still clearly superior — it is understandable and not surprising that this same mindset has seemed logical for knowledge management problems as well.  But knowledge and KM are by their nature incomplete, changing and uncertain. A closed-world mindset carries with it certainty and logic implications not supportable by real circumstances.

This is not an esoteric point, but a fundamental one. How one thinks about the world and evaluates it is pivotal to what can be learned and how and with what information. Transactions require completeness and performance; insight requires drawing connections in the face of incompleteness or unknowns.

The absolute applicability of the semantic Web stack to an open-world circumstance is the elephant in the room [1]. By itself, the open world mindset provides no assurance of gaining insight or wisdom. But, absent it, we place thresholds on information and understanding that may neither be affordable nor achievable with traditional, closed-world approaches.

And, by either serendipity or some cosmic beauty, the open world mindset also enables incremental development, testing and refinement. Even if my basic argument of the open world advantage for knowledge management purposes is wrong, we can test that premise at low cost and risk. So, within available budget, pick a doable proof-of-concept, and decide for yourself.

Seven Pillars The Foundations for the Open Semantic Enterprise

The seven pillars above are not magic bullets and each is likely not absolutely essential. But, based on today’s understandings and with still-emerging use cases being developed, we can see our open semantic enterprise as resulting from the interplay of these seven factors:

Open Semantic Enterprise

Thirty years of disappointing knowledge management projects and much wasted money and effort compel that better ways must be found. On the other hand, until recently, too much of the semantic Web discussion has been either revolutionary (“change everything!!”) or argued from pie-in-the-sky bases. Something needs to give.

Our work over the past few years — but especially as focused in the last 12 months — tells us that meaningful semantic Web initiatives can be mounted in the enterprise with potentially huge benefits, all at manageable risks and costs. These seven pillars point to way to how this might happen. What is now required is that eighth pillar — you.


[1] See, M.K. Bergman, 2009. “The Open World Assumption: Elephant in the Room“, AI3:::Adaptive Information blog, December 21, 2009.
[2] In most instances, semantic technologies are poorly suited to transactional or operational applications. Also, there are instances in modeling specific closed-world domains where ontologies can be quite useful, such as in aerospace, petrochemicals, engineering, etc., where the scope of the domain can be precisely bounded and defined. Such efforts tend to be high cost with lengthy lead times. There are vendors who support efforts in these areas, though my company, Structured Dynamics, does not. Our focus and the more generally suitable case for semantic technologies we believe is in knowledge representation and management.
[3] The standard Sweet Tools listing on my AI3:::Adaptive Information blog contains more than 800 semantic Web and -related tools, most of which are open source, which can be inspected via filtered and faceted search.
[4] See, M.K. Bergman, 2009. “Advantages and Myths of RDF”, AI3:::Adaptive Information blog, April 8, 2009.
[5] For example, see this listing of more than 150 specific format options available as open source. These converters can also work directly with major application APIs.
[6] For an expansion on RDF as a canonical data model, see further M.K. Bergman, 2009. “Structure the World”, AI3:::Adaptive Information blog, August 3, 2009.
[7] For example, for dataset authoring, Structured Dynamics has developed irON, an instance record and object notation that can be serialized as JSON (called irJSON), XML (called irXML) or comma-separated values (or CSV comma-delimited files, called commON). The purpose of these notations is to provide easier authoring environments and scripting support to RDF-ready datasets. The advantage is to shield users from the nuances of RDF. The design of commON is especially geared to using spreadsheets as authoring environments for instance record tables or simple outline structures.  See further the irON specification.
[8] For a general listing of linked data articles, please see that category on this AI3:::Adaptive Information blog. Specific articles of interest include the four-part series on “Making Linked Data Reasonable Using Description Logics” [9] (February 11, February 15, February 18 and February 23, 2009) and the “The Law of Linked Data” (October 11, 2009).
[9] Our best practices approach makes explicit splits between the “ABox” (for instance data) and “TBox” (for ontology schema) in accordance with our working definition for description logics, a fundamental underpinning for how we use RDF:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[10] Those unfamiliar with the term ontology might be interested in my first introduction to the subject: M.K. Bergman, 2007. An Intrepid Guide to Ontologies, AI3:::Adaptive Information blog, May 16, 2007.
[11] See M.K. Bergman, 2009. Ontologies as the ‘Engine’ for Data-Driven Applications, AI3:::Adaptive Information blog, June 10, 2009. This is the most detailed explanation, but the specific term adaptive ontology was not yet used. The first dedicated focus on adaptive ontologies was in “Confronting Misconceptions with Adaptive Ontologies” (August 17, 2009). See also [12] and [13].
[12] See, M.K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies”, AI3:::Adaptive Information blog, November 23, 2009.
[13] See, M.K. Bergman, 2009. “Fresh Perspectives on the Semantic Enterprise”, AI3:::Adaptive Information blog, September 28, 2009.
[14] See, M.K. Bergman, 2009. “A General Web-oriented Architecture (WOA) for Structured Data”, AI3:::Adaptive Information blog, May 3, 2009. Also, see the related WOA category for other articles in this area.
[15] See, M.K. Bergman, 2008. “WOA: A New Enterprise Partner for Linked Data”, AI3:::Adaptive Information blog, October 12, 2008.
[16] See, M.K. Bergman, 2009. “structWSF: A Framework for Collaboration Networks”, AI3:::Adaptive Information blog, July 2, 2009.
[17] See http://structureddynamics.com/products.html for a general descriptive illustration of Structured Dynamics’ product stack. There is also a longer slideshow, with particular reference to slide #37.

Posted by AI3's author, Mike Bergman

Posted on January 12, 2010 at 3:26 pm in Description Logics, Linked Data, Ontologies, Ontology Best Practices, Semantic Web, Structured Dynamics, Web-oriented Architecture | Comments (11)
The URI link reference to this post is: http://www.mkbergman.com/859/seven-pillars-of-the-open-semantic-enterprise/
The URI to trackback this post is: http://www.mkbergman.com/859/seven-pillars-of-the-open-semantic-enterprise/trackback/
Date:   December 21, 2009

Open World

OWA Enables Incremental, Low-risk Wins for the Semantic Enterprise

In speaking of the semantic Web, it is not infrequent that the open world assumption (OWA) gets mentioned. What this post argues is that this somewhat obscure concept may hold within it the key as to why there have been decades of too-frequent failures in the enterprise in business intelligence, data warehousing, data integration and federation, and knowledge management.

This is a fairly bold assertion. In order to support it, we first need to look to the logic and mindset assumptions associated with traditional relational data management and the semantic Web. We then need to look to the nature of knowledge itself and its relation to data federation. It is in this intersection that the key of decades of faulty premises may reside.

The main argument is that the closed world assumption (CWA) and its prevalent mindset in traditional database systems have hindered the ability of enterprises and the vendors that support them to adopt incremental, low-risk means to knowledge systems and management. CWA, in turn, has led to over-engineered schema, too-complicated architectures and massive specification efforts that have led to high deployment costs, blown schedules and brittleness.

The good news is that abandoning these failed practices and embracing the open world approach can be done immediately based on existing assets. Simply shifting from the closed world to open world premise can, I argue, improve the odds for enterprise IT success in these areas.

It is time to meet the elephant in the room.

Scope and Some Root Causes of Enterprise IT Failures

It is, of course, a bit of editorial hyperbole to label most enterprise initiatives in business intelligence and knowledge management as being failures over the past few decades. And, insofar as failures have occurred, I also do not believe they are the result of vendor greed or cynicism, or IT management mistakes or incompetence. Rather, I believe the fault resides in the attempt to pound a square peg (relational model) into a round hole (knowledge representation).

The scope of these failures is not known. We have seen anecdotal claims of trillions of dollars in annual loses due to IT project failures worldwide; failure rates for major IT projects in the 65% to 80% ranges; and analysis of waste and failures in individual firms that are fairly eye-popping [1]. The real point of this post is not to try to quantify these problems. However, in my many years within IT it has been a common perception and concern that many — if not most — large-scale information technology deployments have disappointed in one way or another.

These disappointments range from cost overruns, to late delivery, to unmet objectives, or to low user acceptance. Many initiatives are simply cancelled before any such metrics can be documented. Whatever the absolute quantification, I think most experienced IT managers and executives would agree that these failures and disappointments have been all too commonplace.

“Business Intelligence projects are famous for low success rates, high costs and time overruns. The economics of BI are visibly broken, and have been for years. Yet BI remains the #1 technology priority according to Gartner.”[2]

Why might this be?

I truly believe the reasons for these disappointments do not reside in bad faith or incompetence. The potential importance of IT knowledge projects to improve competitive position, lower costs, or aid innovation for new markets is understood by all. Dilbert aside, I find it simply incomprehensible that disappointments or failures are rooted in these causes.

Rather, I suspect the root cause resides in the success of the relational model in the enterprise.

As transaction systems and for modeling narrowly bound and structured domains (such as products, inventory or customer lists), the relational model and its proven and optimized RDBMs and SQL query language have been resounding successes. It is natural to take a successful approach and try to extend it to other areas.

However, beginning with data warehouses in the 1980s, business intelligence (BI) systems in the 1990s, and the general issue of most enterprise information being bound up in documents for decades, the application of the relational model to these areas has been disappointing.

The reasons for this do not reside in areas such as storage or hardware; these areas have seen remarkable improvements over the decades. Rather, the problem resides in the nature of the relational model itself, and its lack of suitability to knowledge-based problems.

Technical Aspects of OWA, Broadly Defined

I have noted the importance of the open world assumption to the semantic enterprise in many of my more recent posts [3,4]. But I, like many others, often refer to the open world assumption with facile summaries such as it means that a lack of information does not imply the missing information to be false. Yet to fully understand the implications of OWA and many of its associated assumptions, it is necessary to delve deeper.

I am using here a shorthand that poses the closed world assumption (CWA) vs. the open world assumption (OWA). Actually, the data models behind these approaches (Datalog or non-monotonic logic in the case of CWA; monotonic in the case of OWA [5]; OWA is also firmly grounded in description logics [4]) tend be coupled with a few other assumptions. I use the shorthand of relational approach vs. (open) semantic Web approach to contrast these two models.

There are instances where the relational model can embrace the open world assumption (for example, the null in SQL) and there are instances where semantic Web approaches can be closed world (as with frame logic or Prolog or other special considerations; see conclusion). But, as generally applied and as generally understood, this contrast between typical relational practice and the semantic Web (based on RDF and OWL) tends to hold.

From a theoretical standpoint, I have found the treatment of Patel-­Schneider and Horrocks [6] to be most useful in comparing these approaches. However, the Description Logics Handbook and some other varied sources are also helpful [7,5]. Much of the technical aspects summarized in the table below are from these sources; I refer you to these sources for more informed technical discussions:

Relational Approach (Open) Semantic Web Approach

Closed World Assumption (CWA)

That which is not known to be true is presumed to be false; it needs to be explicitly stated as true. Negation as failure (NAF) is a related assumption, since it assumes as false every predicate that cannot be proven to be true. Under CWA, any statement not known to be true is false.

Everything is prohibited until it is permitted.

Open World Assumption (OWA)

The lack of a given assertion or fact being available does not imply whether that possible assertion is true or false: it simply is not known. In other words, lack of knowledge does not imply falsity.

Everything is permitted until it is prohibited.

Unique Name Assumption (UNA)

The unique name assumption (UNA) is premised that different names always refer to different entities in the world.

Duplicate Labels Allowed

OWL allows different synonym labels to be used for the same object; same names may refer to different objects. Identity assertions must be explicitly stated.

Complete Information

The data system at hand is assumed to be complete. (Missing information is often handled via the null statement in SQL, but that has been controversial and contentious in its own right.) This is also known as the domain-closure assumption.

Incomplete Information

A central tenet of OWA is that information is incomplete. A corollary is that the attributes of specific objects or instances may also be incomplete or partially known.

Single Schema (one world)

A single schema is necessary to define the scope and interpretation of the world (domain at hand).

Many World Interpretations

Schema and data instance assertions are kept separate. Multiple interpretations (worlds) for the same data are possible.

Integrity Constraints

Integrity constraints prevent “incorrect” values from being asserted in the relational model. It is useful for validation/parsing/data input and is related to the single model that contains only the facts asserted. Strict cardinality is used for checking validation.

Logical Axioms (restrictions)

Logical axioms provide restrictions through property domains and ranges. Everything can be true unless proven otherwise, and multiple possible models can satisfy the axioms. This provides more powerful inferencing, though can also be unintuitive at times. Cardinality and range restrictions exhibit different behavior for objects (inferred) or datatypes.

Non-monotonic Logic

The set of conclusions warranted on the basis of a given knowledge base does not increase (in fact, it likely shrinks) with the size of the knowledge base [5].

Monotonic Logic

The hypotheses of any derived fact may be freely extended with additional assumptions. Additional assertions tend to reduce the inferences or entailments that can be applied. A new piece of knowledge cannot reduce what is known [5]. New knowledge can arise through inference.

Fixed and Brittle

Changing the schema requires re-architecting the database; not inherently extensible.

Reusable and Extensible

Designed from the ground up to reuse existing ontologies (axioms) and to be extensible. Database design and management can be more agile, with schema evolving incrementally.

Flat Structure; Strong Typing

Information organized into flat tables; linkages and connections between tables based on foreign keys or joins. Strong data typing orientation.

Graph Structure; Open Typing

Inherent graph structure, supporting of linkage and connectivity analysis. Datatypes are inherently loose, though axioms can add strong types. Datatypes treated in the same way as classes, and datatype values are treated in the same way as individual identiers (i.e., a data value is treated as referring to an object).

Querying and Tooling

SQL and query optimizers well developed. Tooling well developed. Disjunction not supported; negation must be accommodated through approaches such as NAF. Sums and counts are easier due to unique name premise. Answer closure (one answer passable to a next calculation) is easier than OWA. Most tools are not suitable for any arbitrary schema.

Querying and Tooling

SPARQL and emerging rule languages used for querying; performance at scale and with broad distribution a concern. Queries require contextual information for proper set selection. Negation and disjunction are allowed and are powerful constructs. Tools generally less developed. Exciting opportunities for ontology-driven applications working against a small set of generic tools.

In well-characterized or self-contained domains (seats on a plane, books in a library, customers of a company, products sold via distribution channels), the traditional relational model works well. A closed-world assumption is performant for transaction operations with easier data validation. The number of negative facts about a given domain is typically much greater than the number of the positive ones. So, in many bounded applications, the number of negative facts is so large that their explicit representation can become practically impossible [7]. In such cases, it is simpler and shorter to state known “true” statements than to enumerate all “false” conditions.

However, the relational model is a paradigm where the information must be complete and it must be described by a single schema. Traditional databases require an agreement on a schema, which must be made before data can be stored and queried. The relational model assumes that the only objects and relationships that exist in the domain are those that are explicitly represented in the database, and that names uniquely identify objects in this domain. The result of these assumptions is that there is a single (canonical) model for relational systems where objects and relationships are in a one-to-one correspondence with the data in the database [6].

This makes CWA and its related assumptions a very poor choice when attempting to combine information from multiple sources, to deal with uncertainty or incompleteness in the world, or to try to integrate internal, proprietary information with external data.

The process of describing an open, semantic Web “world” can proceed incrementally, sequentially asserting new statements or conditions. The schema in the open semantic Web — the ontology — consists of sets of statements (called axioms) that describe characteristics that must be satisfied by the ontology designer’s idea of “reasonable” states of the world. Formally, such statements correspond to logical sentences, and an ontology corresponds to a logical theory [6].

Irregularity and incompleteness are toxic to relational model design. In the open semantic Web, data that is structured differently can still be stored together via RDF triple statements (subjectpredicateobject). For example, OWA allows suppliers without cities and names to be stored along alongside suppliers with that information. Information can be combined about similar objects or individuals even though they have different or non-overlapping attributes. Duplicate checking now occurs based on the logic of the system and not unique name evaluations. Data validation in OWA systems can both become more complicated (via testing against restriction statements) or partially easier (via inference).

It is interesting to note that the theoretical underpinnings of CWA by Reiter [8] began to be understood about the same time (1978) that data federation and knowledge representation (KR) activities also began to come to the fore. CWA and later work on (for example) default reasoning [5] appeared to have informed early work in description logics and its alternative OWA approach. This heavily influenced the development of the semantic Web languages RDF and OWL. However, the early path toward KM work based on the relational model also appears to have been set in this timeframe.

We are still reaping the whirlwind from this unfortunate early choice of the relational model for KR, KM and BI purposes. Moreover, though there is quite a bit of theoretical and logical discussion of the alternative OWA and CWA data models, there are surprisingly few discussions of what the implications of these models are to the enterprise. (That is, the elephant in the room.) The next two sections tackle this gap.

The Knowledge Management Argument for OWA

The above should make clear that the relational model and CWA are appropriate for defined and bounded systems. However, many of the new knowledge economy challenges are anything but defined and bounded. These applications all reside in the broad category of knowledge management (KM), and include such applications as data federation, data warehousing, enterprise information integration, business intelligence, competitive intelligence, knowledge representation, and so forth.

Let’s looks at the characteristics of such knowledge systems and why they are more appropriately modeled through the open world assumption (OWA) rather than the relational model and CWA:

  • Knowledge is never complete — gaining and using knowledge is a process, and is never complete. A completeness assumption around knowledge is by definition inappropriate
  • Knowledge is found in structured, semi-structured and unstructured forms — structured databases represent only a portion of structured information in the enterprise (spreadsheets and other non-relational datastores provide the remainder). Further, general estimates are that 80% of information available to enterprises reside in documents, with a growing importance to metadata, Web pages, markup documents and other semi-structured sources. A proper data model for knowledge representation should be equally applicable to these various information forms; the open semantic language of RDF is specifically designed for this purpose
  • Knowledge can be found anywhere — the open world assumption does not imply open information only. However, it is also just as true that relevant information about customers, products, competitors, the environment or virtually any knowledge-based topic can also not be gained via internal information alone. The emergence of the Internet and the universal availability and access to mountains of public and shared information demands its thoughtful incorporation into KM systems. This requirement, in turn, demands OWA data models
  • Knowledge structure evolves with the incorporation of more information — our ability to describe and understand the world or our problems at hand requires inspection, description and definition. Birdwatchers, botanists and experts in all domains know well how inspection and study of specific domains leads to more discerning understanding and “seeing” of that domain. Before learning, everything is just a shade of green or a herb, shrub or tree to the incipient botanist; eventually, she learns how to discern entire families and individual plant species, all accompanied by a rich domain language. This truth of how increased knowledge leads to more structure and more vocabulary needs to be explicitly reflected in our KM systems
  • Knowledge is contextual — the importance or meaning of given information changes by perspective and context. Further, exactly the same information may be used differently or given different importance depending on circumstance. Still further, what is important to describe (the “attributes”) about certain information also varies by context and perspective. Large knowledge management initiatives that attempt to use the relational model and single perspectives or schema to capture this information are doomed in one of two ways:  either they fail to capture the relevant perspectives of some users; or they take forever and massive dollars and effort to embrace all relevant stakeholders’ contexts
  • Knowledge should be coherentcoherence is the state of having internal logical consistency. A library of books organized by the Dewey Decimal Classification v. the Library of Congress Classification v. the Colon classification system (or others) is not inherently correct or wrong, but it is important that whatever system is used be applied consistently. Because of the power of OWA logics in inferencing and entailments, whatever “world” is chosen for a given knowledge representation should be coherent.  Fantasies such as Avatar and the Lord of the Rings trilogy, even though not real, can be made believable and compelling by virtue of their coherence
  • Knowledge is about connections — the epistemological nature of knowledge can be argued endlessly, but I submit much of what distinguishes knowledge from information is that knowledge makes the connections between disparate pieces of relevant information. As these relationships accrete, the knowledge base grows. Again, RDF and the open world approach are essentially connective in nature. New connections and relationships tend to break brittle relational models, and
  • Knowledge is about its users defining its structure and use — since knowledge is a state of understanding by practitioners and experts in a given domain, it is also important that those very same users be active in its gathering, organization (structure) and use. Data models that allow more direct involvement and authoring and modification by users — as is inherently the case with RDF and OWA approaches — bring the knowledge process closer to hand. Besides this ability to manipulate the model directly, there are also the immediacy advantages of incremental changes, tests and tweaks of the OWA model. The schema consensus and delays from single-world views inherent to CWA remove this immediacy, and often result in delays of months or years before knowledge structures can actually be used and tested [9].

To be sure, there are many circumstances where large stores of instance data and their analysis are necessary for knowledge purposes. In these cases, hybrid CWA-OWA systems (see conclusion) may make sense.

But, as these points emphasize, the general assembly and organization of knowledge is open world in nature. Trying to fit KM and related applications into the straightjacket of the relational model is folly. The relational model and CWA for KM is the elephant in the room. Three decades of failures and disappointments affirm this fact.

The Business Argument for OWA

Besides the native match of knowledge systems with OWA, there are sound business arguments for embracing the (open) semantic enterprise as well. These arguments can be summarized as lower risklower cost, faster deployment, and more agile responsiveness. What is there not to love?

It should now be clear that it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with a focus on early, high-value applications and domains.

Open world does not necessarily mean open data and it does not mean open source. Open world is simply a way to think about the information we have and how we act on it. OWA technologies are neutral to the question of open or public sources. The techniques can equivalently be applied to internal, closed, proprietary data and structures. Moreover, the technologies can themselves be used as a basis for bringing external information into the enterprise. An open world assumption merely asserts that we never have all necessary information and lacking that information does not itself lead to any conclusions.

Further, we need not abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives. However, in leveraging those assets, it is important that the enterprise begin to embrace and understand the open world assumption.

We also see that RDF and OWL, while important behind the scenes as a canonical data model and languages for organizing this information, need not be exposed as such to most users. Most instance data can be expressed as is with the data languages of choice such as XML, JSON or whatever. We are merely using the techniques of the (open) semantic Web as the data model to organize our information assets at hand. These assets need not themselves be represented in the native RDF or OWL languages.

Thus, open world frameworks provide some incredibly important benefits for knowledge management applications in the enterprise:

  • Domains can be analyzed and inspected incrementally
  • Schema can be incomplete and developed and refined incrementally
  • The data and the structures within these open world frameworks can be used and expressed in a piecemeal or incomplete manner
  • We can readily combine data with partial characterizations with other data having complete characterizations
  • Systems built with open world frameworks are flexible and robust; as new information or structure is gained, it can be incorporated without negating the information already resident, and
  • Open world systems can readily bridge or embrace closed world subsystems.

One might argue, as we believe, that the biggest impediment to the semantic enterprise is the mind shift necessary to start thinking about and accepting the open world premise. Again, this perspective is not applicable to all problems and domains. But, where it is, much can be left in place and leveraged with semantic technologies, so long as the enterprise begins to look at these existing assets through a different open-world lens.

In most real world circumstances, there is much we don’t know and we interact in complex and external environments. Knowledge management inherently occupies this space. Ultimately, data interoperability implies a global context. Open world is the proper logic premise for these circumstances. Via the OWA framework, we can readily change and grow our conceptual understanding and coverage of the world, including incorporation of external ontologies and data. Since this can easily co-exist with underlying closed-world data, the semantic enterprise can readily bridge both worlds.

So, we can now define the open semantic enterprise as one that embraces OWA for its knowledge management applications and engages in rapid and low-risk testing of incremental learning. The open world assumption is the proper framework to reverse decades of failure and disappointment for knowledge projects in the enterprise.

Some Open Questions about OWA

In our own discussions about ABox – TBox splits [10], we have, in essence, supported a hybrid OWA-CWA argument for the enterprise. It is beyond the scope of this current piece to describe these approaches in detail, but some of the options include local CWA, the addition of rule languages and constraints to basic OWA, use of the new OWL 2, TopQuadrant’s SPIN notation, and others [11]. I will address some of these in a later post.

There are also questions about performance and scalability with open semantic technologies. Here, too, progress is rapid, with billion triple thresholds rapidly falling with daily reports of better performance [12]. Fortunately, the incremental approach that we advocate herein dovetails well with these rapid developments. There should be no arguing the benefits of a successful incremental project in a smaller domain, perhaps repeated across multiple domains, in comparison to large, costly initiatives that never produce (even though their underlying technologies are performant).

There are also architecture issues inherent in these OWA designs. In one of our next posts, we return to the topic of Web-oriented architecture and its role in support of these OWA knowledge management initiatives.

In the end, there is no substitute for doing and learning. KM based on OWA for the open semantic enterprise can be started today, in a focused manner with tangible benefits and outcomes, at low cost and risk. Let’s push the elephant out of the room and let the learning and doing begin.


[1] For example, see Roger Sessions, 2009. Cost of IT Failure, September 28, 2009. This analysis suggests failure rates of 65% with a total estimated worldwide cost of $6.2 trillion in 2009. Commenters have raised questions as to what constitutes failure and have questioned some of the analysis assumptions. Nonetheless, even with over-estimates, the scale of the numbers is alarming; see Jorge Dominguez, 2009. The CHAOS Report 2009 on IT Project Failure, June 16, 2009, which indicates combined failure and challenge rates for IT projects have ranged from 65% to 84% over the period 1994 to 2009; see Dan Galorath, 2008. Software Project Failure Costs Billions; Better Estimation & Planning Can Help, June 7, 2008. In this report, Galorath compares and combines many of the available IT failure studies and summarizes that 3 of 5 IT projects do not do what they were supposed to for the expected costs, with 49% showing budget overruns, 47% showing higher than expected maintenance costs, and 41% failing to deliver expected business value; the anecdotal failure rate for years for IT projects has been claimed as 80%, with business intelligence and data warehousing particularly failure-prone areas; in 2001, a study by Mark N. Frolick and Keith Lindsey, Critical Factors for Data Warehouse Failures, for the Data Warehousing Institute noted conventional wisdom says the failure rate of data warehousing projects is 70 to 80 percent, with a then-recent study in the insurance industry found a 90-percent failure rate. This report is useful for combining many historical studies.
[2] According to this article, by Antone Gonsalves, Poor Use Of Data Integration Tools Can Waste $500,000 Annually: Gartner (April 27, 2009), which reports on a recent Gartner Report, large global 2000 companies, using several data integration tools with overlapping features, can reduce costs by more than $500,000 annually by eliminating redundant software and leveraging a shared services model. In a further report by Roman Stanek, Business Intelligence Projects are Famous for Low Success Rates, High Costs and Time Overruns (April 25, 2009), Gartner is talking about a dirty little secret in the world of data integration, the fact that the data integration technology in place is based on generations of data integration technology being layered in the enterprise over the years. Thus, technology that was purchased to solve data integration problems, and reduce costs, is actually making the data integration problem more complex and no longer cost efficient.
[3] Here are some of my earlier postings dealing in some degree with OWA: Ontology-driven Applications Using Adaptive Ontologies, November 23, 2009; Fresh Perspectives on the Semantic Enterprise, September 28, 2009; Confronting Misconceptions with Adaptive Ontologies, August 17, 2009; Advantages and Myths of RDF, April 8, 2009; Making Linked Data Reasonable using Description Logics, Part 2, February 15, 2009, which specifically relates OWA to the ABox and TBox [4]; and, The Role of UMBEL: Stuck in the Middle with You . . ., May 11, 2008.
[4] We use the reference to “ABox” and “TBox” in accordance with our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[5] A model theory is a formal semantic theory which relates expressions to interpretations. A “model” refers to a given logical “interpretation” or “world”. (See, for example, the discussion of interpretation in Patrick Hayes, ed., 2004. RDF Semantics – W3C Recommendation, 10 February 2004.) The logic or inference system of classical model theory is monotonic. That is, it has the behavior that if S entails E then (S + T) entails E. In other words, adding information to some prior conditions or assertions cannot invalidate a valid entailment. The basic intuition of model-theoretic semantics is that asserting a statement makes a claim about the world: it is another way of saying that the world is, in fact, so arranged as to be an interpretation which makes the statement true. An assertion amounts to stating a constraint on the possible ways the world might be. In comparison, a non-monotonic logic system may include default reasoning, where one assumes a ‘normal’ general truth unless it is contradicted by more particular information (birds normally fly, but penguins don’t fly); negation-by-failure, commonly assumed in logic programming systems, where one concludes, from a failure to prove a proposition, that the proposition is false; and implicit closed-world assumptions, often assumed in database applications, where one concludes from a lack of information about an entity in some corpus that the information is false (e.g., that if someone is not listed in an employee database, that he or she is not an employee.) See further, Non-monotonic Logic from the Stanford Encyclopedia of Philosophy.
[6] Peter F. Patel-­Schneider and Ian Horrocks, 2006. Position Paper: A Comparison of Two Modelling Paradigms in the Semantic Web,” in WWW2006, May 22–-26, 2006, Edinburgh, UK. See http://www.comlab.ox.ac.uk/people/ian.horrocks/Publications/download/2006/PaHo06a.pdf.
[7] Other resources include: Franz Baader, Diego Calvanese, Deborah McGuiness, Daniele Nardi, and Peter Patel-Schneider, eds., 2003. The Description Logic Handbook: Theory, Implementation and Applications, Cambridge University Press, 2003. Online access to much of the book is available at http://www.inf.unibz.it/~franconi/dl/course/; see esp. Chapters 1, 2, 4 and 16 relate to this topic; Jos de Bruijn, Axel Polleres, Ruben Lara and Dieter Fensel, 2005. OWL DL vs. OWL Flight: Conceptual Modeling and Reasoning for the Semantic Web, in Proceedings of the Ninth World Wide Web Conference, Japan, May 2005. This paper argues against the use of description logics for the semantic Web; Andrew Newman, 2007. A Relational View of the Semantic Web, March 14, 2007; Hai Wang, 2006. Frames and OWL Side by Side, presented at the 9th International Protégé Conference, July 23-26, 2006, Stanford, CA; Nick Drummond and Rob Shearer, 2006. The Open World Assumption, Powerpoint presentation at The Chris Date Seminar: The Closed World of Databases Meets the Open World of the Semantic Web, e-Science Institute, Edinburgh, Scotland, 12 Ocotober 2006; Yulia Levin, 2008. Closed World Reasoning, presentation at Non-classical Logics and Applications Seminar – Winter 2008, Tel Aviv University; and Pat Hayes, 2001. “Why must the web be monotonic?”, email thread at http://lists.w3.org/Archives/Public/www-rdf-logic/2001Jul/0067.html.
[8] Raymond Reiter, 1978. “On Closed World Data Bases”, in Logic and Data Bases, H. Gallaire and J. Minker, eds., New York: Plenum Press, 55-76; see also, Raymond Reiter, 1980. “A Logic for Default Reasoning,” Artificial Intelligence, 13:81-132.
[9] See this Google search on ontology-driven applications.
[10] See this Google search on ABox-TBox articles.
[11] See, as examples: J. Heflin and H. Munoz-Avila, 2002. LCW-Based Agent Planning for the Semantic Web, in AAAI ‘02 Workshop on Ontologies and the Semantic Web, AAAI Press, pp. 63–70. See http://www.cse.lehigh.edu/~heflin/pubs/lcw-aaai02.pdf (one of the first local CWA suggestions in specific regard to the semantic Web); K. Golden, O. Etzioni and D. Weld, D. 1994. Omnipresence Without Omniscience: Efficient Sensor Managment for Planning, in Proceedings of AAAI-94 (one of the first to propose LCWA in general); Evren Sirin, Michael Smith and Evan Wallace, 2008. Integrity constraints: Opening, Closing Worlds — On Integrity Constraints, presented at OWL: Experiences and Directions (OWLED 2008), Fifth International Workshop, Karlsruhe, Germany, October 26-27, 2008; Timothy L. Hinrichs, Jui-Yi Kao and Michael R. Genesereth, 2009. Inconsistency-tolerant Reasoning with Classical Logic and Large Databases, in Proceedings of the Eighth Symposium on Abstraction, Reformulation, and Approximation (SARA2009), July 2009; S. Gómez, C. Chesñevar and G. Simari 2008. An Argumentative Approach to Reasoning with Inconsistent Ontologies, in Proceedings of the KR Workshop on Knowledge Representation and Ontologies (KROW 2008), Conferences in Research and Practice in Information Technology, Vol. 90, pp. 11-20. Eds. T.Meyer, M. Orgun. Australian Computer Society, Sidney, Australia, July 2008. Holger Knoblauch, The Object-Oriented Semantic Web with SPIN, Sunday, January 18, 2009, that discusses the SPIN (SPARQL Inferencing Notation) Modeling Vocabulary, which is a light-weight collection of RDF properties and classes to support the use of SPARQL to specify rules and logical constraints.
[12] For example, the BigOWLIM can perform reasoning against 12 billion explicit statements and loads about 12,000 statements per second on a standard server; see http://www.ontotext.com/owlim/benchmarking/lubm.html; also, see Orri Erling’s blog regarding performance of the Virtuoso RDF triple store (http://www.openlinksw.com/weblog/oerling/). In any case, these performance benchmarks continue to rise steadily and indicate the performance of RDF as an ontology integration layer.

Posted by AI3's author, Mike Bergman

Posted on December 21, 2009 at 11:20 pm in Description Logics, Ontologies, Semantic Web | Comments (8)
The URI link reference to this post is: http://www.mkbergman.com/852/the-open-world-assumption-elephant-in-the-room/
The URI to trackback this post is: http://www.mkbergman.com/852/the-open-world-assumption-elephant-in-the-room/trackback/
Date:   November 23, 2009

Open World - from worldatlas.com

A Low-risk Path to the Open World, Semantic Enterprise

OK, you’ve been reading the literature and perhaps have attended a conference or two. You have heard a lot about semantic technologies, but have some real questions and concerns:

  • How do we get started, especially with smaller proofs-of-concept?
  • Do we need to abandon our past practices and systems in order to gain semantic advantages?
  • To gain the advantages of interoperability, do we have to convert everything into RDF or OWL?
  • Are semantic technologies limited to open or public data; how do we accommodate our proprietary information?

Such questions — and more — are not infrequent when organizations first contemplate making the transition to become a semantic enterprise.

Overview

The diagram below shows a general workflow for migrating existing instance data into the semantic enterprise. The diagram is broken down into three parts. The first part is to characterize and stage existing data and information into the underlying structured data framework. This is what SD (that is, my firm, Structured Dynamics) does as data architects using our particular approach to adaptive ontologies. I’ll touch on this again in a moment.

Jumping to the right-hand side of the diagram is the access and display part. It is here that developers or users can make selections from dropdown lists and so forth to define the “slices” of diced results sets they wish to display. The results of those interactions are structured data results sets that are pre-staged to “drive” various applications and displays [1,2]. These same capabilities can also be embedded into standard Web end user applications, such as content management systems.

The third and middle part of the diagram is the critical part, the pivot point. It is the interface layer between the structured data on the left and the display and presentation of that data on the right. As provided by SD, this abstraction layer is the structWSF Web services framework that “bridges” between the black box of what happens with RDF and semantic Web structured data characterizations on the left in order to feed, or “drive”, useful services and functions on the right.

We call this general design and architecture “ontology-driven applications”. The bulk of this posting explains each of these three parts in a bit more detail, organized from left-to-right by these Parts 1 to 3.

Adaptive Ontology Workflow
(click to expand)

Part 1: Structured Data Instances and Ontologies

Our approach relies on what we call “adaptive ontologies”. These ontologies set the structural basis for all subsequent data display, analysis, inferencing, entailments, and the like. We call them “adaptive” because we embrace a set of unique best practices. These practices enable the ontologies to do the double-duty of first structuring data and then driving generic applications by properly informing user interfaces, dropdown lists, menus and the like.

This structuring results in faceting key important dimensions and attributes of available content. Structured data gets organized. Unstructured data (text) gets tagged via this structure and integrated with it.

As Structured Dynamics’ general product schema makes clear (see the diagram at [3]), our approach leverages existing assets as much as possible. Often, this means leaving most existing data structures in place. These existing assets are staged and converted in two complementary manners that largely correspond to the conceptual ABox (instance) and TBox (concepts and schema) split central to description logics and pivotal to SD’s methodology [4].

Whether transitioning small chunks or big chunks, this staging of existing data in Part 1 results in an RDF-accessible characterization of the starting content. Instances and their attributes are represented via a common notation, generally based on irON (instance record and Object Notation) [5], that is an extensible notation and vocabulary for capturing the data characterizations, attributes and metadata of the candidate instance data (”records” in RDBMS parlance). These instances may either be internal or proprietary records, or instance data on the Web or in the public domain. By properly matching same or similar instances to one another, any source of instance characterization can be meaningfully combined.

This instance notation is extremely lightweight, and really is merely an RDF representation of data characterizations. In the characterizations to this point there is not yet any “world view” involved:  we are simply describing instances and their attributes in a manner akin to key-value pairs. The process to this point is entirely descriptive.

However, these instance characteristics do contain within them the semantics as to how to describe these attributes (your “glad” is my “happy”), as well as potentially a schematic or conceptual view of how these instances relate to one another and to the broader world. Instance characterizations provide the building blocks, that are then related and made semantically whole via a second “terminological” level.

These terminological, or conceptual, relationships (the TBox [4]), reside at a different level from simply decribing things. Rather, these schema — what in this context are best known as ontologies — provide a precise language and means for describing conceptual relationships. If these structural relationships are done well, they are coherent: the hip bone is connected to the thigh bone and not to the ear. Coherence is a matter of a consistent world view that “hangs together” when analyzed via powerful logical techniques available via description logics and other broader mechanisms of the semantic enterprise.

Thus, as we transition from the existing, the operational workflow splits the input data stream into two pathways:

  • Instances, and their descriptive characteristics, and
  • Conceptual relationships, or ontologies.

A sequential flow of these steps and splits is provided by this diagram below that shows: 1) the conceptual structure of the concept ontology; as 2) matched with the instances and their descriptive attributes that populate that schema.

Ontology and Instance Build Methodology
(click to expand)

A key point is that — while a proper starting ontology is essential to our process and proofs-of-concept — it can be grown and scaled incrementally. We leverage as much existing starting structure as possible and can readily bound the scope to meet budget and delivery imperatives.

The concepts and entities that occur within these structures help inform our fairly simple tagging system, scones [3]. (There are also benefits from “triangulating” between entity or instance identification and concept identification that helps inform disambiguation nearly for free; see further [6]). It is also possible to integrate these initial proof-of-concept approaches with third-party tools (e.g., Calais, Expert System (Cogito), etc.) to improve unstructured content characterization.

These approaches are pretty straightforward for any organization wanting to test the idea of becoming a semantic enterprise. Real benefits — such as concept retrievals overcoming the limitations of standard keyword search — can be demonstrated from even small starting ontologies and structures. Given the inherent connectedness of the data, it is possible to expand the scope and usefulness of the information incrementally within fixed and manageable budgets.

Part 2: structWSF: A Web-oriented Services API and Framework

A pivotal part of SD’s infrastructure software is structWSF [7], our platform-independent Web services middleware. structWSF is an abstraction layer that provides the APIs, search endpoints, and specific Web services for accessing, querying or getting results sets from the underlying structured data and ontologies.

structWSF has a standard set of access and retrieval services including browse, full-text search, CRUD, direct record retrievals, and the like. It is embedded within an access and permissions service that acts at the level of registered datasets. Then, based on the requested protocol, structWSF returns the filtered results set. These results sets can be delivered as XML, JSON, or any of the other formats already available [7]. They can readily and dynamically populate HTML pages and forms in any deployment framework. For specific purposes, these results sets can also be returned as pre-staged, properly formatted results streams for driving specific applications.

As an API, the structWSF Web services can be interacted with and driven via standard HTTP requests. Alternatively, these requests can come from simple to complicated Web apps that create the API queries based on user interface choices such as selections from dropdown lists or clicking on various listed options. An interactive demo of this approach is shown by SD’s conStruct application [8], though even simpler Web pages or forms may drive the query interface.

Queries and requests to structWSF may also include a parameter for results sets to be returned in particular formats. SD’s irON protocol [5] supports requests or results in CSV, XML or JSON, in addition to other flavors including multiple serializations of RDF.

In this manner, only a simple converter need be added to the structWSF Web services stack in order to “drive” a new application with a particularly formatted results set stream.

structWSF thus acts as a single, uniform Web interface to all of the “black box” nuances of the structured data system organized by the adaptive ontologies. Further, virtually any data structure may be ingested and converted from external sources via an import service and made part of the underlying canonical structure, making the framework perfect for data federation [9]. Lastly, the dataset nature of the framework, and its neutrality to underlying data stores or content management systems, also makes structWSF an excellent framework for one or many nodes to share information and collaborate across the Web [10].

The following diagram shows how a diverse, Web-based network, involving a diversity of Web portals and data gateways and hubs, can work via the structWSF framework to establish a complete collaboration network. Via datasets and differential access rights and permissions, virtually any combination of potential interactions can be supported:

Example Collaboration Network
(click to expand)

These potentials are really fundamentally new, and we ourselves are still trying to find the language and analogies to best explain them. structWSF was initially designed as a platform-independent layer between the structured data representation of existing assets and the ontology-driven applications that interact with them. We are now finding that deployment in a broader Web-based context provides additional exciting prospects for integrating various regional offices or to enable direct collaboration with customers, partners or suppliers.

Part 3: Ontology-driven Applications

The basic design of structWSF is to provide a middleware layer that fulfills one or more of these broad user interaction modes:

  • To create, update, delete or otherwise manage data records
  • To browse or view existing records or record sets, based on simple to possible complex selection or filtering criteria, or
  • To take one of these results sets and progress it through a workflow of some nature, involving specialized analysis, applications, or visualization.

SD has developed generic applications in these areas (with many more possible), the operations of which are guided by the instructions and nature of the underlying data that feeds them. We have proven it is possible to adopt data characterization practices within those ontologies so as to stage or “drive” such generic applications.

In the case of a standard structured data display (say, a simple table like a Wikipedia infobox, for example), such generic design includes templates tailored to various instance types (say, locational information presenting on a map versus people information warranting a image and vital statistics). Alternatively, in the generic design for some specialized application (say, Adobe Flash), the information output of the results set may need to contain certain formats and attributes.

SD’s “ontology-driven apps”, then, are really informed structured results sets that are outputted in a form suitable to various intended applications. This output form can include a variety of serializations, formats or metadata. This flexibility of output that is tailored to and responsive to particular generic applications is what makes our ontologies “adaptive”.

Expressed in this manner, “ontology-driven apps” seem neither remarkably profound nor clever. They are simply attentive to their intended uses.

Using this structure, then, it is possible to either “drive” queries and results sets selections via direct HTTP request or via simple dropdown selections on HTML forms (that is, from right to left as shown on the first diagram). Similarly, it is possible with a single parameter change to drive either a visualization app or a structured table template from the equivalent query request (that is, from left to right on the first diagram).

“Ontology-driven apps” through SD’s architecture design thus provide two profound benefits.  First, the entire system can be driven via simple selections or interactions without the need for any programming or technical expertise. And, second, simple additions of new and minor output converters can work to power entirely new applications available to the system. If, say, Adobe graphics applications need to change tomorrow for Microsoft Silverlight, that switch is easy and can be made transparent to the designer.

The Complete Picture: Embrace the Open World

The ability to develop these systems incrementally and the ability to integrate with external, public data is fundamentally dependent on the open world assumption. The open world assumption is a different logic premise than what many enterprises are used to; relational database systems, for example, embrace the alternate closed world premise.

Open world does not necessarily mean open data and it does not mean open source. Open world is merely a way to think about the information we have and how we act on it. An open world assumption accepts that we never have all necessary information and lacking that information does not itself lead to any conclusions.

Some enterprise circumstances – say a complete enumeration of customers or products or even controlled engineering or design environments — may warrant a closed world approach. In those circumstances, the domain of inquiry is well bounded and we can get relatively complete information about it. Engineering an oil drilling platform or launching the Space Shuttle in fact demands that.

But, in most real world circumstances, there is much we don’t know and we interact in complex and external environments. Open world is the proper logic premise for these circumstances. These circumstances also happen to be the very basis in which most most knowledge workers and analysts reside.

Open world frameworks provide some incredibly important benefits if the circumstances of their use apply:

  • Domains can be analyzed and inspected incrementally
  • Schema can be incomplete and developed and refined incrementally
  • The data and the structures within these open world frameworks can be used and expressed in a piecemeal or incomplete manner
  • We can readily combine data with partial characterizations with other data having complete characterizations
  • Systems built with open world frameworks are flexible and robust; as new information or structure is gained, it can be incorporated without negating the information already resident, and
  • Open world systems can readily bridge or embrace closed world subsystems.

One might argue, as we believe, that the biggest impediment to the semantic enterprise is the mind shift necessary to start thinking about and accepting the open world premise. Again, this perspective is not applicable to all problems and domains.  But, where it is, much can be left in place and leveraged with semantic technologies, so long as the enterprise begins to look at these existing assets through a different open-world lens.

Summary

So, let’s return to the rhetorical questions that began this posting.

It should now be clear that it is possible to start small in testing the transition to a semantic enterprise. These efforts can be done incrementally and with focus on early, high-value applications and domains.

Further, we need not abandon past practices. There is much that can be done to leverage existing assets. Indeed, those prior investments are often the requisite starting basis to inform semantic initiatives. However, in leveraging those assets, it is important that the enterprise begin to embrace and understand the open world assumption.

We also see that RDF and OWL, while important behind the scenes as a canonical data model and languages for organizing this information, need not be exposed as such to most users. Most instance data can be expressed as is with the data languages of choice such as XML, JSON or whatever.

We also see these technologies are neutral to the question of open or public sources. The techniques can equivalently be applied to internal, closed, proprietary data and structures. Moreover, the technologies can themselves be used as a basis for bringing external information into the enterprise.

Without a doubt, some of the early years in describing semantic technologies were burdened with some unfortunate bad information and lack of sophistication. Today’s semantic Web is nimble, agile, and ready to be deployed immediately at low cost and risk. So, jump on in! We think you’ll find the water to be just fine.

This post is Part V of an occasional AI3 series on ontology best practices.

[1] These selections and requests need not occur only via user interfaces or HTML forms, but also programmatically via API or direct Web services calls.
[2] There are two main classes of visualizations possible with our systems:  1) navigations or explorers of the concept space, which is a particularly open challenge for large, graph-based knowledge bases (see, for example, our Subject Concept Explorer using the UMBEL Financial Account concept, and click on the bubbles); or 2) conventional data visualizations or graphics or mappings of instance data. Both are shown as workflow boxes on the first diagram above.
[3] See http://structureddynamics.com/products.html for a general descriptive illustration of Structured Dynamics’ product stack. There is also a longer slideshow, from which this diagram is drawn as slide #37.
[4] We use the reference to ABox and TBox in accordance with our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[5] For the specification and a use case of irON using the CSV (commON) serialization, see http://openstructs.org/iron.
[6] Via this approach we now can assess concept matches in addition to entity matches. This means we can triangulate between the two assessments to aid disambiguation. Because of these logical segmentations, we also have multiple “clusters” (that is, either the concept, type, superType or dimension) upon which to do our disambiguation evaluations, either between concepts and entities or within the various concept clusters. We can do so via either multiple semantic vectors (for statistical-based methods) or multiple features (for machine learning methods). In other words, because of logical segmentation, we have increased the informational power of our concept graph. See further http://www.mkbergman.com/759/supertypes-and-logical-segmentation-of-instances/.
[7] See http://openstructs.org/structwsf/architecture; also, the available export formats are shown at http://constructscs.com/documentation/instructions/export.
[8] There is an online demo of conStruct using the Sweet Tools database of semantic Web and -related tools at http://constructscs.com/conStruct/browse/; for background on this use case, see http://www.mkbergman.com/845/a-most-un-common-way-to-author-datasets/.
[9] See, for example, http://www.mkbergman.com/496/structwsf-a-framework-for-data-mixing/.
[10] See, for example, http://www.mkbergman.com/497/structwsf-a-framework-for-collaboration-networks/.

Posted by AI3's author, Mike Bergman

Posted on November 23, 2009 at 11:24 am in Description Logics, Ontology Best Practices, Structured Dynamics, irON | Comments (7)
The URI link reference to this post is: http://www.mkbergman.com/847/ontology-driven-applications-using-adaptive-ontologies/
The URI to trackback this post is: http://www.mkbergman.com/847/ontology-driven-applications-using-adaptive-ontologies/trackback/
Date:   September 28, 2009

The Tower of Babel by Pieter Brueghel the Elder (1563)

The Benefits are Greater — and Risks and Costs Lower — Than Many Realize

I have been meaning to write on the semantic enterprise for some time. I have been collecting notes on this topic since the publication by PricewaterhouseCoopers (PWC) of an insightful 58-pp report earlier this year [1]. The PWC folks put their finger squarely on the importance of ontologies and the delivery of semantic information via linked data in that publication.

The recent publication of a special issue of the Cutter IT Journal devoted to the semantic enterprise [2] has prompted me to finally put my notes in order. This Cutter volume has a couple of good articles including its editorial intro [3], but is overall spotty in quality and surprisingly unexciting. I think it gets some topics like the importance of semantics to data integration and business intelligence right, but in other areas is either flat wrong or misses the boat.

The biggest mistake are statements such as “. . . a revolutionary mindset will be needed in the way we’ve traditionally approached enterprise architecture” or that the “. . . semantic enterprise means rethinking everything.”

This is just plain hooey. From the outset, let’s make one thing clear:  No one needs to replace anything in their existing architecture to begin with semantic technologies. Such overheated rhetoric is typical consultant hype and fundamentally mischaracterizes the role and use of semantics in the enterprise. (It also tends to scare CIOs and to close wallets.)

As an advocate for semantics in the enterprise, I can appreciate the attraction of framing the issue as one of revolution, paradigm shifts, and The Next Big Thing. Yes, there are manifest benefits and advantages for the semantic enterprise. And, sure, there will be changes and differences. But these changes can occur incrementally and at low risk while experience is gained.

The real key to the semantic enterprise is to build upon and leverage the assets that already exist. Semantic technologies enable us to do just that.

Think about semantic technologies as a new, adaptive layer in an emerging interoperable stack, and not as a wholesale replacement or substitution for all of the good stuff that has come before. Semantics are helping us to bridge and talk across multiple existing systems and schema. They are asking us to become multi-lingual while still allowing us to retain our native tongues. And, hey! we need not be instantly fluent in these new semantic languages in order to begin to gain immediate benefits.

As I noted in my popular article on the Advantages and Myths of RDF from earlier this year:

We can truly call RDF a disruptive data model or framework. But, it does so without disrupting what exists in the slightest. And that is a most remarkable achievement.

That is still a key takeaway message from this piece. But, let’s look and list with a fresh perspective the advantages of moving toward the semantic enterprise [4].

Perspective #1: Incremental, Learn-as-you-Go is Best Strategy

For the interconnected reasons noted below, RDF and semantic technologies are inherently incremental, additive and adaptive. The RDF data model and the vocabularies built upon it allow us to progress in the sophistication of our expressions from pidgin English (simple Dick sees Jane triples or assertions) to elegant and expressive King’s English. Premised on the open world assumption (see below), we also have the freedom to only describe partial domains or problem areas.

From a risk standpoint, this is extremely important. To get started with semantic technologies we neither need to: 1) comprehensively describe or tackle the entire enterprise information space; nor 2) do so initially with precision and full expressiveness. We can be partial and somewhat crude or simplistic in our beginning efforts.

Also extremely important is that we can add expressivity and scope as we go. There is no penalty for starting small or simple and then growing in scope or sophistication. Just like progressing from a kindergarten reader to reading Tolstoy or Dickens, we can write and read schema of whatever complexity our current knowledge and understanding allow.

Perspective #2: Augment and Layer on to Existing Assets, Don’t Replace Them!

Semantic technology does not change or alter the fact that most activities of the enterprise are transactional, communicative or documentary in nature. Structured, relational data systems for transactions or records are proven, performant and understood. Writing and publishing information, sometimes as documents and sometimes as spreadsheets or Web pages, is (and will remain) the major vehicle for communicating within the enterprise and to external constituents.

On its very face, it should be clear that the meaning of these activities — their semantics, if you will — is by nature an augmentation or added layer to how to conduct the activities themselves. Moreover, as we also know, these activities are undertaken for many different purposes and within many different contexts. The inherent meaning of these activities is also therefore contextual and varied.

This simple truth affirms that semantic technologies are not a starting basis, then, for these activities, but a way of expressing and interoperating their outcomes. Sure, some semantic understanding and common vocabularies at the front end can help bring consistency and a common language to an enterprise’s activities. This is good practice, and the more that can be done within reason while not stifling innovation, all the better. But we all know that the budget department and function has its own way of doing things separate from sales or R&D. And that is perfectly OK and natural.

These observations — in combination with semantic technologies — can thus lead to a conceptual architecture for the enterprise that recognizes there are “silo” activities that can still be bridged with the semantic layer:

Conceptual Semantic Enterprise Architecture

Under this conceptual architecture, “RDFizers” (similar to the ETL function) or information extractors working upon unstructured or semi-structured documents expose their underlying information assets in RDF-ready form. This RDF is characterized by one or more ontologies (multiples are actually natural and preferred [5]), which then can be queried using the semantic querying language, SPARQL.

We have written at length about proper separation of instance records and data and schema, what is called the ABox and TBox, respectively, in description logics [6], a key logic premise to the semantic Web. Thus, through appropriate architecting of existing information assets, it is possible to leave those systems in place while still gaining the interoperability advantages of the semantic enterprise.

Another aspect of this information re-use is also a commitment to leverage existing schema structures, be they industry standards, XML, MDM, relational schema or corporate taxonomies. The mappings of these structures in the resulting ontologies thus become the means to codify the enterprise’s circumstances into an actionable set of relationships bridging across multiple, existing information assets.

Perspective #3: The First Major Benefit is from Data Federation

Clearly, then, the first obvious benefit to the semantic enterprise is to federate across existing data silos, as featured prominently in the figure above. Data federation has been the Holy Grail of IT systems and enterprises for more than three decades. Expensive and involved efforts from ETL and MDM and then to enterprise information integration (EII), enterprise application integration (EAI) and business intelligence (BI) have been a major focus.

Frankly, it is surprising that no known vendors in these spaces (aside from our own Structured Dynamics, hehe) premise their offerings on RDF and semantic technologies. (Though some claim so.) This is a major opportunity area. (And we don’t mind giving our competitors useful tips.)

Perspective #4: Wave Goodbye to Rigid, Inflexible Schema

Instance-level records and the ABox work well with relational databases. Their schema are simple and relatively fixed. This is fortunate, because such instance records are the basis of transactional systems where performance and throughput are necessary and valued.

But at the level of the enterprise itself — what its business is, its business environment, what is constantly changing around it — trying to model its world with relational schema has proven frustrating, brittle and inflexible. Though relational and RDF schema share much logically, the physical basis of the relational schema does not lend itself to changes and it lacks the flexibility and malleability of the graph-based RDF conceptual structure.

Knowledge management and business intelligence are by no means new concepts for the enterprise. What is new and exciting, however, is how the emergence of RDF and the semantic enterprise will open new doors and perspectives. Once freed of schema constraints, we should see the emergence of “agile KM” similar to the benefits of agile software development.

Because semantic technologies can operate in a layer apart from the standard data basis for the enterprise, there is also a smaller footprint and risk to experimenting at the KM or conceptual level. More options and more testing and much lower costs and risks will surely translate to more innovation.

Just as semantic technologies are poorly suited for transactional or throughput purposes, we should see the complementary and natural migration of KM to the semantic side of the shop. There are no impediments for this migration to begin today. In the process, as yet unforeseen and manifest benefits in agility, experimentation, inferencing and reasoning, and therefore new insights, will emerge.

Perspective #5: Data-driven Apps Shift the Software Paradigm

The same ontologies that guide the data federation and interoperability layer can also do double-duty as the specifications for data-driven applications. The premise is really quite simple: Once it is realized that the inherent information structure contained within ontologies can guide hierarchies, facets, structured retrievals and inferencing, the logical software design is then to “drive” the application solely based on that structure. And, once that insight is realized, then it becomes important, as a best practice, to add further specifications in order to also carry along the information useful for “driving” user interfaces [7].

Thus, while ontologies are often thought solely to be for the purpose of machine interpretation and communication, this double-duty purpose now tells us that useful labels and such for human use and consumption is also an important goal.

When these best practices of structure and useful human labels are made real, it then becomes possible to develop generic software applications, the operations of which vary solely by the nature of the structure and ontologies fed to them. In other words, ontologies now become the application, not custom-written software.

Of course, this does not remove the requirement to develop and write software. But the nature and focus of that development shifts dramatically.

From the outset, data-driven software applications are designed to be responsive to the structure fed them. Granted, specific applications in such areas as search, report writing, analysis, data visualization, import and export, format conversions, and the like, still must be written. But, when done, they require little or no further modification to respond to whatever compliant ontologies are fed to them — irrespective of domain or scope.

It thus becomes possible to see a relatively small number of these generic apps that can respond to any compliant structure.

The shift this represents can be illustrated by two areas that have been traditional choke points for IT within the enterprise: queries to local data stores (in order to get needed information for analysis and decisions) and report writers (necessary to communicate with management and constituents).

It is not unusual to hear of weeks or months delays in IT groups responding to such requests. It is not that the IT departments are lazy or unresponsive, but that the schema and tools used to fulfill their user demands are not flexible.

It is hard to know just how large the huge upside is for data-driven apps and generic tools. But, this may prove to be of even greater import than overcoming the data federation challenge.

In any event, while potentially disruptive, this prospect of data-driven applications can start small and exist in parallel with all existing ways of doing business. Yes, the upside is huge, but it need not be gained by abandoning what already works.

Perspective #6: Adaptive Ontologies Flatten, Democratize the KM Process

So, assume, then, a knowledge management (KM) environment supported by these data-driven apps. What perspective arises from this prospect?

One obvious perspective is where the KM effort shifts to become the actual description, nature and relationships of the information environment. In other words, ontologies themselves become the focus of effort and development. The KM problem no longer needs to be abstracted to the IT department or third-party software. The actual concepts, terminology and relations that comprise coherent ontologies now become the foundation of KM activities.

An earlier perspective emphasized how most any existing structure can become a starting basis for ontologies and their vocabularies, from spreadsheets to naïve data structures and lists and taxonomies. So, while producing an operating ontology that meets the best practice thresholds noted herein has certain requirements, kicking off or contributing to this process poses few technical or technology demands.

The skills needed to create these adaptive ontologies are logic, coherent thinking and domain knowledge. That is, any subject matter expert or knowledge worker worth keeping on the payroll has, by definition, the necessary skills to contribute to useful ontology development and refinement.

With adaptive ontologies powering data-driven apps we thus see a shift in roles and responsibilities away from IT to knowledge workers themselves. This shift acts to democratize the knowledge management function and flatten the organization.

Perspective #7: The Semantic Enterprise is ‘Open’ to the World

Enterprise information systems, particularly relational ones, embody a closed world assumption that holds that any statement that is not known to be true is false. This premise works well where there is complete coverage of the entities within a knowledge base, such as the enumeration of all customers or all products of an enterprise.

Yet, in the real (”open”) world there is no guarantee or likelihood of complete coverage. Thus, under an open world assumption the lack of a given assertion or fact being available neither implies whether that possible assertion is true or false: it simply is not known. An open world assumption is one of the key factors for enabing adaptive ontologies to grow incrementally. It is also the basis for enabling linkage to external (and surely incomplete) datasets.

Fortunately, there is no requirement for enterprises to make some philosophical commitment to either closed- or open-world systems or reasoning. It is perfectly acceptable to combine traditional closed-world relational systems with open-world reasoning at the ontology level. It is also not necessary to make any choices or trade-offs about using public v. private data or combinations thereof. All combinations are acceptable and easily accommodated.

As noted, one advantage of open-world reasoning at the ontological level is the ability to readily change and grow the conceptual understanding and coverage of the world, including incorporation of external ontologies and data. Since this can easily co-exist with underlying closed-world data, the semantic enterprise can readily bridge both worlds.

Perspective #8: The Semantic Enterprise is a Disruptive Innovation, without Being Disruptive

Unfortunately, as a relatively new area there are advantages for some pundits or consultants to present the semantic Web as more complicated and commitment-laden than it need be. Either the proponents of that viewpoint don’t know what they are saying, or are being cynical to the market. The major point underlying the fresh perspectives herein is to iterate that it is quite possible to start small, and do so with low cost and risk.

While it is true that semantic technologies within the enterprise promise some startling upside potentials and disruptions to the old ways of doing business, the total beauty of RDF and its capabilities and this layered model is that those promises can be realized incrementally and without hard choices. No, it is not for free: a commitment to begin the process and to learn is necessary. But, yes, it can be done so with exciting enterprise-wide benefits at a pace and risk level that is comfortable.

The good news about the dedicated issue of the Cutter IT Journal and the earlier PWC publication is that the importance of semantic technologies to the enterprise is now beginning to receive its just due. But as we ramp up this visibility, let’s be sure that we frame these costs and benefits with the right perspectives.

The semantic enterprise offers some important new benefits not obtainable from prior approaches and technologies. And, the best news is that these advantages can be obtained incrementally and at low risk and cost while leveraging prior investments and information assets.


[1] Paul Horowittz, ed., 2009. Technology Forecast: A Quarterly Journal, PricewaterhouseCoopers, Spring 2009, 58 pp. See http://www.pwc.com/us/en/technology-forecast/spring2009/index.jhtml (after filling out contact form). I reviewed this publication in an earlier post.
[2] Mitchell Ummell, ed., 2009. “The Rise of the Semantic Enterprise,” special dedicated edition of the Cutter IT Journal, Vol. 22(9), 40pp., September 2009. See http://www.cutter.com/offers/semanticenterprise.html (after filling out contact form).
[3] It is really not my purpose to review the Cutter IT Journal issue nor to point out specific articles that are weaker than others. It is excellent we are getting this degree of attention, and for that I recommend signing up and reading the issue yourself. IMO, the two useful articles are: John Kuriakose, “Understanding and Adopting Semantic Web Technology,” pp. 10-18; and Shamod Lacoul, “Leveraging the Semantic Web for Data Integration,” pp. 19-23.
[4] As a working definition, a semantic enterprise is one that adopts the languages and standards of the semantic Web, including RDF, RDFS, OWL and SPARQL and others, and applies them to the issues of information interoperability, preferably using the best practices of linked data.
[5] One prevalent misconception is that is it desirable to have a single, large, comprehensive ontology. In fact, multiple ontologies, developing and growing on multiple tracks in various contexts, are much preferable. This decentralized approach brings ontology development closer to ultimate users, allows departmental efforts to proceed at different paces, and lowers risk.
[6] Here is our standard working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[7] I first introduced this topic in Ontologies as the ‘Engine’ for Data-Driven Applications. Some of the user interface considerations that can be driven by adaptive ontologies include: attribute labels and tooltips; navigation and browsing structures and trees; menu structures; auto-completion of entered data; contextual dropdown list choices; spell checkers; online help systems; etc.

Posted by AI3's author, Mike Bergman

Posted on September 28, 2009 at 10:37 am in Description Logics, Ontologies, Semantic Web, Structured Dynamics, Structured Web | Comments (5)
The URI link reference to this post is: http://www.mkbergman.com/825/fresh-perspectives-on-the-semantic-enterprise/
The URI to trackback this post is: http://www.mkbergman.com/825/fresh-perspectives-on-the-semantic-enterprise/trackback/
Date:   June 30, 2009

Random Colour Swirl photo courtesy from PD Nathan at Photobucket

Interoperable Naïve Data Structs, Datasets and Canonical RDF

As I noted in my review of SemTech 2009, one of the key themes of the conference was data federation. Unfortunately, data federation has been a term a bit out of vogue for a while. (Though I still think it best captures the space.)

The current vernacular has been pushing forward an alternative: data mixing. One of the larger product pushes at the conference was by Zepheira for its new Freemix service and product. Freemix is a hosted service largely built around the Exhibit data display application, aided by some tools to make creating an exhibit easier. Exhibit is an attractive presentation system; for nearly three years AI3’s own Sweet Tools dataset listing of semantic Web and -related tools has been presented via Exhibit.

Freemix looks promising and is now being offered in beta. But one thing caught my ear when listening to the company’s announcement: they are not yet able and ready to show the “data mixing” part of the system. Its release is apparently being delayed until later this year because of the difficulties encountered.

This post coincides with the release of the alpha version of the structWSF code on the OpenStructs Web site. It is available for download under Apache 2 license.
We’ll be blogging a few more times in the coming days regarding other possible uses and applications for this platform-independent Web services framework.

What is Data Mixing and Why is it So Hard?

As a new term there is no “official” definition of data mixing. However, I think we can consider it as generally equivalent to the older data federation concept.

Data federation is the bringing together of data from heterogeneous and often physically distributed data sources into a single, coherent view. Sometimes this is the result of searching across multiple sources, in which case it is called federated search. But it is not limited to search. Data federation is a key concept in business intelligence and data warehousing and a driver behind master data management (MDM).

As I first wrote about data federation about five years ago [1]:

Data federation first became a research emphasis within the biology and computer science communities in the 1980s. At that time, extreme diversity in physical hardware, operating systems, databases, software and immature networking protocols hampered the sharing of data.
Yet it is easy to overlook the massive strides in overcoming these obstacles in the past two decades.
Climbing the Data Federation Pyramid

The Internet and its TCP/IP and Web HTTP protocols and XML standards in particular, have been major contributors to overcoming respective physical and syntactical and data exchange heterogeneities. The current challenge is to resolve differences in meaning, or semantics, between disparate data sources. Your “glad” may be someone else’s “happy” and you may organize the world into countries while others organize by regions or cultures.

Resolving semantic heterogeneities is also called semantic mediation or data mediation. Though it displays as a small portion of the pyramid above, resolving semantics is a complicated task and may involve structural conflicts (such as naming, generalization, aggregation), domain conflicts (such as schemas or units), or data conflicts (such as synonyms or missing values). Researchers have identified nearly 40 distinct types of possible semantic heterogeneities [2].

Ontologies provide a means to define and describe these different worldviews. Referentially integral languages such as RDF (Resource Description Framework) and its schema implementation (RDF-S) or the Web ontological description language OWL are leading standards among other emerging ones for machine-readable means to communicate the semantics of data.

Fortunately, we have climbed most of this data federation pyramid. The stumbling block now are the semantics. This is made all the harder when we place too much burden on the data transmission or “packet” itself. In other words, does exchange also carry with it the burden of meaning? The rest of this post tries to explain what I mean by this and how it relates to our new structWSF Web services framework.

Is it Apples or Oranges?

Not to pick on any one thing or any individuals, but three recent threads on semantic Web-related mailing lists help illustrate in various ways some interesting mindsets. While there is much on each of these threads of other value, I’m only focusing on a narrow topic from each based on my thesis at hand.

And, what is that thesis? It is simply that we too often mix instance record and attribute assertions with schema representations and world views. And, when we do, we sometimes make mountains out of molehills (or mix apples and oranges to completely mix metaphors).

Example 1: Squeezing RDF into JSON

JSON (JavaScript Object Notation) is a data notation or syntax, easily created and widely used for current Web apps. It has a rather simple syntax for representing attribute-value pairs. Many useful tools and parsers for the serialization exist.

In keeping with his general and broad criticisms of how the semantic Web standards and approaches have been promulgated by the W3C to date, John Sowa most recently expressed his ideas in a posting to the ontolog-forum mailing list under the heading of ‘Semantic Systems’ [3]. In this thread, John proposes:

1. The recommended exchange form for RDF will become JSON. Any JSON documents that are limited to triples can use the old XML-based RDF form, but they can also use the more compact and more general full JSON.

Then, in a subsequent posting to that thread he notes:

5. The W3C made a major blunder with a one-size-fits-all approach that tried to use a document tagging language as a knowledge representation language. The result was the *worst* notation for logic ever invented.

Finally, he goes on to note in a further post:

JSON could be used as an alternative to XML for the syntax, but the lack of a standard semantics for JSON means that it could *not* be used as a replacement for RDF *unless* an official standard were adopted for mapping RDF to and from a particular subset of JSON whose semantics was defined in Common Logic.

All of this John proposes in the spirit of:

The goal of my proposal is nothing less than a total *integration* of the Semantic Web methodologies with the methodologies that have been used in the traditional software development community [3].

I find common ground with a couple of the ideas in this proposal. First, accepted formats like JSON should have a prominent place in data exchange. Second, leveraging methodologies used in the traditional community is definitely a good thing.

But John, while suggesting reuse of existing traditions, is also paradoxically recommending a wholesale replacement for RDF. He is also positing a single exchange standard (JSON). And, he stops tantalizingly short of recognizing an important truth that I’m sure he knows: simple instance record assertions and representations — the essence of data exchange — can and should be viewed separately from schema representations.

As I have noted in my earlier naïve data ’structs’ series, there are in fact scores of existing data transfer formats that have been adopted by their communities — and are likely to remain popular within those communities for some time — that can play a similar role to JSON. So long as the role of data exchange is kept to the assertions (”metadata”) about instances, many formats can play in the sandbox.

The role of RDF may or may not reside with data exchange. To conflate and equate RDF and JSON is to reduce the power of keeping instance record representations separate from schema and world view representations. John’s basic sensibilities, I think, could be more effectively promoted by not posing ‘either-or’ strawmen and recognizing that data exchange formats will ALWAYS be diverse and heterogeneous.

Observation: Existing and emerging data ’structs’ useful to data exchange will remain manifest in format and diversity; data exchange imperatives are a different matter from schema and knowledge representation.

Example 2: RDFa is Not ‘Expressive’ Enough

Somewhat in contrast to this thread was a different one by Martin Hepp, editor of the excellent Good Relations ontology, on the LOD (linked open data) mailing list [4]. This thread, which sensibly questions how difficult it is for mere mortals to configure an Apache server to support publishing RDF, reached further into the realm of RDFa as a document annotation language.

As Hepp states,

The reason is that, as beautiful the idea is of using RDFa to make a) the human-readable presentation and b) the machine-readable meta-data link to the same literals, the problematic is it in reality once the structure of a) and b) are very different. For very simple property-value pairs, embedding RDFa markup is no problem. But if you have a bit more complexity at the conceptual level and in particular if there are significant differences to the structure of the presentation (e.g. in terms of granularity, ordering of elements, etc.), it gets very, very messy and hard to maintain.

Further discussion in this thread elaborates the interest in having the documents in which the RDFa is embedded carry much more schema-level information.

Like the Sowa case, this raises the question of where to draw the line. Should embedded metadata in documents carry complex schema information as well? So, we now shift the focus from data exchange to schema representation.

I think this is really unnecessary since it is quite easy in RDFa to refer to a separately specified schema. By, in this case, conflating metadata transfer and exchange with schema, the bar has been raised unnecessarily high.

If we need to capture schema and world views, fine, let us do so directly and succinctly. Then, let our document metadata (in this case using RDFa) make attribute assertions about that “payload” simply and cleanly. The Web certainly does not need individual documents carrying with them entire schema representational views of the world.

Observation: Data exchange, even based on RDF (via RDFa), is best kept to the assertions of facts and attributes.

Example 3: Mixing Vocabularies

In a microformats context, Thomas Lörtsch posed some questions on mixing vocabularies [5] and how they should be interpreted. This caused an involved discussion of intent and possible implications and best practices, with discussants including Brian Suda, Peter Mika, Ben Ward and others. It also led to the start of a useful wiki page on how objects should be represented in Web pages when multiple microformats can be invoked.

For quite some time microformats, I think, have gotten the “mix” just about right. They have created well-reasoned attributes for distinct instance types and seek to keep their embedding of that information simple in existing documents. Some advocate while others question the rigor of the microformat structure; that is not the topic here.

What is interesting about this thread is that it evolved to discuss the implications and best practices when an author posts a document with more than one microformat. How do these vocabularies relate? How should we, as “consumers” of the document, parse the vocabularies?

Yahoo!’s SearchMonkey service has recognized microformats for some time, and its questions regarding interpretation and best practices in the thread were natural. But the interesting point that seemed to come out of this thread is that users will post microformats as they wish. While care and standards in the design of the microformats can help reduce confusion and conflict, it can not guarantee it. The final responsibility for proper ingest and processing likely resides with the aggregators and publishers that consume such data.

So, here, too, we have another case of asserting metadata and embedding for data exchange in a slightly different native format than RDF. Huzzah!

Observation: Standards setters and consuming agents (often aggregators, publishers or search engines) should take lead responsibility for best practices and processing attribute data, realizing that original authors and developers may not fully comply.

Revisiting the ABox and TBox Split

structWFSThese examples are a bit of a long way around the barn to reinforce what we have been arguing for some time: the need for a proper split between the ABox (assertions related to instances) and the TBox (concept relationships, schema and world views) [6]. This has been a pretty constant theme in our writing, ranging from first introductions, to its relation to description logics, relationships to existing data ’structs’, and explicit discussion of ABox and TBox roles in a four-part series.

One of the key points throughout this writing is that an ABox-TBox mindset provides a context and rigor for looking at questions such as our three examples above. In all three cases, I argue, the seeming conundrums result from lacking this mindset. Once this mindset is applied, the respective roles of various data formats, RDF, schema and the like naturally fall into place.

Of course, the Web is also a dirty and chaotic place where niceties of design and best practices are routinely ignored or unknown or purposefully rejected. So be it. This is reality. This reality needs to be accommodated. But good design can help overcome it and work to establish resilient, flexible architectures.

Of course, even though this might be good design, there is no ability to enforce such distinctions across the Web. However, insofar as key implementors are concerned (standards writers, major publishers, tools developers, industry experts, and the like) we can put in place better approaches. This mantra is at the heart of all that Structured Dynamics does — including the structWSF Web services framework, just released as open source code.

A General Data Mixing Model

So, now we can finally turn our attention to the structWSF Web services framework, more broadly described here.

There are a number of perspectives and contexts to view this structWSF framework. In this posting, we take the boundary conditions of data formats and data exchange [7]. The key question for this perspective is: given the realities noted above, what is an adaptive framework for data mixing on the Web? Our schematic answer to this question is below:

structWSF Data Model Relationships

The basic design has two key data considerations. First, all structWSF tools and Web services and schema work from the canonical RDF data model. It is the hub and common denominator for all structWSF installations. We are able to design and optimize generic tools and services (including converters) around this canonical framework.

Second, we assume most everything in the outside world to be non-compliant with this canonical model, with the data representations often naïve and incomplete. Converters (also known as translators or RDFizers) are an essential bridge to this external world, and need to be designed for re-use and extensibility.

Where the outside world is compliant, they conform to the structWSF APIs or are themselves structWSF installations. In these cases, direct data exchange and access with permission rights occurs at a dataset level (not shown).

The Naïve Part of the Spectrum

Converters are themselves bona fide Web services at the structWSF level. (Only a few are presently included in the alpha release.) While some may be one-off converters (sometimes off-the-shelf RDFizers), and often devoted to large volume external data sources, it is also helpful to emphasize one or more “standard” naïve external formats. A “standard” external format allows for a more sophisticated converter and enables specific tools to be more easily justified around the standard naïve format.

As noted above, this “standard” is often JSON or a derivative of JSON. But, just as readily, the common ‘naïve’ format could be SQL from relational databases or another format common to the community at hand. In many ways, because the emphasis of data exchange is on the ABox and instance records and assertions (and attribute extensions), the actual format and serialization is pretty much immaterial.

Emphasizing one or a few naïve external formats allows more tools and services to be cost-effectively developed for those formats. And, even though the format(s) chosen for this external standard may lack the expressiveness of RDF (and, ultimately, OWL), because the burden is principally related to data exchange, this layer can be readily optimized for the deployment at hand.

Besides import converters it is also important to have export services for the more broadly used naïve external formats. In fact, some structWSF services can be devoted to data cleanup or attribute (property) or object reconciliation (including disambiguation as a possibility). In this manner, structWSF installations could also improve the authority and trustworthiness of standard data in the wild.

Another common service for this naïve data is to give it unique URI identifiers and to make it Web-accessible, thus turning it into linked data.

The RDF Canonical Data Model

Such generic services are possible because the “highest common denominator” for the system is the canonical RDF model. Because it is the consistent basis for tools and services, once a converter is available and the external information schema is mapped to the internal structure, all existing tools and services are available for re-use. Moreover, this system and its datasets are now ready for sharing with other structWSF instances, within the enterprise or beyond.

Thus, we begin to see a network of canonical “hubs” in a sea of heterogeneity, the interoperation of which is facilitated by a structWSF framework at every network node. This design is discussed more in the next part of this series.

Some, such as Sowa noted above, would prefer a grounding in common logic (CL) as opposed to RDF. Our choice to use RDF is based on the simplicity and understandability of the data model, plus the richness of languages and standards from the W3C that surround the framework.

Even here, however, the RDF basis of structWSF need not be the final word. Because of a keen intent to keep all designs and ontologies used by structWSF firmly grounded in description logics, it is possible for the structWSF basis to be converted to other languages and frameworks such as CL that can be expressed in DL.

Bringing it Back to Data Federation

Data mixing — or more preferably, data federation — has as its heart the premise of heterogeneous and distributed data sources. It implicitly acknowledges differences in syntax, semantics and serializations.

The design and architecture of structWSF is similarly premised. While each of us may prefer one model or one format over others, we must interoperate in the real world. And that world, for many understandable and immutable reasons, will retain its diversity. Accepting this reality is a first step to adaptive design.

So, we control what we can control, and we adapt to what else exists. We have chosen RDF as the canonical data model that we can control and have embedded it in a Web services framework that is Web-based and scalable; in other words, a fully compliant Web-oriented architecture. These are the conceptual foundations to structWSF.

To be sure, structWSF in its current alpha release is quite raw in many areas and incomplete in others. But we will continue to work on it — and invite your participation to do the same — such that it can fulfill its destiny as a data federation framework for the Web.


[1] I first wrote about this while at BrightPlanet; a page is still up on that Web site with the text above. I have re-caste this material in various ways since.
[2] I have previously written on the “40 sources” of data heterogeneity. See here, for example.
[3] See http://ontolog.cim3.net/forum/ontolog-forum/2009-06/msg00210.html and continue to follow the noted thread.
[4] See the thread, ‘ .htaccess a major bottleneck to Semantic Web adoption,’ at http://lists.w3.org/Archives/Public/public-lod/2009Jun/0341.html and continue to follow this thread.
[5] See http://microformats.org/discuss/mail/microformats-discuss/2009-June/012985.html and continue to follow the ‘mixing vocabularies’ thread.
[6] This is our working definition of the ABox and TBox in specific reference to description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[7] For functionality, download, documentation or other direct materials on structWSF, please see OpenStructs.org and its related resources. There is also a Drupal instantiation of the system called conStruct, also available for download.

Posted by AI3's author, Mike Bergman

Posted on June 30, 2009 at 1:22 pm in Adaptive Innovation, Description Logics, Linked Data, Ontologies, Open Source, Semantic Web Tools, Structured Dynamics, Structured Web, Web-oriented Architecture | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/496/structwsf-a-framework-for-data-mixing/
The URI to trackback this post is: http://www.mkbergman.com/496/structwsf-a-framework-for-data-mixing/trackback/
Date:   April 8, 2009

RDF logo

A 10th Birthday Salute to RDF’s Role in Powering Data Interoperability

There has been much welcomed visibility for the semantic Web and linked data of late. Many wonder why it has not happened earlier; and some observe progress has still been too slow. But what is often overlooked is the foundational role of RDF — the Resource Description Framework.

From my own perspective focused on the issues of data interoperability and data federation, RDF is the single most important factor in today’s advances. Sure, there have been other models and other formulations, but I think we now see the Goldilocks “just right” combination of expressiveness and simplicity to power the foreseeable future of data interoperability.

So, on this 10th anniversary of the birth of RDF [1], I’d like to re-visit and update some much dated discussions regarding the advantages of RDF, and more directly address some of the mis-perceptions and myths that have grown up around this most useful framework.

By request, this article is now available as a PDF download.

A Simple Intro to RDF

RDF is a data model that is expressed as simple subject-predicate-object “triples”. That sounds fancy, but just substitute verb for predicate and noun for subject and object. In other words: Dick sees Jane; or, the ball is round. It may sound like a kindergarten reader, but it is how data can be easily represented and built up into more complex structures and stories.

A triple is also known as a “statement” and is the basic “fact” or asserted unit of knowledge in RDF. Multiple statements get combined together by matching the subjects or objects as “nodes” to one another (the predicates act as connectors or “edges”). As these node-edge-node triple statements get aggregated, a network structure emerges, known as the RDF graph.

The referenced “resources” in RDF triples have unique identifiers, IRIs, that are Web-compatible and Web-scalable. These identifiers can point to precise definitions of predicates or refer to specific concepts or objects, leading to less ambiguity and clearer meaning or semantics.

In my own company’s approach to RDF, basic instance data is simply represented as attribute-value pairs where the subject is the instance itself, the predicate is the attribute, and the object is the value. Such instance records are also known as the ABox. The structural relationships within RDF are defined in ontologies, also known as the TBox, which are basically equivalent to a schema in the relational data realm.

RDF triples can be applied equally to all structured, semi-structured and unstructured content. By defining new types and predicates, it is possible to create more expressive vocabularies within RDF. This expressiveness enables RDF to define controlled vocabularies with exact semantics. These features make RDF a powerful data model and language for data federation and interoperability across disparate datasets.

There are many excellent introductions or tutorials to RDF; a recommended sampling is shown in the endnotes [2].

Is RDF a Framework, Data Model or Vocabulary?

Well, the answer to the rhetorical question is, all three!

The RDF data model provides an abstract, conceptual framework for defining and using metadata and metadata vocabularies. See: We were able to use all three concepts in a single sentence!

The RDF model draws on well-established principles from various data representation communities. RDF properties may be thought of as attributes of resources and in this sense correspond to traditional attribute-value pairs. RDF properties also represent relationships between resources and an RDF model can therefore resemble an entity-relationship diagram. . . . In object-oriented design terminology, resources correspond to objects and properties correspond to instance variables. [1]

But, actually, because RDF is simultaneously a framework, data model and basis for building more complex vocabularies, it is both simple and complex at the same time.

It is first perhaps best to understand basic RDF as a data model of triples with very few (or unconstrained) semantics [3]. In its base form, it has no range or domain constraints; has no existence or cardinality constraints; and lacks transitive, inverse or symmetrical properties (or predicates) [4]. As such, basic RDF has limited reasoning support. It is, however, quite useful in describing static things or basic facts.

In this regard, RDF in its base state is nearly adequate for describing the simple instances and data records of the world, what is called the ABox in description logics.

RDFS (RDF Schema) is the next layer in the RDF stack designed to overcome some of these baseline limitations. RDFS introduces new predicates and classes that bound these semantics. Importantly, RDFS establishes the basic constructs necessary to create new vocabularies, principally through adding the class and subClass declarations and adding domain and range to properties (the RDF term for predicates). Many useful vocabularies have been created with RDFS and it is possible to apply limited reasoning and inference support against them.

The next layer in the RDF stack is OWL, the Web Ontology Language. It, too, is based on RDF. The first versions of OWL were themselves layered from OWL Lite to OWL DL to OWL Full. OWL Lite and OWL DL are both decidable through the first-order logic basis of description logics (the basis for the acronym in OWL DL). OWL Full is not decidable, but provides an OWL counterpart to fragmented RDF and RDFS statements that are desirable in the aggregate, with reasoning applied where possible.

OWL provides sufficient expressive richness to be able to describe the relationships and structure of entire world views, or the so-called terminological (TBox) construct in description logics. Thus, we see that the complete structural spectrum of description logics can be satisfied with RDF and its schematic progeny, with a bit of an escape hatch for combining poorly defined or structural pieces via using OWL Full [5].

However, RDF is NOT a particular serialization. Though XML was the original specified serialization and still is the defined RDF MIME type (application/rdf+xml; other serializations take the form text/turtle or text/n3 or similar), it is not necessary to either write or transmit RDF in the XML syntax.

In any event, depending on its role and application, we can see that RDF is a foundation, in careful expressions based in description logics, that lends itself to a clean expression and separation of concerns. With RDF and RDFS, we have a data model and a basis for vocabularies well suited to instance data (ABox). With RDFS and OWL, we have an extended schema structure and ontologies suitable for describing and modeling the relationships in the world (TBox). Thus, RDF is a framework for modeling all forms of data, for describing that data through vocabularies, and for interoperating that data through shared conceptualizations (ontologies) and schema.

Rationale for a Canonical Data Model

In the context of data interoperability, a critical premise is that a single, canonical data model is highly desirable. Why? Simply because of 2N v N2. That is, a single reference (”canon”) structure means that fewer tool variants and converters need be developed to talk to the myriad of data formats in the wild. With a canonical data model, talking to external sources and formats (N) only requires converters to the canonical form (2N). Without a canonical model, the combinatorial explosion of required format converters becomes N2 [6].

Note, in general, such a canonical data model merely represents the agreed-upon internal representation. It need not affect data transfer formats. Indeed, in many cases, data systems employ quite different internal data models from what is used for data exchange. Many, in fact, have two or three favored flavors of data exchange such as XML, JSON or the like.

In most enterprises and organizations, the relational data model with its supporting RDBMs is the canonical one. In some notable Web enterprises — say, Google for example — the exact details of its internal canonical data model is hidden from view, with APIs and data exchange standards such as GData being the only visible portions to outside consumers.

Generally speaking, a canonical, internal data standard should meet a few criteria:

  • Be expressive enough to capture the structure and semantics of any contributing dataset
  • Have a schema itself which is extensible
  • Be performant
  • Have a model to which it is relatively easy to develop converters for different formats and models
  • Have published and proven standards, and
  • Have sufficient acceptance so as to have many existing tools and documentation.

Other desired characteristics might be for the model and many of its tools to be free and open source, suitable to much analytic work, efficient in storage, and other factors.

Though the relational data model is numerically the most prevalent one in use, it has fallen out of favor for data federation purposes. This loss of favor is due, in part, to the fragile nature of relational schema, which increases maintenance costs for the data and their applications, and incompatibilities in standards and implementation.

Though still comparatively young with a smaller-than-desirable suite of tools and applications support [7], RDF is perhaps the ideal candidate for the canonical data model. To understand why, let’s now switch our discussion to the advantages of RDF.

Advantages of RDF

It is surprisingly difficult to find a consolidated listing of RDF’s advantages. The W3C, the developer of the specification, first published on this topic in the late 1990s, but it has not been updated for some time [8]. Graham Klyne has a better and more comprehensive presentation, but still one that has not been updated since 2004 [4].

I believe data interoperability to be RDF’s premier advantage, but there are many, many others.

Another advantage that is less understood is that RDF and its progeny can completely switch the development paradigm: data can now drive the application, and not the other way around. Frankly, we are just at the beginning realizations of this phase with such developments as linked data and even whole applications or application languages being written in RDF [9], but I think time will prove this advantage to be game-changing.

But, there are many perspectives that can help tease out RDF’s advantages. Some of these are discussed below, with the accompanying table attempting to list these ‘Top Sixty’ advantages in a single location.

Standard, Open and Expressive

In its ten year history, RDF has spawned many related languages and standards. The W3C has been the shepherd for this process, and there are many entry locations on the World Wide Web Consortium’s Web site to begin exploring these options [10]. These standards extend from the RDF, RDFS and OWL vocabularies and languages noted above that give RDF its range of expressiveness, to query languages (e.g., SPARQL), transformation languages (e.g., GRDDL), rule languages (e.g., RIF), and many additional constructs and standards.

The richness of this base of standards is only now being tapped. The combination of these standards and the tools they are spawning is just beginning. And, because it is so easily serialized as XML, a further suite of tools and capabilities such as XPath or XSLT or XForms may be layered onto this base.

Moreover, one is not limited in any way to XML as a serialization. RDF itself has been serialized in a number of formats including RDF/XML, N3, RDFa, Turtle, and N-triples. Also, RDF’s simple subject-predicate-object data model can readily convert human-readable and easily authored instance records (subject) written in the style of attribute-value pairs (predicate-object). As such, RDF is an excellent conversion target for all forms of naïve data structs [11].

Data Interoperability

Indeed, it is in data exchange and interoperability that RDF really shines. Via various processors or extractors, RDF can capture and convey the metadata or information in unstructured (say, text), semi-structured (say, HTML documents) or structured sources (say, standard databases). This makes RDF almost a “universal solvent” for representing data structure.

“The semantic Web’s real selling point is URI-based data integration.”
Harry Halpin [12]

Because of this universality, there are now more than 100 off-the-shelf ‘RDFizers’ for converting various non-RDF notations (data formats and serializations) to RDF [13]. Because of its diversity of serializations and simple data model, it is also easy to create new converters. Generalized conversion languages such as GRDDL provide framework-specific conversions, such as for microformats.

Once in a common RDF representation, it is easy to incorporate new datasets or new attributes. It is also easy to aggregate disparate data sources as if they came from a single source. This enables meaningful composition of data from different applications regardless of format or serialization.

Simple RDF structures and predicates enable synonyms or aliases to also be easily mapped to the same types or concepts. This kind of semantic matching is a key capability of the semantic Web. It becomes quite easy to say that your glad is my happy, and they indeed talk about the same thing.

What this mapping flexibility points to is the immense strengths of RDF in representing diverse schema, the next major advantage.

Schema Unbound

The single failure of data integration since the inception of information technologies — for more than 30 years, now — has been schema rigidity or schema fragility. That is, once data relationships are set, they remain so and can not easily be changed in conventional data management systems nor in the applications that use them.

Relational database management (RDBM) systems have not helped this challenge, at all. While tremendously useful for transactions and enabling the addition of more data records (instances, or rows in a relational table schema), they are not adaptive nor flexible.

Why is this so?

In part, it has to do with the structural “view” of the world. If everything is represented as a flat table of rows and columns, with keys to other flat structures, as soon as that representation changes, the tentacled connections can break. Such has been the fragility of the RDBMS model, and the hard-earned resistance of RDBMS administrators to schema growth or change.

Yet, change is inevitable. And thus, this is the source of frustration with virtually all extant data systems.

RDF has no such limitations. And, for those from a conventional data management perspective, this RDF flexibility can be one of the more unbelievable aspects of this data model.

As we have noted earlier, RDF is well suited and can provide a common framework to represent both instance data and the structures or schema that describe them, from basic data records to entire domains or world views. In fact, whatever schema or structure that characterizes the input data — from simple instance record layouts and attributes to complete vocabularies or ontologies — also embodies domain knowledge. This structure can be used at time of ingest as validity or consistency checks.

As a framework for data interoperability, RDF and its progeny can ingest all relations and terminology, with connections made via flexible predicates that assert the degree and nature of relatedness. There is no need for ingested records or data to be complete, nor to meet any prior agreement as to structure or schema.

Increment, Evolve, Extend, Adapt . . .

Indeed, the very fluidity of RDF and structures based on it is another key strength. Since a basic RDF model can be processed even in the absence of more detailed information, input data and basic inferences can proceed early and logically as a simple fact basis. This strength means that either data or schema may be ingested and then extended in an incremental or partial manner. Partial representations can be incorporated as readily as complete ones, and schema can extend and evolve as new structure is discovered or encountered.

This is revolutionary. RDF provides a data and schema representation framework that can evolve and adapt to what data exists and what structure is known. As new data with new attributes are discovered, or as new relationships are found or realized, these can be added to the existing model without any change whatsoever to the prior existing schema.

This very adaptability is what enables RDF to be viewed as data-driven design. We can deal with a partial and incomplete world; we can learn as we go; we can start small and simple and evolve to more understanding and structure; and we can preserve all structure and investments we have previously made.

And applications based on RDF work the same way: they do not need to process or account for information they don’t know or understand. We can easily query RDF models without being affected whatsoever by unreferenced or untyped data in the basic model.

By replacing the rigid relational data model with one based on RDF, we gain robustness, flexibility, universality and structural persistence over fragility.

Existing technologies such as SQL and the relational model were devised without the specific requirements of disparate, uncontrolled, large-scale integration. Though the relational model enabled us to build efficient data silos and transaction systems, RDF now enables us to finally federate them.

‘Top Sixty’ Benefits of RDF
  • A foundation based in description logics that lends itself to clean expression and separation of concerns regarding instance data (ABox) and schema structure (TBox)
  • RDF’s unique identifiers, IRIs, are Web-compatible and are Web-scalable
  • Potential use of inferencing to contextually broaden search, retrieval and analysis
  • Potential use of its structure to automatically drive applications and tools, including populating context-relevant dropdown lists and auto-completion
  • Based on open source, languages and standards
  • A comparatively complete suite of specifications including languages, schema and tools (e.g., RDF, RDF Schema, OWL, RIF, SPARQL, GRDDL, etc.)
  • A choice of a variety of serializations and notations, including RDF/XML, N3, RDFa, Turtle, N-triples, as well as possible expression in many non-RDF notations
  • Instance records in human-readable, easily authored attribute-value formats can be readily converted to the s-p-o RDF “triple” data model
  • Can capture metadata and structure from unstructured, semi-structured and structured data
  • More than 100 off-the-shelf ‘RDFizers’ exist for converting various non-RDF notations (data formats and serializations)
  • Easy and cost-effective incorporation of new datasets wherein only new attributes require a structure update; all others simply get mapped
  • Aggregate processing of disparate sources as if they came from a single source
  • A ready structure for synonym and alias matching when merging or matching datasets
  • In converting non-RDF data, the ability to bring a more formal class structure to the description of things
  • Common framework and vocabulary for representing instance data
  • Common framework and vocabulary for representing data structures and schema
  • Can describe simple data structs to complete vocabularies/ontologies to processing and inferencing rules
  • Schema can be calculated from the ingested triples; thus, can either generate schema from scratch or be used to cross-check prior schema
  • Can accept and store data with different structure in a general RDF container (e.g., all animals v a specific bird)
  • Eliminates the trade-off between good design and performance for related structure (e.g., full names v first and last names)
  • Untyped relations can still have single operations performed against them
  • More formal RDF structures (e.g., ontologies) embody domain expertise within their subject structure
  • Readily extensible with schema that are also machine readable, bringing about a high degree of automation
  • Allows data that is structured slightly differently to be stored together in the lowest common denominator of an RDF statement
  • No need for upfront schema agreement; can evolve, extend and adapt
  • Allows the schema to change independently of the data without requiring any existing data to be thrown away or padded with NULLs
  • The basic RDF model can be processed in the absence of more detailed information as a simple fact basis
  • Schema based on RDF can be extended and grown incrementally without impacting the existing datastore
  • As a corollary, development based on RDF can also be incremental, reducing the need to “design it at once” or “design it right” up front
  • RDF models and apps lend themselves to experimentation and agile development
  • Information can be gathered incrementally from multiple sources
  • Data and schema can be ingested, represented and conveyed in “partial” form
  • Structure and schema can evolve incrementally in concert with new understandings and new data
  • All prior investments in structure and schema can be maintained
  • Because of conceptual closeness to the relational data model, it is possible to represent RDF in a relational database and vice versa
  • RDF thus has the ability to take advantage of historical RDBMs and SQL query optimizations
  • Ability to create RDF “views” or wrappers over relational schema that can be queried via SPARQL
  • A common storage format based on the triple or quad; suitable for datastore hosting by relational database management systems
  • The use of untyped relations reduces the total number of relations to be handled, with operations over them only needed once
  • Relational systems can serve instance data in situ (ABox) while interoperability is provided by an RDF structural and schema layer (TBox)
  • Ability to do specialized work, such as inferencing
  • Use of a set-based semantics and queries
  • Via its SPARQL query language, easy mechanisms to drive faceted search and other browsing and viewing tools
  • Because of how RDF works it is possible to query a dataset without knowing anything about the data in advance
  • Ability to generalize selection, viewing and publishing tools driven solely from the RDF structure; as the structure changes, tools automatically reflect those changes (e.g., plug-and-play)
  • Can easily create and apply inferencing tables over RDF datastores [14]
  • The RDF graph brings all the advantages and generality of structuring information using graphs
  • A graph is, itself, a unique form of data type with unique algorithms and analytic features
  • Graphs are modular and can be readily combined or broken apart
  • Graphs can be used for scalable, parallelized information processing
  • Unique types of search and discovery can occur with RDF graphs
  • Graphs provide the ability to visualize and navigate large network structures
  • Queries are unaffected by unreferenced values in the source data
  • Emerging lingua franca of the semantic Web and ‘Web of data’
  • Strong compatibility with “linked data” based on Web access (HTTP) and IRI identifiers
  • RDF is readily adaptable to the open-world assumption (OWA)
  • Relation to the semantic Web means much global information and data can be admixed with local content
  • Across all global sources the potential for powerful data “mesh-ups” conjoining related information
  • Network effects such as shared vocabularies, shared background knowledge, collective authoring, annotating and curating, and
  • RDF is an emergent data model.

Yet, Still Kissing Cousins with the Relational Model

Despite these differences in fragility and robustness, there are in fact many logical and conceptual affinities between the relational model and the one for RDF. An excellent piece on those relations was written by Andrew Newman a bit over a year ago [15].

RDF can be modeled relationally as a single table with three columns corresponding to the subject-predicate-object triple. Conversely, a relational table can be modeled in RDF with the subject IRI derived from the primary key or a blank node; the predicate from the column identifier; and the object from the cell value. Because of these affinities, it is also possible to store RDF data models in existing relational databases. (In fact, most RDF “triple stores” are RDBM systems with a tweak, sometimes as “quad stores” where the fourth tuple is the graph.) Moreover, these affinities also mean that RDF stored in this manner can also take advantage of the historical learnings around RDBMS and SQL query optimizations.

Just as there are many RDFizers as noted above, there are also nice ways to convert relational schema to RDF automatically. OpenLink Software, for example, has its RDF “Views” system that does just that [16]. Given these overall conceptual and logical affinities the W3C is also in the process of graduating an incubator group to an official work group, RDB2RDF [17], focused on methods and specifications for mapping relational schema to RDF.

What is emerging is one vision whereby existing RDBM systems retain and serve the instance records (ABox), while RDF and its progeny provide the flexible schema scaffolding and structure over them (TBox). Architectures such as this retain prior investments, but also provide a robust migration path for interoperating across disparate data silos in a performant way.

Data-driven Applications

As developers, one of our favorite advantages of RDF is its ability to support data-driven applications. This makes even further sense when combined with a Web-oriented architecture that exposes all tools and data as RESTful Web services [18].

Two tool foundations are the RDF query language, SPARQL [19], and inferencing. SPARQL provides a generalized basis for driving reports and templated data displays, as well as standard querying. Utilizing RDF’s simple triple structure, SPARQL can also be used to query a dataset without knowing anything in advance about the data. This provides a very useful discovery mode.

Simple inferencing can be applied to broaden and contextualize search, retrieval and analysis. Inference tables can also be created in advance and layered over existing RDF datastores [14] for speedier use and the automatic invoking of inferencing. More complicated inferencing means that RDF models can also perform as complete conceptual views of the world, or knowledge bases. Quite complicated systems are emerging in such areas as common sense (with OpenCyc) and biological systems [20], as two examples.

RDF ontologies and controlled vocabularies also have some hidden power, not yet often seen in standard applications: by virtue of its structure and label properties, we can populate context-relevant dropdown lists and auto-complete entries in user interfaces solely from the input data and structure. This ability is completely generalizable solely on the basis of the input ontology(ies).

A Graph Representation

As the intro noted, when RDF triples get combined, a graph structure emerges. (Actually, it can most formally be described as a directed graph.) A graph structure has many advantages. While we are seeing much starting to emerge in the graph analysis of social networks, we could also fairly argue that we are still at the early stages of plumbing the unique features of graph (”network”) structures.

Graphs are modular and can be both readily combined and broken apart. From a computational standpoint, this can lend itself to parallelized information processing (and, therefore, scalability). With specific reference to RDF it also means that graph extractions are themselves valid RDF models.

Graph algorithms are a significant field of interest within mathematics, computer science and the social sciences. Via approaches such as network theory or scale-free networks, topics such as relatedness, centrality, importance, influence, “hubs” and “domains”, link analysis, spread, diffusion and other dynamics can be analyzed and modeled.

Graphs also have some unique aspects in search and pattern matching. Besides options like finding paths between two nodes, depth-first search, breadth-first search, or finding shortest paths, emerging graph and pattern-matching approaches may offer entirely new paradigms for search.

Graphs also provide new approaches for visualization and navigation, useful for both seeing relationships and framing information from the local to global contexts. The interconnectedness of the graph allows data to be explored via contextual facets, which is revolutionizing data understanding in a way similar to how the basic hyperlink between documents on the Web changed the contours of our information spaces [21].

Many would argue (as do I), that graphs are the most “natural” data structure for capturing the relationships of the real world. If so, we should continue to see new algorithms and approaches emerge based on graphs to help us better understand our information. And RDF is a natural data model for such purposes.

Open World Applications and the Semantic Web

Ultimately, data interoperability implies a global context. The design of RDF began from this perspective with the semantic Web.

This perspective is firstly grounded in the open-world assumption: that is, the information at hand is understood to be incomplete and not self-contained. Missing values are to be expected and do not falsify what is there. A corollary assumption is there is always more information that can be added to the system, and the design should not only accommodate, but promote, that fact.

As the lingua franca for the semantic Web, using RDF means that many new data, structures and vocabularies now become available to you. So, not only can RDF work to interoperate your own data, but it can link in useful, external data and schema as well.

Indeed, the concept of linked data now becomes prominent whereby RDF data with unique IRIs as their universal identifiers are exposed explicitly to aid discovery and interlinking. Whether internal data is exposed in the linked data manner or not, this external data can now be readily incorporated into local contexts. The Linking Open Data movement that is promoting this pattern has become highly successful, with billions of useful RDF statements now available for use and consumption [10].

The semantic Web and RDF is enabling the data federation scope to extend beyond organizational boundaries to embrace (soon) virtually all public information. That means that, say, local customer records can now be supplemented with external information about specific customers or products. We are really just at the nascent stages of such data “mesh-ups” with many unforeseen benefits (and, challenges, too, such as privacy and identity and ownership) likely to emerge.

At Web scales, we will see network effects also emerge in areas such as shared vocabularies, shared background knowledge, and collective authoring, annotating and curating. To be sure the traditional work of trade associations and standards bodies will continue, but likely now in much more operable ways.

Myths of RDF

Throughout the years, a number of myths have grown up around RDF. Some, unfortunately, were based on the legacy of how RDF was first introduced and described. Other myths arise from incomplete understanding of RDF’s multiple roles as a framework, data model, and basis for vocabularies and conceptual descriptions of the world.

The accompanying table lists the “Top Ten” of myths I have found to date. I welcome other pet submissions. Perhaps soon we can get to the point of a clearer understanding of RDF.

‘Top Ten’ Myths of RDF
  1. RDF is equivalent to XML — perhaps the biggest PR error in RDF’s first introduction was to tie RDF so closely with XML. RDF is a data model as described herein that has no dependence on XML and exists in abstract form separate from it
  2. RDF is written or expressed in XML — in a related way, RDF can be serialized (expressed) in many forms other than XML
  3. RDF and OWL are independent — OWL is a language grounded in RDF and a natural extension of the RDF “stack”; OWL is at the full expressiveness end of the spectrum [22, 23]
  4. RDF is a serialization — no; XML is a serialization, RDF is a data model, framework and basis for constructing vocabularies
  5. Basic RDF has no semantics — though limited and purposefully free, basic RDF in fact has extremely well considered semantics; an essential document for any practitioner is [3]
  6. RDF is too complex — it depends, right? At the level of the basic triple, RDF is extremely simple and is the best place to start learning about RDF
  7. RDF is too simple — it depends, right? At the level of OWL ontologies, RDF can capture virtually any relationship and aspect of the world; see [5] for a great start
  8. RDF is useful for “large” datasets only — the real purpose of RDF is data interoperability, which is needed any time two or more datasets are combined, regardless of size
  9. (Conversely and paradoxically), RDF is not scalable — this premise is still being tested, but we now have very large-scale experience with the government and in the Billion Triples Challenge
  10. RDF is not performant — daily we keep learning more about optimizations, query and re-write strategies, and the like. Orri Erling [24] does some of the best work around in this area and writes lucid explanations on his blog. Moreover, RDF systems are easily embedded in WOA architectures, which prove themselves daily at global Web scales.

Conclusion

Emergence is the way complex systems arise out of a multiple of relatively simple interactions, exhibiting new and unforeseen properties in the process. RDF is an emergent model. It begins as simple “fact” statements of triples, that may then be combined and expanded into ever-more complex structures and stories.

As an internal, canonical data model, RDF has advantages over any other approach. We can represent, describe, combine, extend and adapt data and their organizational schema flexibly and at will. We can explore and analyze in ways not easily available with other models.

And, importantly, we can do all of this without the need to change what already exists. We can augment our existing relational data stores, and transfer and represent our current information as we always have.

We can truly call RDF a disruptive data model or framework. But, it does so without disrupting what exists in the slightest. And that is a most remarkable achievement.


[1] Actually, it is just a few weeks past. The first RDF specification was published as: Ora Lassila and Ralph R. Swick, eds., 1999. “Resource Description Framework (RDF) Model and Syntax Specification,” W3C Recommendation, 22 February 1999; see http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/. Of course, RDF had been in development under various names for some time. To my knowledge, the first public explanation specific to the RDF name was by Tim Bray, “RDF and Metadata,” on XML.com, June 09, 1998; see http://www.xml.com/pub/a/98/06/rdf.html. I’m measuring RDF’s birthday in relation to it being published as an official standard (recommendation) per the first reference.
[2] I first recommend an older introduction by Ian Davis, http://research.talis.com/2005/rdf-intro/. There is a more recent, shorter version by Davis and Tom Heath, The 30 Minute Guide to RDF and Linked Data, at http://www.slideshare.net/iandavis/30-minute-guide-to-rdf-and-linked-data. Also, Joshua Tauberer’s write-up at http://www.rdfabout.com/intro/? is quite excellent.
[3] Patrick Hayes, 2004. “RDF Semantics,” a W3C Recommendation, February 2004. See http://www.w3.org/TR/rdf-mt/.
[4] Graham Klyne, 2004. “Semantic Web and RDF,” on the Nine by Nine Web site (http://www.ninebynine.net/), 4 May 2004; see http://www.ninebynine.org/Presentations/20040505-KelvinInsitute.pdf.
[5] The soon-to-be-released recommendation of OWL 2 is best introduced through the recent: OWL 2 Working Group, eds., 2009. “OWL 2 Web Ontology Language: Document Overview,” W3C Working Draft, 27 March 2009; see http://www.w3.org/TR/owl2-overview/.
[6] The canonical data model is especially prevalent in enterprise application integration. An interesting animated visualization of the canonical data model may be found at: http://soa-eda.blogspot.com/2008/03/canonical-data-model-visualized.html.
[7] Still, my own Sweet Tools listing of RDF and -related tools now contains nearly 800 items.
[8] The RDF Advantages Page; see http://www.w3.org/RDF/advantages.html.
[9] See, for example, Neno, the Semantic Web Programming Environment, at: http://neno.lanl.gov/Home.html; and Ripple, at http://code.google.com/p/ripple/. The developers of these systems are now combining efforts.
[10] Here are some useful starting points for RDF at the World Wide Web Consortium (W3C): Begin at the W3C’s ESW wiki. The Linking Open Data community maintains its own people and projects listings as well. Current topics are discussed on the W3C’s semantic Web mailing lists. The W3C maintains a good general semantic Web tools, with specific listings of RDF Triplestores.
[11] Michael Bergman, 2009. “‘Structs’: Naïve Data Formats and the ABox,” on the AI3 blog, January 22, 2009; see http://www.mkbergman.com/?p=471. And, Ibid, 2009. “‘Making Linked Data Reasonable using Description Logics, Part 4,” on the AI3 blog, February 23, 2009; see http://www.mkbergman.com/?p=478.
[12] Harry Halpin, video interview with Marcos Caceres, “GRDDL, Bridging the Interwebs?,” August 4, 2008, on StandardsSuck.org. See http://standardssuck.org/grddl-bridging-the-interwebs.
[13] See, for example, these Virtuoso RDF cartridges (http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtSponger) or listing of RDFizers (http://simile.mit.edu/wiki/RDFizers).
[14] OpenLink Software, 2009. “17.6. Inference Rules & Reasoning,”, part of the online Virutoso User Manual; see: http://docs.openlinksw.com/virtuoso/rdfsparqlrule.html.
[15] Andrew Newman, 2007. “A Relational View of the Semantic Web,” published on XML.com, March 14, 2007; see http://www.xml.com/pub/a/2007/03/14/a-relational-view-of-the-semantic-web.html.
[16] OpenLink Software, 2009. “17.4.3. RDF Views over RDBMS Data Source,” part of the online Virutoso User Manual; see: http://docs.openlinksw.com/virtuoso/rdfsparqlintegrationmiddleware.html#rdfviews. Also see OpenLink Software, 2007. Virtuoso RDF Views — Getting Started Guide, v1.1, June 2007; see http://virtuoso.openlinksw.com/Whitepapers/pdf/Virtuoso_SQL_to_RDF_Mapping.pdf.
[17] W3C, 2009. RDB2RDF Working Group Charter, revised February 24, 2009; see http://www.w3.org/2005/Incubator/rdb2rdf/WG-draft-charter/.
[18] See further my various blog posts on Web-oriented architecture (WOA).
[19] Especially recommended as an introductory tutorial is: Lee Feigenbaum, 2008. “SPARQL By Example: A Tutorial,” Sept. 17, 2008; see http://www.cambridgesemantics.com/2008/09/sparql-by-example.
[20] Many disciplines are embracing RDF. But, in biology, some exemplar projects are the Bio2RDF genomics project; the Linking Open Drug Data (LODD) initiative, which is a sub-project of the W3C’s broader Health Care and Life Sciences Interest Group (HCLSIG); the Neurocommons project; and the RDF branches of the Open Biomedical Ontologies (OBO) project and foundry.
[21] A very nice visualization of graph-driven structures in relation to information discovery and navigation is provided by Rama Hoetzlein, 2007. Quanta: The Organization of Human Knowedge: Systems for Interdisciplinary Research, a Master’s Thesis, University of California, Santa Barbara, June 2007; see http://www.rchoetzlein.com/quanta/index.htm.
[22] The original phrasing of this Myth used the term “distinct”, which Ted Thibodeau Jr rightly questioned. This myth goes to the heart of what I think is a false separation of the RDF and OWL “camps”. As the intro noted, I see a natural progression from RDF –> RDFS –> OWL, with the transition representing more precise semantics and expressiveness. Describing simple things simply, especially for linked data as mostly practiced, works well in RDF and RDFS. Once world views and conceptual schema are desired for inter-relating these things, RDFS and OWL become the better option. OWL Full (including OWL 2, see [23]) is fully grounded in RDF semantics. However, since OWL Full is not decidable, a subset of that, OWL DL, is still expressible as RDF but now consistent with description logics. This approach can provide more inferencing and reasoning power, at the slight cost of greater care in the semantics used and relationships asserted. In the end, the “distinction” between RDF and OWL is really a difference in use cases and intentions, imo.
[23] Michael Schneider, ed., 2009. OWL 2 Web Ontology Language RDF-Based Semantics, W3C Working Draft 21 April 2009; see http://www.w3.org/TR/owl2-rdf-based-semantics/.
[24] See Orri Erling’s Weblog at: http://www.openlinksw.com/weblogs/oerling/.

Posted by AI3's author, Mike Bergman

Posted on April 8, 2009 at 10:02 am in Adaptive Information, Description Logics, Linked Data, Semantic Web, Structured Web, Web-oriented Architecture | Comments (7)
The URI link reference to this post is: http://www.mkbergman.com/483/advantages-and-myths-of-rdf/
The URI to trackback this post is: http://www.mkbergman.com/483/advantages-and-myths-of-rdf/trackback/
Date:   March 27, 2009

Google logo

The Recent ‘The Unreasonable Effectiveness of Data‘ Provides Important Hints

To even the most casual Web searcher, it must now be evident that Google is constantly introducing new structure into its search results. This past week three world-class computer scientists, all now research directors or scientists at Google, Alon Halevy, Peter Norvig and Fernando Pereira, published an opinion piece in the March/April 2009 issue of IEEE Intelligent Systems titled, ‘The Unreasonable Effectiveness of Data.’ It provides important framing and hints for what next may emerge in semantics from the Google search engine.

I had earlier covered Halevy and Google’s work on the deep Web. In this new piece, the authors describe the use of simple models working on very large amounts of data as means to trump fancier and more complicated algorithms.

“Unfortunately, the fact that the word ’semantic’ appears in both ‘Semantic Web’ and ’semantic interpretation’ means that the two problems have often been conflated, causing needless and endless consternation and confusion. The ’semantics’ in Semantic Web services is embodied in the code that implements those services in accordance with the specifications expressed by the relevant ontologies and attached informal documentation.”

Some of the research they cite is related to WebTables [1] and similar efforts to extract structure from Web-scale data. The authors describe the use of such systems to create ’schemata’ of attributes related to various types of instance records — in essence, figuring out the structure of ABoxes [2], for leading instance types such as companies or automobiles [3].

These observations, which they call the semantic interpretation problem and contrast with the Semantic Web, they generalize as being amenable to a kind of simple, brute-force, Web-scale analysis: “Relying on overt statistics of words and word co-occurrences has the further advantage that we can estimate models in an amount of time proportional to available data and can often parallelize them easily. So, learning from the Web becomes naturally scalable.”

Google had earlier posted their 1 terabyte database of n-grams, and I tend to agree that such large-scale incidence mining can lead to tremendous insights and advantages. The authors also helpfully point out that certain scale thresholds occur for doing such analysis, such that researchers need not have access to indexes the scale of Google to do meaningful work or to make meaningful advances. (Good news for the rest of us!)

As the authors challenge:

  1. Choose a representation that can use unsupervised learning on unlabeled data
  2. Represent the data with a non-parametric model, and
  3. Trust the important concepts will emerge from this analysis because human language has already evolved words for it.

My very strong suspicion is that we will see — and quickly — much more structured data for instance types (the ‘ABox’) rapidly emerge from Google in the coming weeks. They have the insights and approaches down, and clearly they have the data to drive the analysis! I also suspect many of these structured additions will just simply show up on the results listings to little fanfare.

The structured Web is growing all around us like stalagmites in a cave!


[1] Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu and Yang Zhang, 2008. “WebTables: Exploring the Power of Tables on the Web,” in the 34th International Conference on Very Large Databases (VLDB), Auckland, New Zealand, 2008. See http://web.mit.edu/y_z/www/papers/webtables-vldb08.pdf.
[2] As per our standard use:

"Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts."
[3] I very much like the authors’ use of ’schemata’ as the way to describe the attribute structure of various instance record types for the ABox, in contrast to the more appropriate ‘ontology’ applied to the TBox.

Posted by AI3's author, Mike Bergman

Posted on March 27, 2009 at 11:08 am in Adaptive Information, Deep Web, Description Logics, Searching, Structured Web | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/
The URI to trackback this post is: http://www.mkbergman.com/481/massive-muscle-on-the-abox-at-google/trackback/
Date:   February 23, 2009

structs

Concluding with a Simplified Instance Record Vocabulary for Linked Data ABoxes

In Part 1 of this series, I advocated the placement of linked data in an ABox construct from description logics [1] based on a separation of concerns argument. In Part 2, I reinforced that argument from the perspective of the work to be done within a knowledge base. In Part 3 we surveyed some of the key literature, finding justification for the split of the TBox from the ABox and the use of specialty RDFS and OWL dialects for work-oriented reasoning in the context of an integral logics.

We now conclude this series and try to bring these threads full circle to address what might be a vocabulary for an ABox instance record design. We’d very much like to thank Dr. Jim Pitman of the Bibliographic Knowledge Network project for having stimulated much of the thinking about the benefits and design of simple, human-authored and -readable instance records.

A Re-cap

Up until about six to eight months ago Fred Giasson and I were spending much of our thinking and design time on UMBEL, ontologies and what we now more precisely define as the TBox. Our intent all along was to get our process and thinking down pat there, and then turn ourselves to the representation of the actual entity data.

We have wanted to keep data records separate from logic and structure all along. Some clients have their own specific data records but may still want to interact with Web stuff or apply similar logic. Moreover, some client data is proprietary, some public. By organizing the data into “named entity dictionaries” we could modularize the architecture to allow swapping in and out of data appropriate to the customer or circumstance at hand.

Our initial design of this and what we share publicly has UMBEL and various standard public ontologies (FOAF, DC, SIOC, BIBO, etc) for the TBox, with Wikipedia entities and stuff from the BBC at the entity level (the ABox).

However, earlier work with another client showed us that our initial named entity structure was not sufficiently general or robust. That company’s records have complex relationships, such as affiliations for entities embedded in the same data record.

For linked data to become truly successful, we need to find easier ways for data publishers to write, expose and share structured data on the Web.

In order to improve the design, we went back to the drawing board to see if we could find guidance from the literature and other researchers as to how to “best” architect instance data in relation to the logic in the TBox (though we were not yet thinking and framing our questions viz description logics, or DL).

This series of postings itself, and some of its predecessor articles, were motivated by probing the description logics space and the guidance it might provide to help determine performant architectures and designs.

Folks, We’re Making Linked Data Just Too Tough

For linked data to become truly successful, we need to find easier ways for data publishers to write, expose and share structured data on the Web.

As anyone who reads my blog knows, I frequently rail against poor semantics or other aspects of the linked data space that I feel are counterproductive. At the same time, I’d like to think that I am also a vocal advocate and proponent for linked data. I am indeed a fan.

To me, the fundamental precepts of RDF as a data model able to capture virtually any data structure or relationship, and the use of Web URIs as linkable identifiers for a global ‘Web of Data’, are simply foundational and game changing. Stuff like this quickens my pulse.

But look at what it takes someone today to publish linked data:

  • He must understand the terminology and standards and best practices — and actually, even amongst current practitioners, few do
  • She must assign Web identifiers (URIs) to her data objects, which means finding them and making them (gawd, I hate this word) “dereferencable”
  • He must understand the semantics of the relationships and linkages his data asserts (which, unfortunately, many don’t)
  • She must present her data in serialized subject-predicate-object “triples”, which are arcane and difficult for most to understand, and
  • They both often confuse data and instances with structure and world views.

Now, come on. This is not the recipe to success.

Simple and unbreakable and forgiving is the recipe to success.

As I noted in an earlier posting, there are many different data structures (’structs‘) for describing and conveying (transmitting) data records. Most of these are easy to understand and easy to read. We know that microformats have tried to capture a part of this space, but so has in other ways data serializations such as JSON or others. What can we learn from such formats?

Well, one thing I have learned is that many on the Web positively want to expose their data. Another thing I have learned is that there is much structured data that will not get exposed without hurdle rates that are small.

Revenge of the ABox

The phrase ‘revenge of the ABox’ comes from Heiko Stoermer’s thesis [2]; it conveys well, I think, the fact that everyone wants to capture and structure “world views” via ontologies and the big picture, but many do not want to grub around at the level of individual instances and data records. As he states, “. . . the most valuable knowledge is typically the one about individuals, but research on ontology integration has traditionally concentrated on concepts and relations.”

(The perverse outcome of this is that even though linked data as practiced to date is almost 100% about instance data, the discussion rarely looks at ABox-level work or instance data integrity.)

As this series and its predecessor posts have argued, description logics (DL) is an excellent guiding framework for how to make architectural and design decisions about linked data. DL and the ABox – TBox have meshed beautifully with our earlier intuition to split ontologies and a structural and organizational view of the world (TBox) from the instance records (ABox, or what we had been calling internally our ‘named entity dictionaries’).

As this four-part series and its predecessor pieces indicate, not only can we gain better conceptual understanding and realization of some of this semantic Web stuff by using DL, but also, perhaps, many of today’s silly or inefficient design practices may be remedied by better grounding our architectures in these logics.

One area, for example, that has helped us much is to get away from the confusing terminology of ‘individuals’ v ‘instances’. Once we come to see an instance record as just that (so, that is why collections can play on an equal footing with individual things, for example), we now only need worry about asserting the attributes of the instance. We can defer all of the logic and reasoning about individuals and members and sets and collections and classes, etc., to the TBox and just get on with capturing and conveying our instance record, as an ABox.

For this reason alone (but there are others), Structured Dynamics has now abandoned the terminology of a ‘named entity dictionary’ in favor or ‘instance dictionaries’ or ABox (either term of which is understood to contain one or more instance records).

The ‘Instance Record’

An instance record is simply a means to either represent or convey the information (”attributes”) of a given instance. An instance is the thing at hand, and need not represent an individual; it could, for example, represent the entire holdings or collection of books in a given library.

An instance record may convey information about multiple instances, but each block of information for each instance is about that instance alone. Thus, for example, if the instance is a paper citation, the instance is the paper. If as attributes it asserts multiple authors, each with different institutional affiliations, those affiliations get asserted in a separate instance for each author. They are attributes of the authors, not of the paper.

In this manner it is easy to see attributes as only pertaining to a given instance. If the overall information to be conveyed discusses attributes for multiple instances, than the instance record presents in series each instance that is characterized.

The Simplicity of Key-Value Pairs

The objective is to make it easy for data owners to write, read and publish data. This means the starting format should be a human readable, easily writable means for authoring and conveying these instance records (that is, instances and their attributes and assigned values).

The simplest, naïve format (independent of syntax or serialization) is the key-value (name-value) pair. In the key-value pair, the subject is always implied. So, for me, MikeBergman, as the subject:

first_name:Mike
sex:male
citizenship:USA
town:Iowa City

Because an instance record only describes attributes for a single instance at a time, all assertions can easily be transformed into the subject-predicate-object (s-p-o) “triples” of RDF. So,

<subject:MikeBergman> <hasFirstName> <Mike>

Now, of course, in conventional linked data many of these entries need to be expressed as URIs in order to “define” the item. Our design allows for that, of course, but also allows the user to simply provide literals (that is, not identifiers, but text strings or numeric or actual values) for each item. Thus, the declaration of a “new” attribute only need occur by its expression, with its value also as simply declared.

Separate, specialized services (see below) may be (and often will need to be!) employed to look up and de-reference URIs, do datatype or data instance validation checks, evaluate identity relationships, disambiguate terms and so forth. The data supplier may choose to publish more-or-less complete “records” on their own, or they may not.

Through this design, nothing need change with regard to how linked data is being done today (other than the addition of some simple converters to accommodate the new format; see below). But, by shifting testing and validation work to external services, we can make it much easier for more data to get exposed and published. It is now time for linked data intermediaries and services to evolve in the linked data ecosystem.

In its most naïve form, this key-value pair format allows for fast and easy instance record creation with the ability to create instances and new attributes on the fly. Sure, these assertions need to be checked, but so does most data when it is asked to participate in any meaningful work.

This simple design, then, is very much in keeping with the limited roles and work associated with an ABox. Only attributes and metadata for an instance are being asserted. Conceptual relationships and specialized work that might be applied against the ABox to determine data validity or whatever is shifted to be external to the instance record, where it properly and logically belongs.

Relation to RDF

In Part 3 we discussed how fragments of the RDF and OWL languages can be used for specialized purposes within a knowledge base while keeping the overall logics of the system integral and decidable. Clearly, this instance record approach where the sole purpose is to assert attributes and values for an instance does not require any OWL. In fact, most linked data to date only brings OWL into the picture for the owl:sameAs property, the common errors of which we discussed in Part 2.

The instance record only requires a small subset of the RDF language. But it does require use of RDFS (Schema) because of the appropriate use of datatypes within the instance data record.

At the level of the TBox and the “specialized work” areas, there are other fragments of OWL, now called profiles in the soon to be released OWL 2 [3], that similarly can be applied to areas such as instance checking and validation, identity relation testing, etc., that I mentioned above. In other words, we can logically fragment RDF and OWL to do the individual parts of a complete system in order to simplify things and aid performance and computational efficiency.

The Instance Record Vocabulary

We are implementing this design internally through what we call the Instance Record Vocabulary (QName: irv). It is still quite experimental and we are testing some important aspects, some of which we describe below. As we get these nuances worked out better, we will release this vocabulary publicly for any to use and comment.

As we presently see it, the namespace languages required for the IRV vocabulary are RDF, RDFS, DCterms and XSD. The RDFS (Schema) is required because, at minimum, of the incorporation of XML Schema datatypes (XSD), which we think to be a desirable requirement for what is, after all, an instance data specification and transfer protocol. However, the actual RDF and RDFS vocabulary used would be extremely minimal, with no OWL required.

In pseudo-form, with many serializations and simple syntaxes possible, this Instance Record Vocabulary has the following properties. Note as discussed above that the <s> in s-p-o is implied. Thus, in its naïve or handwritten form, it could be expressed in pretty simple key-value pairs:

<InstanceRecord>
<Instance>
<hasLabel> <[literal]> @en
<hasAltLabel> <[literal]> @en
<hasURI> <[URI]>
<hasDescription> <[literal]> @en
<Attribute>
<hasAttribute1> <[literal with optional XSD (@en) or URI]>
<hasAttribute2> <[literal with optional XSD (@en) or URI]>
<hasAttribute3> <[literal with optional XSD (@en) or URI]>
<hasAttributeX> <[literal with optional XSD (@en) or URI]>
</Attribute>
<assertIdentity> <[literal or URI]>
<assertType> <[literal or URI]>
<hasSource> <[literal or URI]>
<hasVetting> <[literal or URI]>
</Instance>
<Instance>
. . . repeat as needed . . . 
</Instance>
</InstanceRecord>

Note that most values allow either literal or URI specifications. Some of the properties are obviously optional, others, such as hasLabel, will be required. hasURI, for example, is one case of an optional property that then may require a separate lookup service to complete it as a linked data record.

Instance records with literal specifications would need to be validated and checked before actually used for standard linked data or meaningful data purposes. However, this approach is already well-proved through, for example, OpenLink’s Virtuoso Sponger cartridges and design. Sure some work would need to be done at time of ingest, but there are no technical challenges.

The language used to write a literal can be specified for any kind of attribute (metadata or not). The language is specified using the “@lang-tag” at the end of the literal. This method is similar to the N3 serialization of RDF, which is also equivalent to the XML serialization of RDF using the “xml:lang” attribute.

Metadata

Most of the first properties are simply metadata describing the instance. The strings could be qualified by language.

Attributes

The bulk of the instance record is devoted to the attributes and their values. Attributes could be optionally declared with XSD datatypes. URI references could be specified or later substituted by vetting services (see below).

Attributes could also optionally be characterized in a list format, similar to the Lists specification for Notation 3 (N3).

Asserted Relations

Identity and class membership (rdf:type) assertions could be made; these could later be checked for correctness or identity relations with external or specialized services. The assertIdentity property, in particularly, is the replacement with more appropriate ABox semantics for owl:sameAs.

hasSource

A separate Source record is being developed to cover source or dataset characterizations. A single instance extraction from a Web page, for example, would be accompanied by a simple source characterization. Instances of particular types, such as microformats for example, would be so noted (as they might invoke specialized processors or carry certain authority). Instances from large datasets would have a still longer list of possible characterizations.

This property may look closely at what is also being done for the voiD dataset vocabulary.

Certain parameters in a Source record, such as language for example, may also be applied in special ways by the IRV parser at time of ingest with respect to specific literal specifications.

In any event, this is one of the properties still needing much more thought and definition.

hasVetting

This property, too, needs much more thought and definition.

The hasVetting property, for which multiples are allowed, would identify the specific checks and services applied to the instance data. Depending on service, such checks might include URI lookup or de-referencing, identity relations and testing, record completeness and sufficiency checks, data type checking and validation, general instance checking, disambiguation, and so forth (see “Specialized Work” below).

Some services might also re-write the instance record with corrected values or URIs returned in place of literals.

Best practice for external services would suggest identifying them by URI, though literals would also be allowed to identify internal checks or for lookup purposes.

This property is meant to be a key indicator of how third parties may want to rely on the data. Combined with hasSource, these hasVetting entries provide essential authority and provenance information about the data at hand.

Putting it All Together

This diagram attempts to show the relationship of how many of these pieces may interact:

Information flow to the ABox

Some of these bubbles deserve some additional commentary.

Hand-crafted Input

An important objective in this design is to allow naïve, simple text specifications to be hand-crafted for instance records. There are many relatively simple formats for specifying key-value pairs with a relatively few conventions, ranging from BibTeX to YAML and JSON and others. There are literally hundreds of such formats available, as my earlier overview of Naïve Representations and Structs discussed.

There may be justification for still another form in relation to this Instance Record Vocabulary or not; this topic is still under active discussion.

External Structs

However, whether there is a separate format or not, that same earlier piece overviewed the many simple data structs presently out there. It also noted the nearly 100 existing converters for these forms to RDF. These same converters, with quite slight modifications, could all output the Instance Record Vocabulary in an appropriate serialization as well.

Hooks to Functional and Scripting Languages

Another option is to combine this design with a functional language front-end to generate these records. (Though they could be produced in other ways, as well.) For example, lambda calculus or even a domain-specific language (DSL) could be used to create this very simple record generator. This simple system, in turn, could have a straightforward API that would allow existing scripting languages (such as Python or others) to be used as well.

Specialized Work

So, in fact, we can also now see the specialized work (see also Part 2) that itself is not part of the ABox but can and often should be applied to the instance data in the ABox:

  • Record sufficiency checking
  • De-duplication
  • Membership testing
  • Most specific concept identifying
  • Datatype checking
  • Identity relation testing
  • New attribute checking
  • ABox consistency testing
  • Data range checking
  • Disambiguation
  • Source-specific testing
  • Uniqueness testing
  • URI lookup
  • URI de-referencing
  • Satisfiability checking
  • Others . . .

Though, strictly speaking, such specialty work could be seen to occur at the TBox level, it is actually different and separate logic from “standard” inferencing or reasoning. Specialized work can therefore often occur as separate tests or in batch mode with fragments of OWL or other dedicated indexes and algorithms. Some of this specialized work may take advantage of the conceptual relationships in the TBox, but may not necessarily need to do so. In these manners, the inferencing work of the TBox can be kept clean and efficient.

Beyond Browsing and Unvalidated Queries

Today, linked data has largely been used for browsing and providing unvalidated responses to queries; focus and attention to its ABox roles are important to move beyond this baseline into meaningful work [2]. In those limited instances where this linked data has been looked at and evaluated as a complete knowledge base, such as the SWSE search engine with the SAOR approach as discussed in Part 3, more than 97% of the RDF triples provided in those cases were removed from consideration, often for logical or mis-assertion reasons [4].

The ideas presented here for a simpler linked data specification that can be easily represented in readable text is not new. RDF in JSON has been looked at in this way by Talis and JDIL, YAML has been looked at similarly, and similar and simpler approaches have been looked at closely for topic maps. There are other examples.

A key thrust of these efforts is to make it easier for the data publisher, thereby encouraging the exposure of more structured data.

These emerging ideas do not change in any way the usefulness of current linked data. Our suggested approach interoperates seamlessly with current practices and easily co-resides with them. But, these ideas do:

  • Provide a simpler path for writing and publishing human-readable instance data
  • Provide an ABox instance record structure that can have much specialized work applied against it in a consistent way, and
  • Contributes to an overall logic and architecture that is performant and scalable for doing meaningful work.

Though still needing further thought and refinement, this broad outline of roles and architecture and structure for the ABox completes the last missing piece to Structured Dynamics’ overall approach to linked data and RDF. Much time, thought and research have gone into it. Again, we’d very much like to thank Jim Pitman for his ideas that have helped catalyze this design [5].

We think the combination of a generalized Instance Record Vocabulary that can be reasoned over for ABox-level data checking, and that works with a simple, text-based key-value pair input format, might be a winning combination.


[1] This is our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[2] Heiko Stoermer, 2008. Okkam: Enabling Entity-centric Information Integration in the Semantic Web, Ph.D. thesis presented to the DIT – University of Trento, January 2008, 185 pp. See http://eprints.biblio.unitn.it/archive/00001394/01/dissertation_camera_ready.pdf.
[3] Boris Motik et al., eds., 2008. “OWL 2 Web Ontology Language: Profiles,” a W3C Working Draft, December 2, 2008. See http://www.w3.org/TR/owl2-profiles/.
[4] Aidan Hogan, Andreas Harth and Axel Polleres, 2008. “Scalable Authoritative OWL Reasoning on a Billion Triples,” in Proceedings of Billion Triple Semantic Web Challenge 2008, at the 7th International Semantic Web Conference (ISWC2008), Karlsruhe, Germany, 2008. See http://sw.deri.org/~aidanh/docs/saor_billiontc08.pdf.
[5] This input has come as a result of research supported in part by NSF Award 0835851, Bibliographic Knowledge Network.

Posted by AI3's author, Mike Bergman

Posted on February 23, 2009 at 2:14 pm in Description Logics, Linked Data, Semantic Web, Structured Dynamics, Structured Web, UMBEL Comments Off
The URI link reference to this post is: http://www.mkbergman.com/478/making-linked-data-reasonable-using-description-logics-part-4/
The URI to trackback this post is: http://www.mkbergman.com/478/making-linked-data-reasonable-using-description-logics-part-4/trackback/
Page 1 of 212»
Copyright © 2004–2010 Michael K. Bergman.   This work is licensed under a Creative Commons License