Posted:April 8, 2009

RDF logoA 10th Birthday Salute to RDF’s Role in Powering Data Interoperability

There has been much welcomed visibility for the semantic Web and linked data of late. Many wonder why it has not happened earlier; and some observe progress has still been too slow. But what is often overlooked is the foundational role of RDF — the Resource Description Framework.

From my own perspective focused on the issues of data interoperability and data federation, RDF is the single most important factor in today’s advances. Sure, there have been other models and other formulations, but I think we now see the Goldilocks “just right” combination of expressiveness and simplicity to power the foreseeable future of data interoperability.

So, on this 10th anniversary of the birth of RDF [1], I’d like to re-visit and update some much dated discussions regarding the advantages of RDF, and more directly address some of the mis-perceptions and myths that have grown up around this most useful framework.

By request, this article is now available as a PDF download.

A Simple Intro to RDF

RDF is a data model that is expressed as simple subjectpredicateobject “triples”. That sounds fancy, but just substitute verb for predicate and noun for subject and object. In other words: Dick sees Jane; or, the ball is round. It may sound like a kindergarten reader, but it is how data can be easily represented and built up into more complex structures and stories.

A triple is also known as a “statement” and is the basic “fact” or asserted unit of knowledge in RDF. Multiple statements get combined together by matching the subjects or objects as “nodes” to one another (the predicates act as connectors or “edges”). As these node-edge-node triple statements get aggregated, a network structure emerges, known as the RDF graph.

The referenced “resources” in RDF triples have unique identifiers, IRIs, that are Web-compatible and Web-scalable. These identifiers can point to precise definitions of predicates or refer to specific concepts or objects, leading to less ambiguity and clearer meaning or semantics.

In my own company’s approach to RDF, basic instance data is simply represented as attribute-value pairs where the subject is the instance itself, the predicate is the attribute, and the object is the value. Such instance records are also known as the ABox. The structural relationships within RDF are defined in ontologies, also known as the TBox, which are basically equivalent to a schema in the relational data realm.

RDF triples can be applied equally to all structured, semi-structured and unstructured content. By defining new types and predicates, it is possible to create more expressive vocabularies within RDF. This expressiveness enables RDF to define controlled vocabularies with exact semantics. These features make RDF a powerful data model and language for data federation and interoperability across disparate datasets.

There are many excellent introductions or tutorials to RDF; a recommended sampling is shown in the endnotes [2].

Is RDF a Framework, Data Model or Vocabulary?

Well, the answer to the rhetorical question is, all three!

The RDF data model provides an abstract, conceptual framework for defining and using metadata and metadata vocabularies. See: We were able to use all three concepts in a single sentence!

The RDF model draws on well-established principles from various data representation communities. RDF properties may be thought of as attributes of resources and in this sense correspond to traditional attribute-value pairs. RDF properties also represent relationships between resources and an RDF model can therefore resemble an entity-relationship diagram. . . . In object-oriented design terminology, resources correspond to objects and properties correspond to instance variables. [1]

But, actually, because RDF is simultaneously a framework, data model and basis for building more complex vocabularies, it is both simple and complex at the same time.

It is first perhaps best to understand basic RDF as a data model of triples with very few (or unconstrained) semantics [3]. In its base form, it has no range or domain constraints; has no existence or cardinality constraints; and lacks transitive, inverse or symmetrical properties (or predicates) [4]. As such, basic RDF has limited reasoning support. It is, however, quite useful in describing static things or basic facts.

In this regard, RDF in its base state is nearly adequate for describing the simple instances and data records of the world, what is called the ABox in description logics.

RDFS (RDF Schema) is the next layer in the RDF stack designed to overcome some of these baseline limitations. RDFS introduces new predicates and classes that bound these semantics. Importantly, RDFS establishes the basic constructs necessary to create new vocabularies, principally through adding the class and subClass declarations and adding domain and range to properties (the RDF term for predicates). Many useful vocabularies have been created with RDFS and it is possible to apply limited reasoning and inference support against them.

The next layer in the RDF stack is OWL, the Web Ontology Language. It, too, is based on RDF. The first versions of OWL were themselves layered from OWL Lite to OWL DL to OWL Full. OWL Lite and OWL DL are both decidable through the first-order logic basis of description logics (the basis for the acronym in OWL DL). OWL Full is not decidable, but provides an OWL counterpart to fragmented RDF and RDFS statements that are desirable in the aggregate, with reasoning applied where possible.

OWL provides sufficient expressive richness to be able to describe the relationships and structure of entire world views, or the so-called terminological (TBox) construct in description logics. Thus, we see that the complete structural spectrum of description logics can be satisfied with RDF and its schematic progeny, with a bit of an escape hatch for combining poorly defined or structural pieces via using OWL Full [5].

However, RDF is NOT a particular serialization. Though XML was the original specified serialization and still is the defined RDF MIME type (application/rdf+xml; other serializations take the form text/turtle or text/n3 or similar), it is not necessary to either write or transmit RDF in the XML syntax.

In any event, depending on its role and application, we can see that RDF is a foundation, in careful expressions based in description logics, that lends itself to a clean expression and separation of concerns. With RDF and RDFS, we have a data model and a basis for vocabularies well suited to instance data (ABox). With RDFS and OWL, we have an extended schema structure and ontologies suitable for describing and modeling the relationships in the world (TBox). Thus, RDF is a framework for modeling all forms of data, for describing that data through vocabularies, and for interoperating that data through shared conceptualizations (ontologies) and schema.

Rationale for a Canonical Data Model

In the context of data interoperability, a critical premise is that a single, canonical data model is highly desirable. Why? Simply because of 2N v N2. That is, a single reference (“canon”) structure means that fewer tool variants and converters need be developed to talk to the myriad of data formats in the wild. With a canonical data model, talking to external sources and formats (N) only requires converters to the canonical form (2N). Without a canonical model, the combinatorial explosion of required format converters becomes N2 [6].

Note, in general, such a canonical data model merely represents the agreed-upon internal representation. It need not affect data transfer formats. Indeed, in many cases, data systems employ quite different internal data models from what is used for data exchange. Many, in fact, have two or three favored flavors of data exchange such as XML, JSON or the like.

In most enterprises and organizations, the relational data model with its supporting RDBMs is the canonical one. In some notable Web enterprises — say, Google for example — the exact details of its internal canonical data model is hidden from view, with APIs and data exchange standards such as GData being the only visible portions to outside consumers.

Generally speaking, a canonical, internal data standard should meet a few criteria:

  • Be expressive enough to capture the structure and semantics of any contributing dataset
  • Have a schema itself which is extensible
  • Be performant
  • Have a model to which it is relatively easy to develop converters for different formats and models
  • Have published and proven standards, and
  • Have sufficient acceptance so as to have many existing tools and documentation.

Other desired characteristics might be for the model and many of its tools to be free and open source, suitable to much analytic work, efficient in storage, and other factors.

Though the relational data model is numerically the most prevalent one in use, it has fallen out of favor for data federation purposes. This loss of favor is due, in part, to the fragile nature of relational schema, which increases maintenance costs for the data and their applications, and incompatibilities in standards and implementation.

Though still comparatively young with a smaller-than-desirable suite of tools and applications support [7], RDF is perhaps the ideal candidate for the canonical data model. To understand why, let’s now switch our discussion to the advantages of RDF.

Advantages of RDF

It is surprisingly difficult to find a consolidated listing of RDF’s advantages. The W3C, the developer of the specification, first published on this topic in the late 1990s, but it has not been updated for some time [8]. Graham Klyne has a better and more comprehensive presentation, but still one that has not been updated since 2004 [4].

I believe data interoperability to be RDF’s premier advantage, but there are many, many others.

Another advantage that is less understood is that RDF and its progeny can completely switch the development paradigm: data can now drive the application, and not the other way around. Frankly, we are just at the beginning realizations of this phase with such developments as linked data and even whole applications or application languages being written in RDF [9], but I think time will prove this advantage to be game-changing.

But, there are many perspectives that can help tease out RDF’s advantages. Some of these are discussed below, with the accompanying table attempting to list these ‘Top Sixty’ advantages in a single location.

Standard, Open and Expressive

In its ten year history, RDF has spawned many related languages and standards. The W3C has been the shepherd for this process, and there are many entry locations on the World Wide Web Consortium’s Web site to begin exploring these options [10]. These standards extend from the RDF, RDFS and OWL vocabularies and languages noted above that give RDF its range of expressiveness, to query languages (e.g., SPARQL), transformation languages (e.g., GRDDL), rule languages (e.g., RIF), and many additional constructs and standards.

The richness of this base of standards is only now being tapped. The combination of these standards and the tools they are spawning is just beginning. And, because it is so easily serialized as XML, a further suite of tools and capabilities such as XPath or XSLT or XForms may be layered onto this base.

Moreover, one is not limited in any way to XML as a serialization. RDF itself has been serialized in a number of formats including RDF/XML, N3, RDFa, Turtle, and N-triples. Also, RDF’s simple subjectpredicateobject data model can readily convert human-readable and easily authored instance records (subject) written in the style of attribute-value pairs (predicateobject). As such, RDF is an excellent conversion target for all forms of naïve data structs [11].

Data Interoperability

Indeed, it is in data exchange and interoperability that RDF really shines. Via various processors or extractors, RDF can capture and convey the metadata or information in unstructured (say, text), semi-structured (say, HTML documents) or structured sources (say, standard databases). This makes RDF almost a “universal solvent” for representing data structure.

“The semantic Web’s real selling point is URI-based data integration.”
Harry Halpin [12]

Because of this universality, there are now more than 100 off-the-shelf ‘RDFizers’ for converting various non-RDF notations (data formats and serializations) to RDF [13]. Because of its diversity of serializations and simple data model, it is also easy to create new converters. Generalized conversion languages such as GRDDL provide framework-specific conversions, such as for microformats.

Once in a common RDF representation, it is easy to incorporate new datasets or new attributes. It is also easy to aggregate disparate data sources as if they came from a single source. This enables meaningful composition of data from different applications regardless of format or serialization.

Simple RDF structures and predicates enable synonyms or aliases to also be easily mapped to the same types or concepts. This kind of semantic matching is a key capability of the semantic Web. It becomes quite easy to say that your glad is my happy, and they indeed talk about the same thing.

What this mapping flexibility points to is the immense strengths of RDF in representing diverse schema, the next major advantage.

Schema Unbound

The single failure of data integration since the inception of information technologies — for more than 30 years, now — has been schema rigidity or schema fragility. That is, once data relationships are set, they remain so and can not easily be changed in conventional data management systems nor in the applications that use them.

Relational database management (RDBM) systems have not helped this challenge, at all. While tremendously useful for transactions and enabling the addition of more data records (instances, or rows in a relational table schema), they are not adaptive nor flexible.

Why is this so?

In part, it has to do with the structural “view” of the world. If everything is represented as a flat table of rows and columns, with keys to other flat structures, as soon as that representation changes, the tentacled connections can break. Such has been the fragility of the RDBMS model, and the hard-earned resistance of RDBMS administrators to schema growth or change.

Yet, change is inevitable. And thus, this is the source of frustration with virtually all extant data systems.

RDF has no such limitations. And, for those from a conventional data management perspective, this RDF flexibility can be one of the more unbelievable aspects of this data model.

As we have noted earlier, RDF is well suited and can provide a common framework to represent both instance data and the structures or schema that describe them, from basic data records to entire domains or world views. In fact, whatever schema or structure that characterizes the input data — from simple instance record layouts and attributes to complete vocabularies or ontologies — also embodies domain knowledge. This structure can be used at time of ingest as validity or consistency checks.

As a framework for data interoperability, RDF and its progeny can ingest all relations and terminology, with connections made via flexible predicates that assert the degree and nature of relatedness. There is no need for ingested records or data to be complete, nor to meet any prior agreement as to structure or schema.

Increment, Evolve, Extend, Adapt . . .

Indeed, the very fluidity of RDF and structures based on it is another key strength. Since a basic RDF model can be processed even in the absence of more detailed information, input data and basic inferences can proceed early and logically as a simple fact basis. This strength means that either data or schema may be ingested and then extended in an incremental or partial manner. Partial representations can be incorporated as readily as complete ones, and schema can extend and evolve as new structure is discovered or encountered.

This is revolutionary. RDF provides a data and schema representation framework that can evolve and adapt to what data exists and what structure is known. As new data with new attributes are discovered, or as new relationships are found or realized, these can be added to the existing model without any change whatsoever to the prior existing schema.

This very adaptability is what enables RDF to be viewed as data-driven design. We can deal with a partial and incomplete world; we can learn as we go; we can start small and simple and evolve to more understanding and structure; and we can preserve all structure and investments we have previously made.

And applications based on RDF work the same way: they do not need to process or account for information they don’t know or understand. We can easily query RDF models without being affected whatsoever by unreferenced or untyped data in the basic model.

By replacing the rigid relational data model with one based on RDF, we gain robustness, flexibility, universality and structural persistence over fragility.

Existing technologies such as SQL and the relational model were devised without the specific requirements of disparate, uncontrolled, large-scale integration. Though the relational model enabled us to build efficient data silos and transaction systems, RDF now enables us to finally federate them.

‘Top Sixty’ Benefits of RDF
  • A foundation based in description logics that lends itself to clean expression and separation of concerns regarding instance data (ABox) and schema structure (TBox)
  • RDF’s unique identifiers, IRIs, are Web-compatible and are Web-scalable
  • Potential use of inferencing to contextually broaden search, retrieval and analysis
  • Potential use of its structure to automatically drive applications and tools, including populating context-relevant dropdown lists and auto-completion
  • Based on open source, languages and standards
  • A comparatively complete suite of specifications including languages, schema and tools (e.g., RDF, RDF Schema, OWL, RIF, SPARQL, GRDDL, etc.)
  • A choice of a variety of serializations and notations, including RDF/XML, N3, RDFa, Turtle, N-triples, as well as possible expression in many non-RDF notations
  • Instance records in human-readable, easily authored attribute-value formats can be readily converted to the spo RDF “triple” data model
  • Can capture metadata and structure from unstructured, semi-structured and structured data
  • More than 100 off-the-shelf ‘RDFizers’ exist for converting various non-RDF notations (data formats and serializations)
  • Easy and cost-effective incorporation of new datasets wherein only new attributes require a structure update; all others simply get mapped
  • Aggregate processing of disparate sources as if they came from a single source
  • A ready structure for synonym and alias matching when merging or matching datasets
  • In converting non-RDF data, the ability to bring a more formal class structure to the description of things
  • Common framework and vocabulary for representing instance data
  • Common framework and vocabulary for representing data structures and schema
  • Can describe simple data structs to complete vocabularies/ontologies to processing and inferencing rules
  • Schema can be calculated from the ingested triples; thus, can either generate schema from scratch or be used to cross-check prior schema
  • Can accept and store data with different structure in a general RDF container (e.g., all animals v a specific bird)
  • Eliminates the trade-off between good design and performance for related structure (e.g., full names v first and last names)
  • Untyped relations can still have single operations performed against them
  • More formal RDF structures (e.g., ontologies) embody domain expertise within their subject structure
  • Readily extensible with schema that are also machine readable, bringing about a high degree of automation
  • Allows data that is structured slightly differently to be stored together in the lowest common denominator of an RDF statement
  • No need for upfront schema agreement; can evolve, extend and adapt
  • Allows the schema to change independently of the data without requiring any existing data to be thrown away or padded with NULLs
  • The basic RDF model can be processed in the absence of more detailed information as a simple fact basis
  • Schema based on RDF can be extended and grown incrementally without impacting the existing datastore
  • As a corollary, development based on RDF can also be incremental, reducing the need to “design it at once” or “design it right” up front
  • RDF models and apps lend themselves to experimentation and agile development
  • Information can be gathered incrementally from multiple sources
  • Data and schema can be ingested, represented and conveyed in “partial” form
  • Structure and schema can evolve incrementally in concert with new understandings and new data
  • All prior investments in structure and schema can be maintained
  • Because of conceptual closeness to the relational data model, it is possible to represent RDF in a relational database and vice versa
  • RDF thus has the ability to take advantage of historical RDBMs and SQL query optimizations
  • Ability to create RDF “views” or wrappers over relational schema that can be queried via SPARQL
  • A common storage format based on the triple or quad; suitable for datastore hosting by relational database management systems
  • The use of untyped relations reduces the total number of relations to be handled, with operations over them only needed once
  • Relational systems can serve instance data in situ (ABox) while interoperability is provided by an RDF structural and schema layer (TBox)
  • Ability to do specialized work, such as inferencing
  • Use of a set-based semantics and queries
  • Via its SPARQL query language, easy mechanisms to drive faceted search and other browsing and viewing tools
  • Because of how RDF works it is possible to query a dataset without knowing anything about the data in advance
  • Ability to generalize selection, viewing and publishing tools driven solely from the RDF structure; as the structure changes, tools automatically reflect those changes (e.g., plug-and-play)
  • Can easily create and apply inferencing tables over RDF datastores [14]
  • The RDF graph brings all the advantages and generality of structuring information using graphs
  • A graph is, itself, a unique form of data type with unique algorithms and analytic features
  • Graphs are modular and can be readily combined or broken apart
  • Graphs can be used for scalable, parallelized information processing
  • Unique types of search and discovery can occur with RDF graphs
  • Graphs provide the ability to visualize and navigate large network structures
  • Queries are unaffected by unreferenced values in the source data
  • Emerging lingua franca of the semantic Web and ‘Web of data’
  • Strong compatibility with “linked data” based on Web access (HTTP) and IRI identifiers
  • RDF is readily adaptable to the open-world assumption (OWA)
  • Relation to the semantic Web means much global information and data can be admixed with local content
  • Across all global sources the potential for powerful data “mesh-ups” conjoining related information
  • Network effects such as shared vocabularies, shared background knowledge, collective authoring, annotating and curating, and
  • RDF is an emergent data model.

Yet, Still Kissing Cousins with the Relational Model

Despite these differences in fragility and robustness, there are in fact many logical and conceptual affinities between the relational model and the one for RDF. An excellent piece on those relations was written by Andrew Newman a bit over a year ago [15].

RDF can be modeled relationally as a single table with three columns corresponding to the subjectpredicateobject triple. Conversely, a relational table can be modeled in RDF with the subject IRI derived from the primary key or a blank node; the predicate from the column identifier; and the object from the cell value. Because of these affinities, it is also possible to store RDF data models in existing relational databases. (In fact, most RDF “triple stores” are RDBM systems with a tweak, sometimes as “quad stores” where the fourth tuple is the graph.) Moreover, these affinities also mean that RDF stored in this manner can also take advantage of the historical learnings around RDBMS and SQL query optimizations.

Just as there are many RDFizers as noted above, there are also nice ways to convert relational schema to RDF automatically. OpenLink Software, for example, has its RDF “Views” system that does just that [16]. Given these overall conceptual and logical affinities the W3C is also in the process of graduating an incubator group to an official work group, RDB2RDF [17], focused on methods and specifications for mapping relational schema to RDF.

What is emerging is one vision whereby existing RDBM systems retain and serve the instance records (ABox), while RDF and its progeny provide the flexible schema scaffolding and structure over them (TBox). Architectures such as this retain prior investments, but also provide a robust migration path for interoperating across disparate data silos in a performant way.

Data-driven Applications

As developers, one of our favorite advantages of RDF is its ability to support data-driven applications. This makes even further sense when combined with a Web-oriented architecture that exposes all tools and data as RESTful Web services [18].

Two tool foundations are the RDF query language, SPARQL [19], and inferencing. SPARQL provides a generalized basis for driving reports and templated data displays, as well as standard querying. Utilizing RDF’s simple triple structure, SPARQL can also be used to query a dataset without knowing anything in advance about the data. This provides a very useful discovery mode.

Simple inferencing can be applied to broaden and contextualize search, retrieval and analysis. Inference tables can also be created in advance and layered over existing RDF datastores [14] for speedier use and the automatic invoking of inferencing. More complicated inferencing means that RDF models can also perform as complete conceptual views of the world, or knowledge bases. Quite complicated systems are emerging in such areas as common sense (with OpenCyc) and biological systems [20], as two examples.

RDF ontologies and controlled vocabularies also have some hidden power, not yet often seen in standard applications: by virtue of its structure and label properties, we can populate context-relevant dropdown lists and auto-complete entries in user interfaces solely from the input data and structure. This ability is completely generalizable solely on the basis of the input ontology(ies).

A Graph Representation

As the intro noted, when RDF triples get combined, a graph structure emerges. (Actually, it can most formally be described as a directed graph.) A graph structure has many advantages. While we are seeing much starting to emerge in the graph analysis of social networks, we could also fairly argue that we are still at the early stages of plumbing the unique features of graph (“network”) structures.

Graphs are modular and can be both readily combined and broken apart. From a computational standpoint, this can lend itself to parallelized information processing (and, therefore, scalability). With specific reference to RDF it also means that graph extractions are themselves valid RDF models.

Graph algorithms are a significant field of interest within mathematics, computer science and the social sciences. Via approaches such as network theory or scale-free networks, topics such as relatedness, centrality, importance, influence, “hubs” and “domains”, link analysis, spread, diffusion and other dynamics can be analyzed and modeled.

Graphs also have some unique aspects in search and pattern matching. Besides options like finding paths between two nodes, depth-first search, breadth-first search, or finding shortest paths, emerging graph and pattern-matching approaches may offer entirely new paradigms for search.

Graphs also provide new approaches for visualization and navigation, useful for both seeing relationships and framing information from the local to global contexts. The interconnectedness of the graph allows data to be explored via contextual facets, which is revolutionizing data understanding in a way similar to how the basic hyperlink between documents on the Web changed the contours of our information spaces [21].

Many would argue (as do I), that graphs are the most “natural” data structure for capturing the relationships of the real world. If so, we should continue to see new algorithms and approaches emerge based on graphs to help us better understand our information. And RDF is a natural data model for such purposes.

Open World Applications and the Semantic Web

Ultimately, data interoperability implies a global context. The design of RDF began from this perspective with the semantic Web.

This perspective is firstly grounded in the open-world assumption: that is, the information at hand is understood to be incomplete and not self-contained. Missing values are to be expected and do not falsify what is there. A corollary assumption is there is always more information that can be added to the system, and the design should not only accommodate, but promote, that fact.

As the lingua franca for the semantic Web, using RDF means that many new data, structures and vocabularies now become available to you. So, not only can RDF work to interoperate your own data, but it can link in useful, external data and schema as well.

Indeed, the concept of linked data now becomes prominent whereby RDF data with unique IRIs as their universal identifiers are exposed explicitly to aid discovery and interlinking. Whether internal data is exposed in the linked data manner or not, this external data can now be readily incorporated into local contexts. The Linking Open Data movement that is promoting this pattern has become highly successful, with billions of useful RDF statements now available for use and consumption [10].

The semantic Web and RDF is enabling the data federation scope to extend beyond organizational boundaries to embrace (soon) virtually all public information. That means that, say, local customer records can now be supplemented with external information about specific customers or products. We are really just at the nascent stages of such data “mesh-ups” with many unforeseen benefits (and, challenges, too, such as privacy and identity and ownership) likely to emerge.

At Web scales, we will see network effects also emerge in areas such as shared vocabularies, shared background knowledge, and collective authoring, annotating and curating. To be sure the traditional work of trade associations and standards bodies will continue, but likely now in much more operable ways.

Myths of RDF

Throughout the years, a number of myths have grown up around RDF. Some, unfortunately, were based on the legacy of how RDF was first introduced and described. Other myths arise from incomplete understanding of RDF’s multiple roles as a framework, data model, and basis for vocabularies and conceptual descriptions of the world.

The accompanying table lists the “Top Ten” of myths I have found to date. I welcome other pet submissions. Perhaps soon we can get to the point of a clearer understanding of RDF.

‘Top Ten’ Myths of RDF
  1. RDF is equivalent to XML — perhaps the biggest PR error in RDF’s first introduction was to tie RDF so closely with XML. RDF is a data model as described herein that has no dependence on XML and exists in abstract form separate from it
  2. RDF is written or expressed in XML — in a related way, RDF can be serialized (expressed) in many forms other than XML
  3. RDF and OWL are independent — OWL is a language grounded in RDF and a natural extension of the RDF “stack”; OWL is at the full expressiveness end of the spectrum [22, 23]
  4. RDF is a serialization — no; XML is a serialization, RDF is a data model, framework and basis for constructing vocabularies
  5. Basic RDF has no semantics — though limited and purposefully free, basic RDF in fact has extremely well considered semantics; an essential document for any practitioner is [3]
  6. RDF is too complex — it depends, right? At the level of the basic triple, RDF is extremely simple and is the best place to start learning about RDF
  7. RDF is too simple — it depends, right? At the level of OWL ontologies, RDF can capture virtually any relationship and aspect of the world; see [5] for a great start
  8. RDF is useful for “large” datasets only — the real purpose of RDF is data interoperability, which is needed any time two or more datasets are combined, regardless of size
  9. (Conversely and paradoxically), RDF is not scalable — this premise is still being tested, but we now have very large-scale experience with the government and in the Billion Triples Challenge
  10. RDF is not performant — daily we keep learning more about optimizations, query and re-write strategies, and the like. Orri Erling [24] does some of the best work around in this area and writes lucid explanations on his blog. Moreover, RDF systems are easily embedded in WOA architectures, which prove themselves daily at global Web scales.

Conclusion

Emergence is the way complex systems arise out of a multiple of relatively simple interactions, exhibiting new and unforeseen properties in the process. RDF is an emergent model. It begins as simple “fact” statements of triples, that may then be combined and expanded into ever-more complex structures and stories.

As an internal, canonical data model, RDF has advantages over any other approach. We can represent, describe, combine, extend and adapt data and their organizational schema flexibly and at will. We can explore and analyze in ways not easily available with other models.

And, importantly, we can do all of this without the need to change what already exists. We can augment our existing relational data stores, and transfer and represent our current information as we always have.

We can truly call RDF a disruptive data model or framework. But, it does so without disrupting what exists in the slightest. And that is a most remarkable achievement.


[1] Actually, it is just a few weeks past. The first RDF specification was published as: Ora Lassila and Ralph R. Swick, eds., 1999. “Resource Description Framework (RDF) Model and Syntax Specification,” W3C Recommendation, 22 February 1999; see http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/. Of course, RDF had been in development under various names for some time. To my knowledge, the first public explanation specific to the RDF name was by Tim Bray, “RDF and Metadata,” on XML.com, June 09, 1998; see http://www.xml.com/pub/a/98/06/rdf.html. I’m measuring RDF’s birthday in relation to it being published as an official standard (recommendation) per the first reference.
[2] I first recommend an older introduction by Ian Davis, http://research.talis.com/2005/rdf-intro/. There is a more recent, shorter version by Davis and Tom Heath, The 30 Minute Guide to RDF and Linked Data, at http://www.slideshare.net/iandavis/30-minute-guide-to-rdf-and-linked-data. Also, Joshua Tauberer’s write-up at http://www.rdfabout.com/intro/? is quite excellent.
[3] Patrick Hayes, 2004. “RDF Semantics,” a W3C Recommendation, February 2004. See http://www.w3.org/TR/rdf-mt/.
[4] Graham Klyne, 2004. “Semantic Web and RDF,” on the Nine by Nine Web site (http://www.ninebynine.net/), 4 May 2004; see http://www.ninebynine.org/Presentations/20040505-KelvinInsitute.pdf.
[5] The soon-to-be-released recommendation of OWL 2 is best introduced through the recent: OWL 2 Working Group, eds., 2009. “OWL 2 Web Ontology Language: Document Overview,” W3C Working Draft, 27 March 2009; see http://www.w3.org/TR/owl2-overview/.
[6] The canonical data model is especially prevalent in enterprise application integration. An interesting animated visualization of the canonical data model may be found at: http://soa-eda.blogspot.com/2008/03/canonical-data-model-visualized.html.
[7] Still, my own Sweet Tools listing of RDF and -related tools now contains nearly 800 items.
[8] The RDF Advantages Page; see http://www.w3.org/RDF/advantages.html.
[9] See, for example, Neno, the Semantic Web Programming Environment, at: http://neno.lanl.gov/Home.html; and Ripple, at http://code.google.com/p/ripple/. The developers of these systems are now combining efforts.
[10] Here are some useful starting points for RDF at the World Wide Web Consortium (W3C): Begin at the W3C’s ESW wiki. The Linking Open Data community maintains its own people and projects listings as well. Current topics are discussed on the W3C’s semantic Web mailing lists. The W3C maintains a good general semantic Web tools, with specific listings of RDF Triplestores.
[11] Michael Bergman, 2009. “‘Structs’: Naïve Data Formats and the ABox,” on the AI3 blog, January 22, 2009; see https://www.mkbergman.com/?p=471. And, Ibid, 2009. “‘Making Linked Data Reasonable using Description Logics, Part 4,” on the AI3 blog, February 23, 2009; see https://www.mkbergman.com/?p=478.
[12] Harry Halpin, video interview with Marcos Caceres, “GRDDL, Bridging the Interwebs?,” August 4, 2008, on StandardsSuck.org. See http://standardssuck.org/grddl-bridging-the-interwebs.
[13] See, for example, these Virtuoso RDF cartridges (http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtSponger) or listing of RDFizers (http://simile.mit.edu/wiki/RDFizers).
[14] OpenLink Software, 2009. “17.6. Inference Rules & Reasoning,”, part of the online Virutoso User Manual; see: http://docs.openlinksw.com/virtuoso/rdfsparqlrule.html.
[15] Andrew Newman, 2007. “A Relational View of the Semantic Web,” published on XML.com, March 14, 2007; see http://www.xml.com/pub/a/2007/03/14/a-relational-view-of-the-semantic-web.html.
[16] OpenLink Software, 2009. “17.4.3. RDF Views over RDBMS Data Source,” part of the online Virutoso User Manual; see: http://docs.openlinksw.com/virtuoso/rdfsparqlintegrationmiddleware.html#rdfviews. Also see OpenLink Software, 2007. Virtuoso RDF Views — Getting Started Guide, v1.1, June 2007; see http://virtuoso.openlinksw.com/Whitepapers/pdf/Virtuoso_SQL_to_RDF_Mapping.pdf.
[17] W3C, 2009. RDB2RDF Working Group Charter, revised February 24, 2009; see http://www.w3.org/2005/Incubator/rdb2rdf/WG-draft-charter/.
[18] See further my various blog posts on Web-oriented architecture (WOA).
[19] Especially recommended as an introductory tutorial is: Lee Feigenbaum, 2008. “SPARQL By Example: A Tutorial,” Sept. 17, 2008; see http://www.cambridgesemantics.com/2008/09/sparql-by-example.
[20] Many disciplines are embracing RDF. But, in biology, some exemplar projects are the Bio2RDF genomics project; the Linking Open Drug Data (LODD) initiative, which is a sub-project of the W3C’s broader Health Care and Life Sciences Interest Group (HCLSIG); the Neurocommons project; and the RDF branches of the Open Biomedical Ontologies (OBO) project and foundry.
[21] A very nice visualization of graph-driven structures in relation to information discovery and navigation is provided by Rama Hoetzlein, 2007. Quanta: The Organization of Human Knowedge: Systems for Interdisciplinary Research, a Master’s Thesis, University of California, Santa Barbara, June 2007; see http://www.rchoetzlein.com/quanta/index.htm.
[22] The original phrasing of this Myth used the term “distinct”, which Ted Thibodeau Jr rightly questioned. This myth goes to the heart of what I think is a false separation of the RDF and OWL “camps”. As the intro noted, I see a natural progression from RDF –> RDFS –> OWL, with the transition representing more precise semantics and expressiveness. Describing simple things simply, especially for linked data as mostly practiced, works well in RDF and RDFS. Once world views and conceptual schema are desired for inter-relating these things, RDFS and OWL become the better option. OWL Full (including OWL 2, see [23]) is fully grounded in RDF semantics. However, since OWL Full is not decidable, a subset of that, OWL DL, is still expressible as RDF but now consistent with description logics. This approach can provide more inferencing and reasoning power, at the slight cost of greater care in the semantics used and relationships asserted. In the end, the “distinction” between RDF and OWL is really a difference in use cases and intentions, imo.
[23] Michael Schneider, ed., 2009. OWL 2 Web Ontology Language RDF-Based Semantics, W3C Working Draft 21 April 2009; see http://www.w3.org/TR/owl2-rdf-based-semantics/.
[24] See Orri Erling’s Weblog at: http://www.openlinksw.com/weblogs/oerling/.

Posted:April 3, 2009

Whether You Call it ROA or WOA, It Simply Works

Brian Sletten, newly with Riot Games, has just completed his four-part REST for Java developers series on JavaWorld.com. The series, which began last October, was wrapped up this week.

The series is highly readable and a real keeper:

The printable versions are the easiest to read since you don’t have to get stuck in JavaWorld’s annoying document-splitting style.

Brian prefers to use Sam Ruby‘s term of resource-oriented architecture (ROA), though I have preferred to use Nick Gall‘s Web-oriented architecture (WOA) nomenclature. In any case, the series is highly informative and clearly written and is sufficiently general in Parts 1 and 4 and most of Part 3 to be of use to non-Java developers.

What is quite interesting is that Brian also makes the connection between a REST Web service style and linked data in Part 4, as I suggested a few months back. (In fact, for those familiar with REST, I recommend you start with Part 4.) Again, I think RESTful Web services in combination with RDF and linked data points to the winning and performant architecture of the foreseeable future.

Thanks for a great series, Brian!

Posted by AI3's author, Mike Bergman Posted on April 3, 2009 at 11:58 am in Linked Data, Structured Web, Web-oriented Architecture | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/482/great-series-on-rest-and-resource-oriented-architecture/
The URI to trackback this post is: https://www.mkbergman.com/482/great-series-on-rest-and-resource-oriented-architecture/trackback/
Posted:March 27, 2009

Google logo

The Recent ‘The Unreasonable Effectiveness of Data‘ Provides Important Hints

To even the most casual Web searcher, it must now be evident that Google is constantly introducing new structure into its search results. This past week three world-class computer scientists, all now research directors or scientists at Google, Alon Halevy, Peter Norvig and Fernando Pereira, published an opinion piece in the March/April 2009 issue of IEEE Intelligent Systems titled, ‘The Unreasonable Effectiveness of Data.’ It provides important framing and hints for what next may emerge in semantics from the Google search engine.

I had earlier covered Halevy and Google’s work on the deep Web. In this new piece, the authors describe the use of simple models working on very large amounts of data as means to trump fancier and more complicated algorithms.

“Unfortunately, the fact that the word ‘semantic’ appears in both ‘Semantic Web’ and ‘semantic interpretation’ means that the two problems have often been conflated, causing needless and endless consternation and confusion. The ‘semantics’ in Semantic Web services is embodied in the code that implements those services in accordance with the specifications expressed by the relevant ontologies and attached informal documentation.”

Some of the research they cite is related to WebTables [1] and similar efforts to extract structure from Web-scale data. The authors describe the use of such systems to create ‘schemata’ of attributes related to various types of instance records — in essence, figuring out the structure of ABoxes [2], for leading instance types such as companies or automobiles [3].

These observations, which they call the semantic interpretation problem and contrast with the Semantic Web, they generalize as being amenable to a kind of simple, brute-force, Web-scale analysis: “Relying on overt statistics of words and word co-occurrences has the further advantage that we can estimate models in an amount of time proportional to available data and can often parallelize them easily. So, learning from the Web becomes naturally scalable.”

Google had earlier posted their 1 terabyte database of n-grams, and I tend to agree that such large-scale incidence mining can lead to tremendous insights and advantages. The authors also helpfully point out that certain scale thresholds occur for doing such analysis, such that researchers need not have access to indexes the scale of Google to do meaningful work or to make meaningful advances. (Good news for the rest of us!)

As the authors challenge:

  1. Choose a representation that can use unsupervised learning on unlabeled data
  2. Represent the data with a non-parametric model, and
  3. Trust the important concepts will emerge from this analysis because human language has already evolved words for it.

My very strong suspicion is that we will see — and quickly — much more structured data for instance types (the ‘ABox’) rapidly emerge from Google in the coming weeks. They have the insights and approaches down, and clearly they have the data to drive the analysis! I also suspect many of these structured additions will just simply show up on the results listings to little fanfare.

The structured Web is growing all around us like stalagmites in a cave!


[1] Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu and Yang Zhang, 2008. “WebTables: Exploring the Power of Tables on the Web,” in the 34th International Conference on Very Large Databases (VLDB), Auckland, New Zealand, 2008. See http://web.mit.edu/y_z/www/papers/webtables-vldb08.pdf.
[2] As per our standard use:

"Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts."
[3] I very much like the authors’ use of ‘schemata’ as the way to describe the attribute structure of various instance record types for the ABox, in contrast to the more appropriate ‘ontology’ applied to the TBox.
Posted:March 10, 2009

I’ve tried to avoid the general frenzy, but please see:

http://blog.cyc.com/2009/03/wolfram-alpha.html

“It handles a much wider range of queries than Cyc, but much narrower than Google; it understands some of what it is displaying as an answer, but only some of it . . .”
“I would invest in this, literally and figuratively. If it is not gobbled up by one of the existing industry superpowers, his company may well grow to become one of them in a small number of years, with most of us setting our default browser to be Wolfram Alpha.”

Posted by AI3's author, Mike Bergman Posted on March 10, 2009 at 8:30 pm in Adaptive Information, Adaptive Innovation | Comments (0)
The URI link reference to this post is: https://www.mkbergman.com/480/lenatcyc-reaction-to-wolfram-alpha/
The URI to trackback this post is: https://www.mkbergman.com/480/lenatcyc-reaction-to-wolfram-alpha/trackback/
Posted:February 23, 2009

structsConcluding with a Simplified Instance Record Vocabulary for Linked Data ABoxes

In Part 1 of this series, I advocated the placement of linked data in an ABox construct from description logics [1] based on a separation of concerns argument. In Part 2, I reinforced that argument from the perspective of the work to be done within a knowledge base. In Part 3 we surveyed some of the key literature, finding justification for the split of the TBox from the ABox and the use of specialty RDFS and OWL dialects for work-oriented reasoning in the context of an integral logics.

We now conclude this series and try to bring these threads full circle to address what might be a vocabulary for an ABox instance record design. We’d very much like to thank Dr. Jim Pitman of the Bibliographic Knowledge Network project for having stimulated much of the thinking about the benefits and design of simple, human-authored and -readable instance records.

A Re-cap

Up until about six to eight months ago Fred Giasson and I were spending much of our thinking and design time on UMBEL, ontologies and what we now more precisely define as the TBox. Our intent all along was to get our process and thinking down pat there, and then turn ourselves to the representation of the actual entity data.

We have wanted to keep data records separate from logic and structure all along. Some clients have their own specific data records but may still want to interact with Web stuff or apply similar logic. Moreover, some client data is proprietary, some public. By organizing the data into “named entity dictionaries” we could modularize the architecture to allow swapping in and out of data appropriate to the customer or circumstance at hand.

Our initial design of this and what we share publicly has UMBEL and various standard public ontologies (FOAF, DC, SIOC, BIBO, etc) for the TBox, with Wikipedia entities and stuff from the BBC at the entity level (the ABox).

However, earlier work with another client showed us that our initial named entity structure was not sufficiently general or robust. That company’s records have complex relationships, such as affiliations for entities embedded in the same data record.

For linked data to become truly successful, we need to find easier ways for data publishers to write, expose and share structured data on the Web.

In order to improve the design, we went back to the drawing board to see if we could find guidance from the literature and other researchers as to how to “best” architect instance data in relation to the logic in the TBox (though we were not yet thinking and framing our questions viz description logics, or DL).

This series of postings itself, and some of its predecessor articles, were motivated by probing the description logics space and the guidance it might provide to help determine performant architectures and designs.

Folks, We’re Making Linked Data Just Too Tough

For linked data to become truly successful, we need to find easier ways for data publishers to write, expose and share structured data on the Web.

As anyone who reads my blog knows, I frequently rail against poor semantics or other aspects of the linked data space that I feel are counterproductive. At the same time, I’d like to think that I am also a vocal advocate and proponent for linked data. I am indeed a fan.

To me, the fundamental precepts of RDF as a data model able to capture virtually any data structure or relationship, and the use of Web URIs as linkable identifiers for a global ‘Web of Data’, are simply foundational and game changing. Stuff like this quickens my pulse.

But look at what it takes someone today to publish linked data:

  • He must understand the terminology and standards and best practices — and actually, even amongst current practitioners, few do
  • She must assign Web identifiers (URIs) to her data objects, which means finding them and making them (gawd, I hate this word) “dereferencable”
  • He must understand the semantics of the relationships and linkages his data asserts (which, unfortunately, many don’t)
  • She must present her data in serialized subject-predicate-object “triples”, which are arcane and difficult for most to understand, and
  • They both often confuse data and instances with structure and world views.

Now, come on. This is not the recipe to success.

Simple and unbreakable and forgiving is the recipe to success.

As I noted in an earlier posting, there are many different data structures (‘structs‘) for describing and conveying (transmitting) data records. Most of these are easy to understand and easy to read. We know that microformats have tried to capture a part of this space, but so has in other ways data serializations such as JSON or others. What can we learn from such formats?

Well, one thing I have learned is that many on the Web positively want to expose their data. Another thing I have learned is that there is much structured data that will not get exposed without hurdle rates that are small.

Revenge of the ABox

The phrase ‘revenge of the ABox’ comes from Heiko Stoermer’s thesis [2]; it conveys well, I think, the fact that everyone wants to capture and structure “world views” via ontologies and the big picture, but many do not want to grub around at the level of individual instances and data records. As he states, “. . . the most valuable knowledge is typically the one about individuals, but research on ontology integration has traditionally concentrated on concepts and relations.”

(The perverse outcome of this is that even though linked data as practiced to date is almost 100% about instance data, the discussion rarely looks at ABox-level work or instance data integrity.)

As this series and its predecessor posts have argued, description logics (DL) is an excellent guiding framework for how to make architectural and design decisions about linked data. DL and the ABox – TBox have meshed beautifully with our earlier intuition to split ontologies and a structural and organizational view of the world (TBox) from the instance records (ABox, or what we had been calling internally our ‘named entity dictionaries’).

As this four-part series and its predecessor pieces indicate, not only can we gain better conceptual understanding and realization of some of this semantic Web stuff by using DL, but also, perhaps, many of today’s silly or inefficient design practices may be remedied by better grounding our architectures in these logics.

One area, for example, that has helped us much is to get away from the confusing terminology of ‘individuals’ v ‘instances’. Once we come to see an instance record as just that (so, that is why collections can play on an equal footing with individual things, for example), we now only need worry about asserting the attributes of the instance. We can defer all of the logic and reasoning about individuals and members and sets and collections and classes, etc., to the TBox and just get on with capturing and conveying our instance record, as an ABox.

For this reason alone (but there are others), Structured Dynamics has now abandoned the terminology of a ‘named entity dictionary’ in favor or ‘instance dictionaries’ or ABox (either term of which is understood to contain one or more instance records).

The ‘Instance Record’

An instance record is simply a means to either represent or convey the information (“attributes”) of a given instance. An instance is the thing at hand, and need not represent an individual; it could, for example, represent the entire holdings or collection of books in a given library.

An instance record may convey information about multiple instances, but each block of information for each instance is about that instance alone. Thus, for example, if the instance is a paper citation, the instance is the paper. If as attributes it asserts multiple authors, each with different institutional affiliations, those affiliations get asserted in a separate instance for each author. They are attributes of the authors, not of the paper.

In this manner it is easy to see attributes as only pertaining to a given instance. If the overall information to be conveyed discusses attributes for multiple instances, than the instance record presents in series each instance that is characterized.

The Simplicity of Key-Value Pairs

The objective is to make it easy for data owners to write, read and publish data. This means the starting format should be a human readable, easily writable means for authoring and conveying these instance records (that is, instances and their attributes and assigned values).

The simplest, naïve format (independent of syntax or serialization) is the key-value (name-value) pair. In the key-value pair, the subject is always implied. So, for me, MikeBergman, as the subject:

first_name:Mike
sex:male
citizenship:USA
town:Iowa City

Because an instance record only describes attributes for a single instance at a time, all assertions can easily be transformed into the subjectpredicateobject (spo) “triples” of RDF. So,

<subject:MikeBergman> <hasFirstName> <Mike>

Now, of course, in conventional linked data many of these entries need to be expressed as URIs in order to “define” the item. Our design allows for that, of course, but also allows the user to simply provide literals (that is, not identifiers, but text strings or numeric or actual values) for each item. Thus, the declaration of a “new” attribute only need occur by its expression, with its value also as simply declared.

Separate, specialized services (see below) may be (and often will need to be!) employed to look up and de-reference URIs, do datatype or data instance validation checks, evaluate identity relationships, disambiguate terms and so forth. The data supplier may choose to publish more-or-less complete “records” on their own, or they may not.

Through this design, nothing need change with regard to how linked data is being done today (other than the addition of some simple converters to accommodate the new format; see below). But, by shifting testing and validation work to external services, we can make it much easier for more data to get exposed and published. It is now time for linked data intermediaries and services to evolve in the linked data ecosystem.

In its most naïve form, this key-value pair format allows for fast and easy instance record creation with the ability to create instances and new attributes on the fly. Sure, these assertions need to be checked, but so does most data when it is asked to participate in any meaningful work.

This simple design, then, is very much in keeping with the limited roles and work associated with an ABox. Only attributes and metadata for an instance are being asserted. Conceptual relationships and specialized work that might be applied against the ABox to determine data validity or whatever is shifted to be external to the instance record, where it properly and logically belongs.

Relation to RDF

In Part 3 we discussed how fragments of the RDF and OWL languages can be used for specialized purposes within a knowledge base while keeping the overall logics of the system integral and decidable. Clearly, this instance record approach where the sole purpose is to assert attributes and values for an instance does not require any OWL. In fact, most linked data to date only brings OWL into the picture for the owl:sameAs property, the common errors of which we discussed in Part 2.

The instance record only requires a small subset of the RDF language. But it does require use of RDFS (Schema) because of the appropriate use of datatypes within the instance data record.

At the level of the TBox and the “specialized work” areas, there are other fragments of OWL, now called profiles in the soon to be released OWL 2 [3], that similarly can be applied to areas such as instance checking and validation, identity relation testing, etc., that I mentioned above. In other words, we can logically fragment RDF and OWL to do the individual parts of a complete system in order to simplify things and aid performance and computational efficiency.

The Instance Record Vocabulary

We are implementing this design internally through what we call the Instance Record Vocabulary (QName: irv). It is still quite experimental and we are testing some important aspects, some of which we describe below. As we get these nuances worked out better, we will release this vocabulary publicly for any to use and comment.

As we presently see it, the namespace languages required for the IRV vocabulary are RDF, RDFS, DCterms and XSD. The RDFS (Schema) is required because, at minimum, of the incorporation of XML Schema datatypes (XSD), which we think to be a desirable requirement for what is, after all, an instance data specification and transfer protocol. However, the actual RDF and RDFS vocabulary used would be extremely minimal, with no OWL required.

In pseudo-form, with many serializations and simple syntaxes possible, this Instance Record Vocabulary has the following properties. Note as discussed above that the <s> in spo is implied. Thus, in its naïve or handwritten form, it could be expressed in pretty simple key-value pairs:

<InstanceRecord>
<Instance>
<hasLabel> <[literal]> @en
<hasAltLabel> <[literal]> @en
<hasURI> <[URI]>
<hasDescription> <[literal]> @en
<Attribute>
<hasAttribute1> <[literal with optional XSD (@en) or URI]>
<hasAttribute2> <[literal with optional XSD (@en) or URI]>
<hasAttribute3> <[literal with optional XSD (@en) or URI]>
<hasAttributeX> <[literal with optional XSD (@en) or URI]>
</Attribute>
<assertIdentity> <[literal or URI]>
<assertType> <[literal or URI]>
<hasSource> <[literal or URI]>
<hasVetting> <[literal or URI]>
</Instance>
<Instance>
. . . repeat as needed . . . 
</Instance>
</InstanceRecord>

Note that most values allow either literal or URI specifications. Some of the properties are obviously optional, others, such as hasLabel, will be required. hasURI, for example, is one case of an optional property that then may require a separate lookup service to complete it as a linked data record.

Instance records with literal specifications would need to be validated and checked before actually used for standard linked data or meaningful data purposes. However, this approach is already well-proved through, for example, OpenLink’s Virtuoso Sponger cartridges and design. Sure some work would need to be done at time of ingest, but there are no technical challenges.

The language used to write a literal can be specified for any kind of attribute (metadata or not). The language is specified using the “@lang-tag” at the end of the literal. This method is similar to the N3 serialization of RDF, which is also equivalent to the XML serialization of RDF using the “xml:lang” attribute.

Metadata

Most of the first properties are simply metadata describing the instance. The strings could be qualified by language.

Attributes

The bulk of the instance record is devoted to the attributes and their values. Attributes could be optionally declared with XSD datatypes. URI references could be specified or later substituted by vetting services (see below).

Attributes could also optionally be characterized in a list format, similar to the Lists specification for Notation 3 (N3).

Asserted Relations

Identity and class membership (rdf:type) assertions could be made; these could later be checked for correctness or identity relations with external or specialized services. The assertIdentity property, in particularly, is the replacement with more appropriate ABox semantics for owl:sameAs.

hasSource

A separate Source record is being developed to cover source or dataset characterizations. A single instance extraction from a Web page, for example, would be accompanied by a simple source characterization. Instances of particular types, such as microformats for example, would be so noted (as they might invoke specialized processors or carry certain authority). Instances from large datasets would have a still longer list of possible characterizations.

This property may look closely at what is also being done for the voiD dataset vocabulary.

Certain parameters in a Source record, such as language for example, may also be applied in special ways by the IRV parser at time of ingest with respect to specific literal specifications.

In any event, this is one of the properties still needing much more thought and definition.

hasVetting

This property, too, needs much more thought and definition.

The hasVetting property, for which multiples are allowed, would identify the specific checks and services applied to the instance data. Depending on service, such checks might include URI lookup or de-referencing, identity relations and testing, record completeness and sufficiency checks, data type checking and validation, general instance checking, disambiguation, and so forth (see “Specialized Work” below).

Some services might also re-write the instance record with corrected values or URIs returned in place of literals.

Best practice for external services would suggest identifying them by URI, though literals would also be allowed to identify internal checks or for lookup purposes.

This property is meant to be a key indicator of how third parties may want to rely on the data. Combined with hasSource, these hasVetting entries provide essential authority and provenance information about the data at hand.

Putting it All Together

This diagram attempts to show the relationship of how many of these pieces may interact:

Information flow to the ABox

Some of these bubbles deserve some additional commentary.

Hand-crafted Input

An important objective in this design is to allow naïve, simple text specifications to be hand-crafted for instance records. There are many relatively simple formats for specifying key-value pairs with a relatively few conventions, ranging from BibTeX to YAML and JSON and others. There are literally hundreds of such formats available, as my earlier overview of Naïve Representations and Structs discussed.

There may be justification for still another form in relation to this Instance Record Vocabulary or not; this topic is still under active discussion.

External Structs

However, whether there is a separate format or not, that same earlier piece overviewed the many simple data structs presently out there. It also noted the nearly 100 existing converters for these forms to RDF. These same converters, with quite slight modifications, could all output the Instance Record Vocabulary in an appropriate serialization as well.

Hooks to Functional and Scripting Languages

Another option is to combine this design with a functional language front-end to generate these records. (Though they could be produced in other ways, as well.) For example, lambda calculus or even a domain-specific language (DSL) could be used to create this very simple record generator. This simple system, in turn, could have a straightforward API that would allow existing scripting languages (such as Python or others) to be used as well.

Specialized Work

So, in fact, we can also now see the specialized work (see also Part 2) that itself is not part of the ABox but can and often should be applied to the instance data in the ABox:

  • Record sufficiency checking
  • De-duplication
  • Membership testing
  • Most specific concept identifying
  • Datatype checking
  • Identity relation testing
  • New attribute checking
  • ABox consistency testing
  • Data range checking
  • Disambiguation
  • Source-specific testing
  • Uniqueness testing
  • URI lookup
  • URI de-referencing
  • Satisfiability checking
  • Others . . .

Though, strictly speaking, such specialty work could be seen to occur at the TBox level, it is actually different and separate logic from “standard” inferencing or reasoning. Specialized work can therefore often occur as separate tests or in batch mode with fragments of OWL or other dedicated indexes and algorithms. Some of this specialized work may take advantage of the conceptual relationships in the TBox, but may not necessarily need to do so. In these manners, the inferencing work of the TBox can be kept clean and efficient.

Beyond Browsing and Unvalidated Queries

Today, linked data has largely been used for browsing and providing unvalidated responses to queries; focus and attention to its ABox roles are important to move beyond this baseline into meaningful work [2]. In those limited instances where this linked data has been looked at and evaluated as a complete knowledge base, such as the SWSE search engine with the SAOR approach as discussed in Part 3, more than 97% of the RDF triples provided in those cases were removed from consideration, often for logical or mis-assertion reasons [4].

The ideas presented here for a simpler linked data specification that can be easily represented in readable text is not new. RDF in JSON has been looked at in this way by Talis and JDIL, YAML has been looked at similarly, and similar and simpler approaches have been looked at closely for topic maps. There are other examples.

A key thrust of these efforts is to make it easier for the data publisher, thereby encouraging the exposure of more structured data.

These emerging ideas do not change in any way the usefulness of current linked data. Our suggested approach interoperates seamlessly with current practices and easily co-resides with them. But, these ideas do:

  • Provide a simpler path for writing and publishing human-readable instance data
  • Provide an ABox instance record structure that can have much specialized work applied against it in a consistent way, and
  • Contributes to an overall logic and architecture that is performant and scalable for doing meaningful work.

Though still needing further thought and refinement, this broad outline of roles and architecture and structure for the ABox completes the last missing piece to Structured Dynamics’ overall approach to linked data and RDF. Much time, thought and research have gone into it. Again, we’d very much like to thank Jim Pitman for his ideas that have helped catalyze this design [5].

We think the combination of a generalized Instance Record Vocabulary that can be reasoned over for ABox-level data checking, and that works with a simple, text-based key-value pair input format, might be a winning combination.


[1] This is our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[2] Heiko Stoermer, 2008. Okkam: Enabling Entity-centric Information Integration in the Semantic Web, Ph.D. thesis presented to the DIT – University of Trento, January 2008, 185 pp. See http://eprints.biblio.unitn.it/archive/00001394/01/dissertation_camera_ready.pdf.
[3] Boris Motik et al., eds., 2008. “OWL 2 Web Ontology Language: Profiles,” a W3C Working Draft, December 2, 2008. See http://www.w3.org/TR/owl2-profiles/.
[4] Aidan Hogan, Andreas Harth and Axel Polleres, 2008. “Scalable Authoritative OWL Reasoning on a Billion Triples,” in Proceedings of Billion Triple Semantic Web Challenge 2008, at the 7th International Semantic Web Conference (ISWC2008), Karlsruhe, Germany, 2008. See http://sw.deri.org/~aidanh/docs/saor_billiontc08.pdf.
[5] This input has come as a result of research supported in part by NSF Award 0835851, Bibliographic Knowledge Network.