Posted:May 29, 2009

Zotero Bibliographic Plug-in

Major Report Signals the Emergence of Linked Data into the Enterprise

PricewaterhouseCoopers (PWC) has just published a major 58-pp report on linked data in the enterprise. The report features insightful interviews with many industry practitioners as well as PWC’s own in-depth and thorough research. I think this report is a most significant event: it represents the first mainstream recognition of the potential importance of linked data and semantic Web technologies to the business of data interoperability within the enterprise.

This entire issue is uniformly excellent and well-timed. PWC has done a superb job of assembling the right topics and players. The report has three feature articles interspersed with four in-depth interviews. The target audience is the enterprise CIO with much useful explanation and background. Applications discussed range from standard business intelligence to energy and medicine.

The emphasis on the linked data aspect is a strong one. PWC puts twin emphases on ontologies and the enterprise perspective (naturally). This is a refreshing new perspective for the linked data community, which at times could be accused a bit of being myopic with regard to: 1) open data only; 2) instance records (RDF and no OWL, with little discussion of domain or concept ontologies); and 3) sometimes a disdain for the business perspective (as opposed to the academic).

PWC has done a great job of getting beyond some of the community's own prejudices in order to couch this in CIO and enterprise terms. This signals to me the transition from the lab to the marketplace, with all of its consequent challenges and advantages.

In short: Bravo! This is a very good piece and will, I think, put PWC ahead of the curve for some time to come.

I was very pleased to have the opportunity to review earlier drafts of this major report. After reading a couple of my recent papers on Shaky Semantics and the Advantages and Myths of RDF, with the latter cited in the piece, I had a chance to have a fruitful dialog with one of the report’s editors, Alan Morrison, who is a manager in PWC’s Center for Technology and Innovation (CTI). He kindly solicited my comments and incorporated some suggestions.

The report also lists 14 various semantic technology vendors and service providers. I’m pleased to note that PWC included our small Structured Dynamics firm as part of its listing. Other vendors listed include Cambridge Semantics, Collibra, Metatomix, Microsoft, OpenLink Software, Oracle, Semantic Discovery Systems, Talis, Thomson Reuters, TopQuadrant and Zepheira, with the selected service providers of Radar Networks and AdaptiveBlue.

This report is easy — but important — reading. I personally enjoyed the insights of Frank Chum of Chevron, a new name for me. I encourage all in the field to read and study the entire report closely. I think this report will be an important milestone for the semantic Web in the enterprise for quite some time to come.

After a brief sign-up, the 58-pp report is available for free download.

Posted by AI3's author, Mike Bergman Posted on May 29, 2009 at 10:04 am in Linked Data, Ontologies, Semantic Web, Structured Dynamics | Comments (2)
The URI link reference to this post is: http://www.mkbergman.com/490/pwc-dedicates-quarterly-technology-forecast-to-linked-data/
The URI to trackback this post is: http://www.mkbergman.com/490/pwc-dedicates-quarterly-technology-forecast-to-linked-data/trackback/
Posted:May 17, 2009

Structured Dynamics LLC

Ontology Best Practices for Data-driven Applications: Part 2

It is perhaps not surprising that the first substantive post in this occasional series on ontology best practices for data-driven applications begins with the importance of keeping an ABox and TBox split. Structured Dynamics has been beating the tom-tom for quite a while on this topic. We reiterate and expand on this position in this post.

The Relation to Description Logics

Description logics (DL) are one of the key underpinnings to the semantic Web. DL are a logic semantics for knowledge representation (KR) systems based on first-order predicate logic (FOL). They are a kind of logical metalanguage that can help describe and determine (with various logic tests) the consistency, decidability and inferencing power of a given KR language. The semantic Web ontology languages, OWL Lite and OWL DL (which stands for description logics), are based on DL and were themselves outgrowths of earlier DL languages.

Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. It is this construct for which Structure Dynamics generally reserves the term “ontology”.

The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (or individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts. Both the TBox and ABox are consistent with set-theoretic principles.

Natural and Logical Work Splits

TBox and ABox logic operations differ and their purposes differ. TBox operations are based more on inferencing and tracing or verifying class memberships in the hierarchy (that is, the structural placement or relation of objects in the structure). ABox operations are more rule-based and govern fact checking, instance checking, consistency checking, and the like. ABox reasoning is generally more complex and at a larger scale than that for the TBox.

Early semantic Web systems tended to be very diligent about maintaining these ‘box’ distinctions of purpose, logic and treatment. One might argue, as Structured Dynamics does, that the usefulness and basis for these splits has been lost somewhat more recently.

Particularly as we now see linked data become more prevalent, these same questions of scale and actual interoperability are posing real pragmatic challenges. To help aid this thinking, we have re-assembled, re-articulated and in some cases added to earlier discussions of the purposes of the TBox and ABox:

TBoxTBox < — > ABoxABox
  • Definitions of the concepts and properties (relationships) of the controlled vocabulary
  • Declarations of concept axioms or roles
  • Inferencing of relationships, be they transitive, symmetric, functional or inverse to another property
  • Equivalence testing as to whether two classes or properties are equivalent to one another
  • Subsumption, which is checking whether one concept is more general than another
  • Satisfiability, which is the problem of checking whether a concept has been defined (is not an empty concept)
  • Classification, which places a new concept in the proper place in a taxonomic hierarchy of concepts
  • Logical implication, which is whether a generic relationship is a logical consequence of the declarations in the TBox
  • Infer property assertions implicit through the transitive property
  • Entailments, which are whether other propositions are implied by the stated condition
  • Instance checking, which verifies whether a given individual is an instance of (belongs to) a specified concept
  • Knowledge base consistency, which is to verify whether all concepts admit at least one individual
  • Realization, which is to find the most specific concept for an individual object
  • Retrieval, which is to find the individuals that are instances of a given concept
  • Identity relations, which is to determine the equivalence or relatedness of instances in different datasets
  • Disambiguation, which is resolving references to the proper instance
  • Membership assertions, either as concepts or as roles
  • Attributes assertions
  • Linkages assertions that capture the above but also assert the external sources for these assignments
  • Consistency checking of instances
  • Satisfiability checks, which are that the conditions of instance membership are met

As the table shows, the TBox is where the reasoning work occurs, the ABox is where assertions and data integrity occurs, and knowledge base work in the middle (among other aspects) requires both. We can reflect these work splits via the following diagram:

TBox- and ABox-level work

This figure maps the work activities noted in the table, with particular emphasis on the possible and specialized work activities at the interstices between the TBox and ABox.

The Split Should Feel Natural

Whether a single database or the federation across many, we have data records (structs of instances) and a logical schema (ontology of concepts and relationships) by which we try to relate this information. This is a natural and meaningful split: structure and relationships v. the instances that populate that structure.

Stated this way, particularly for anyone with a relational database background, the split between schema and data is clear and obvious. While the relational data community has not always maintained this split, and the RDF, semantic Web and linked data communities have not often done so as well, this split makes eminent sense as a way to maintain a desirable separation of concerns.

The importance of description logics — besides its role as a logical underpinning to the semantic Web enterprise — is its ability to provide a perspective and framework for making these natural splits. Moreover, with some updated thinking, we can also establish a natural framework for guiding architecture and design. It is quite OK to also look to the interaction and triangulation of the ABox and TBox, as well as to specialized work that is not constrained to either.

For example, identity evaluation and disambiguation really come down to the questions of whether we are talking about the same or different things across multiple data sources. By analyzing these questions as separate components, we also gain the advantage of enabling different methodologies or algorithms to be determined or swapped out as better methods become available. A low-fidelity service, for example, could be applied for quick or free uses, with more rigorous methods reserved for paid or batch mode analysis. Similarly, maintaining full-text search as a separate component means the work can be done by optimized search engines with built-in faceting (such as the excellent open-source Solr application).

These distinctions feel obvious and natural. They arise from a sound grounding in the split of the ABox and the TBox.

The Re-cap of Key Reasons to Maintain the TBox – ABox Split

So, to conclude this part in this occasional series, here are some of the key reasons to maintain a relative split between instances (the ABox) and the conceptual relationships that describe a world view for interpreting them (the TBox):

  • We are able to handle instance data simply. The nature of instance “things” is comparatively constant and can be captured with easily understandable attribute-value pairs
  • We can re-use these instance records in varied and multiple world views (the TBox). World views can be refined or approached from different perspectives without affecting instance data in the slightest
  • We can approach data architectural decisions from the standpoints of the work to be done, leaving open special analysis or tasks like disambiguation or full-text search
  • Ontologies (as defined by SD and focused on the TBox) are kept simpler and easier to understand. Inter-dataset relationships are asserted and testable in largely separate constructs, rather than admixed throughout
  • Relatedly, we are thus able to use ontologies to focus on the issues of mappings and conceptual relationships
  • Instance records can often be kept in situ, especially useful when incorporating the massive amounts of data in existing relational databases
  • Instance evaluations can be done separately from conceptual evaluations, which can help through triangulation in such tasks as disambiguation or entity identification
  • It is easier to convert simple data structs to the instance record structure, aiding interoperability (a subject for a later part in this series)
  • We provide a framework that is amenable to swapping in and out different analysis methods, and
  • It is easier for broader input when the task is adding and refining attributes rather than internally consistent conceptual relationships.

Here is a final best practice suggestion when these ABox and TBox splits are maintained: Make sure as curators that new attributes added at the instance level are also added with their conceptual relationships at the TBox level. In this way, the knowledge base can be kept integral while we simultaneously foster a framework that eases the broadest scope of contributions.

This post is part of an occasional AI3 series on ontology best practices.
Posted:May 12, 2009

Structured Dynamics LLC

Ontology Best Practices for Data-driven Applications: Part 1

Structured Dynamics is plowing virgin ground in how linked, structured data — powered by the flexible RDF data model — can establish new approaches useful to the enterprise. These approaches range from how applications are architected, to how data is shared and interoperated, and to how we even design and deploy applications and the data themselves.

At the core of this mindset is the concept of ‘data-driven apps‘, with their underlying structure based on ontologies. Over the coming weeks, I will be posting a series of best practices for how these ontologies can be designed, constructed and employed, and how they can shift the paradigm from static and inflexible applications to ones that are driven by the underlying data.

So, as the introduction to this occasional series, it is thus useful to define our terms and viewpoints. Clearly the two key concepts are:

  • Data-driven applications — this concept means the use of generic tools, applications and services that shape themselves and expose capabilities based on the structure of their underlying data. Generic means reusable. Unlike inflexible report writers or static tools of the past, these applications present functionality and are contextual based on the structure of the underlying data they serve. The data-driven aspects results from proper construction of the ontologies that describe this underlying data
  • Ontologies — ontologies have been something of a teeth-grinding concept for a couple of years, having been appropriated from their historical meaning of the nature of being (“ontos”) in philosophy to describe “shared conceptualizations” in computer science and knowledge engineering [1]. For its purposes, Structured Dynamics more precisely defines ontologies as the relationships of the concepts and domains embodied in the underlying things or instances described by the data. Under this approach, ontologies based on RDF become a structural representation of the data relationships in graph form. But, in addition, we also define ontologies to mean the proper description of these concepts, so as to supply the context, synonyms and aliases, and labels useful to human use and understanding.

We therefore put a fairly high threshold of construction and design on our ontologies. These imperatives provide the rationale for this series.

One complementary aspect to our design is the importance to get data in any form or serialization converted to the canonical RDF data model upon which the ontologies define and describe the data structure. Though crucial, this aspect is not discussed further in this series.

Now, of course, when someone (me) has the chutzpah to posit “best practices” it should also be clear as to what end. Ontologies may be used for many things. Others may have as the aim completeness of domain capture, wealth of predicates, reasoning or inference. In our sense, we define “best practices” within our focus of data interoperability and data-driven apps. Your own mileage may vary.

In no particular order and with likely new topics to emerge, here is the current listing of what some of the other parts in this occasional series will contain:

  • Intro (concepts)
  • ABox – TBox split
  • Architecting (modularizing) ontologies into categories (e.g., UI/display of information; domains/instances; admin/internal apps)
  • Definition of a standard instance record vocabulary (ABox)
  • Role of an instance record vocabulary for universal struct ingest
  • Selection of core external ontologies and re-use
  • A deeper exploration of the data-driven application
  • Initial ontology building and techniques
  • Specific UI items suitable to be driven by ontologies (a listing of 20 or so items)
  • Techniques for mapping to external ontologies
  • Dataset interoperability and the myth that OWL is only useful for real-time reasoning, and
  • OWL mapping predicates, importance of class mappings, and OWL 2.

The idea throughout this series is to document best practices as encountered. We certainly do not claim completeness on these matters, but we also assert that good upfront design can deliver many free backend benefits.

If there is a particular topic missing from above that you would like us to discuss, please fire away! In any event, we will be giving you our best thinking on these topics over the coming weeks and how they might be important to you.


[1] Michael K. Bergman, 2007. An Intrepid Guide to Ontologies, May 16, 2007. See http://www.mkbergman.com/?p=374.
Posted:May 10, 2009

2009 Semantic Technology Conference

First Unveiling of the Bibliographic Knowledge Network, New Product Announcement from SD

Well, it was just about six months ago that Fred Giasson and I announced our new company, Structured Dynamics. After a relatively quiet period and much laboring at the workbench, I will be presenting our new efforts at the 2009 Semantic Technology Conference, June 14th – 18th, at the Fairmont Hotel, in San Jose, California.

I will be speaking on, “BKN: Building Communities through Knowledge, and Knowledge Through Communities,” on Tuesday, June 16, during the 11:45 AM – 12:45 PM last morning session.

BKN — the Bibliographic Knowledge Network — is a major, two-year, NSF-funded project jointly sponsored by the University of California, Berkeley, Harvard University, Stanford University, and the American Institute of Mathematics, with broad private sector and community support [1]. Though initially nucleating around mathematics and statistics, each node in the network is a Web site or dataset distribution hub dedicated to a specific topic or field of knowledge. The project itself is developing a suite of tools and infrastructure based on semantic technologies for professionals, students or researchers to form new communities, and — with a single-click — to share and leverage expertise.

Besides presenting the BKN efforts for this first time, which includes an innovative and open Web services framework for collaboration, I will also be using the occasion of my talk to announce our new semantic Web and linked data RDF framework for content management systems. We’re pretty excited about these advances.

This is my first time at this premier semantic Web event, which has been steadily growing and now exceeds 1000 attendees. The agenda is most impressive; it will be difficult to decide which of the many excellent talks to choose from for each session.

If you will be at the meeting and would like to get together, drop me a note and we can schedule a time. I hope to see you there!


[1] Research supported by NSF Award 0835851, Bibliographic Knowledge Network.
Posted:May 3, 2009

Structured Dynamics LLC

An Abstract Web Services Framework Aids Broad Applicability

Structured Dynamics‘ product and software architecture is oriented to the Web. It emphasizes maximum flexibility, minimum “lock-in” and complete adaptability. This piece describes this architecture and how these aims are being met.

Design Objectives

Structured Dynamics is committed to what is known as a Web-oriented architecture (WOA) [1], which can be defined as:

WOA = SOA + WWW + REST

Nick Gall describes WOA as based on the architectural foundations of the Web, and is characterized by “globally linked, decentralized, and uniform intermediary processing of application state via self-describing messages.”

WOA is a subset of the service-oriented architectural style, wherein discrete functions are packaged into modular and shareable elements (“services”) that are made available in a distributed and loosely coupled manner. WOA uses the representational state transfer (REST) architectural style defined by Roy Fielding in his 2000 doctoral thesis; Fielding is also one of the principal authors of the Hypertext Transfer Protocol (HTTP) specification.

REST provides principles for how resources are defined and used and addressed with simple interfaces without additional messaging layers such as SOAP or RPC. The principles are couched within the framework of a generalized architectural style and are not limited to the Web, though they are a foundation to it.

REST and WOA stand in contrast to earlier Web service styles that are often known by the WS-* acronym (such as WSDL, etc.). WOA has proven itself to be highly scalable and robust for decentralized users since all messages and interactions are self-contained (convey “state”).

Structured Dynamics abstracts its WOA services into simple and compound ones (which are combinations of the simple). All Web services (WS) have uniform interfaces and conventions and share the error codes and standard functions of HTTP. We further extend the WOA definition and scope to include linked data, which is also RESTful. Thus, our WOA also sits atop an RDF (Resource Description Framework) database (“triple store”) and full-text search engine.

These Web services then become the middleware interaction layer for general access and querying (“endpoints”) and for tying in external software (“clients”), portals or content management systems (CMS). This design provides maximum flexibility, extensibility and substitutability of components.

Architecture & Components

Here is the basic overview of the architecture and its components. Each of its major components is described in turn as keyed by number:

General Structured Dynamics WOA Architecture

The Web Services Framework Middleware

The core to the system is the Web services middleware layer, or WS framework (WSF). Structured Dynamics is preparing this framework [see (1)] as a separately available open source package under Apache 2 license (soon to be released). It provides all of the components shown as items (2) to (5) in the diagram.

WSF is the abstraction layer that provides the Web service endpoints and services for external use. It also provides the direct hooks into the underlying RDF triple stores and full-text search engines that drive these services. At initial release, these pre-configured hooks will be to the Virtuoso RDF triple store (via ODBC, and later HTTP) and the Solr faceted text search engine (via HTTP) [2]. However, the design also allows other systems to be substituted if desired or for other specialized systems to be included (such as an analysis or advanced inference engine).

Authentication/Registration WS

The controlling Web service in WSF is the Authentication/Registration WS [see (2)]. The initial version uses registered IP addresses as the basis to grant access and privileges to datasets and functional Web services. Later versions may be expanded to include other authentication methods such as OpenID, keys (a la Amazon EC2), foaf+ssl or oauth. A secure channel (HTTPS, SSH) could also be included.

Other Core Web Services

The other core Web services provided with WSF are the CRUD functional services (create – read – update – delete), import and export, browse and search, and a basic templating system [see (3)]. These are viewed as core services for any structured dataset.

In initial release, the import and export formats will likely include TSV, RDF/XML, RDF/N3, RDF/Turtle, XML and possibly JSON.

Datasets, Access and Use Rights

A simple but elegant system guides access and use rights. First, every Web service is characterized as to whether it supports one or more of the CRUD actions. Second, each user is characterized as to whether they first have access rights to a dataset and, if they do, which of the CRUD permissions they have [see (4, 5)]. We can thus characterize the access and use protocol simply as A + CRUD.

Thereafter, a simple mapping of dataset access and CRUD rights determines whether a user sees a given dataset and what Web services (“tools”) are presented to them and how they might manipulate that data. When expressed in standard user interfaces this leads to a simple contextual display of datasets and tools.

At the Web service layer, these access values are set parametrically. The system, however, is designed to more often be driven by user and group management at the CMS level via a lightweight plug-in or module layer.

A Structured Data Foundation

Fundamentally, this “data-driven application” works because of its structured data foundation [see (6)]. Structured Dynamics employs an innovative design that exposes all RDF and record aspects to full-text search and is able to present contextual (“faceted”) results at all times in the interface [2]. In addition, the Virtuoso universal server provides RDF triple store and related structured data services.

The actual “driver” for the structured data system is one or more schema (“ontologies”) setting all of these structured data conditions. These ontologies are also managed by the triple store. The definition of these ontologies is specified in such a way with accompanying documentation to enable new scopes or domains to easily drive the system.

Interactions with CMSs and External Clients

As described by the diagram so far, all interactions with the system have been mediated either by Web service APIs or external endpoints, such as SPARQL.

For external clients or any HTTP-accessible system [see (10)], this is sufficient. Programmatically, external clients (software) may readily interact with the WS and obtain results via parametric requests.

However, the framework is also designed to be embedded within existing content management systems (CMSs). For this purpose, an additional layer is provided.

The architecture of the system can support interactions with standard open source CMSs or app frameworks such as Ruby on Rails, Django, Joomla!, WordPress, Drupal or Liferay, as examples [see (9)].

CMS interaction first occurs via specific modules or plug-ins written for that system [see (7)]. These are very lightweight wrappers that conform to the registry and hooks of the host CMS system. The actual modules or plug-ins provided are also geared to the management style of the governing CMS and what it offers [see (8)]. Each module or plug-in wrapper is a packaging decision of how to bound the WSF Web services in a configuration appropriate to the parent CMS.

This design keeps the actual tie-ins to the CMS as a very thin wrapper layer, which can embrace an open source licensing basis consistent with the host CMS. Because all of the underlying functionality has been abstracted in the WSF framework, licensing integrity across all deployment layers is maintained while allowing broad CMS interoperability. The design also allows networks to be established of multiple portals or nodes with different CMSs, perfect for broad-scale collaboration. User choice and flexibility is retained to the max.

In this design, the CMS retains its prominence and visibility (and, indeed, the standard admin and licensing basis). The WSF, specific Web services, and structured data backend remain largely invisible to the user.

Benefits of the Design

This design has manifest benefits, some of which include:

  • Broad suitability to a diversity of deployment architectures, operating systems and scales
  • Substitutability of underlying triple stores and text engines
  • Substitutability of preferred content management systems
  • Access and use of Web service endpoints independent of CMS (external clients)
  • Performant Web-oriented architecture (WOA) design
  • Common, RESTful interface for all Web services and functions in the framework
  • Easy registration of new Web services, inclusion with authorization system
  • Ability to share and interoperate data independent of client CMSs or portals
  • Use of the common lingua franca of RDF and its general advantages [3].

A Note on Tools and Philosophy

Structured Dynamics has the twin philosophies of using the best tools available yet also to give its customers and clients full choice. For instance, SD believes the Virtuoso system to be the best RDF triple store with superior functionality. Our internal benchmarks affirm its performance. Virtuoso is our standard first recommendation for a performant triple store.

Yet our architecture and design is not dependent on this application, nor indeed any other application. Deployment environments, customer preference, or pre-existing installations sometimes warrant the use of certain tools or applications. Large collaboration networks necessarily spring from diversity. It is thus critical that SD’s designs and architectures be tool-neutral and allow swapping and substitution. This is a major reason for the WOA and other RESTful design aspects of our Web services framework.

SD brings particular strengths in architecture, proper splits and design that separate ABox and TBox functionalilty [4], and ontology use and development. All of our designs are meant to be as tool neutral as possible, and we are always seeking the best of class in open source tools for any category.

Over the coming weeks and months Structured Dynamics will be rolling out packages and distribution sites for access to this framework and components built around this philosophy [5]. Stay tuned!


[2] Frédérick Giasson, 2009. RDF Aggregates and Full Text Search on Steroids with Solr, April 29, 2009. See http://fgiasson.com/blog/index.php/2009/04/29/rdf-aggregates-and-full-text-search-on-steroids-with-solr/.
[3] Michael K. Bergman, 2009. Advantages and Myths of RDF, white paper from Structured Dynamics LLC, April 22, 2009, 13 pp. See http://www.mkbergman.com/wp-content/themes/ai3v2/files/2009Posts/Advantages_Myths_RDF_090422.pdf.
[4] As per our standard use:

Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[5] This Structured Dynamics design has been aided in part as a result of research supported by NSF Award 0835851, Bibliographic Knowledge Network. The general Web services design and architecture is based on the system already developed for the UMBEL Web services.