In my Intrepid Guide to Ontologies from a couple of years back, I noted that “Ontology is one of the more daunting terms for those exposed for the first time to the semantic Web.” And, for sure, if one starts to peruse the current discussions ranging from the Ontolog Forum to major academic symposia (not meaning to single anyone out), it is clear that the idea of developing “ontologies” is often freighted with much weight, hot air, and (by implication) cost.
This is both a shame because, firstly, it is unnecessary and not often true. And, secondly, because the whole pragmatic idea of what an ontology is and what it can do has often gotten lost in the shuffle.
To be sure, there have been massive standards efforts and EU-funded mega-projects devoted to ontologies. There are certainly cases where coordination of specific domains such as petroleum or integration with a complicated supplier base such as in airline manufacture warrant these massive, complicated ontology development efforts.
But, from my vantage, these extremes overshadow the vast majority of more prosaic, pragmatic applications of ontologies. Remember, ontologies are merely a means of describing a conceptual view of the world . If one defines that “world” within focused and appropriate scope, it is surprising (we believe) how much mileage can be extracted from these suckers.
As we see a breakthrough of interest in semantic Web and linked data principles applied to the enterprise, as wonderfully described in the recent seminal PricewaterhouseCoopers quarterly Technology Forecast, also notably with a prominent focus on ontologies, I think it is time to direct all guns on prior bad assumptions and bad anecdotes. To wit:
In fact, it is the last point that no one is discussing today, but it is the most important of the lot: Ontologies, properly crafted, can be the ‘engines’ for data-driven applications.
It is this latter point that is a true paradigm shift and one of the most exciting prospects of ontologies.
Ontologies, for sure, are a formal representation of conceptual relations, a “world view.” But, that world view need not carry with it the freight of trying to describe all human knowledge. It can (and should) be restricted to an understandable scope (domain) and purpose. In that vein, what does such a “world view” need?
Let’s first talk about scope. We don’t need a “global ontology” that is accepted by everybody on Earth. What we need are focused ontology(ies) for describing things within a given problem space (whose data may reside in a single dataset or aggregate of datasets). We need to communicate how this system describes the things within its domain and how it understands the concepts and attributes associated with its problem space and data. This communication is published as the ontology. Rather than a global, comprehensive schema, we simply need these well constructed bricks, one by one.
Then, the ontology itself needs to be understandable and manageable. Ontologies should be readable by machines, but too many see ontologies solely through the lens of machines. I believe that to be a mistake. While importantly needing to be designed for machine ingest, I believe the real purpose of ontologies is for humans. How do we label things? How do we describe and define things? How do we find things? How do we organize things? How can these understandings be brought before us in the software that we use?
These types of questions lead us to the pragmatic and pull us back from the abstract. If we keep foremost the simple idea that ontologies are merely structures for how to organize (schema) and describe (vocabularies) our problem space at hand, then we can actually get on with cutting the bull and getting real stuff done.
Let’s take as an example our structWSF Web services framework that I will be announcing and demoing for the first time at SemTech 2009 next week. We developed a simple and flexible ontology to describe what a “Web services framework” should be. Then, we developed and implemented the software to make it happen. This means that an ontology development task can be seen as a specification task, too.
So, OK, what do these exhortations mean? Without respect to any particular scope or domain, let me then list below some important functional areas to which ontologies — properly and pragmatically designed — can contribute.
The traditional lens for viewing ontologies is as a means to express conceptual relationships. We agree.
However, ontologies need not have large and nuanced predicate (relationship) vocabularies in order to be useful. Relatively simple but powerful structures with hierarchical or part-of relationships can be very effectively employed for inferencing or faceted searching. From a pragmatic standpoint, let’s first agree on what “things” (nouns) there are in our domains, then let’s worry about how they relate (verbs).
The idea here has long been known as successive approximation: Let’s first get ourselves into the right country, then right province, then right city, then right neighborhood, then right house, and then right room. Only then should we worry about the condition of the paint or the age of the floors.
Endless harangues about “true” conceptual relations are a hindrance, not a help, to this perspective. It is much better (and faster, cheaper and more pragmatic) to put forward simple but coherent relationships than to worry about what all of that “really” means. From a business perspective, isn’t being able to utilize the assertion that the hip bone is connected to the thigh bone more important than having to await a full explication of all of the muscles and ligaments and tendons that might comprise them?
Once such simple relationship structures are embraced, then amazing inferencing power comes to the fore. If one searches on thigh bone, inferencing can also bring forward the hip because of its relationships to the leg.
OK, so now at least we have a coherent scaffolding of concepts and their straightforward relationships. That is, is concept A a “bigger” one (class or super set) than concept B? (Other simple relations could be substituted.) If so, we now see a bit of an organizing “world view” begin to emerge.
So, now we begin to bring in external data. But that data and its schema describe themselves differently. In one realm it is “foo”; in the other, “bar”.
While this different terminology for the same “thing” or related things may not be known at the outset, it is discoverable. And, when discovered, it is quite easy to associate the idea or concept of “foo” in one dataset to “bar” in another. In this manner, through learning and accretion, we are able to associate more and more similar things to one another.
We did not need to begin with some global, cosmic view to begin relating this data to one another. We only needed the right framework and structure that allowed this association to evolve as the learning occurred.
And, oh, by the way: this very same process is akin to documenting the organization’s institutional memory.
Being able to relate and “classify” or “organize” some things to other things also means that we are now beginning to create a roadmap for how “stuff” in a broad sense relates to other “stuff.” For example, if I develop a detailed understanding of the hip bone, I can now bring that body of knowledge into the context above to relate this new information to the thigh and to the leg.
Frankly, at this juncture, while perhaps ultimately important, it is helpful merely to know that Domain A (hip) is somehow related to Domain B (thigh). Think of the issue more like trying to get into the right map vicinity on the globe, and not whether individual streets intersect.
Again, the mindset here is one of letting ontologies and their concepts get related knowledge bases into the same ballpark. Whether we are trying to match Little League ball players with Major League ballplayers is beside the point: accept that both are playing baseball, then decide the importance and specifics of the relationship in a later step.
Again, “ballpark” is more helpful than no connection whatsoever. Silly statements about “ontological commitments” really mean nothing. Ontologies, like any other tool, can play different roles at different times. When helping to get like-related things into the same ballpark, ontologies are easy and quite effective.
(As an aside, it is useful to note here that our efforts with the UMBEL upper-level subject ontology are solely premised on this “roadmap” purpose. In and of itself, UMBEL is not a very complicated explication of the world. But it does provide a comprehensive set of 20,000 subject concepts for orienting quite disparate datasets and information to one another. This very same approach could be replicated and then applied to the granularity of individual domains, kind of like zooming in on Google Maps, to provide similar benefits at smaller scale for domain-specific roadmaps. In fact, that is a common approach we apply in our own client ontologies, which we then also make sure we tie into UMBEL for global orientation.)
OK, so with this foundation now built, we can next raise the bar a bit further. Once one begins to express these “world views” formally as an ontology, even with reduced ambitions as presented above, one still ends up with a formal specification of that conceptualization. And, that means, we now also have a basis with standard languages for mapping two disparate or separately developed ontologies to one another.
Moreover, we also have found a means for stitching together datasets with disparate schema to one another. Voilà: We now have met the Holy Grail of data interoperability.
In my opinion, this is the money shot from all of this effort. But, again, if we set the deployment threshold to the unrealistic levels that some ontology pundits suggest, this payoff is unlikely to happen. We are not trying to state absolute, universal truth about anything nor to be unrealistically comprehensive. All we are trying to do is make defensible assertions that one portion of a world view is similar or related to a portion of a different world view.
Now, does that sound that scary? No, of course not. It is merely a reasonable and pragmatic means for relating two structures together.
A key aspect to this mapping ability is to enrich the description of our concepts with what we call “semsets.” Semsets are a listing of related terms and phrases that provide synonyms, aliases, jargon and related context for alternative ways to describe or bound the concept at hand. This terminological “grist” is the basis for relatively straightforward natural language processing techniques to suggest matches between concepts in different ontologies (which might also be combined with other ontology components such as preferred labels, descriptions or structural placements in the schema).
Like many of the points above, these semsets can be built incrementally and over time as new jargon and terminology is discovered.
These techniques of mapping datasets and their ontology structures can be leveraged still further with the proper application of linked data practices. Via linked data, we place our data into Web-accessible (HTTP) networks and give them Web-scalable identifiers (URIs). This means we can now integrate and interoperate with much external public Web data and break down our own internal data silos.
Our instance records can be fleshed out with supplementary sources to provide more comprehensive attibutes and characterizations. Uniformity of treatment and coverage is promoted. Data interoperability is finally at hand.
A key best practice to this, of course, needs to be the recognition that not all data or information is public and not all users have the same roles or should have the same access to different sets of data. Thus, to embrace global mechanisms for data interoperability, there must also be local methods for enforcing access, privacy and confidentiality.
Properly designed ontologies can fulfill this requirement, as well. By organizing information into datasets and setting profiles for access and CRUD (create – read – update – delete) rights, an effective environment for data sharing and federation is established.
To this point, we have taken almost an exclusively data- or schema-centric view of ontologies. But, as structures, pure and simple, their structural nature can be exploited in other ways. It is here, frankly, that less is spoken of the potential for ontologies than in the more “conceptual” areas noted above.
The first of these new areas is in instance-sensitive data display. Each instance record is associated with an instance type in a governing ontology. Detecting this type means that context-sensitive display templates can then be invoked.
Detecting that something refers to a city, for example, can invoke a template providing a map, population figures, area size, city governance method and the like. In contrast, detecting an instance as a camera might invoke an entirely different display template focusing on product features or price or store and purchasing locations. Such instance-type displays are common; they are known as “infoboxes” within Wikipedia articles, as one example.
But this power of data display templates can be generalized further. What if we detect our instance represents a camera but do not have a display template specific to cameras? Well, the ontology and simple inferencing can tell us that cameras are a form of digital or optical products, which more generally are part of a product concept, which more generally is a form of a human-made artifact, or similar.
By tracing this inferencing chain from the specifc to the more general we can “fall back” until a somewhat OK display template is discovered, even in the absence of the better and more specific one. Then, if we find we are trying to display information on cameras frequently, we only need take one of the more general, parent templates and specifically modify it for the desired camera attributes. We also keep presentation separate from data so that the styling and presentation mode of these templates is also freely modifiable.
This parallel set of display structures to the domain ontology provides a highly reusable and leveraged data presentation framework. For 30 years organizations have struggled with report generators and all sorts of complicated systems for responsive reporting and data display. When driven by ontologies, this challenge is greatly simplified.
The careful reader of the above will note that our ontologies now have a number of interesting characteristics, all of which can be leveraged within the user interface. For example, we have:
This very information, when indexed in a supplementary full-text search engine with faceting capabilities (such as the Solr engine we use), can be leveraged in the user interface for these types of desired UI capabilities:
This is absolutely mindblowing power!
We can now design generic tools that do patterned functions. Then, based on the data at hand and the ontologies that describe them, we can now see completely modified and tailored interfaces. And all of this is done without modifying a single line of application code!
Applications in this brave new world now consist of assembling a proper suite of generic tools, and then spending the bulk of our time on describing and characterizing our data via ontologies and refining templates for displaying or reporting the types of specific instances within our current problem space.
All of the points made above are doable and being done today. Properly designed ontologies can readily deliver all of the aspects noted above. Later parts in this ongoing series will address many of those aspects in greater detail.
Ontologies are not magic. Properly done — an important emphasis — ontologies are the pivot point for faster and more adaptable ways of doing business. A simple, pragmatic mindset can help.
Our perspective is that ontologies are really the “flour that gets backed into the cake”. While viewable and definable as their own structures, properly constructed ontologies actually should exist everywhere within applications and contribute everywhere to applications. This is what we mean by “data-driven applications.”
To be sure, we are suggesting a paradigm shift from 30 years of IT frustrations: schema no longer must be fragile; reports no longer must be costly and delayed; and data can finally be made interoperable.
We will continue to give you our best thinking on these topics over the coming weeks and how they might be important to you.
Sound too good to be true? Read the material above again. And, then, we welcome getting your call.
PricewaterhouseCoopers (PWC) has just published a major 58-pp report on linked data in the enterprise. The report features insightful interviews with many industry practitioners as well as PWC’s own in-depth and thorough research. I think this report is a most significant event: it represents the first mainstream recognition of the potential importance of linked data and semantic Web technologies to the business of data interoperability within the enterprise.
This entire issue is uniformly excellent and well-timed. PWC has done a superb job of assembling the right topics and players. The report has three feature articles interspersed with four in-depth interviews. The target audience is the enterprise CIO with much useful explanation and background. Applications discussed range from standard business intelligence to energy and medicine.
The emphasis on the linked data aspect is a strong one. PWC puts twin emphases on ontologies and the enterprise perspective (naturally). This is a refreshing new perspective for the linked data community, which at times could be accused a bit of being myopic with regard to: 1) open data only; 2) instance records (RDF and no OWL, with little discussion of domain or concept ontologies); and 3) sometimes a disdain for the business perspective (as opposed to the academic).
PWC has done a great job of getting beyond some of the community's own prejudices in order to couch this in CIO and enterprise terms. This signals to me the transition from the lab to the marketplace, with all of its consequent challenges and advantages.
In short: Bravo! This is a very good piece and will, I think, put PWC ahead of the curve for some time to come.
I was very pleased to have the opportunity to review earlier drafts of this major report. After reading a couple of my recent papers on Shaky Semantics and the Advantages and Myths of RDF, with the latter cited in the piece, I had a chance to have a fruitful dialog with one of the report’s editors, Alan Morrison, who is a manager in PWC’s Center for Technology and Innovation (CTI). He kindly solicited my comments and incorporated some suggestions.
The report also lists 14 various semantic technology vendors and service providers. I’m pleased to note that PWC included our small Structured Dynamics firm as part of its listing. Other vendors listed include Cambridge Semantics, Collibra, Metatomix, Microsoft, OpenLink Software, Oracle, Semantic Discovery Systems, Talis, Thomson Reuters, TopQuadrant and Zepheira, with the selected service providers of Radar Networks and AdaptiveBlue.
This report is easy — but important — reading. I personally enjoyed the insights of Frank Chum of Chevron, a new name for me. I encourage all in the field to read and study the entire report closely. I think this report will be an important milestone for the semantic Web in the enterprise for quite some time to come.
After a brief sign-up, the 58-pp report is available for free download.
It is perhaps not surprising that the first substantive post in this occasional series on ontology best practices for data-driven applications begins with the importance of keeping an ABox and TBox split. Structured Dynamics has been beating the tom-tom for quite a while on this topic. We reiterate and expand on this position in this post.
Description logics (DL) are one of the key underpinnings to the semantic Web. DL are a logic semantics for knowledge representation (KR) systems based on first-order predicate logic (FOL). They are a kind of logical metalanguage that can help describe and determine (with various logic tests) the consistency, decidability and inferencing power of a given KR language. The semantic Web ontology languages, OWL Lite and OWL DL (which stands for description logics), are based on DL and were themselves outgrowths of earlier DL languages.
Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. It is this construct for which Structure Dynamics generally reserves the term “ontology”.
The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (or individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts. Both the TBox and ABox are consistent with set-theoretic principles.
TBox and ABox logic operations differ and their purposes differ. TBox operations are based more on inferencing and tracing or verifying class memberships in the hierarchy (that is, the structural placement or relation of objects in the structure). ABox operations are more rule-based and govern fact checking, instance checking, consistency checking, and the like. ABox reasoning is generally more complex and at a larger scale than that for the TBox.
Early semantic Web systems tended to be very diligent about maintaining these ‘box’ distinctions of purpose, logic and treatment. One might argue, as Structured Dynamics does, that the usefulness and basis for these splits has been lost somewhat more recently.
Particularly as we now see linked data become more prevalent, these same questions of scale and actual interoperability are posing real pragmatic challenges. To help aid this thinking, we have re-assembled, re-articulated and in some cases added to earlier discussions of the purposes of the TBox and ABox:
|TBox||TBox < — > ABox||ABox|
As the table shows, the TBox is where the reasoning work occurs, the ABox is where assertions and data integrity occurs, and knowledge base work in the middle (among other aspects) requires both. We can reflect these work splits via the following diagram:
This figure maps the work activities noted in the table, with particular emphasis on the possible and specialized work activities at the interstices between the TBox and ABox.
Whether a single database or the federation across many, we have data records (structs of instances) and a logical schema (ontology of concepts and relationships) by which we try to relate this information. This is a natural and meaningful split: structure and relationships v. the instances that populate that structure.
Stated this way, particularly for anyone with a relational database background, the split between schema and data is clear and obvious. While the relational data community has not always maintained this split, and the RDF, semantic Web and linked data communities have not often done so as well, this split makes eminent sense as a way to maintain a desirable separation of concerns.
The importance of description logics — besides its role as a logical underpinning to the semantic Web enterprise — is its ability to provide a perspective and framework for making these natural splits. Moreover, with some updated thinking, we can also establish a natural framework for guiding architecture and design. It is quite OK to also look to the interaction and triangulation of the ABox and TBox, as well as to specialized work that is not constrained to either.
For example, identity evaluation and disambiguation really come down to the questions of whether we are talking about the same or different things across multiple data sources. By analyzing these questions as separate components, we also gain the advantage of enabling different methodologies or algorithms to be determined or swapped out as better methods become available. A low-fidelity service, for example, could be applied for quick or free uses, with more rigorous methods reserved for paid or batch mode analysis. Similarly, maintaining full-text search as a separate component means the work can be done by optimized search engines with built-in faceting (such as the excellent open-source Solr application).
These distinctions feel obvious and natural. They arise from a sound grounding in the split of the ABox and the TBox.
So, to conclude this part in this occasional series, here are some of the key reasons to maintain a relative split between instances (the ABox) and the conceptual relationships that describe a world view for interpreting them (the TBox):
Here is a final best practice suggestion when these ABox and TBox splits are maintained: Make sure as curators that new attributes added at the instance level are also added with their conceptual relationships at the TBox level. In this way, the knowledge base can be kept integral while we simultaneously foster a framework that eases the broadest scope of contributions.
Structured Dynamics is plowing virgin ground in how linked, structured data — powered by the flexible RDF data model — can establish new approaches useful to the enterprise. These approaches range from how applications are architected, to how data is shared and interoperated, and to how we even design and deploy applications and the data themselves.
At the core of this mindset is the concept of ‘data-driven apps‘, with their underlying structure based on ontologies. Over the coming weeks, I will be posting a series of best practices for how these ontologies can be designed, constructed and employed, and how they can shift the paradigm from static and inflexible applications to ones that are driven by the underlying data.
So, as the introduction to this occasional series, it is thus useful to define our terms and viewpoints. Clearly the two key concepts are:
We therefore put a fairly high threshold of construction and design on our ontologies. These imperatives provide the rationale for this series.
One complementary aspect to our design is the importance to get data in any form or serialization converted to the canonical RDF data model upon which the ontologies define and describe the data structure. Though crucial, this aspect is not discussed further in this series.
Now, of course, when someone (me) has the chutzpah to posit “best practices” it should also be clear as to what end. Ontologies may be used for many things. Others may have as the aim completeness of domain capture, wealth of predicates, reasoning or inference. In our sense, we define “best practices” within our focus of data interoperability and data-driven apps. Your own mileage may vary.
In no particular order and with likely new topics to emerge, here is the current listing of what some of the other parts in this occasional series will contain:
The idea throughout this series is to document best practices as encountered. We certainly do not claim completeness on these matters, but we also assert that good upfront design can deliver many free backend benefits.
If there is a particular topic missing from above that you would like us to discuss, please fire away! In any event, we will be giving you our best thinking on these topics over the coming weeks and how they might be important to you.
I am pleased to report that UMBEL is now included as one of the recommended vocabularies for the Yahoo! SearchMonkey service. Using SearchMonkey, developers and site owners can use structured data to enhance the value of standard Yahoo! search results and customize their presentation, including through “infobars“. SearchMonkey is integral to a concerted effort by Yahoo! to embrace structured data, RDF and the semantic Web.
SearchMonkey was first announced in February 2008 with a beta release in April and then public release in May with 28 supported vocabularies. Then, last October, an additional set of common, external vocabularies were recommended for the system including DBpedia, Freebase, GoodRelations and SIOC. At the same time, some further internal Yahoo! vocabularies and standard Web languages (e.g., OWL, XHTML) were also added.
This is the first vocabulary update since then. Besides UMBEL, the AB Meta and Semantic Tags vocabularies have also been added to this latest revision. (There have also been a few deprecations over time.)
A recommended vocabulary means that its namespace prefix is recognized by SearchMonkey. The namespaces for the recommended vocabularies are reserved. Though site owners may customize and add new SearchMonkey structure, they must be explicitly defined in specific DataRSS feeds.
Structured data may be included in Yahoo! search results from these sources:
As a recommended vocabulary, UMBEL namespace references can now be embedded and recognized (and then presented) in Yahoo! search results.
Here are the 34 current vocabularies (plus five deprecated) recognized by the system:
|assert||SearchMonkey Assertions (deprecated)||http://search.yahoo.com/searchmonkey/assert/|
|context||SearchMonkey Context (deprecated)||http://search.yahoo.com/searchmonkey/context/|
|country||SearchMonkey Country Datatypes||http://search.yahoo.com/searchmonkey-datatype/country/|
|currency||SearchMonkey Currency Datatypes||http://search.yahoo.com/searchmonkey-datatype/currency/|
|owl||OWL ontology language||http://www.w3.org/2002/07/owl#|
|page||SearchMonkey Page (deprecated)||http://search.yahoo.com/searchmonkey/page/|
|rel||SearchMonkey Relations (deprecated)||http://search.yahoo.com/searchmonkey-relation/|
|tagspace||SearchMonkey Tagspace (deprecated)||http://search.yahoo.com/searchmonkey/tagspace/|
|use||SearchMonkey Use Datatypes||http://search.yahoo.com/searchmonkey-datatype/use/|
|xsd||XML Schema Datatypes||http://www.w3.org/2001/XMLSchema#|
In addition, there are a number of standard datatypes recognized by SearchMonkey, mostly a superset of XSD (XML Schema datatypes).
What is emerging from this Yahoo! initiative is a very useful set of structured data definitions and vocabularies. These same resources can be great starting points for non-SearchMonkey applications as well.
There is quite a bit of online material now available for SearchMonkey, with new expansions and revisions also accompanying this most recent release. As some starting points, I recommend: