ONTOLOG (a.k.a. ‘Ontolog Forum’) is an open, international, virtual community of practice devoted to advancing the field of ontology, ontological engineering and semantic technology, and advocating their adoption into mainstream applications and international standards. It has a great reputation and about 520 active members from 30 countries.
Our panel session kicked-off the Forum’s new “Emerging Ontology Showcase” mini-series. This series is being co-championed by Ken Baclawski (Northeastern University) and Mike Bennett (Hypercube Ltd., UK). The criteria for invitation to the showcase include being new or a new release within the past 6 months or so; an emphasis on an ontology itself, not data or tools; and a focus on schema versus instances or facts or assertions. Efforts intended to produce or create standards are of particular interest.
After Ken’s introduction, the podcast begins with Mike Bennett speaking on, “The EDM Council Semantics Repository: Building Global Consensus for the Financial Services Industry.” This is an important initiative and in keeping with other financial reporting and XBRL-related topics of late. His slides are also online.
My talk, “UMBEL: A Lightweight Subject Reference Structure for the Web,” begins about 35% of the way into the podcast, accompanied by about 30 slides. The audio is a bit spotty for the first two slides until I switched from a speaker to a microphone. My presentation is about 30 min followed by joint Q & A with Mike for another 30 min or so.
Full proceedings — including agenda, abstracts, slides, audio recording and the transcript of the live chat session — may be found on the session page of the Forum wiki; see http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2008_09_25.
The Forum has been doing this for some time and has a nice system worked out for coordinating later viewing of presentations synchronized with the audio.
This presentation was part of a frequent Thursday speaker’s session sponsored by the Forum.
Zotero has long been one of my favorite Firefox plug-ins, being a productive and trusted sidekick for collecting and reporting my voluminous citation and bibliographic data. I think perhaps my review of Zotero from January 2007 was one of my most glowing write-ups.
If you go to the Zotero home page, you will see at the lower left the steady increase of functionality that has come out in this free and open source tool. For example, Zotero now supports more than 1100 bibliographic sources, can capture Web pages and many standard Web sources, and has MS Office and WordPress support. Zotero has been developed and is distributed by the Center for History and New Media at George Mason University.
According to the Courthouse News Service with a copy of this complaint filed September 5, Thomson Reuters is suing George Mason University and, as a state institution, the Commonwealth of Virginia, for $10 million in damages and an injunction on further distribution of a beta version of Zotero. Thomson is seeking a jury trial.
Thomson claims that a July 8 beta release of Zotero (version 1.5) included a new feature to read and convert Thomson’s 3,500 plus proprietary .ens style files within the EndNote software into free, open source Zotero .csl files. Thomson claims this is in direct violation with GMU’s current license for EndNote. The Zotero beta release introduces a server-side synchronization function; the standard Zotero release without this feature and the EndNote support is version 1.07.
EndNote is a proprietary and popular citation software used by many academics and researchers. EndNote has very similar functionality to Zotero. It allows users to search online bibliographic databases, organize them, and store and re-format citations in various publication styles. Single user licenses are $250 with volume and academic discounts available. Thomson claims “millions” of ultimate users.
File format ingest and conversions have long been a mainstay of interoperable software systems. This lawsuit will bear close monitoring.
Hat tip to Rafael Sidi for this link.
Since early in 2008 my colleague, Fred Giasson, has been authoring a series of important blog posts on ‘exploding the domain.’ Exploding the domain means to make class-to-class mappings between a reference ontology and external ontologies, which allows properties, domains and ranges of applicability to be inherited under appropriate circumstances. Exploding the domain expands inferencing power to this newly mapped information.
Fred first used the phrase in an April post that introduced the concept:
The equation is simple:
a coherent framework + ontologies contextualized by this framework = more coherent ontologies.”
The thrust of this analysis was to show how UMBEL subject concepts can act to create context (his emphasis) for linked classes defined in external ontologies. Where the individuals in a dataset are instances of classes, and some of these classes are linked to UMBEL (or a similar contextual structure), these subject concept classes also give context to those individuals.
Finally, and most recently, Fred demonstrated how the use of UMBEL could explode DBpedia’s domain by linking classes using only three properties: rdfs:subClassOf, owl:equivalentClass and umbel:isAligned. (And, as I noted in an earlier posting this week, those mappings have now been made bi-directional from DBpedia to UMBEL as well.)
As we discuss and apply these concepts we are starting to see some further guidelines emerge. Presenting these is the purpose of this post in this ongoing exploding the domain series.
Since its inception, DBpedia has had a class structure of sorts, first from the native Wikipedia categories from which it was derived and then with the incorporation of the YAGO structure (based on WordNet concept relationships). Yet we have claimed that class structure has truly only recently been brought to DBpedia with the mappings to UMBEL. Why? Does not DBpedia’s initial class structure meet the test?
(BTW, these same questions may be applied to some of the other large data structures beginning to emerge on the semantic Web such as Freebase. But, those are stories for another day.)
There are really two answers to these questions. First, the mere existence of classes is not enough; they must actually be used. And, second, the nature of those classes and their structure and coherence are absolutely fundamental. This subsection addresses the first point; the following the second; with both aided by the table below.
There has been a class structure within DBpedia from inception, which was then supplemented a few months after release with the YAGO structure. The starting Wikipedia structure showed early issues which began to be addressed through a cleaned Wikipedia category class (CWCC) hierarchy. These relationships were established with the rdf:type predicate that relates an instance to a class. The classes themselves were related to one another through the rdfs:subClassOf predicate. These class relationships allowed the linked classes to be shown in a hierarchical (taxonomy-like) structure.
Initially, in the case of the beginning Wikipedia categories, the internal class relationships were weak. This was somewhat improved with the addition of YAGO and its WordNet-based concept relationships (with better semantics).
However, these class relationships were (to my knowledge) never mapped to any external structures or ontologies. If used at all, they were only implied for ad hoc navigation within the internal instance data.
Really, anyone could have approached DBpedia at this point and began an effort of mapping its existing class structures to external data. Indeed, we (as editors of UMBEL) considered doing so, but chose a different structure (see next section) for reasons of context and coherence.
As we brought in UMBEL to provide a class structure to DBpedia and linked data, this circumstance began to change, as this table indicates:
|– subClassOf||– subClassOf
– and, entity-to-class predicates
|– No external mappings made||– Aggressive use of external mappings (‘exploding the domain‘)
– Consistent internal structure
|– Based on WordNet concept relationships||– Based on Cyc common sense structural relationships
– Inferencing and reasoning Cyc tools for testing coherence
– Microtheories framework for domain differences
– Extendable structure
|Unique Class Count||~ 55,000||~ 20,000|
Though shown for comparison reasons, the number of classes probably has no real importance.
The key argument in this subsection is that classes matter. Indeed, one of the current challenges before the linked data community is to understand and treat differently the issues of instances from classes. But, the question of whether one class structure is better than another is moot if class mappings are neglected altogether.
UMBEL’s reasons for not taking up the Wikipedia structure or the WordNet structure — that is, the initial structures within DBpedia — for its lightweight subject ontology was based on lack of coherence. I have spoken earlier about When is Content Coherent? regarding these arguments. Other analysis supports the conclusions in various ways .
A central (or “upper”) reference framework should be one that is solid and venerable. Over time, many subsidiary ontologies and structures will relate to it. Like a steel superstructure to a skyscraper or a structural framework to a large ocean-going tanker, this beginning structure needs to withstand many stresses and maintain its integrity as various subsidiary structures hang from it.
So long as we are still in “toy” mode with relatively few external mappings and relative few ontologies, simple class-to-class mappings without respect to the coherence of the underlying ontologies may be OK. But, we will soon (if not already) see that structural flaws, like slight perturbations at the Big Bang, may propagate to create huge discontinuities.
At the pace of development we are now seeing, there will be tens to hundreds of thousands of ontologies within the foreseeable future. Granted, for any given circumstance, only a minor few of those may be applicable. But the potential combinations still can defy imagination in terms of complexity and potential scope at widely varying scales.
At the scale of the Web, of course, there will never be a central authority (nor should there be) for “official” vocabularies or structures. Yet, just as certainly, those ontologies and structures that do share some conceptual and structural coherence and are therefore more likely to easily integrate and interoperate will (I believe) win the Darwinian race. Without some degree of coherence, these disparate structures become like ill-fitting jigsaw pieces from different puzzles.
As we watch structures and relationships accrete like layers in a pearl, we should begin with a solid grain of common sense and coherence. That is why we chose the Cyc structure as the basis of UMBEL — it provides one such solid, coherent framework for moving forward.
I am not sanguine that ad hoc, free-form ontological structures, created in the same manner as topic-specific articles in Wikipedia or as informal tags in bookmarking systems, will bring such coherence. But who knows? Perhaps on the Web where novelty and the joy of creation and exploration can trump usefulness, such could transpire.
But, when we look to linked data and semantic Web constructs finally achieving its potential in the enterprise to overcome decades of data silos, I suspect purposeful coherence will win the day.
Sustainable ontologies, which themselves can host and interoperate with still further ontologies and structures, will require coherent underpinnings to not collapse from the weight of keeping consistency. Just as our current highways and interstates follow the earlier roads before them, as early trailblazers we have a responsibility to follow the natural contours of our applicable information spaces. And that requires coherence and consistency; in other words, logic and design.
In the past few months there has been a remarkable emergence of interest in vocabularies and semantics (as traditionally understood by linguists). Today, for example, marks the kick-off of the first VoCamp get-together in Oxford, England, with interest and discussion active about potentially many others to follow. Peter Mika, Matthias Samwald and Tom Heath have each outlined their desires for this meeting.
I hope the participants in this meeting and others to follow look seriously at the issues of coherence and interoperability and sustainability. My caution is as follows: like tagging and Wikipedia, we have seen amazing contributions from user-generated content that have totally re-shaped our information world. However, we have not yet seen such processes work for structure and coherent conceptual relationships.
I believe participation and UGC have real roles to play in the emergence of coherent structures and vocabularies to enable interoperability. But I also believe they have not done so to date, and useful approaches to those will not emerge in a free-form fashion and without consideration for sustainability and coherence testing.
In these observations there is absolutely no criticism intended or implied with DBpedia or prior linked data practice. A natural and understandable progression has been followed here: first make connections between things, then begin to surface knowledge through the exploration of relationships. We are just now beginning that exploration through the use of classes, vocabularies and ontologies to explicate relationships. The fact that linked data and DBpedia first emphasized linking things and publishing things is a major milestone. It is now time to move on to the new challenges of structure and relationships.
There is much to be learned from other pathbreaking efforts such as the Open Biomedical Ontologies efforts and their attempts at coordination and standardization. As the demands and interests in interoperability increase, interfaces, consistency and coordination will continue, I believe, to come to the fore.
In another 18 months we will likely look back at today’s issues and thoughts as also naÃ¯ve in light of new understandings. The pace of discovery and learning is such that I believe best practices will remain fluid for quite some time.
Wordnets tend to be star-like in structure, with sparse relations, and dominated by a few hub clusters. c.f., Holger Wunsch, 2008. Exploiting Graph Structure for Accelerating the Calculation of Shortest Paths in Wordnets, in Proceedings of Coling 2008, Manchester, UK, August 2008. See http://www.sfs.uni-tuebingen.de/~wunsch/wn-shortest-paths.pdf.
The strict uncorrected structure of Wikipedia categories can also be inconsistent, inaccurate, populated with administrative categories, demonstrate cycles, and lack uniform coverage. c.f., Jonathan Yu, James A. Thom and Audrey Tam, 2007. Ontology Evaluation Using Wikipedia Categories for Browsing, in Proceedings of 16th Conference on Information and Knowledge Management (CIKM), 2007; see http://goanna.cs.rmit.edu.au/~jyu/publications/YuEtal07.pdf. This paper also presents a comprehensive framework for ontology evaluation.
This reference describes a new way to calculate semantic relatedness (not the same as coherence) in relation to Wikipedia, ConceptNet and WordNet: Sander Wubben, 2008. Using Free Link Structure to Calculate Semantic Relatedness. ILK Research Group Technical Report Series no. 08-01, July 2008, 61 pp. See http://ilk.uvt.nl/downloads/pub/papers/wubben2008-techrep.pdf.
Table 3 in this citation presents an interesting contrast between what the authors call collaborative knowledge bases (CKBs, like Wikipedia) and linguistic knowledge basis (LKBs, like WordNet), again however not really addressing the coherence issue: Torsten Zesch and Christof MÃ¼ller and Iryna Gurevych, 2008. Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary, in Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), May 28-30, Marrakech, Morocco. See http://elara.tk.informatik.tu-darmstadt.de/publications/2008/lrec08_camera_ready.pdf. Also see Torsten Zesch and Iryna Gurevych, 2007. Analysis of the Wikipedia Category Graph for NLP Applications, in Proceedings of the Workshop TextGraphs-2: Graph-Based Algorithms for Natural Language Processing at HLT-NAACL 2007, 26 April, 2007, Rochester, NY, pp. 1-8. See http://www.aclweb.org/anthology-new/W/W07/W07-02.pdf.
Thanks to Kingsley Idehen and OpenLink Software, DBpedia has been much enrichened with its mapping to UMBEL‘s 20,000 class-based subject concepts. DBpedia is the structured data version of Wikipedia that I (among many) wrote about in depth in April of last year shortly after its release.
We have also recently gotten an updated estimate of the size of the semantic Web and a new release of the linking open data (LOD) cloud diagram.
Since DBpedia’s release, it has become the central hub of linked open data as shown by this now-famous (and recently updated!) LOD diagram :
Each version of the diagram adds new bubbles (datasets) and new connections. The use of linked data, which is based on the RDF data model and uses Web protocols to name and access data, is proving to be a powerful framework for interconnecting disparate and heterogeneous information. As the diagram above shows, all types of information from a variety of public sources now make up the LOD cloud .
The most recent analysis of this LOD cloud is by Michael Hasenblas and colleagues as presented at I-Semantics08 in September . About 50 major datasets comprising roughly two billion triples and three million interlinks were contained in the cloud at the time of their analysis. They partitioned their analysis into two distinct types: 1) single-point-of-access datasets (akin to conventional databases), such as DBpedia or Geonames, and 2) distributed records characterized by RDF ontologies such as FOAF or SIOC. Their paper  should be reviewed for its own conclusions. In general, though, most links appear to be of low value (though a minority are quite useful).
Simple measures such as triples or links have little meaning in themselves. Moreover, and this is most telling, all of the LOD relationships in the diagram above and the general nature of linked data to date have based their connections on instance-level data. Often this takes the form that a specific person, place or thing in one dataset is related to that very same thing in another dataset using the owl:sameAs property; sometimes it is that one person knows another person; or, it may be in other examples that one entry has an associated photo. Entities are related to other entities and their attributes, but little is provided about the conceptual or structural relationships amongst those entities.
Instance-level mapping is highly useful to aggregate various attributes or facts about given entities or things. But, they only scratch the surface of the structure that can be made available through linked data and the conceptual relationships between and amongst all of those things. For those relationships to be drawn or inferred a different level of linkages needs to be made: what is the class or collection or schema view of the data.
UMBEL, or similar conceptual frameworks, can provide this structural backbone.
UMBEL (Upper Mapping and Binding Exchange Layer; see http://www.umbel.org) is a lightweight reference ontology of about 20,000 subject concepts and their logical and semantic relationships. The UMBEL ontology is a direct derivation of the proven Cyc knowledge base from Cycorp, Inc. (see http://www.cyc.com).
UMBEL’s subject concepts provide mapping points for the many (indeed, millions of) named entities that are their notable instances. Examples might include the names of specific physicists, cities in a country, or a listing of financial stock exchanges. UMBEL mappings enable us to link a given named entity to the various subject classes of which it is a member.
And, because of relationships amongst subject concepts in the backbone, we can also relate that entity to other related entities and concepts. The UMBEL backbone traces the major pathways through the content graph of the Web.
The UMBEL backbone provides structure and relationships at large or small scale. For example, in its full extent, the structure of UMBEL’s complete structure resembles:
But, we can dive into that structure with respect to automobiles or related concepts . . .
. . . all the way down to seeing the relationships to Saab cars:
It is this ability to provide context through structure and relations that can help organize and navigate large datasets of instances such as DBpedia. Until the application of UMBEL — or any subject or class structure like it — most of the true value within DBpedia has remained hidden.
But no longer.
UMBEL already had mapped most DBpedia instances to its own internal classes. By a simple mapping of files and then inferencing against the UMBEL classes, this structure has now been brought to DBpedia itself. Any SPARQL queries applied against DBpedia can now take advantage of these relationships.
Below are some sample queries Kingsley used to announce these UMBEL capabilities to the LOD mailing list . You can test these queries yourself or try alternative ones by using a standard SPARQL query.
For example, go to one of DBpedia’s query endpoints such as http://dbpedia.org/sparql and cut-and-paste one of these highlighted code snippets into the ‘Query text’ box:
By going to UMBEL’s technical documentation page at http://umbel.org/documentation.html, you can download the files to create your own mappings (assuming you have a local instance of DBpedia).
The example below also assumes you are using the OpenLink Virtuoso server as your triple store. If you are using a different system, you will need to adjust your commands accordingly.
A new era of interacting with DBpedia is at hand. Within a period of just more than a year, the infrastructure and data are now available to show the advantages of the semantic Web based on a linked Web of data. DBpedia has been a major reason for showing these benefits; it is now positioned to continue to do so.