A standard cliché of management consultants is the exhortation to think “outside the box.” Of course, what is meant by this is to question assumptions, to think differently, to look at problems from new perspectives.
With our recent release of the (linked open data) ‘LOD constellation‘ of linked data classes based around UMBEL, I have been fielding a lot of inquiries on what the relationship is of UMBEL to DBpedia. (See, for example, this current interview by the Semantic Web Company with me and Sören Auer of the DBpedia project.) This also fits into the ongoing distinction we have made in the UMBEL project between our subject concepts (classes) and named entities (instances).
What has actually most been helping my thinking is to get fully inside the box (or, rather, boxes, hehe). Let me explain.
Description logics are one of the key underpinnings to the semantic Web. They grew out of earlier frame-based logic systems from Marvin Minsky and also semantic networks; the term and discipline was first given definition in the 1980s by Ron Brachman, among many others .
Description logics (DL, most often expressed in the plural) are a logic semantics for knowledge representation (KR) systems based on first-order predicate logic (FOL). They are a kind of logical metalanguage that can help describe and determine (with various logic tests) the consistency, decidability and inferencing power of a given KR language. The semantic Web ontology languages, OWL Lite and OWL DL (which stands for description logics), are based on DL and were themselves outgrowths of earlier DL languages.
Description logics and their semantics traditionally split concepts and their relationships from the different treatment of individuals and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships.
The second split of individuals is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of individuals, the roles between individuals, and other assertions about individuals regarding their class membership with the TBox concepts. Both the TBox and ABox are consistent with set-theoretic principles.
TBox and ABox logic operations differ and their purposes differ. TBox operations are based more on inferencing and tracing or verifying class memberships in the hierarchy (that is, the structural placement or relation of objects in the structure). ABox operations are more rule-based and govern fact checking, instance checking, consistency checking, and the like . ABox reasoning is generally more complex and at a larger scale than that for the TBox.
Early semantic Web systems tended to be very diligent about maintaining these “box” distinctions of purpose, logic and treatment. One might argue, as I do herein, that the usefulness and basis for these splits has been lost somewhat in our first implementations and publishing of linked data systems.
Most of the semantic Web work at the beginning of this decade was pretty explicit about references to description logics and related inferencing engines and computational efficiency. Some of the early commercial semantic Web vendors are still very much focused on this space.
However, with the first release and emphasis on linked data about two years ago, the emphasis seemed to shift to the more pragmatic questions of actually posting and getting data out there. Best practices for cool URIs and publishing and linkage modes assumed prominence. The linking open data (LOD) movement began in earnest and gained mindshare. Of course, many in the DL and OWL development communities continued to discuss logic and inferencing, but now seemingly more as a separate camp to which the linked data tribe paid little heed.
The central hub of this linked data effort has been DBpedia and its pivotal place within the ‘LOD cloud.’ What is remarkable about the LOD cloud, however, is that it is almost entirely an ABox representation of the world and its instances. Starting from the core set of individual instances within Wikipedia, this cloud has now grown to many other sources and the central place for finding linked instance data. If one looks carefully at the LOD cloud and its linkages we can see the prevalence of instance-level relationships and attributes.
In fact, the LOD cloud diagram to upper right from the Wikipedia article on linked data has become the key visual metaphor for the movement. But, as noted, this view is almost exclusively one at the ABox instance level.
The UMBEL project began at roughly the same time and as a response to the release of DBpedia. My question in looking at the first data linked to DBpedia was, What is this content about? Sure, I might be able to find multiple records discussing Abraham Lincoln as a US president regarding attributes like birth date and a list of children, but where could I retrieve records about other presidents or, more broadly, other types of leaders such as prime ministers, kings or dictators?
The intuition was that the linked data and the various FOAF and other distributed instance records it was combining lacked a coherent reference structure of subject topics or concepts with which to describe content. The further intuition was that — while tagging systems and folksonomies would allow any and all users to describe this content with their own metadata — a framework for relating these various assignments to one another was still lacking.
In the nearly two years of development leading to the first beta release of UMBEL we have tried many analogies and metaphors to describe the basis of the 20,000 subject concept classes within UMBEL in relation to its role and other linked data initiatives. While many of those metaphors help visualize use and role, the more formal basis offered by description logics actually helps to most precisely cast UMBEL’s role. For example, in today’s interview with the Semantic Web Company, I note:
“. . . we have described UMBEL as a roadmap, or middleware, or a backbone, or a concept ontology, or an infocline, or a meta layer for metadata, and others. Today, what I tend to use, particularly in reference to DBpedia, is the TBox-ABox distinction in computer science and description logics. UMBEL is more of a class or structural and concept relationships schema — a TBox — while DBpedia is more of an an instance and entity layer with attributes — an ABox. I think they are pretty complementary. . . “
The resulting class level structure produced by UMBEL and its mappings to other classes within existing linked data enabled us to create and then publish the ‘LOD constellation‘, a complementary TBox structure to the linked data’s existing ABox one. This diagram to the lower right from the Wikipedia article on linked data now shows this complement.
Description logics have arisen to aid our creating and understanding of knowledge representation systems. From this basis, we can see that the first efforts of the linked data initiative have lacked context, the TBox. At a specific level, the question is not DBpedia v. UMBEL or cloud v. constellation. Both types of structure are required in order to complete the logical framework. By thinking inside the box — by paying attention to our logical underpinnings — we can see that both TBoxes and ABoxes are essential and complementary to creating a useful knowledge representation system.
By more explicitly adopting a description logics framework we can also better address many prior questions of context, coherence and sufficiency. These have been constant themes in my recent writings that I will be revisiting again through the helpful prism of formal description logics.
My interview today with Sören Auer also brought up some important points regarding context. As we have said in other venues, it is important that any TBox be available for context purposes. Whether that should be UMBEL or some other framework depends on the use case. As I noted in the interview, “UMBEL’s specific purpose is to provide a coherent framework for serious knowledge engineers looking to federate data.” Other uses may warrant other frameworks, and certainly not always UMBEL.
But, in any event, I have two cautions to the linked data community: 1) do not take the suggestion to have a reference framework of concepts as being equivalent to adopting a single ontology for the Web; think of any reference structure as an essential missing TBox, and not some call to adopt “one ontology to rule them all,” but 2) in adopting alternative frameworks, take care that whatever is designed or adopted itself be able to meet basic DL logic tests of consistency and coherence.
The many advantages from separate TBox and ABox frameworks are one serendipitous surprise coming from the early development of linked data. To my knowledge, no one has yet elaborated the significant advantages from design, performance, architectural and flexibility perspectives from a distinct and explicit separation of TBox from ABox. We believe these advantages to be substantial.
Realize, as distributed, UMBEL already has both TBox and ABox components. The TBox component is the lightweight UMBEL ontology, with its 20,000 subject concept classes and their hierarchical and other relationships. This component has a vocabulary (or terminology) for aiding the linking to external ontologies. The vocabulary is quite suitable for extension into new domains as well.
The ABox component is the named entities part of instances drawn from Wikipedia and the BBC’s John Peel sessions. Besides being of common, broad interest, these 1.5 million instances (per the current version) are included in the distribution to instantiate the ontology for demonstration and sandbox purposes.
So, UMBEL’s world is quite simple: subject concepts (SCs) and named entities (NEs). Subject concepts are the TBox and classes that define the structure and concept relationships. Named entities are the individual “things” in the world (some lower case such as animals or foods) and are the ABox of instances that populate this structure.
In our early efforts, we concentrated on the SC portion of UMBEL. Most recently, we have been concentrating on the NE component and its NE dictionaries. It was these investigations that drew us into an ABox perspective when looking at design options. The logic and rationale had been sitting there for some years, but it took cracking open the older textbooks to become reacquainted with it.
Once we again began looking inside the box, we began to see and enumerate some significant advantages to an explicit TBox-ABox design, as well as advantages for keeping these components distinct:
We are still learning about these advantages and will document them further in pending work on coherence and named entity dictionary (NED) creation.
The two main points of this article have been to: 1) recognize the important intellectual legacy of description logics and how they can inform the linked data enterprise moving forward; and 2) be explicit about the functional and architectural splits of the TBox from the ABox. Making this split brings many advantages.
There will continue to be many design challenges as linked data proliferates and actually begins to play its role of aiding meaningful knowledge work. The grounding in description logics and the use of DL for testing alternative designs and approaches is a powerful addition to our toolkit.
Sometimes there are indeed many benefits to thinking inside the box.
I first wrote about OpenLink Software‘s stellar suite of structured Web-related software back in April 2007, with a spotlight on Virtuoso, the company’s flagship ‘universal server’ product. As it has for years, OpenLink continues a steady drumbeat of new releases and extensions. The most recent version upgrade, 5.0.9, was announced today.
In the intervening period I have now personally had the chance to experience Virtuoso first hand, both as the standard hosting platform for Zitgist’s linked data products and services, and as the hosting environment for UMBEL‘s various and growing Web services. I can state quite categorically that our ability to get things done fast with few resources depends critically on the unbelievable high-productivity platform that Virtuoso provides. (And, hehe, given our close relationship to OpenLink, we also get great responsiveness and technical support! Though, truthfully, OpenLink continues to amaze with its outreach and embrace of all of the important initiatives within the semantic Web community.)
I normally let these standard Virtuoso release announcements pass without comment. But today’s release v. 5.0.9 has an especially important feature from my parochial perspective: the first support for UMBEL.
Just to refresh memories, OpenLink’s Virtuoso is a cross-platform universal server for SQL, XML, and RDF data, including data management. It includes a powerful virtual database engine, full-text indexing, native hosting of existing applications, Web Services (WS*) deployment platform, Web application server, and bridges to numerous existing programming languages. Now in version 5.0, Virtuoso is also offered in an open source version. The basic technical architecture of Virtuoso and its robust capabilities is:
From an RDF and linked data perspective, Virtuoso is the most scalable and fastest platform on the market. Critically from Zitgist’s perspective is Virtuoso’s more than 100 built-in RDF-izers (or “Sponger cartridges”) that address all major data formats, serializations, relational data and Web 2.0 APIs. But don’t take my word for it: Check out OpenLink’s impressive list of these cartridges and their various linkages throughout the linked data space.
The key aspect of the new UMBEL support in Virtuoso is its incorporation of UMBEL lookups and its use of Named Entity extraction into the RDF-izer cartridges. This is but the first of growing support anticipated for UMBEL.
In addition to UMBEL, this version 5.0.9 includes significant performance optimizations to the SQL Engine, SPARQL+RDF Engine, and the ODBC and JDBC drivers.
Other new features include:
Finally, per usual, there are also minor bug-fixes:
For more details, you can see these Virtuoso release notes: https://sourceforge.net/project/shownotes.php?release_id=626647&group_id=161622
I recently wrote about WOA (Web-oriented architecture), a term coined by Nick Gall, and how it represented a natural marriage between RESTful Web services and RESTful linked data. There was, of course, a method behind that posting to foreshadow some pending announcements from UMBEL and Zitgist.
Well, those announcements are now at hand, and it is time to disclose some of the method behind our madness.
As Fred Giasson notes in his announcement posting, UMBEL has just released some new Web services with fully RESTful endpoints. We have been working on the design and architecture behind this for some time and, all I can say is, it’s UMBELievable!
As Fred notes, there is further background information on the UMBEL project — which is a lightweight reference structure based on about 20,000 subject concepts and their relationships for placing Web content and data in context with other data — and the API philosophy underlying these new Web services. For that background, please check out those references; that is not my main point here.
We discussed much in coming up with the new design for these UMBEL Web services. Most prominent was taking seriously a RESTful design and grounding all of our decisions in the HTTP 1.1 protocol. Given the shared approaches between RESTful services and linked data, this correspondence felt natural.
What was perhaps most surprising, though, was how complete and well suited HTTP was as a design and architectural basis for these services. Sure, we understood the distinctions of GET and POST and persistent URIs and the need to maintain stateless sessions with idempotent design, but what we did not fully appreciate was how content and serialization negotiation and error and status messages also were natural results of paying close attention to HTTP. For example, here is what the UMBEL Web services design now embraces:
There are likely other services out there that embrace this full extent of RESTful design (though we are not aware of them). What we are finding most exciting, though, is the ease with which we can extend our design into new services and to mesh up data with other existing ones. This idea of scalability and distributed interoperability is truly, truly powerful.
It is almost like, sure, we knew the words and the principles behind REST and a Web-oriented architecture, but had really not fully taken them to heart. As our mindset now embraces these ideas, we feel like we have now looked clearly into the crystal ball of data and applications. We very much like what we see. WOA is most cool.
For lack of a better phrase, Zitgist has a component internal plan that it calls its ‘Grand Vision’ for moving forward. Though something of a living document, this reference describes how Zitgist is going about its business and development. It does not describe our markets or products (of course, other internal documents do that), but our internal development approaches and architectural principles.
Just as we have seen a natural marriage between RESTful Web services and RESTful linked data, there are other natural fits and synergies. Some involve component design and architecting for pipeline models. Some involve the natural fit of domain-specific languages (DSLs) to common terminology and design, too. Still others involve use of such constructs in both GUIs and command-line interfaces (CLIs), again all built from common language and terminology that non-programmers and subject matter experts alike can readily embrace. Finally, some is a preference for Python to wrap legacy apps and to provide a productive scripting environment for DSLs.
If one can step back a bit and realize there are some common threads to the principles behind RESTful Web services and linked data, that very same mindset can be applied to many other architectural and design issues. For us, at Zitgist, these realizations have been like turning on a very bright light. We can see clearly now, and it is pretty UMBELievable. These are indeed exciting times.
BTW, I would like to thank Eric Hoffer for the very clever play on words with the UMBELievable tag line. Thanks, Eric, you rock!
I have been a consistently vocal critic of “Web 3.0″ as a moniker for current Web trends, specifically as some sort of branding replacement for the semantic Web. My specific postings on the term Web 3.0, likely with lame attempts at derision, are on record as:
Then, in relation to an article on ReadWriteWeb regarding a keynote on semantics and advertising at last week’s Web 3.0 Conference & Expo in Santa Clara, CA, I added a comment joining others criticizing the use of version numbers to describe the semantic Web. My comment was simply: “Please, Squash that Web 3.0 Cockroach (see http://www.mkbergman.com/?p=406).”
I count Dan as a colleague and a friend and do not want to engage in a flame war across multiple sites. So, herein, I address his rhetorical question, “How shall we call it [Web 3.0] instead, Mike? Please indulge us.”
It just so happens I have been writing on this very topic for quite some time.
(BTW, while others commented on the proper role or not of semantics and advertising, that is not my issue. I have never criticized advertising or marketing and will not do so with respect to semantic Web techniques, either. I fully expect semantic technologies to be applied everywhere appropriate and where they can work, which most certainly includes better targeting of ads and messages.)
So, what is my problem with the use of Web 3.0 for semantic Web-related trends and activities?
Simply answered, it is because Web 3.0 means nothing.
It can mean anything or everything depending on who is pushing the idea. I dare anyone who is pushing this term to find a consensus understanding or authoritative or “official” understanding or grounding in language or usage or any other informing basis for what this term means.
I find it ironic that the semantic Web, which is at heart about meaning and data interoperability, could potentially accept a naming or branding that is itself meaningless. Simply answered, that is my issue and has been since the term Web 3.0 was first floated.
Moreover, rejection of the term Web 3.0 does not carry with it the logical fallacy of therefore not wanting simpler ways to explain semantic Web-related concepts and technologies to the broader public. Nor does rejection of the term Web 3.0 carry with it the logical fallacy of not wanting effective branding, effective terminology or effective business models.
What rejection of the term Web 3.0 does mean is that we can and should do better in conveying what our collective endeavor is really all about. And with meaning.
I have two answers to Dan’s rhetorical question: structured Web and linked data. I further place both of these concepts into context.
Many have given their own bona fide attempts at describing semantic Web aliases and where they stand within a development continuum. Some of this is the natural attempt by some to want to “name” a space and time. Some of it is undoubtedly a reaction to the glazed look many in the public get when “semantic Web” is first put before them. Some of it is likely a reaction to the fact that the semantic Web has been slow to develop or has not yet met its initial promise and (perhaps) hype.
Over the past two years, I have pointed to two concepts as part of an ongoing continuum eventually leading to the semantic Web. These are the broader and bridging structured Web, which is currently seeing one expression as linked data. I first tried to capture this continuum in a diagram from July 2007:
|Document Web||Structured Web||Semantic Web|
To further a common language, I have also put forward my own working definitions of these concepts:
I have repeatedly discussed these themes and ideas since I first criticised attempts at the Web 3.0 branding:
Those entries marked with an asterisk [*] are the most central ones dealing with either structured Web or linked data as terminology.
Over two years, about one in six of my blog posts has been devoted strictly to meaning and terminology. I agree proper branding for our collective endeavor is very important. In the end, it is not important whether my views hold sway, but that we are able to effectively explain and sell our products and services. There is still considerable outreach and communication required with the marketplace.
Amongst the two concepts, I prefer the terminology of structured Web because it seems to be readily understood and appreciated within enterprise clients. But, we have also embraced linked data because it was gaining mindshare and conveys (imo) a concrete and correct image.
Terminology adoption is a function of both providers and consumers. Consumers vote with their attention and their wallets; providers through their branding and positioning, which naturally should be attentive to what is resonant in the marketplace.
Right now, in the emerging markets around the semantic Web and its related technologies expressing the current evolution of this continuum, the naming issues are still largely in the purview of the providers. Consumers are still few. I think most would agree that terminology is still unsettled.
It is legitimate to question whether the provider community is doing itself a disservice using ‘semantic Web’ as a hook. It is healthy to seek better terminology if it can be found. Any attempt to find clear and compelling language is to be applauded.
But, insofar as we are selling improved meaning and data interoperability, let’s find language that conveys those advantages. Web 3.0 directly contradicts this fundamental message by offering no meaning in its vacuity.
Well, come on now, let’s do smile a bit. There are worse things than calling the term Web 3.0 a cockroach. After all, cockroaches have existed for 340 million years well in advance of humans and can be found everywhere. More than 5,000 species have been identified. Some claim cockroaches will be here long after humans are gone.
Still, as for me, I’m hoping the term Web 3.0 is squashed well in advance of that time. Meanwhile, I appreciate the invitation to indulge in meaningful branding and terminology.
An earlier popular entry of this AI3 blog was “99 Wikipedia Sources Aiding the Semantic Web”. Each academic paper or research article in that compilation was based on Wikipedia for semantic Web-related research. Many of you suggested additions to that listing. Thanks!
Wikipedia continues to be an effective and unique source for many information extraction and semantic Web purposes. Recently, I needed to update my own research and found that many valuable new papers have been added to the literature.
I thus decided to make a compilation of such papers a permanent feature — which I’ve named SWEETpedia — and to update it on a periodic basis. You can now find the most recent version under the permanent SWEETpedia page link.
Hint, hint: Check out this link to see the 163 Wikipedia research sources!
NOTE: If you know of a paper that I’ve overlooked, please suggest it as a comment to this posting and I will add it to the next update.
For starters, it summarizes the size and status of the English-version Wikipedia with a more discerning eye than usual:
|Articles and related pages||5,460,000|
|lists and stubs||620,000|
|between category and subcategory||740,000|
|between category and article||7,270,000|
The size, scope and structure of Wikipedia make it an unprecedented resource for researchers engaged in natural language processing (NLP), information extraction (IE) and semantic Web-related tasks. Further, the more than 250 language versions of Wikipedia also make it a great resource for multi-lingual and translation studies.
In the eight months since posting the semantic Web-related research papers using Wikipedia, my new SWEETpedia listing has grown by about 65%. There are now 63 new papers, bringing the total to 163.
Of course, these are not the only academic papers published about or using Wikipedia. The SWEETpedia listing is specifically related to structure, term, or semantic extractions from Wikipedia. Other research about frequency of updates or collaboration or growth or comparisons with standard encyclopedias may also be found under Wikipedia’s own listing of academic studies.
This graph indicates the growth in use of Wikipedia as a source of semantic Web research. It is hard to tell if the effort is plateauing or not; the apparent slight dip in 2008 is too early to yet conclude that.
For example, the current SWEETpedia listing adds another 35% more listings for 2007 to the earlier records. It is likely many 2008 papers will also be discovered later in 2009. Many of the venues at which these papers get presented can be somewhat obscure, and new researchers keep entering the field.
However, we can conclude that Wikipedia is assuming a role in semantic Web and natural language research never before seen for other frameworks.
As noted, the new 82-page technical report by Olena Medelyan et al. from the University of Waikato in New Zealand, Mining Meaning from Wikipedia , is now the must-have reference for all things related to the use of Wikipedia for semantic Web and natural language research.
Olena and her co-authors, Catherine Legg, David Milne and Ian Witten, have each published much in this field and were some of the earliest researchers tapping into the wealth of Wikipedia.
They first note the many uses to which Wikipedia is now being put:
These type of uses then enable the authors to place various research efforts and papers into context. They do so via four major clusters of relevant tasks related to language processing and the semantic Web:
There are many interesting observations throughout this report. There are also useful links to related tools, supporting and annotated datasets, and key researchers in the field.
I highly recommend this report as the essential starting point for anyone first getting into these research topics. Many of the newly added references to the SWEETpedia listing arose from this report. Reading the report is useful grounding to know where to look for specific papers in a given task area.
Though clearly the authors have their own perspectives and research emphases, they do an admirable job of being complete and even-handed in their coverage. Basic review reports such as this play an important role in helping to focus new research and make it productive.
Excellent job, folks! And, thanks!