Posted:October 19, 2008

“How Shall We Call [Web 3.0] Instead, Mike? Please Indulge Us”

The Structured Web and Linked Data are Leading to the Semantic Web

I have been a consistently vocal critic of “Web 3.0” as a moniker for current Web trends, specifically as some sort of branding replacement for the semantic Web. My specific postings on the term Web 3.0, likely with lame attempts at derision, are on record as:

June 27, 2006 — Hey, Web 98.6, using the song to highlight the silliness of version “creep” and one upsmanship
October 5, 2007 — Please, Squash that Web 3.0 Cockroach from a bit over a year ago.

Then, in relation to an article on ReadWriteWeb regarding a keynote on semantics and advertising at last week’s Web 3.0 Conference & Expo in Santa Clara, CA, I added a comment joining others criticizing the use of version numbers to describe the semantic Web. My comment was simply: “Please, Squash that Web 3.0 Cockroach (see https://www.mkbergman.com/?p=406).”

Dan Grigorovici, one of the conference organizers, reacted strongly and negatively, both in a series of comments to the RWW posting and on his own blog.

I count Dan as a colleague and a friend and do not want to engage in a flame war across multiple sites. So, herein, I address his rhetorical question, “How shall we call it [Web 3.0] instead, Mike? Please indulge us.”

It just so happens I have been writing on this very topic for quite some time.

(BTW, while others commented on the proper role or not of semantics and advertising, that is not my issue. I have never criticized advertising or marketing and will not do so with respect to semantic Web techniques, either. I fully expect semantic technologies to be applied everywhere appropriate and where they can work, which most certainly includes better targeting of ads and messages.)

Getting Serious: Meaning is Important

So, what is my problem with the use of Web 3.0 for semantic Web-related trends and activities?

Simply answered, it is because Web 3.0 means nothing.

It can mean anything or everything depending on who is pushing the idea. I dare anyone who is pushing this term to find a consensus understanding or authoritative or “official” understanding or grounding in language or usage or any other informing basis for what this term means.

I find it ironic that the semantic Web, which is at heart about meaning and data interoperability, could potentially accept a naming or branding that is itself meaningless. Simply answered, that is my issue and has been since the term Web 3.0 was first floated.

Moreover, rejection of the term Web 3.0 does not carry with it the logical fallacy of therefore not wanting simpler ways to explain semantic Web-related concepts and technologies to the broader public. Nor does rejection of the term Web 3.0 carry with it the logical fallacy of not wanting effective branding, effective terminology or effective business models.

What rejection of the term Web 3.0 does mean is that we can and should do better in conveying what our collective endeavor is really all about. And with meaning.

So, Please Indulge Us

I have two answers to Dan’s rhetorical question: structured Web and linked data. I further place both of these concepts into context.

Many have given their own bona fide attempts at describing semantic Web aliases and where they stand within a development continuum. Some of this is the natural attempt by some to want to “name” a space and time. Some of it is undoubtedly a reaction to the glazed look many in the public get when “semantic Web” is first put before them. Some of it is likely a reaction to the fact that the semantic Web has been slow to develop or has not yet met its initial promise and (perhaps) hype.

Over the past two years, I have pointed to two concepts as part of an ongoing continuum eventually leading to the semantic Web. These are the broader and bridging structured Web, which is currently seeing one expression as linked data. I first tried to capture this continuum in a diagram from July 2007:


Document Web	Structured Web		Semantic Web
		Linked Data
Document-centric Document resources Unstructured data and semi-structured data HTML URL-centric circa 1993	Data-centric Structured data Semi-structured data and structured data XML, JSON, RDF, etc URI-centric circa 2003	Data-centric Linked data Semi-structured data and structured data RDF, RDF-S URI-centric circa 2006	Data-centric Linked data Semi-structured data and structured data RDF, RDF-S, OWL URI-centric circa ???

To further a common language, I have also put forward my own working definitions of these concepts:

Structured Web is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance
Linked Data is a set of best practices for publishing and deploying instance and class data using the RDF data model, naming the data objects using uniform resource identifiers (URIs), and exposing the data for access via the HTTP protocol, while emphasizing data interconnections, interrelationships and context useful to both humans and machine agents.

I have repeatedly discussed these themes and ideas since I first criticised attempts at the Web 3.0 branding:

October 12, 2008 — WOA: A New Enterprise Partner for Linked Data
July 25, 2008 — When is Content Coherent?
June 23, 2008 — [*] What is Linked Data?
May 23, 2008 — Structured Web v. Semantic Web?
May 6, 2008 — The Semantics of Context
March 12, 2008 — The Shaky Semantics of the Semantic Web
February 3, 2007 — Linked Data Comes of Age
July 22, 2007 — [*] More Structure, More Terminology and (hopefully) More Clarity
July 18, 2007 — [*] What is the Structured Web?
May 3, 2007 — Structure Paves the Way to the Semantic Web
April 23, 2007 — Structurizing the Web with RDF
April 5, 2007 — What’s in a Name?
April 2, 2007 — Did You Blink? The Structured Web Just Arrived
December 12, 2006 — The Pragmatic Semantic Web

Those entries marked with an asterisk [*] are the most central ones dealing with either structured Web or linked data as terminology.

Over two years, about one in six of my blog posts has been devoted strictly to meaning and terminology. I agree proper branding for our collective endeavor is very important. In the end, it is not important whether my views hold sway, but that we are able to effectively explain and sell our products and services. There is still considerable outreach and communication required with the marketplace.

The Markets will Decide

Amongst the two concepts, I prefer the terminology of structured Web because it seems to be readily understood and appreciated within enterprise clients. But, we have also embraced linked data because it was gaining mindshare and conveys (imo) a concrete and correct image.

Terminology adoption is a function of both providers and consumers. Consumers vote with their attention and their wallets; providers through their branding and positioning, which naturally should be attentive to what is resonant in the marketplace.

Right now, in the emerging markets around the semantic Web and its related technologies expressing the current evolution of this continuum, the naming issues are still largely in the purview of the providers. Consumers are still few. I think most would agree that terminology is still unsettled.

It is legitimate to question whether the provider community is doing itself a disservice using ‘semantic Web’ as a hook. It is healthy to seek better terminology if it can be found. Any attempt to find clear and compelling language is to be applauded.

But, insofar as we are selling improved meaning and data interoperability, let’s find language that conveys those advantages. Web 3.0 directly contradicts this fundamental message by offering no meaning in its vacuity.

Don’t Crush that Dwarf; Hand Me the Pliers

Well, come on now, let’s do smile a bit. There are worse things than calling the term Web 3.0 a cockroach. After all, cockroaches have existed for 340 million years well in advance of humans and can be found everywhere. More than 5,000 species have been identified. Some claim cockroaches will be here long after humans are gone.

Still, as for me, I’m hoping the term Web 3.0 is squashed well in advance of that time. Meanwhile, I appreciate the invitation to indulge in meaningful branding and terminology.

Posted:October 15, 2008

Research Shows Natural Fit between Wikipedia and Semantic Web

SWEETpedia Listing of 163 Research Articles; NZ Technical Report Affirm Trend

An earlier popular entry of this AI3 blog was “99 Wikipedia Sources Aiding the Semantic Web”. Each academic paper or research article in that compilation was based on Wikipedia for semantic Web-related research. Many of you suggested additions to that listing. Thanks!

Wikipedia continues to be an effective and unique source for many information extraction and semantic Web purposes. Recently, I needed to update my own research and found that many valuable new papers have been added to the literature.

I thus decided to make a compilation of such papers a permanent feature — which I’ve named SWEETpedia — and to update it on a periodic basis. You can now find the most recent version under the permanent SWEETpedia page link.

Hint, hint: Check out this link to see the 163 Wikipedia research sources!

NOTE: If you know of a paper that I’ve overlooked, please suggest it as a comment to this posting and I will add it to the next update.

Status of Wikipedia

Meanwhile, a complementary technical report, Mining Meaning from Wikipedia [1], was just released from the University of Waikato in New Zealand. It is a fantastic resource for anyone in this field.

For starters, it summarizes the size and status of the English-version Wikipedia with a more discerning eye than usual:

Categories	390,000
Articles and related pages	5,460,000
redirects	2,970,000
disambiguation pages	110,000
lists and stubs	620,000
bona-fide articles	1,760,000
Templates	174,000
infoboxes	9,000
other	165,000
Links
between articles	62,000,000
between category and subcategory	740,000
between category and article	7,270,000

The size, scope and structure of Wikipedia make it an unprecedented resource for researchers engaged in natural language processing (NLP), information extraction (IE) and semantic Web-related tasks. Further, the more than 250 language versions of Wikipedia also make it a great resource for multi-lingual and translation studies.

Growth of SWEETpedia

In the eight months since posting the semantic Web-related research papers using Wikipedia, my new SWEETpedia listing has grown by about 65%. There are now 63 new papers, bringing the total to 163.

Of course, these are not the only academic papers published about or using Wikipedia. The SWEETpedia listing is specifically related to structure, term, or semantic extractions from Wikipedia. Other research about frequency of updates or collaboration or growth or comparisons with standard encyclopedias may also be found under Wikipedia’s own listing of academic studies.

This graph indicates the growth in use of Wikipedia as a source of semantic Web research. It is hard to tell if the effort is plateauing or not; the apparent slight dip in 2008 is too early to yet conclude that.

For example, the current SWEETpedia listing adds another 35% more listings for 2007 to the earlier records. It is likely many 2008 papers will also be discovered later in 2009. Many of the venues at which these papers get presented can be somewhat obscure, and new researchers keep entering the field.

However, we can conclude that Wikipedia is assuming a role in semantic Web and natural language research never before seen for other frameworks.

Kinds of Semantic Web-related Research

As noted, the new 82-page technical report by Olena Medelyan et al. from the University of Waikato in New Zealand, Mining Meaning from Wikipedia [1], is now the must-have reference for all things related to the use of Wikipedia for semantic Web and natural language research.

Olena and her co-authors, Catherine Legg, David Milne and Ian Witten, have each published much in this field and were some of the earliest researchers tapping into the wealth of Wikipedia.

They first note the many uses to which Wikipedia is now being put:

Wikipedia as an encyclopedia — the standard use familiar to the general public
Wikipedia as corpus — large text collections for testing and modeling NLP tasks
Wikipedia as a thesaurus — equivalent and hierarchical relationships between terms and related or synoymous terms
Wikipedia as a database — the extraction and codification of structure and structural relationships
Wikipedia as an ontology — the formal expression of relationships in semantic Web and logical constructs, and
Wikipedia as a network structure — relationship analysis and mining through Wikipedia’s representation as a network graph.

These type of uses then enable the authors to place various research efforts and papers into context. They do so via four major clusters of relevant tasks related to language processing and the semantic Web:

Natural Language Processing (NLP) Tasks:
Semantic relatedness
Word sense disambiguation
words and phrases
named entities
thesaurus and ontology terms
Co-reference resolution
Multilingual alignment

Information Retrieval Tasks:
Query expansion
Multilingual retrieval
Question answering
Entity ranking
Text categorization
Topic indexing

Information Extraction (IE) Tasks:
Semantic relations in raw (unstructured) text
Semantic relations in structure
Typing (classifying) named entities

Ontology Building Tasks:
Knowledge organization
Named entities
Thesaurus information
Ontology alignment
Facts extraction and assertion

There are many interesting observations throughout this report. There are also useful links to related tools, supporting and annotated datasets, and key researchers in the field.

I highly recommend this report as the essential starting point for anyone first getting into these research topics. Many of the newly added references to the SWEETpedia listing arose from this report. Reading the report is useful grounding to know where to look for specific papers in a given task area.

Though clearly the authors have their own perspectives and research emphases, they do an admirable job of being complete and even-handed in their coverage. Basic review reports such as this play an important role in helping to focus new research and make it productive.

Excellent job, folks! And, thanks!

[1] Olena Medelyan, Catherine Legg, David Milne and Ian H. Witten, 2008. Mining Meaning from Wikipedia, Working Paper Series ISSN 1177-777X, Department of Computer Science, The University of Waikato (New Zealand), September 2008, 82 pp. See http://arxiv.org/ftp/arxiv/papers/0809/0809.4530.pdf.

Posted:October 12, 2008

WOA: A New Enterprise Partner for Linked Data

Web-Oriented Architecture (and REST) is Gaining Enterprise Mindshare

Nick Gall, a VP of Gartner, first coined the TLA (three-letter acronym) for WOA (Web-oriented architecture) in late 2005. In further describing it, Nick simply defines it as:

WOA = SOA + WWW + REST

In the longer version, Nick describes WOA as based on the architecture of the Web that he further characterizes as “globally linked, decentralized, and [with] uniform intermediary processing of application state via self-describing messages.”

WOA is a subset of the service-oriented architectural style. He describes SOA as comprising discrete functions that are packaged into modular and shareable elements (“services”) that are made available in a distributed and loosely coupled manner.

Representational state transfer (REST) is an architectural style for distributed hypermedia systems such as the World Wide Web. It was named and defined in Roy Fielding‘s 2000 doctoral thesis; Roy is also one of the principal authors of the Hypertext Transfer Protocol (HTTP) specification.

REST provides principles for how resources are defined and used and addressed with simple interfaces without additional messaging layers such as SOAP or RPC. The principles are couched within the framework of a generalized architectural style and are not limited to the Web, though they are a foundation to it.

REST and WOA stand in contrast to earlier Web service styles that are often known by the WS* acronym (such as WSDL, etc.). (Much has been written on RESTful Web services v. “big” WS*-based ones; one of my own postings goes back to an interview with Tim Bray back in November 2006.)

While there are dozens of well-known methods for connecting distributed systems together, protocols based on HTTP will be the ones that stand the test of time. And since HTTP is the fundamental protocol of the Web, those protocols most closely aligned with its essential nature will likely be the most successful.

— Dion Hinchliffe [2]

Shortly after Nick coined the WOA acronym, REST luminaries such as Sam Ruby gave the meme some airplay [1]. From an enterprise and client perspective, Dion Hinchliffe in particular has expanded and written extensively on WOA. Besides his own blog, Dion has also discussed WOA several times on his Enterprise Web 2.0 blog for ZDNet.

Largely due to these efforts (and — some would claim — the difficulties associated with earlier WS* Web services) enterprises are paying much greater heed to WOA. It is increasingly being blogged about and highlighted at enterprise conferences [3].

While exciting, that is not what is most important in my view. What is important is that the natural connection between WOA and linked data is now beginning to be made.

Analogies to Linked Data

Linked data is a set of best practices for publishing and deploying data on the Web using the RDF data model. The data objects are named using Web uniform resource identifiers (URIs), emphasize data interconnections, and adhere to REST principles.

Most recently, Nick began picking up the theme of linked data on his new Gartner blog. Enterprises now appreciate the value of an emerging service aspect based on HTTP and accessible by URIs. The idea is jelling that enterprises can now process linked data architected in the same manner.

I think the similar perspectives between REST Web services and linked data become a very natural and easily digested concept for enterprise IT architects. This is a receptive audience because it is these same individuals who have experienced first-hand the challenges and failures of past hype and complexity from non-RESTful designs.

It helps immensely, of course, that we can now look at the major Web players such as Google and Amazon and others — not to mention the overall success of the Web itself — to validate the architecture and associated protocols for the Web. The Web is now understood as the largest Machine designed by humans and one that has been operational every second of its existence.

Many of the same internal enterprise arguments that are being made in support of WOA as a service architecture can be applied to linked data as a data framework. For example, look at Dion’s 12 Things You Should Know About REST and WOA and see how most of the points can be readily adopted to linked data.

So, enterprise thought leaders are moving closer to what we now see as the reality and scalability of the Web done right. They are getting close, but there is still one piece missing.

False Dichotomies

I admit that I have sometimes tended to think of enterprise systems as distinct from the public Web. And, for sure, there are real and important distinctions. But from an architecture and design perspective, enterprises have much to learn from the Web’s success.

With the Web we see the advantages of a simple design, of universal identifiers, of idempotent operations, of simple messaging, of distributed and modular services, of simple interfaces, and, frankly, of openness and decentralization. The core foundations of HTTP and adherence to REST principles have led to a system of such scale and innovation and (growing) ubiquity as to defy belief.

So, the first observation is that the Web will be the central computing touchstone and framework for all computing systems for the foreseeable future. There simply is no question that interoperating with the Web is now an enterprise imperative. This truth has been evident for some time.

But the reciprocal truth is that these realities are themselves a direct outcome of the Web’s architecture and basic protocol, HTTP. The false dichotomy of enterprise systems as being distinct from the Web arises from seeing the Web solely as a phenomenon and not as one whose basic success should be giving us lessons in architecture and design.

Thus, we first saw the emergence of Web services as an important enteprise thrust — we wanted to be on the Web. But that was not initially undertaken consistent with Web design — which is REST or WOA — but rather as another “layer” in the historical way of doing enterprise IT. We were not of the Web. As the error of that approach became evident, we began to see the trend toward “true” Web services that are now consonant with the architecture and design of the actual Web.

So, why should these same lessons and principles not apply as well to data? And, of course, they do.

If there is one area that enterprises have been abject failures in for more than 30 years it is data interoperability. ETL and enterprise busses and all sorts of complex data warehousing and EAI and EIA mumbo jumbo have kept many vendors fat and happy, but few enterprise customers so. On almost every single dimension, these failed systems have violated the basic principles now in force on the Web based on simplicity, uniform interfaces, etc.

The Starting Foundation: HTTP 1.1

OK, so how many of you have read the HTTP specifications [4]? How many understand them? What do you think the fundamental operational and architectural and design basis of the Web is?

HTTP is often described as a communications protocol, but it really is much more. It represents the operating system of the Web as well as the embodiment of a design philosophy and architecture. Within its specification lies the secret of the Web’s success. REST and WOA quite possibly require nothing more to understand than the HTTP specification.

Of course, the HTTP specification is not the end of the story, just the essential beginning for adaptive design. Other specifications and systems layer upon this foundation. But, the key point is that if you can be cool with HTTP, you are doing it right to be a Web actor. And being a cool Web actor means you will meet many other cool actors and be around for a long, long time to come.

Concept “Routers” for Information

An understanding of HTTP can provide similar insights with respect to data and data interoperability. Indeed, the fancy name of linked data is nothing more than data on the Web done right — that is, according to the HTTP specifications.

Just as packets need their routers to get to their proper location based on resolving the names of a URI to a physical device, data or information on the Web needs similar context. And, one mechanism by which such context can be provided is through some form of logical referencing framework by which information can be routed to its right “neighborhood”.

I am not speaking of routing to physical locations now, but the routing to the logical locations about what information “is about” and what it “means”. On the simple level of language, a dictionary provides such a function by giving us the definition of what a word “means”. Similar coherent and contextual frameworks can be designed for any information requirement and scope.

Of course, enterprises have been doing similar things internally for years by adopting common vocabularies and the like. Relational data schema are one such framework even if they are not always codified or understood by their enterprises as such.

Over the past decade or two we have seen trade and industry associations and standards bodies, among others, extend these ideas of common vocabularies and information structures such as taxonomies and metadata to work across enterprises. This investment is meaningful and can be quite easily leveraged.

As Nick notes, efforts such as what surrounds XBRL are one vocabulary that can help provide this “routing” in the context of financial data and reporting. So, too, can UMBEL as a general reference framework of 20,000 subject concepts. Indeed, our unveiling of the recent LOD constellation points to a growing set of vocabularies and classes available for such contexts. Literally thousands and thousands of such existing structures can be converted to Web-compliant linked data to provide the information routing hubs necessary for global interoperability.

And, so now we come down to that missing piece. Once we add context as the third leg to this framework stool to provide semantic grounding, I think we are now seeing the full formula powerfully emerge for the semantic Web:

SW = WOA + linked data + coherent context

This simple formula becomes a very powerful combination.

Just as older legacy systems can be exposed as Web services, and older Web services can be turned into WOA ones compliant with the Web’s architecture, we can transition our data in similar ways.

The Web has been pointing us to adaptive design for both services and data since its inception. It is time to finally pay attention.

[1] Sam and his co-author Leonard Richardson of RESTful Web Services (O’Reilly Media Inc., 446 pp, May 2007; ISBN 0596529260) have preferred the label ROA, for Resource-oriented Architecture.

[2] D. Hinchcliffe, “A Search for REST Frameworks for Exploring WOA Patterns — And Current Speaking Schedule”, Sept. 10, 2006; see http://hinchcliffe.org/archive/2006/09/10/9275.aspx

[3] The Linked Data community should pay much closer attention to existing and well-attended enterprise conferences in which the topic can be inserted as a natural complement rather than trying to start entire new venues.

[4] The current specification is RFC 2616 (June 1999), which defines HTTP/1.1; see http://tools.ietf.org/html/rfc2616. For those wanting an easier printed copy, a good source in PDF is http://www.faqs.org/ftp/rfc/rfc2616.pdf.

Posted:October 9, 2008

New Currents in the ‘Deep Web’

Timelines, Semantics and Ontologies are Coming to the Fore

The past two weeks have seen an interesting emergence of new perspectives on the ‘deep Web‘. The deep Web, a term Thane Paulsen and I coined for my oft-quoted study from 2000, The Deep Web: Surfacing Hidden Value [1], is the phenomenon of database-backed content served from interactive Web search forms.

Because deep Web content is dynamic and produced only on request, it has been difficult for traditional search engines to index. It is also huge and of high quality (though likely not the 100x to 500x figure larger than the standard ‘surface’ Web that I used in that first study.)

Deep Web Timeline

This is the most recent of the three notable events over the past two weeks, and came out on Tuesday. Maureen Flynn-Burhoe of the oceanflynn @ Digg blog has produced a very informative and comprehensive timeline of deep Web and related developments from 1980 to the present (database-backed content and early Web precursors, of course, precede the Web itself and the term ‘deep Web’).

I have been directly involved in this field since 1994 and have not yet seen such a comprehensive treatment. She cites studies noting “hundreds of thousands” of deep Web sites and the faster growth of dynamic (database-served) as opposed to static (‘surface’) content on the Web.

As someone directly involved in estimating the size of the deep Web, I appreciate the analytic difficulties and take all of the estimates (my own older ones included!) with a grain of salt. Nonetheless, the deep Web is important, its content is huge, often of unique and high quality, and it deserves serious attention by Web scientists.

Great job, Maureen! I always appreciate thorough researchers. (BTW, I suspect you might also like the Timeline of Information History.)

Trends and Role in the Semantic Web

The next notable event was the publishing of Searching the Deep Web by Alex Wright in the Communications of the ACM (October 2008) [2]. Alex had first written about the deep Web for Salon magazine in 2004 and had given nice attention to my company at that time, BrightPlanet [3].

In this current update, Alex does an excellent job of characterizing current status and research in search techniques for the deep Web. I also liked the fact he used our fishing analogy of trawling for standard search crawlers versus direct angling in the deep Web (see our earlier figure at upper left).

As some may recall, Google has stepped up its activities in this area, an event I reported on a few months back. Those perspectives, and others from some other notable figures, are included in Alex’s piece as well.

My own contribution to the piece was to suggest that RDF and semantic Web approaches offered the next evolutionary stage in deep Web searching. Alex was able to take that theme and get some great perspectives on it. I also appreciate the accuracy of my quotes, which gives me confidence in the quality for the rest of the story.

Without a doubt there is high quality in the deep Web and bringing structure and semantic characterization to it through metadata is a task of some consequence.

For myself, I chose to move beyond the deep Web when its focus seemed stuck in a document-level perspective to retrieval and analysis. However, there is much to be learned from the techniques used to select and access deep Web content, which could be readily transferable to linked data.

Thanks, Alex, for making these prospects clearer! Maybe it is time to dust off some of my old stuff!

Getting Deeper into the Semantics

This emerging joining of deep Web and semantics is actually taking place through the efforts of a number of academic researchers. Recently and prominently has been James Geller from the New Jersey Institute of Technology and his colleagues Soon Ae Chun and Yoo Jung [4]. Their recently published paper, Toward the Semantic Deep Web , shows how ontologies and semantic Web constructs can be combined to more effectively extract information from the deep Web. They call this combination the ‘semantic deep Web.’

The authors posit that the structured roots of deep Web content lend themselves to better ontology learning from the Web. They also point to the usefulness of deep Web structure to annotations.

That such confluences are occurring between the semantic and deep “Webs” is a function of focused academic attention and the growing maturity of both perspectives. This year, for example, saw the inauguration of the first Workshop on Advances in Accessing Deep Web (ADW 2008). As part of the International Conference on Business Information Systems (BIS 2008), this meeting saw a lot of elbow rubbing with semantic Web and enterprise topics.

It might seem strange (indeed, sometimes it does to me 😉 ) to envision structured database content being served through a Web form and then converted via ontologies and other means to semantic Web formats. After all, why not go direct to the data?

And, of course, direct conversion is less lossy and more efficient.

But, one interesting point is that semantic Web techniques are increasingly working as a structure-extraction layer wrapping the standard Web. In that regard, starting with inherently structured source data — that is, the deep Web — can lead to higher quality inputs across the distributed, heterogeneous content of the Web.

Given the impossibility of everyone starting with the same premises and speaking the same languages and concepts, semantic Web mediation methods offer a way to overcome the Tower of Babel. And, when the starting content itself is inherently structured and (generally) of higher quality — that is, the deep Web — the logic of the combination becomes more obvious.

For More Information

Interested in learning more about the deep Web? I firstly recommend the resources posted at the bottom of Flynn-Burhoe’s timeline. And, for a very thorough treatment, I also recommend Denis Shestakov’s Ph.D. thesis from earlier this year [5]. It has a bibliography of some 115 references.

[1] Michael K. Bergman, 2001. The Deep Web: Surfacing Hidden Value, Journal of Electronic Publishing. 7:1. Note, this publication was an update of an internal BrightPlanet study first published on July 26, 2000.

[2] Alex Wright, 2008. “Searching the Deep Web,” in Communications of the ACM, pp. 14-15, October 2008. See http://mags.acm.org/communications/200810/?CFID=5461527&CFTOKEN=11076271.

[3] Alex is also the author of GLUT: Mastering Information Through the Ages (Joseph Henry Press, 296 pp., July 2007; ISBN 0309102383). BTW, I had earlier reviewed this book with some criticisms, which should go a long way to prove Alex’s fairness and chops as a journalist.

[4] James Geller, Soon Ae Chun and Yoo Jung, 2008. “Toward the Semantic Deep Web,” in Computer, vol. 41, no. 9, pp. 95-97, Sept., 2008. See

[5] Dennis Shestakov, 2008. Search Interfaces on the Web: Querying and Characterizing, PhD. dissertation from the University of Turku Centre for Computer Science, Finland, 153 pp., May 2008.

Posted:October 5, 2008

A New Constellation in the Linking Open Data (LOD) Sky

Class-level Mappings Now Generalize Semantic Web Connectivity

We are pleased to present a complementary view to the now-famous linking open data (LOD) cloud diagram (shown to the left; click on it for a full-sized view) [1]. This new diagram (shown below) — what we call the LOD constellation to distinguish it from its notable LOD cloud sibling — presents the current class-level structure within LOD datasets.

This new LOD constellation complements the instance-level view in the LOD cloud that has been the dominant perspective to date. The LOD cloud centrally hubs around DBpedia, the linked data structured representation of Wikipedia. The connections shown in the cloud diagram mostly reflect owl:sameAs relations, which means that the same individual things or instances are referenced and then linked between the datasets. Across all datasets, linking open data (LOD) now comprises some two billion RDF triples, which are interlinked by around 3 million RDF links [2]. This instance-level view of the LOD cloud shown to the left was updated a bit over a week ago [1].

The objective of the Linking Open Data community is to extend the Web with a data commons by publishing various open datasets as RDF on the Web and by setting RDF links between data items from different data sources. All of the sources on these LOD diagrams are open data [3].

So, Tell me Again Why Classes Are Important?

In prior postings, Fred Giasson and I have explained the phenomenon of ‘exploding the domain‘. Exploding the domain means to make class-to-class mappings between a reference ontology and external ontologies, which allows properties, domains and ranges of applicability to be inherited under appropriate circumstances [4]. Exploding the domain expands inferencing power to this newly mapped information. Importantly, too, exploding the domain also means that instances or individuals that are members of these mapped classes also inherit or assume the structural relations (schema, if you will) of their mapped sources as well.

Trying to think through the statements above, however, is guaranteed to make your head hurt. When just reading the words, these sound like fairly esoteric or abstract ideas.

So, to draw out the distinctions, let’s discuss linked data that is based on instance (individual) mappings versus those mapped on the basis of classes. Like all things, there are exceptions and edge cases, but let us simply premise our example using basic set theory. Our individual instances are truly discrete individuals, in this case some famous dogs, and our classes are the conceptual sets by which these individuals might be characterized.

To make our example simple, we will use two datasets (A and B) about dogs and their possible relations, each characterized by their structure (classes) or their individuals (instances):

	Dataset A (organisms)	Dataset B (pets)
Classes (structure)	mammal canid wolf dog	pet dog breed (list)
Instances (individuals) and class assignments	Rin Tin Tin (dog)	Rin Tin Tin (German shepherd) Lassie (collie) Clifford (Visla) Old Yeller (mutt)

When datasets are linked based on instance mappings alone, as is generally the case with current practice using sameAs, and there are no class mappings, we can say that Rin Tin Tin is both a dog pet and a mammal. However, we can not say that Lassie, for example, is a mammal, because there is no record for Lassie in Dataset A.

So, we thus see our first lesson: to draw an inference about instances using sameAs in the absence of class mappings requires each record (instance) to exist in an external dataset in order to make that assertion. Instances can inherit the properties and structure of the datasets in which they specifically occur, but only there. Thus, what can be said about a given individual (linked via owl:sameAs) is at most the intersection of what is contained in only the datasets in which that individual appears and is mapped. Assertions are thus totally specific and can not be made without the presence of a matching instance record. We can call this scenario the intersection model: only where there is an intersection of matching instance records can the structure of their source datasets be inferred.

However, when mappings can be made at the class level, then inferences can be drawn about all of the members of those sets. By asserting equivalentClass for dog between Datasets A and B, we can now infer that Lassie, Clifford and Old Yeller are canids and mammals as well as Rin Tin Tin, even though their instance records are not part of Dataset A. To complete the closure we can also now infer that Rin Tin Tin (Dataset A) is a pet and a German shepherd from Dataset B. We can call this scenario the union model. The mappings have become generalized and our inferencing power now extends to all instances of mapped classes whether there are records or not for them in other datasets.

This power of generalizability, plus the inheritance of structure, properties and domain and range attributes, is why class mappings are truly essential for the semantic Web. Exploding the domain is real and powerful. Thus, to truly understand the power of linked data, it is necessary to view its entirety from a class perspective [5].

Thus, to summarize our answer to the rhetorical question, class mappings are important because they can:

Generalize the understanding of individual instances
Expand the description of things in the world by inheriting and reusing other class properties, domains and ranges, and
Place and contextualize things by inheriting class structure and hierarchical relationships.

The LOD Constellation

So, here is the new LOD constellation of class-level linkages. The definition of class-level linkages is based on one of four possible predicates (rdfs:subClassOf, owl:equivalentClass, umbel:superClassOf or umbel:isAligned). Because of the newness of UMBEL as a vocabulary, only a few of the sources linked to UMBEL have the umbel:superClassOf relationship and one (bibo) has isAligned.

Note that some of the sources are combined vocabularies (ontologies) and instance representations (e.g., UMBEL, GeoNames), others are strict ontologies (e.g., event, bibo), and still others are ontologies used to characterize distributed instances (e.g., foaf, sioc, doap). Other distinctions might be applied as well:

[click for full size]

The current 21 LOD datasets and ontologies that contribute to these class-level mappings are (with each introduced by its namespace designation):

bibo — Bibilographic ontology
cc — Creative Commons ontology
damltime — Time Zone ontology
doap — Description of a Project ontology
event — Event ontology
foaf — Friend-of-a-Friend ontology
frbr — Functional Requirements for Bibliographic Records
geo — Geo wgs84 ontology
geonames — GeoNames ontology
mo — Music Ontology
opencyc — OpenCyc knowledge base
owl — Web Ontology Language
pim_contact — PIM (personal information management) Contacts ontology
po — Programmes Ontology (BBC)
rss — Really Simple Syndicate (1.0) ontology
sioc — Socially Interlinked Online Communities ontology
sioc_types — SIOC extension
skos — Simple Knowledge Organization System
umbel — Upper Mapping and Binding Exchange Layer ontology
wordnet — WordNet lexical ontology
yandex_foaf — FOAF (Friend-of-a-Friend) Yandex extension ontology

The diagram was programmatically generated using Cytoscape (see below) [6], with some minor adjustments in bubble position to improve layout separation. The bubble sizes are related to number of linked structures (ontologies) to which the node has class linkages. The arrow thicknesses are related to number of linking predicates between the nodes. Two-way arrows are shown as darker and indicate equivalentClass or matching superClassOf and subClassOf; single arrows represent subClassOf relationships only.

Note we are not presenting any rdf:type relations because those are not structural, and rather deal with the assignment of instances to classes [7]. More background is provided in the discussion of the construction methodology [6].

At this time, we have not calculated how many individuals or instances might be directly included in these class-level mappings. The data and files used in constructing this diagram are available for download without restriction [8].

Finally, we have expended significant effort to discover class-level mappings for which we may not be directly aware (see next). Please bring any missing, erroneous or added linkages to our attention. We will be pleased to incorporate those updates into future releases of the diagram.

How the LOD Constellation Was Constructed

Our diligence has not been exhaustive since not all LOD datasets are indexed locally and others do not have SPARQL endpoints. The general method was to query the datasets to check which ontologies used external classes to instantiate their individuals using the rdf:type predicate. The externally referenced ontology was then checked to determine its own external class mappings.

Here is the basic SPARQL query to discover the the rdf:type predicate for non-Virtuoso resources:

select ?o where
{
?s a ?o.
}

And here is the SPARQL query for Virtuoso-hosted datasets (note Virtuoso supports the distinct non-standard extension to SPARQL which is a more efficient way to get listings):

select distinct ?o where
{
?s a ?o.
}

We then created a simple script to go to all of the ontology namespaces so listed in these external mappinigs. If there was an external class mapping in the source with one of the four possible predicates of rdfs:subClassOf, owl:equivalentClass, umbel:superClassOf or umbel:isAligned, we noted the source and predicate and wrote it to a simple CSV (comma delimited) file. This formed the input file to the Cytoscape program that created the network graph [6].

There are possibly oversights and omissions in this first-release diagram since not all bubbles in the LOD cloud were exhaustively inspected. Please notify us with updates or new class linkages. Alternatively, you can also download and modify the diagram yourself [8].

Conspicuous by Their Absence

We gave particular diligence to a few of the more dominant sources in the LOD instance cloud that showed no class mappings. These include DBpedia and YAGO. While these have numerous and useful rdf:type and owl:sameAs relationships, and all have rich internal class structures, none apparently map at the class level to external sources.

However, because of the unique overlap of instances (named entities) in UMBEL, which does have extensive external class mappings, and which have been mapped to DBpedia instances, it is possible to infer some of these external class linkages.

For example, go to the DBpedia SPARQL endpoint:

http://dbpedia.org/sparql/

And try out some sample queries by pasting the following into the Query text box and running the query:

define input:inference ‘http://dbpedia.org/resource/inference/rules/umbel#’
prefix umbel: <http://umbel.org/umbel/sc/>
select ?s
where
{
?s a umbel:Person
}

This example query applies the external class structure of UMBEL to individual person instances in DBpedia because of the prior specification of some mapping rules used for inferencing [9]. The result set is limited to 1000 results.

Alternatively, since UMBEL has also been mapped to the external FOAF ontology, we can also now invoke the FOAF class structure directly to produce the exact same result set (since umbel:Person owl:equivalentClass foaf:Person). We do this by applying the same inferencing rules in a different way:

define input:inference ‘http://dbpedia.org/resource/inference/rules/umbel#’
select ?s
where
{
?s a <http://xmlns.com/foaf/0.1/Person>.
}

UMBEL can thus effectively act as a class structure bridge to the DBpedia instances.

Since DBpedia is an instance hub, this bridging effect is quite effective between UMBEL and other DBpedia instances in the LOD cloud. However, because there is not the same degree of overlap of instances with, say, GeoNames, this technique would be less effective there.

Explicit class-level mappings between datasets will always be more powerful than instance-level ones with class mediators. And, in all cases, both of those techniques that explicitly invoke classes are more powerful than instance-level links alone.

The Linked Data Infrastructure is Now Complete

Though all of the available linkages have not yet been made in the LOD datasets, we can now see that all essential pieces of the linkage infrastructure are in place and ready to be exploited. Of course, new datasets can take advantage of this infrastructure as well.

UMBEL is one of the essential pieces that provides the bridging “glue” to these two perspectives or “worlds” of the instances in the LOD cloud and the classes in the LOD constellation. This “glue” becomes possible because of UMBEL’s unique combination of three components or roles:

UMBEL provides a rich set of 20,000 subject concept classes and their relations (a reference structure “backbone”) that facilitates class-level mappings with virtually any external ontology with the benefits as described above
UMBEL contains a named entity dictionary from Wikipedia also mapped to these classes, which therefore strongly intersects with DBpedia and YAGO, and therefore helps provide the individual instances <–> classes bridging “glue”, and
UMBEL is also a vocabulary that enhances the lightweight SKOS vocabulary to explicitly facilitate linkages to external ontologies at the subject concept layer.

In fact, it is the latter vocabulary sense, in combination with the reference subject concepts, that enables us to draw the LOD class constellation.

So, we can now see a merger of the LOD cloud and the LOD constellation to produce all of the needed parts to the LOD infrastructure for going forward:

A hub of instances (DBpedia)
A hub of subject-oriented (“is about”) reference classes (Cyc-UMBEL), and
A vocabulary for glueing it all together (SKOS-UMBEL).

This infrastructure is ready today and available to be exploited for those who can grasp its game-changing significance.

And, from UMBEL’s standpoint, we can also point to the direct tie-ins to the Cyc knowledge base and structure for conceptual and relationship coherence testing. This infrastructure is an important enabler to extend these powerful frameworks to new domains and for new purposes. But that is a story for another day. 😉

[1] The current version of the LOD cloud may be found at the World Wide Web Consortium’s (W3C) SWEO wiki page. There is also a clickable version of the diagram that will take you to the home references for the consituent data sources in this diagram; see http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2008-09-18.html.

[2] According to [1], this estimate was last updated one year ago in October 2007. The numbers today are surely much larger, since the number of datasets has also nearly doubled in the interim.

[3] Open data has many definitions, but a common one with a badge is often seen. However, the best practices of linked data can also be applied to proprietary or intranet information as well; see this FAQ.

[4] For further information about exploding the domain, see these postings: F. Giasson, Exploding the Domain: UMBEL Web Services by Zitgist (April 20, 2008), UMBEL as a Coherent Framework to Support Ontology Development (August 27, 2008), Exploding DBpedia’s Domain using UMBEL (September 4, 2008); M. Bergman, ‘ Exploding the Domain’ in Context (September 24, 2008).

[5] Of course, instance-level mappings with sameAs have value as well, as the usefulness of linked data to date demonstrates. The point is not that class-level mappings are the “right” way to construct linked data. Both instance mappings and class mappings complement one another, in combination bringing both specificity and generalizability. The polemic here is merely to redress today’s common oversight of the class-level perspective.

[6] See the text for how the listing of class relationships was first assembled. After removal of duplicates, a simple comma delimited file was produced, class_level_lod_constellation.csv, with four columns. The first three columns show the subject-predicate-object linking the datasets by the class-level predicate. The fourth column presents the count of the number of types of class-level predicates used between the dataset pairs; the maximum value is 4.

This CVS file is the import basis to Cytoscape. After import, the spring algorithm was applied to set the initial distribution of nodes (bubbles) in the diagram. Styles were developed to approximate the style in the LOD cloud diagram, and each of the class-linkage predicates was assigned different views and arrows. (The scaling of arrow width allowed the chart to be cleaned up with repeat linkages removed and simplified to a single arrow, with the strongest link type used as the remaining assignment. For example, equivalentClass is favored over subClassOf is favored over superClassOf.)

In addition, each node was scaled according to the number of external dataset connections, with the assignments as shown in the file. Prior to finalizing, the node bubbles were adjusted for clear spacing and presentation. The resulting output is provided as the Cytoscape file, lod_class_constellation.cys. For the published diagram, the diagram was also exported as an SVG file.

This SVG file was lastly imported into Inkscape for final clean up. The method for constructing the final diagram, including discussion about how the shading effect was added, is available upon request.

[7] rdf:type is a predicate for assigning an instance to a class, which is often an external class. It is an important source of linkages between datasets at the instance level and provides a structural understanding of the instance within its dataset. While this adds structural richness for instances, rdf:type is by definition not class-level and provides no generalizability.

[8] The CSV file and the Cytoscape files may be downloaded from http://www.umbel.org/lod_constellation.html.

[9] See the explanation of the external linkage file in M. Bergman, ‘ Exploding the Domain’ in Context (September 24, 2008).

Main Links

Search

Author: Mike Bergman