Linked Data follows recommended practices for identifying, exposing and connecting data. A robust Linking Open Data (LOD) community has developed around the practice in the past year with the size of compliant data now exceeding several billion RDF triples.
Like any new development, there has been the need for best practices to be articulated and documented. Some of the best guides are How to Publish Linked Data on the Web from Chris Bizer, Richard Cyganiak and Tom Heath; Cool URIs for the Semantic Web from Leo Sauermann and Richard Cyganiak; the Linked Data for the Web chapter from Joshua Tauberer; and Deploying Linked Data from OpenLink Software. Also, to see and experience Linked Data just follow Kingsley Idehen’s blog and prolific mailing list postings, almost always with valuable demos and links.
The techniques these documents most often explain deal with such items as exposing and dereferencing URIs, content negotiation and naming and distinguishing so-called (unfortunately) information resources and non-information resources . The above references cover these topics with good clarity. The general tenor of these guides is on the techniques of exposing and publishing Linked Data.
UMBEL is really a mechanism for aiding the linkage of data, not exposing it per se, so we will leave the discussion of exposing and publishing Linked Data to these other venues. Those other venues deal well with the Data portion of Linked Data. We want to focus here on the Linked portion.
At first blush, it is surprising how little is actually said or written about this linkage portion. For example, in the best practices How to Publish Linked Data there is fairly minor discussion of external links in Section 2.2 and then the sole discussion on links limited to Section 6.
To quote in part:
RDF links enable Linked Data browsers and crawlers to navigate between data sources and to discover additional data. The application domain will determine which RDF properties are used as predicates. . . . It is common practice to use the owl:sameAs property for stating that another data source also provides information about a specific non-information resource. An owl:sameAs link indicates that two URI references actually refer to the same thing. . . . RDF links can be set manually, which is usually the case for FOAF profiles, or they can be generated by automated linking algorithms. This approach is usually taken to interlink large datasets.
Upon reflection, though, perhaps less coverage of the linkage portion of Linked Data is not that surprising. The Linked Data practice is barely one year old and, while growing at a most impressive rate, is still in the very earliest phases. Frankly, until recently, there has not been really that much data to Link.
We can see the status of Linked Data via the now-famous Linked Open Data diagram maintained by Richard Cyganiak (see  for the most recent interactive version; this one is current as of the date of posting):
Many have used this figure before (including me) to make general statements about the state of Linked Data. In this post, however, I want to comment on some different aspects.
While new data sources (or bubbles) are being added constantly, I count 43 “mappings” on this diagram (the arrows, and ignoring bi-directional) and 34 different sources (the bubbles). Nineteen of those mappings involve DBpedia, the exposed data of Wikipedia, and 11 involve FOAF.
owl:sameAs relations between possible datasets are in essence a pairwise mapping, similar to how a group of people might toast one another by clinking glasses. This type of pairwise relationship is kind of like an additive analog to a factorial, which is actually a quadratic function more specifically known as a triangular number. As new datasets get added, we see a progression of the form 1, 3, 6, 10, 15, 21, 28, 36, etc., representing the number of these possible pairwise mappings (“glass clinks”) between datasets.
The actual equation for this progression is n*((n+1)/2), where n is the number of datasets (N) – 1. Nominally, then, the number of 34 dataset bubbles could lead to as many as 561 pairwise mappings. But, again, only 43 are shown.
Of course, we are still only talking about potential pairwise mappings between datasets, and not the number of actual instance mappings themselves. DBpedia alone contains 1.5 million or so instances.
We can factor our progression into the Big O computer science notation consistent with the quadratic form of O(n2). Now, with instances numbering into the millions compounded by only a very few datasets and their pairwise mappings, we are still talking about potentially astronomical numbers to express as linked triples.
Yet the actual number of mapped triples is much lower than these potential maximums. The amount of Linked Data remains tractable. Why?
The first and obvious answer is that not all pairwise mappings make sense and not all instances are equivalent (sameAs). This factor will always be true. But it does not alone account for the efficiency.
The second less obvious answer is that certain of the datasets act as reference nodes or hubs. By having them, everything need not be mapped to everything else. We can express linkages as N to one or N to a few, and not the asymptotically growing N-to-N. Newly added datasets may often and for a notable portion of instances only need to be mapped to the reference nodes in order to link their data into the network.
DBpedia, with its scope and richness of notable instances, therefore, plays an essential role in Linked Data as expressed to date. Other comprehensive and authoritative sources can act similarly. In this manner, the development of the Linked Data graph may mirror the hub aspects of the existing World Wide Web document graph.
Rich reference nodes acting as hubs appears to be a key to the scalability of the Linked Data Web.
[A publisher exposes Linked Data by making the URI of an RDF resource ("data") accessible via HTTP. When encountered, an agent (such as a browser or crawler) can then dereference this resource to a Web-based URL address for retrieval.
Any attribute or relationship that describes such Linked Data may be accessed at time of retrieval. A relationship defines an external link when either the subject or object of the triple is an external URI. If that resource's URI has also been properly exposed, we can now trace it to still new relations and resources, akin to data surfing (so long as the trail of resources remains exposed). In a parallel analogy to document hyperlinks, some have termed this 'hyperdata' .
But that leads to a funny thing. Without this fetch or retrieval, there is generally no explicit publishing or knowledge of the external links for these resources. In other words, these external links are not “publicly” obvious (so to speak), until the Linked Data resource is discovered or stumbled upon. So, while our current recipes give us best practices for how to expose and publish resources (the Data), we have no similar guidance — and, frankly, no practice — for the Linked portion of Linked Data.
To carry the analogy a bit further, while Linked Data is acting to break the barriers of data silos, the relations and linkages of that data remain in those silos until accessed. This may indeed be the proper thing, but somehow it has a feeling of the early years of the document Web before services like Yahoo! began publishing listings of useful links.
Given that the mappings between data sources represent new and often expensive manual or automated effort, it seems like invested value is not being sufficiently shared. Fortunately, there is nothing preventing us from explicitly dereferencing these linked mapping triples along with standard resource triples. We just need to begin doing so.
But I digress. ]
The careful reader may have noticed a couple of earlier implications. Current Linked Data is useful for linking data for given instances from different data sources (say, for combining political, demographic and mapping information for a geographic place like Quebec City) via owl:sameAs. But that predicate is an instance-level relationship that only works for the very same individuals . Our current ability to make external linkages is largely constrained to the instance level.
Moreover, such instance-level links lack context and a conceptual framework for inferencing or determining relatedness between concepts or in relation to other instances. Today’s state-of-the-art is not really about linking “things” (as quoted before) when we establish Linked Data. It is more about linking atomic instances, the members of “things”. We have no current framework for relating things at the concept level.
Put another way, Linked Data presently lacks practical frameworks or mechanisms for linking at the class level.
On the face of it this sounds contradictory to what we know about RDF and how it is designed. The language and its formalisms and indeed many of the popular RDF schema have a rich set of classes. Classes are easy to design and spin-off on the fly.
So, the commentary about lack of a framework is NOT about the lack of logic or vocabulary or even schema or ontologies (though there are certainly some gaps there). Rather, it is based on the lack of reference nodes or structures upon which to base those connections. When there is no fixed or defined point in information space, everything floats; there is no framework for connections. There is no grounding.
So, technically, while class-level data connections are not prevented and can be made with Linked Data, to our knowledge few or none presently exist . This is a little known secret with far-reaching implications.
Just as DBpedia provided the nucleating hub for linking instance data, UMBEL is designed to provide a similar reference node for concepts. These concepts provide some fixed positions in the information space to which other sources can link and relate. And, like references for instance data, the existence of reference concepts can greatly diminish the number of links necessary for an efficient Linked Data environment.
Though the nature of the reference set is important (and we describe UMBEL’s choice of OpenCyc in Part 4 of this series), a more fundamental point is that a reference structure of almost any nature has this value. We can argue later about what is the best reference structure.
But the first task is to just get one in place and begin bootstrapping. Indeed, over time, it is likely that a few reference structures will emerge and compete and get supplemented by still further structures. This evolution is expected and natural and desirable in that it provides choice and options.
A reference structure of concepts has the further benefit of providing a logical reference structure for instance data as well. While a DBpedia (based on Wikipedia) is perhaps the most comprehensive collection of humanity-wide instances, no single source can or will be complete in scope. Thus, we foresee specialty sources ranging from the companies in Wikicompany to plants and animals in the Encyclopedia of Life or thousands of other rich instance sources also acting as reference hubs.
How do each of these rich instance sources relate to one another? What is the subject concept or topical basis by which they overlap or complement? What is the framework and graph structure of knowledge to give this information context?
These roadmaps and signposts are UMBEL’s formal purpose.
Mapping between classes is a much different — and more complicated — matter than mapping instances. As editors we are still grappling with design choices here and are playing with ideas such as confidence metrics to capture the relative accuracy of set matching methods. Later ontology documentation will discuss these designs further.
The rationale for UMBEL and our observations on the state of current Linked Data is not meant to be critical. The community is early in its understanding of how to do Linked Data and scale it. Personally as editors and then on behalf of our company, we have clearly committed to Linked Data as a practice and objective.
In summary, our review of the current state of Linked Data suggests that we:
Reference sets are a real key, both for instances and concepts. Using them by no means implies centrality or a loss of the distributed advantages of the Web. DBpedia (Wikipedia) has not had this effect for instances and UMBEL will not do so for concepts.
Nor does the use of reference sets imply the need to reach some global consensus or to close out any alternatives. Reference hubs and choice and freedom are not in conflict. Placing data in context will show clear advantages over data absent context. The argument will be settled as simply as that.
Now that Linked Data has put forward the recipes and mechanisms for opening up and sharing data on the Web, it is now time to take the initiative to the next level by providing the contextual signposts and roadmaps for those linkages.
I am pleased to announce a new collaboration on Sweet Tools due to outreach by Matthias Samwald and Andreas Blumauer of the Semantic Web Company. Their outreach was timely since the listing itself was growing to a point of not being easily maintained by a single individual, plus reducing possible conflicts of interest due to my new position with Zitgist.
This listing now has 693 tools, an increase of 43 tools (or 7 percent) since the last update this past November. As always, the listing in Exhibit and other formats may be found at its permanent page; or now from a direct link at the Semantic Web Company. Like prior versions, the listing is also provided as:
Please note this will be the last time I provide a simple table complement; it is proving too difficult to update.
Prior listings and statistics may be found at:
Matthias has brought new ideas and energy to the idea of a comprehensive tools listing for the semantic Web. The SWEO group of the W3C has maintained a valuable listing on its ESW wiki, and perhaps that should be the longer-term home for this listing as well. But the tenure of the SWEO is uncertain and its wiki format as presently configured does not provide the faceted browsing and filtering offered by the Exhibit system used by Sweet Tools.
You should anticipate updates over time from either here or SWC. While we like the Exhibit display and its flexibility, it is also not the easiest format to facilitate contributions, despite the fact this Sweet Tools instance is hosted as a Google spreadsheet.
One of the great things from my standpoint is that the Semantic Web Company has education and outreach as its mandate. I think we will see some cool innovations head our way about how to make this all more seamless. And, from my standpoint at Zitgist, I also commit to make it easier to expose this information as Linked Data.
Personally, my initial objectives to see a comprehensive listing and to learn much by assembling one have been accomplished. As the space grows and tools and needs become more varied and sophisticated, any comprehensive listing either requires more time devoted or more collaborators (and likely both!). Both this blog and the Semantic Web Company will be announcing on a periodic basis new mechanisms and avenues to extend this collaboration.
In sharing the baton, I listed out the basic steps I had been following for Matthias’ use. Let us know if you want to see these steps. But, again, as noted, we hope to make this whole update and contribution procedure much easier for other contributors to follow.
UMBEL (Upper-level Mapping and Binding Exchange Layer) is a lightweight ontology for relating Web content and data to a standard set of subject concepts. It is being designed to apply to all types of data on the Web from RSS and Atom feeds to tagging to microformats and topic maps to RDF and OWL (among others). The project Web site is at http://www.umbel.org.
UMBEL was first announced in July 2007 and has been a direct subject of these prior posts:
However, much internal development and refinement has occurred especially in the past few months . Over the next few days, this posting, a re-introduction to UMBEL, will be followed by these additional parts:
These articles are lead-ins to the discussion of the actual UMBEL ontology that will soon follow.
UMBEL has two purposes: 1) to provide a lightweight structure of subject concepts as a reference to what Web content or data “is about“, what is called a concept schema in SKOS ; and 2) to define a variety of binding protocols for different Web data formats to map to this “backbone.” The project’s immediate priority is to first create this reference backbone . That is the focus of these current postings.
Think of the backbone as a set of roadsigns to help find related content. UMBEL is like a map of an interstate highway system, a way of getting from one big place to another. Once in the right vicinity, other maps (or ontologies), more akin to detailed street maps, are then necessary to get to specific locations or street addresses.
By definition, these more fine-grained maps are beyond UMBEL’s scope. But UMBEL can help provide the context for placing such detailed maps in relation to one another and in relation to the Big Picture of what related content is about.
These subject concepts also provide the mapping points for the many, many thousands (indeed, millions) of specific named entities that are the notable instances of these subject concepts. Examples might include the names of specific physicists, cities in a country, or a listing of financial stock exchanges. UMBEL mappings enable us to link a given named entity to the various subject classes of which it is a member.
And, because of relationships amongst subject concepts in the backbone, we can also relate that entity to other related entities and concepts. The UMBEL backbone traces the major pathways through the content graph of the Web. For some visualizations of this subject graph, see So, What Might The Web’s Subject Backbone Look Like?
Today, the actual linkages in Linked Data, the first meaningful expression of the semantic Web, occur almost exclusively via direct sameAs relationships between instances. An easy way to think of one of these notable instances is as the topic of a specific article in Wikipedia. People, places, important historical events, and so forth are all examples of such named entities.
Current Linked Data is therefore useful for linking data for given instances from different data sources (say, for combining political, demographic and mapping information for a geographic place like Quebec City). But, these instance-level links lack context and a conceptual framework for inferencing or determining relatedness between concepts or in relation to other instances. For these purposes, Linked Data needs a class structure (Part 2).
As noted, UMBEL’s class structure is based on subject concepts, which are a distinct subset of the more broadly understood concept  such as used in the SKOS RDFS controlled vocabulary , conceptual graphs, formal concept analysis or the very general concepts common to many upper ontologies . We define subject concepts as a special kind of concept: namely, ones that are concrete, subject-related and non-abstract .
UMBEL contrasts its subject concepts with abstract concepts and with named entities. Abstract concepts represent abstract or ephemeral notions such as truth, beauty, evil or justice, or are thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world. Named entities are the real things or instances in the world that are themselves natural and notable instances (members) of subject concepts (classes). More detailed discussion of this terminology is provided in Part 3.
UMBEL thus sets for itself objectives that include an identification of subject concepts and their relationships; a premise of emphasizing representational concepts over unattainable precision or exactitude; and a means for relating any notable thing of the world to that structure. Moreover, meeting these objectives should be based on best systems and practices, informed where possible by social acceptance and consensus.
W-O-W-Y is the shorthand we apply to the semantic framework for meeting these UMBEL objectives. W-O-W-Y is derived from the constituent UMBEL building blocks of WordNet (W) , OpenCyc (O), Wikipedia (W)  and YAGO (Y) . Each resource contributes in a different way.
Via the WOWY framework, OpenCyc provides the basis for the reference subject backbone (Part 4), WordNet (supplemented by others) provides the “synsets” for relating natural language nouns and phrases to these concepts, and Wikipedia as processed by YAGO (among a growing list of other resources) provides the starting dictionary of relevant named entities important to the Web public.
The initial UMBEL ontology contains roughly 21,000 subject concepts distilled from OpenCyc and vetted for their relational structure. A further 1.5 million named entities are also currently mapped to that structure.
Coincident with the pending release of the draft UMBEL ontology of these subject concepts will be a multi-volume technical report that details the exact mapping and vetting procedures used.
I remember in some of my first jobs in the restaurant industry how surprised I was at the depth of feeling that employees could feel toward one another. Waiters screaming at other waiters; kitchen staff dismissive of out-front staff, everyone sharpening knives about pompous managers, and the like.
Strangely, this past week had many similar flashbacks for me.
If you have been around a bit (not necessarily all the way back to the Tulip frenzy in Holland), you have seen hype and frenzy screw things up. This whole idea of the “last fool” is pretty disgusting and has a real effect on real people. Speculators pushing up house prices 20% per year in Vegas and Miami being only the latest most extreme examples.
Tim Berners-Lee does not blog frequently, but, when he does, it always seems to be at a moment of import.
In his post tonight, I think he is with grace trying to say some things to us. He talks about buzz and hype; he tries to put silly notions about “killer apps” into the background, he emphasizes the real challenge of how a democratized knowledge environment needs to find new measures of trust, and again he talks about the importance of data and linkages.
The real stimulus, I sense, is that in the current frenzy about “semantic Web” stuff his real points are being misunderstood and misquoted.
In all this Semantic Web news, though, the proof of the pudding is in the eating. The benefit of the Semantic Web is that data may be re-used in ways unexpected by the original publisher. That is the value added. So when a Semantic Web start-up either feeds data to others who reuse it in interesting ways, or itself uses data produced by others, then we start to see the value of each bit increased through the network effect.
So if you are a VC funder or a journalist and some project is being sold to you as a Semantic Web project, ask how it gets extra re-use of data, by people who would not normally have access to it, or in ways for which it was not originally designed. Does it use standards? Is it available in RDF? Is there a SPARQL server?
For those of us who have been there before, we fear the hype and cynicism that brought us the dot-com era.
If you feel that you are truly part of a historical transition point — as I and those who have been laborers in the garden of the semantic Web do — then we sense this same potential for an important effort to be hijacked.
The smell of money is in the air; the hype machine is in full swing; VCs are breathy and attentive. Podcasts and blogs are full of baloney. Media excesses are now observable.
One perspective might say that “perspective” will tell us that all of this is natural. We are now in the midst of some expected phase of a Moore chasm or some other predicted evolution of technology hype and development. But, let me ask this: how many times must we be a greater fool to be a lesser fool? We’ve seen this before, and the promise of the semantic Web to do more deserves more.
I wish I had Tim Berners-Lee’s grace; I do not. But, all of us can look around and gain perspective. And, that perspective is: Look for substance and value. Everything else, grab on to your wallet.
I am absolutely thrilled to be joining Zitgist LLC as its new CEO. This courtship has been a while in the making, and some of you knew it was in progress though it did not make sense to talk about it until now. I’ve had the chance to work with exceptional people throughout my career. None — and this is saying a lot — has matched the Zitgist people and this opportunity.
Later posts will discuss future directions and prospects. For now, forgive me for taking a more personal tone to explain what captured my attention with this impressive, young company.
Since my college and grad student days as a plant systematist and population geneticist, I have been fascinated with the structure of information and how it is organized (first in the biological world and then through human culture). I think it is fair to say that my early jobs and the companies I have founded have been dedicated to this passion. I feel especially fortunate to have been an early participant in the growth and development of the Web.
I took a sabbatical a bit over a year ago to research and contemplate the idea of the ‘Semantic Web’, Tim Berners-Lee’s vision of the Web as a medium for data integration and usefulness. While the advent of the early Web had been amazing, its next use as a means for global information interoperability represents a real fulcrum in human history. The semantic Web vision resonates because the power of the Internet and its global reach are clear, and because the accumulation and management of cultural information is the singular basis for the economic wealth and uniqueness of humanity.
If our present reality were only one of access to mind-boggling amounts of information and the ability to manage it, that would be exciting enough. But, with these changes, have also come fundamental changes in the nature of the commercial enterprise, what is its value, and the role of business and social organizations to create future wealth. We are truly in an era of open source, open standards and open data. The abiding aspect of our new era is interoperability and sharing, not closed and proprietary systems.
Many have noted the cultural and social impacts of Gutenberg’s printing press in areas such as the Reformation, Enlightenment, and emergence of secularism and social change. I have no doubt future historians will look to our current era as a similar breakpoint in human development.
My sabbatical was thus not only to learn about this new technological era, but also to think hard about what might be the business models and commercial organizational structures of the future. Over this period I kissed many frogs.
I found Zitgist to be unique from the standpoints of people, organization, culture and technology. There will be time to elaborate more in future posts regarding Zitgist’s prospects and advantages. For now let me simply say the company has a business model responsive to today’s imperatives and the chops to pull it off.
We’re entering a time of few precedents. I wish I could say with 100% certainty that Zitgist has the secret sauce that ensures success. I can not. But, I can definitely say that Zitgist has a viewpoint, plan, and unique technology and data perspective that looks pretty darn good. (I’m also pleased to announce a major update today to our Zitgist Web site to reflect some of these prospects. Of course, there is more to come. ).
Semantic Web ventures have a real challenge in figuring out how to monetize value while remaining true to the core principles of openness and collaboration. (And, oh, by the way, also to keep it simple.) Zitgist, I believe, brings a winning perspective to these challenges.
The best way I know to manage uncertainty is to with great people. The super thing of the semantic Web is its community of smart, dedicated people. Within that group, Zitgist and its people stand out. Three deserve mention by name.
My first attraction to Zitgist came through its chief technology officer, Frédérick Giasson. I have had the great fortune to have worked with a few natural programmers in my career. Fred certainly is a member of that rarefied group. But more unusual and attractive from my viewpoint is Fred’s clear vision and pragmatism.
Going back to his original forays with Talk Digger and Ping the Semantic Web, Fred has looked to clever ways to combine available constructs into practical tools. He co-founded Zitgist about 18 months ago to bring that same pragmatic view to what we are now calling Linked Data. Fred has also clearly understood that the expansion of the semantic Web market depends on simple and direct user interfaces and hiding the technicalities of RDF and other details in the background.
Independent of Zitgist, Fred and I had already been collaborating as co-editors of the UMBEL lightweight subject concept ontology (Fred is also an editor of three other ontologies). When Fred separately showed me zLinks as a WordPress plug-in, even though only a proof-of-concept, the latent power of turning any existing hyperlink into a portal of Linked Data and relevant content literally blew me away.
Our discussions thus broadened last September and were the fuel that led to my joining Zitgist. I confidently predict Fred will emerge as one of the leading voices in the next generation of semantic Web innovators.
Going back more than 10 years Kingsley and his team have been building a universal platform for hosting and managing and converting data, Virtuoso. It is uncanny how this suite of existing capabilities so nearly perfectly presaged the semantic Web. It is also remarkable — and unbelievable how little appreciated today — that Virtuoso also contains capabilities in its Sponger, RDF Views for RDB data, ODS and complete “data space” integration capabilities that presage the semantic Web of ten years hence.
Kingsley’s singular vision has driven this development, backed by the technical excellence of Orri Erling and the other nearly 50 members of the OpenLink team. Moreover, this team understands scalability, distributed architectures and virtual (“cloud”) computing. These capabilities provide both a solid foundation and a deep reservoir for Zitgist to draw upon as its services scale to meet the demands of the full Web.
Kingsley has been able to translate this internal vision to a shared one in the broader semantic Web community. He has been a key leader of the Linked Data initiative and has tirelessly networked and advocated within the community. After his long labors in the garden, it is exciting to see the bountiful harvest now come to fruition. It is also gratifying to see Kingsley get the growing recognition he so richly deserves.
How can one not choose to work with such a great team?
Since I first began starting my own companies I have thought I could never join a venture that was not of my own creation. Vision is a tricky thing and so very hard to get right. But the synergy arising from this interaction points to a new model of virtual ventures drawing upon a global pool of like-minded collaborators.
It is a once-in-a-lifetime opportunity to contribute to Zitgist and the vibrant Linked Data movement of which it is a part. Thanks, to all of you for this opportunity. Now . . . it is time to roll up the sleeves and help make a new era happen.