There has been much recent excitement surrounding RDF, its linked data, “RDFizers” and GRDDL to convert existing structured information to RDF. The belief is that 2007 is the breakout year for the semantic Web.
The why of the semantic Web is now clear. The how of the semantic Web is now appearing clear, built around RDF as the canonical data model and the availability and maturation of a growing set of tools. And the what is also becoming clear, with the massive new datastores of DBpedia, Wikipedia3, the HCLS demo, Musicbrainz, Freebase, and the Encyclopedia of Life only being some of the most recent and visible exemplars.
Yet the where aspect seems to be largely missing.
By where I mean: Where do we look for our objects or data? If we have new objects or data, where do we plug into the system? Amongst all possible information and domains, where do we fit?
These questions seem simplistic or elemental in their basic nature. It is almost incomprehensible to wonder where all of this data now emerging in RDF relates to each other — what the overall frame of reference is — but my investigations seem to point to such a gap. In other words, a key piece of the emerging semantic Web infrastructure — where is this stuff — seems to be missing. This gap is across domains and across standard ontologies.
What are the specific components of this missing where, this missing infrastructure? I believe them to be:
I discuss these missing components in a bit more detail below, concluding with some preliminary thoughts on how the problem of this critical infrastructure can be redressed. The good news, I believe, is that these potholes on the road to the semantic Web can be relatively easily and quickly filled.
I think it is fair to say that structure on the current Web is a jumbled mess.
As my recent Intrepid Guide to Ontologies pointed out, there are at least 40 different approaches (or types of ontologies, loosely defined) extant on the Web for organizing information. These approaches embrace every conceivable domain and subject. The individual data sets using these approaches span many, many orders of magnitude in size and range of scope. Diversity and chaos we have aplenty, as the illustrative diagram of this jumbled structural mess shows below.
Mind you, we are not yet even talking about whether one dot is equivalent or can be related to another dot and in what way (namely, connecting the dots via real semantics), but rather at a more fundamental level. Does one entire data set have a relation to any other data set?
Unfortunately, the essential precondition of getting data into the canonical RDF data model — a challenge in its own right — does little to guide us as to where these data sets exist or how they may relate. Even in RDF form, all of this wonderful RDF data exists as isolated and independent data sets, bouncing off of one another in some gross parody of Brownian motion.
What this means, of course, is that useful data that could be of benefit is overlooked or not known. As with problems of data silos everywhere, that blindness leads to unnecessary waste, incomplete analysis, inadequate understanding, and duplicated effort .
These gaps were easy to overlook when the focus of attention was on the why, what and how of the semantic Web. But, now that we are seeing RDF data sets emerge in meaningful numbers, the time is ripe to install the road signs and print up the maps. It is time to figure out where we want to go.
As I discussed in an earlier posting, there’s not yet enough backbone to the structured Web. I believe this structure should firstly be built around a lightweight subject- or topic-oriented reference layer.
Unlike traditional upper-level ontologies (see the Intrepid Guide), this backbone is not meant to be comprised of abstract concepts or a logical completeness of the “nature of knowledge”. Rather, it is meant to be only the thinnest veneer of (mostly) hierarchically organized subjects and topic references (see more below).
This subject or topic vocabulary (at least for the backbone) is meant to be quite small, likely more than a few hundred reference subjects, but likely less than many thousands. (There may be considerably more terms in the overall controlled vocabulary to assist context and disambiguation.)
This “umbrella” subject structure could be thought of as the reference subject “super-structure” to which other specific ontologies could place themselves in a sort of locational or “info-spatial” context.
One way to think of these subject reference nodes is as the major destinations — the key cities, locations or interchanges — on the broader structured Web highway system. A properly constructed subject structure could also help disambiguate many common misplacements by virtue of the context of actual subject mappings.
For example, an ambiguous term such as “driver” becomes unambiguous once it is properly mapped to one of its possible topics such as golf, printers, automobiles, screws, NASCAR, or whatever. In this manner, context is also provided for other terms in that contributing domain. (For example, we would now know how to disambiguate “cart” as a term for that domain.)
A high-level and lightweight subject mapping layer does not warrant difficult (and potentially contentious) specificity. The point is not to comprehensively define the scope of all knowledge, but to provide the fewest choices necessary for what subject or subjects a given domain ontology may appropriately reference. We want a listing of the major destinations, not every town and parish in existence.
(That is not to say that more specific subject references won’t emerge or be appropriate for specific domains. Indeed, the hope is that an “umbrella” reference subject structure might be a tie-in point for such specific maps. The more salient issue addressed here is to create such an “umbrella” backbone in the first place.)
This subject reference “super-structure” would in no way impose any limits on what a specific community might do itself with respect to its own ontology scope, definition, format, schema or approach. Moreover, there would be no limit to a community mapping its ontology to multiple subject references (or “destinations”, if you will).
The reason for this high-level subject structure, then, is simply to provide a reference map for where we might want to go — no more, no less. Such a reference structure would greatly aid finding, viewing and querying actual content ontologies — of whatever scope and approach — wherever that content may exist on the Web.
This is not a new idea. About the year 2000 the topic map community was active with published subject indicators (PSIs)  and other attempts at topic or subject landmarks. For example, that report stated:
From that earlier community, Bernard Vatant has subsequently spoken of the need and use of “hubjects” as organizing and binding points, as has Jack Park and Patrick Durusau using the related concept of “subject maps” . An effort that has some overlap with a subject structure is also the Metadata Registry being maintained by the National Science Digital Library (NSDL).
However, while these efforts support the idea of subjects as partial binding or mapping targets, none of them actually proposed a reference subject structure. Actual subject structures may be a bit of a “third rail” in ontology topics due to the historical artifact of wanting to avoid the pitfalls of older library classification systems such as the Dewey Decimal Classification or the Library of Congress Subject Headings.
Be that as it may. I now think the timing is right for us to close this subject gap.
This mapping layer lends itself to a three-tiered general conceptual model. The first tier is the subject structure, the conceptualization embracing all possible subject content. This referential layer is the lookup point that provides guidance for where to search and find “stuff.”
The second layer is the representation layer, made up of informal to “formal” ontologies. Depending on the formalism, the ontology provides more or less understanding about the subject matter it represents, but at minimum binds to the major subject concepts in the top subject mapping layer.
The third layer are the data sets and “data spaces”  that provide the actual content instantiations of these subjects and their ontology representations. This data space layer is the actual source for getting the target information.
Here is a diagram of this general conceptual model:
The layers in this general conceptual model progress from the more abstract and conceptual at the upper level, useful for directing where traffic needs to go, to concrete information and data at the lower level, the real object of manipulation and analysis.
The data spaces and ontologies of various formalisms in the lower two tiers exist in part today. The upper mapping layer does not.
By its nature the general upper mapping layer, in its role as a universal reference “backbone,” must be somewhat general in the scope of its reference subjects. As this mapping infrastructure begins to be fleshed out, it is also therefore likely that additional intermediate mapping layers will emerge for specific domains, which will have more specific scopes and terminology for more accurate and complete understanding with their contributing specific data spaces.
So, let’s beg the question for a moment of what an actual reference subject structure might be. Let’s assume one exists. What should we do with this structure? How should we bind to it?
How this might look in a representative subject binding to two candidate ontology data sets is shown below:
This diagram shows two contributing data sets and their respective ontologies (no matter how “formal”) binding to a given “subject.” This subject proxy may not be the only one bound by a given data set and its ontology. Also note the “subject” is completely arbitrary and, in fact, is only a proxy for the binding to the potential topic.
Clearly, this approach does not provide a powerful inferential structure.
But, what it does provide is a quite powerful organizational structure. Access and the where for the sake of simplicity and adoption are given preference over inferential elegance. Thus, assuming now, again, a subject structure backbone, matched with these principles and the same jumbled structures as first noted above, we can now see an organizational order emerge from the chaos:
The assumptions to get to this point are not heroic. Simple binding mechanisms matched with a high-level subject “backbone” are all that is required mechanically for such an approach to emerge. All we have done to achieve this road map is to follow the trails above.
So, there is nothing really revolutionary in any of the discussion to this point. Indeed, many have cited the importance of reference structures previously. Why hasn’t such a subject structure yet been developed?
One explanation is that no one accepts any global “external authority” for such subject identifications and organization. The very nature of the Web is participatory and democratic (with a small “d”). Everyone’s voice is equal, and any structure that suggests otherwise will not be accepted.
It is not unusual, therefore, that some of the most accepted venues on the Web are the ones where everyone has an equal chance to participate and contribute: Wikipedia, Flickr, eBay, Amazon, Facebook, YouTube, etc. Indeed, figuring out what generates such self-sustaining “magic” is the focus of many wannabe ventures.
While that is not our purpose here, our purpose is to set the preconditions for what would constitute a referential subject structure that can achieve broad acceptance on the Web. And in the paragraph above, we already have insight into the answer: Build a subject structure from already accepted sources rich in subject content.
A suitable subject structure must be adaptable and self-defining. These criteria reflect expressions of actual social usage and practice, which of course changes over time as knowledge increases and technologies evolve.
One obvious foundation to building a subject structure is thus Wikipedia. That is because the starting basis of Wikipedia information has been built entirely from the bottom up — namely, what is a deserving topic. This has served Wikipedia and the world extremely well, with now nearly 1.8 million articles online in English alone (versions exist for about 100 different languages) . There is also a wealth of internal structure within Wikipedia’s “infobox” templates, structure that has been utilized by DBpedia (among others) to actually transform Wikipedia into an RDF database (as I described in an earlier article). As socially-driven and -evolving, I foresee Wikipedia to continue to be the substantive core at the center of a knowledge organizational framework for some time to come.
But Wikipedia was never designed with an organizing, high-level subject structure in mind. For my arguments herein, creating such an organizing (yes, in part, hierarchical) structure is pivotal.
One innovative approach to provide a more hierarchical structural underpinning to Wikipedia has been YAGO (“yet another great ontology”), an effort from the Max-Planck-Institute SaarbrÃ¼cken . YAGO matches key nouns between Wikipedia and WordNet, and then uses WordNet’s well-defined taxonomy of synsets to superimpose the hierarchical class structure. The match is more than 95% accurate; YAGO is also designed for extensibility with other quality data sets.
I believe YAGO or similar efforts show how the foundational basis of Wikipedia can be supplemented with other accepted lexicons to derive a suitable subject structure with appropriate high-level “binding” attributes. In any case, however constructed, I believe that a high-level reference subject structure must evolve from the global community of practice, as has Wikipedia and WordNet.
I have previously described this formula as W + W + S + ? (for Wikipedia + WordNet + SKOS + other?). There indeed may need to be “other” contributing sources to construct this high-level reference subject structure. Other such potential data sets could be analyzed for subject hierarchies and relationships using fairly well accepted ontology learning methods. Additional techniques will also be necessary for multiple language versions. Those are important details to be discussed and worked out.
The real point, however, is that existing and accepted information systems already exist on the Web that can inform and guide the construction of a high-level subject map. As the contributing sources evolve over time, so could periodic updates and new versions of this subject structure be generated.
Though the choice of the contributing data sets from which this subject structure could be built will never be unanimous, using sources that have already been largely selected through survival of the “fittest” by large portions of the Web-using public will go a long ways to establishing authoritativeness. Moreover, since the subject structure is only intended as a lightweight reference structure — and not a complete closed-world definition — we are also setting realistic thresholds for acceptance.
The specific topic of this piece has been on a subject reference mapping and binding layer that is lightweight, extensible, and reflects current societal practice (broadly defined). In the discussion, there has been recognition of existing schema in such areas as people (FOAF), projects (DOAP), communities (SIOC) and geographical places (Geonames) that might also contribute to the overall binding structure. There very well may need to be some additional expansions in other dimensions such as time and events, organizations, products or whatever. I hope that a consensus view on appropriate high-level dimensions emerges soon.
There are a number of individuals presently working on a draft proposal for an open process to create this subject structure. What we are working quickly to draft and share with the broader community is a proposal related to:
We believe this to be an exciting and worthwhile endeavor. Prior to the unveiling of our public Web site and project, I encourage any of you with interest in helping to further this cause to contact me directly at mike at mkbergman dot com .
 I attempted to quantify this problem in a white paper from about two years ago, Untapped Assets: The $3 Trillion Value of U.S. Enterprise Documents, BrightPlanet Corporation, 42 pp., July 20, 2005. Some reasons for how such waste occurs were documented in a four-part series on this AI3 blog, Why Are $800 Billion in Document Assets Wasted Annually?, beginning in October 2005 through parts two, three and four concluding in November 2005.
 The concept of “data spaces” has been well-articulated by Kingsley Idehen of OpenLink Software and Frédérick Giasson of Zitgist LLC. A “data space” can be personal, collective or topical, and is a virtual “container” for related information irrespective of storage location, schema or structure.
 See Wikipedia’s statistics sections.
 Fabian M. Suchanek, Gjergji Kasneci and Gerhard Weikum, “Yago – A Core of Semantic Knowledge” (also in bib or ppt). Presented at the 16th International World Wide Web Conference (WWW 2007) in Banff, Alberta, on May 8-12, 2007. YAGO contains over 900,000 entities (like persons, organizations, cities, etc.) and 6 million facts about these entities, organized under a hierarchical schema. YAGO is available for download (400Mb) and converters are available for XML, RDFS, MySQL, Oracle and Postgres. The YAGO data set may also be queried directly online.
 I’d especially like to thank Frédérick Giasson and Bernard Vatant of Mondeca for their reviews of a draft of this posting. Fred was also instrumental in suggesting the ideas behind the figure on the general conceptual model.