In the first part of this series we argued for the importance of reference structures to provide the structures and vocabularies to guide interoperability on the semantic Web. The argument was made that these reference structures are akin to human languages, requiring a sufficient richness and terminology to enable nuanced and meaningful communications of data across the Web and within the context of their applicable domains.
While the idea of such reference structures is great — and perhaps even intuitive when likened to human languages — the question is begged as to what is the basis for such structures? Just as in human languages we have dictionaries, thesauri, grammar and style books or encyclopedia, what are the analogous reference sources for the semantic Web?
In this piece, we tackle these questions from the perspective of the entire Web. Similar challenges and approaches occur, of course, for virtually every domain and specific community. But, by focusing on the entirety of the Web, perhaps we can discern the grain of sand at the center of the pearl.
The idea of bootstrapping is common in computers, compilers or programming. Every computer action needs to start from a basic set of instructions from which further instructions or actions are derived. Even starting up a computer (“booting up”) reflects this bootstrapping basis. Bootstrapping is the answer to the classic chicken-or-egg dilemma by embedding a starting set of instructions that provides the premise at start up . The embedded operand for simple addition, for example, is the basis for building up more complete mathematical operations.
So, what is the grain of sand at the core of the semantic Web that enables it to bootstrap meaning? We start with the basic semantics and “instructions” in the core RDF, RDFS and OWL languages. These are very much akin to the basic BIOS instructions for computer boot up or the instruction sets leveraged by compilers. But, where do we go from there? What is the analog to the compiler or the operating system that gives us more than these simple start up instructions? In a semantics sense, what are the vocabularies or languages that enable us to understand more things, connect more things, relate more things?
To date, the semantic Web has given us perhaps a few dozen commonly used vocabularies, most of which are quite limited and simple pidgin languages such as DC, FOAF, SKOS, SIOC, BIBO, etc. We also have an emerging catalog of “things” and concepts from Wikipedia (via DBpedia) and similar. (Recall, in this piece, we are trying to look Web-wide, so the many fine building blocks for domain purposes such as found in biology, medicine, finance, astronomy, etc., are excluded.) The purposes and scope of these vocabularies widely differ and attack quite different slices of the information space. SKOS, for example, deals with describing simple knowledge structures like taxonomies or thesauri; SIOC is for describing social media.
By virtue of adoption, each of these core languages has proved its usefulness and role. But, as skew lines in space, how do these vocabularies relate to one another? And, how can all of the specific domain vocabularies also relate to those and one another where there are points of intersection or overlap? In short, after we get beyond the starting instructions for the semantic Web, what is our language and vocabulary? How do we complete the bootstrap process?
Clearly, like human languages, we need rich enough vocabularies to describe the things in our world and a structure of the relationships amongst those things to give our communications meaning and coherence. That is precisely the role provided by reference structures.
To prevent reference structures from being rubber rulers, some fixity or grounding needs to establish the common understanding for its referents. Such fixed references are often called ‘gold standards‘. In money, of course, this used to be a fixed weight of gold, until that basis was abandoned in the 1970s. In the metric system, there are a variety of fixed weights and measures that are employed. In the English language, the Oxford English Dictionary (OED) is the accepted basis for the lexicon. And so on.
Yet, as these examples show, none of these gold standards is absolute. Money now floats; multiple systems of measurement compete; a variety of dictionaries are used for English; most languages have their own reference sets; etc. The key point in all gold standards, however, is that there is wide acceptance for a defined reference for determining alignments and arbitrating differences.
Gold standards or reference standards play the role of referees or arbiters. What is the meaning of this? What is the definition of that? How can we tell the difference between this and that? What is the common way to refer to some thing?
Let’s provide one example in a semantic Web context. Let’s say we have a dataset and its schema A that we are aligning with another dataset with schema B. If I say two concepts align exactly across these datasets and you say differently, how do we resolve this difference? On one extreme, each of us can say our own interpretation is correct, and to heck with the other. On the other extreme, we can say both interpretations are correct, in which case both assertions are meaningless. Perhaps papering over these extremes is OK when only two competing views are in play, but what happens when real problems with many actors are at stake? Shall we propose majority rule, chaos, or the strongest prevails?
These same types of questions have governed human interaction from time immemorial. One of the reasons to liken the problem of operability on the semantic Web to human languages, as argued in Part I, is to seek lessons and guidance for how our languages have evolved. The importance of finding common ground in our syntax and vocabularies — and, also, critically, in how we accept changes to those — is the basis for communication. Each of these understandings needs to be codified and documented so that they can be referenced, and so that we can have some confidence of what the heck it is we are trying to convey.
For reference structures to play their role in plugging this gap — that is, to be much more than rubber rulers — they need to have such grounding. Naturally, these groundings may themselves change with new information or learning inherent to the process of human understanding, but they still should retain their character as references. Grounded references for these things — ‘gold standards’ — are key to this consensual process of communicating (interoperating).
The need for gold standards for the semantic Web is particularly acute. First, by definition, the scope of the semantic Web is all things and all concepts and all entities. Second, because it embraces human knowledge, it also embraces all human languages with the nuances and varieties thereof. There is an immense gulf in referenceability from the starting languages of the semantic Web in RDF, RDFS and OWL to this full scope. This gulf is chiefly one of vocabulary (or lack thereof). We know how to construct our grammars, but we have few words with understood relationships between them to put in the slots.
The types of gold standards useful to the semantic Web are similar to those useful to our analogy of human languages. We need guidance on structure (syntax and grammar), plus reference vocabularies that encompass the scope of the semantic Web (that is, everything). Like human languages, the vocabulary references should have analogs to dictionaries, thesauri and encyclopedias. We want our references to deal with the specific demands of the semantic Web in capturing the lexical basis of human languages and the connectedness (or not) of things. We also want bases by which all of this information can be related to different human languages.
To capture these criteria, then, I submit we should consider a basic starting set of gold standards:
Each of these potential gold standards is next discussed in turn. The majority of discussion centers on Wikipedia and UMBEL.
Naturally, the first suggested gold standard for the semantic Web are the RDF/RDFS/OWL language components. Other writings have covered their uses and roles . In relation to their use as a gold standard, two documents, one on RDF semantics  and the other an OWL  primer, are two great starting points. Since these languages are now in place and are accepted bases of the semantic Web, we will concentrate on the remaining members of the standard reference set.
The second suggested gold standard for the semantic Web is Wikipedia, principally as a sort of canonical vocabulary base or lexicon, but also for some structural aspects. Wikipedia now contains about 3.5 million English articles, by far larger than any other knowledge base, and has more than 250 language versions. Each Wikipedia article acts as more or less a reference for the thing it represents. In addition, the size, scope and structure of Wikipedia make it an unprecedented resource for researchers engaged in natural language processing (NLP), information extraction (IE) and semantic Web-related tasks.
For some time I have been maintaining a listing called SWEETpedia of academic and research articles focused on the use of Wikipedia for these tasks. The latest version tracks some 250 articles , which I guess to be about one half or more of all such research extant. This research shows a broad variety of potential roles and contributions from Wikipedia as a gold standard for the semantic Web, some of which is detailed in the tables below.
An excellent report by Olena Medelyan et al. from the University of Waikato in New Zealand, Mining Meaning from Wikipedia, organized this research up through 2008 and provided detailed commentary and analysis of the role of Wikipedia . They noted, for example, that Wikipedia has potential use as an encyclopedia (its intended use), a corpus for testing and modeling NLP tasks, as a thesaurus, a database, an ontology or a network structure. The Intelligent Wikipedia project from the University of Washington has also done much innovative work on “automatically learned systems [that] can render much of Wikipedia into high-quality semantic data, which provides a solid base to bootstrap toward the general Web” .
However, as we proceed through the next discussions, we’ll see that the weakest aspect of Wikipedia is its category structure. Thus, while Wikipedia is unparalleled as the gold standard for a reference vocabulary for the Web, and has other structural uses as well, we will need to look elsewhere for how that content is organized.
Many groups have recognized these advantages for Wikipedia, and have built knowledge bases around it. Also, many of these groups have also recognized the category (schema) weaknesses in Wikipedia and have proposed alternatives. Some of these major initiatives, which also collectively represent a large number of the research articles in SWEETpedia, include:
|DBpedia||Wikipedia Infoboxes||excellent source for URI identifiers; structure extraction basis used by many other projects|
|Freebase||User Generated||schema are for domains based on types and properties; at one time had a key dependence on Wikipedia; has since grown much from user-generated data and structure; now owned by Google|
|Intelligent Wikipedia||Wikipedia Infoboxes||a broad program and a general set of extractors for obtaining structure and relationships from Wikipedia; was formerly known as KOG; from Univ of Washington|
|SIGWP||Wikipedia Ontology||the Special Interest Group of Wikipedia (Research or Mining); a general group doing research on Wikipedia structure and mining; schema basis is mostly from a thesaurus; group has not published in two years|
|UMBEL||UMBEL Reference Concepts||RefConcepts based on the Cyc knowledge base; provides a tested, coherent concept schema, but one with gaps regarding Wikipedia content; has 28,000 concepts mapped to Wikipedia|
|WikiNet||Extracted Wikipedia Ontology||part of a long-standing structure extraction effort from Wikipedia leading to an ontology; formerly known as WikiRelate; from the Heidelberg Institute for Theoretical Studies (HITS)|
|Wikipedia Miner||N/A||generalized structure extractor; part of a wider basis of Wikipedia research at the Univ of Waikato in New Zealand|
|Wikitology||Wikipedia Ontology||general RDF and ontology-oriented project utilizing Wikipedia; effort now concluded; from the Ebiquity Group at the Univ of Maryland|
|YAGO||WordNet||maps Wordnet to Wikipedia, with structured extraction of relations for characterizing entities|
It is interesting to note that none of the efforts above uses the Wikipedia category structure “as is” for its schema.
The surface view of Wikipedia is topic articles placed into one or more categories. Some of these pages also include structured data tables (or templates) for the kind of thing the article is; these are called infoboxes. An infobox is a fixed-format table placed at the top right of articles to consistently present a summary of some unifying aspect that the articles share. For example, see the listing for my home town, Iowa City, which has a city infobox.
However, this cursory look at Wikipedia in fact masks much additional and valuable structure. Some early researchers noted this . The recognition of structure has also been a key driver for the interest in Wikipedia as a knowledge base (in addition to its global content scope). The following table is a fairly complete listing of structure possibilities within Wikipedia (see Endnotes for any notes):
|Wikipedia Structure||Potential Applications||Note|
|knowledge base; graph structure; corpus for n-grams, other constructions|||
|category suggestion; semantic relatedness; query expansion; potential parent category|
|semantically-related terms (siblings)|
|hyponymic and meronymic relations between terms|
|semantically-related terms (siblings)|
|synonyms; key-value pairs|
|units of measure; fact extraction|||
|category suggestion; entity suggestion|
|coordinates; places; geolocational; (may also appear in full article text)|
|exclusion candidates; other structural analysis; examples include Stub, Message Boxes, Multiple Issues|||
|complete discussion; related terms; context; translations; NLP analysis basis; relationships; sentiment|
|synonymy; spelling variations, misspellings; abbreviations; query expansion|
|named entities; domain specific terms or senses|
|category suggestion (phrase marked in bold in first paragraph)|
|category suggestion; semantic relatedness|||
|related concepts; query expansion|||
|related concepts; external harvest points|
|related terms; co-occurrences|
|synonyms; spelling variations; related terms; query expansion|
|link graph; related terms|
|category suggestion; functional metadata|
|category suggestion; functional metadata|
|external harvest points||[9,10]|
|thumbnails; image recognition for disambiguation; controversy (edit/upload frequency)|||
|related concepts; related terms; functional metadata|||
Redux for Article Structure
|see Articles for uses|
|topicalness; controversy (diversity of editors, reversions)|
|instances; named entity candidates|
|Alternate Language Versions|
Redux for All Structures
|see all items above; translation; multilingual alignment; entity disambiguation|||
The potential for Wikipedia to provide structural understandings is evident from this table. However, it should be noted that, aside from some stray research initiatives, most effort to date has focused on the major initiatives noted earlier or from analyzing linking and infoboxes. There is much additional research that could be powered by the Wikipedia structure as it presently exists.
From the standpoint of the broader semantic Web, the potential of Wikipedia in the areas of metadata enhancement and mapping to multiple human languages  are particularly strong. We are only now at the very beginning phases of tapping this potential.
The three main weaknesses with Wikipedia are its category structure , inconsistencies and incompleteness. The first weakness means Wikipedia is not a suitable organizational basis for the semantic Web; the next two weaknesses, due to the nature of Wikipedia’s user-generated content, are constantly improving.
Our recent effort to map between UMBEL and Wikipedia, undertaken as part of the recent UMBEL v 1.00 release, spent considerable time analyzing the Wikipedia category structure . Of the roughly half million categories in Wikipedia, only about 85,000 were found to be suitable candidates to participate in an actual schema structure. Further breakdowns are shown by this table resulting from our analysis:
|Wikipedia Category Breakdowns|
|Functional (not schema)||61.8%|
|Fn Listings, related||0.8%|
Fully 1/5 of the categories are administrative or internal in nature. The large majority of categories are, in fact, not structural at all, but what we term functional categories, which means the category contains faceting information (such as subclassifying musicians into British musicians) . Functional categories can be a rich source of supplementary metadata for its assigned articles — though, no one has yet processed Wikipedia in this manner — but are not a useful basis for structural conceptual relationships or inferencing.
This weakness in the Wikipedia category system has been known for some time , but researchers and others still attempt to do mappings on mostly uncleaned categories. Though most researchers recognize and remove internal or administrative categories in their efforts, using the indiscriminate remainder of categories still leads to poor precision in resulting mappings. In fact, in comparison to one of the more rigorous assessments to date , our analysis still showed a 6.8% error rate in hand inspected categories.
Other notable category problems include circular references, skipped intermediate categories, misassigned categories and incomplete assignments.
Nonetheless, Wikipedia categories do have a valuable use in the analysis of local relationships (one degree of relatedness) and for finding missing category candidates. And, as noted, the functional categories are also a rich and untapped source of additional article metadata.
Like any knowledge base, Wikipedia also has inconsistent and incomplete coverage of topics . However, as more communities accept Wikipedia as a central resource deserving completeness, we should see these gaps continue to get filled.
One of the first database versions of Wikipedia built for semantic Web purposes is DBpedia. DBpedia has an incipient ontology useful for some classification purposes. Its major structural organization is built around the Wikipedia infoboxes, which are applied to about a third of Wikipedia articles. DBpedia also has multiple language versions.
DBpedia is a core hub of Linked Open Data (LOD), which now has about 300 linked datasets; has canonical URIs used by many other applications; has extracted versions and tools very useful for further processing; and has recently moved to incorporate live updates from the source Wikipedia . For these reasons, the DBpedia version of Wikipedia is the suggested implementation version.
The third suggested gold standard for the Semantic Web is WordNet, a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. There are over 50 languages covered by wordnet approaches, most mapped to this English WordNet .
Though it has been used in many ontologies , WordNet is most often mapped for its natural language purposes and not used as a structure of conceptual relationships per se. This is because it is designed for words and not concepts. It contains hundreds of basic semantic inconsistencies and also lacks much domain applicability. Entities, of course, are also lacking. In those cases where WordNet has been embraced as a schema basis, much work is generally expended to transform it into an ontology suitable for knowledge representation.
Nonetheless, for word sense disambiguation and other natural language processing tasks, as well as for aiding multi-lingual mappings, WordNet and its various other language variants is a language reference gold standard.
So, with these prior gold standards we gain a basic language and grammar; a base (canonical) vocabulary and some structure guidance; and a reference means for processing and extracting information from input text. Yet two needed standards remain.
One needed standard is a conceptual organizing structure (or schema) by which the canonical vocabulary of concepts and instances can be related. This core structure should be constructed in a coherent  manner and expressly designed to support inferencing and (some) reasoning. This core structure should be sufficiently large to embrace the scope of the semantic Web, but not so detailed as to make it computationally inefficient. Thus, the core structure should be a framework that allows more focused and purposeful vocabularies to be “plugged in”, depending on the domain and task at hand. Unfortunately, the candidate category structures from our other gold standards in Wikipedia and WordNet do not meet these criteria.
A second needed standard is a bit of additional vocabulary “glue” specifically designed for the purposes of the semantic Web and ontology and domain incorporation. We have multiple and disparate world views and contexts, as well as the things described by them . To get them to interoperate — and to acknowledge differences in alignment or context — we need a set of relational predicates (vocabulary) that can capture a range of mappings from the exact to the approximate . Unlike other reference vocabularies that attempt to capture canonical definitions within defined domains, this vocabulary is expressly required by the semantic Web and its goal to federate different data and schema.
UMBEL has been expressly designed to address both of these two main needs . UMBEL is a coherent categorization structure for the semantic Web and a mapping vocabulary designed for dataset and conceptual interoperability. UMBEL’s 28,000 reference concepts (
RefConcepts) are based on the Cyc knowledge base , which itself is expressly designed as a common sense representation of the world with express variations in context supported via its 1000 or so microtheories. Cyc, and UMBEL upon which it is based, are by no means the “correct” or “only” representations of the world, but they are coherent ones and thus internally consistent.
UMBEL’s role to allow datasets to be “plugged in” and related through some fixed referents was expressed by this early diagram :
The idea — which is still central to this kind of reference structure — is that a set of reference concepts can be used by multiple datasets to connect and then inter-relate. These are shown by the nested subjects (concepts) in the umbrella structure.
UMBEL, of course, is not the only coherent structure for such interoperability purposes. Other major vocabularies (such as LCSH; see below) or upper-level ontologies (such as SUMO, DOLCE, BFO or PROTON, etc.) can fulfill portions of these roles, as well. In fact, the ultimate desire is for multiple reference structures to emerge that are mapped to one another, similar to how human languages can inter-relate. Yet, even in that desired vision, there is still a need for a bootstrapped grounding. UMBEL is the first such structure expressly designed for the two needed standards.
UMBEL is already based on the central semantic Web languages of RDF, RDFS, SKOS, and OWL 2. The recent version 1.00 now maps 60% of UMBEL to Wikipedia, with efforts for the remaining in process. UMBEL provides mappings to WordNet, via its Cyc relationships. More of this is in process and will be exposed. And the mappings between UMBEL and GeoNames  for locational purposes is also nearly complete.
Each of these reference structures — RDF/OWL, Wikipedia, WordNet, UMBEL — is itself coherent and recognized or used by multiple parties for potential reference purposes on the semantic Web. The advocacy of them as standards is hardly radical.
However, the gold lies in the combination of these components. It is in this combination that we can see a grounded knowledge base emerge that is sufficient for bootstrapping the semantic Web.
The challenge in creating this reference knowledge base is in the mapping between the components. Fortunately, all of the components are already available in RDF/OWL. WordNet already has significant mappings to Wikipedia and UMBEL. And 60% of UMBEL is already mapped to Wikipedia. The remaining steps for completing these mappings are very near at hand. Other vocabularies, such as GeoNames , would also beneficially contribute to such a reference base.
Yet to truly achieve a role as a gold standard, these mappings should be fully vetted and accurate. Automated techniques that embed errors are unacceptable. Gold standards should not themselves be a source for propagation of errors. Like dictionaries or thesauri, we need reference structures that are quality and deserving of reference. We need canonical structures and canonical vocabularies.
But, once done, these gold standards themselves become reference sources that can aid automatic and semi-automatic mappings of other vocabularies and structures. Thus, the real payoff is not that these gold standards themselves get actually embedded in specific domain uses or whatever, but that they can act as reference referees for helping align and ground other structures.
Like the bootstrap condition, more and more reference structures may be brought into this system. A reference structure does not mean reliance; it need not even have more than minimal use. As new structures and vocabularies are brought into the mix, appropriate to specific domains or purposes, reference to other grounding structures will enable the structures and vocabularies to continue to expand. So, not only are reference concepts necessary for grounding the semantic Web, but we also need to pick good mapping predicates for properly linking these structures together.
In this manner, many alternative vocabularies can be bootstrapped and mapped and then used as the dominant vocabularies for specific purposes. For example, at the level of general knowledge categorization, vocabularies such as LCSH, the Dewey Decimal Classification, UDC, etc., can be preferentially chosen. Other specific vocabularies are at the ready, with many already used for domain purposes. Once grounded, these various vocabularies can also interoperate.
Grounding in gold standards enables the freedom to switch vocabularies at will. Establishing fixed reference points via such gold standards will power a virtuous circle of more vocabularies, more mappings, and, ultimately, functional interoperability no matter the need, domain or world view.
Since the first days of the Web there has been an ideal that its content could extend beyond documents and become a global, interoperating storehouse of data. This ideal has become what is known as the “semantic Web“. And within this ideal there has been a tension between two competing world views of how to achieve this vision. At the risk of being simplistic, we can describe these world views as informal v formal, sometimes expressed as “bottom up” v “top down” [1,2].
The informal view emphasizes freeform and diversity, using more open tagging and a bottoms-up approach to structuring data . This group is not anarchic, but it does support the idea of open data, open standards and open contributions. This group tends to be oriented to RDF and is (paradoxically) often not very open to non-RDF structured data forms (as, for example, microdata or microformats). Social networks and linked data are quite central to this group. RDFa, tagging, user-generated content and folksonomies are also key emphases and contributions.
The formal view tends to support more strongly the idea of shared vocabularies with more formalized semantics and design. This group uses and contributes to open standards, but is also open to proprietary data and structures. Enterprises and industry groups with standard controlled vocabularies and interchange languages (often XML-based) more typically reside in this group. OWL and rules languages are more often typically the basis for this group’s formalisms. The formal view also tends to split further into two camps: one that is more top down and engineering oriented, with typically a more closed world approach to schema and ontology development ; and a second that is more adaptive and incremental and relies on an open world approach .
Again, at the risk of being simplistic, the informal group tends to view many OWL and structured vocabularies, especially those that are large or complex, as over engineered, constraining or limiting freedom. This group often correctly points to the delays and lack of adoption associated with more formal efforts. The informal group rarely speaks of ontologies, preferring to use the term of vocabularies. In contrast, the formal group tends to view bottoms-up efforts as chaotic, poorly structured and too heterogeneous to allow machine reasoning or interoperability. Some in the formal group sometimes advocate certification or prescribed training programs for ontologists.
Readers of this blog and customers of Structured Dynamics know that we more often focus on the formal world view and more specifically from an open world perspective. But, like human tribes or different cultures, there is no one true or correct way. Peaceful coexistence resides in the understanding of the importance and strength of different world views.
Shared communication is the way in which we, as humans, learn to understand and bridge cultural and tribal differences. These very same bases can be used to bridge the differences of world views for the semantic Web. Shared concepts and a way to communicate them (via a common language) — what I call reference structures  — are one potential “sweet spot” for bridging these views of the semantic Web .
According to Merriam Webster and Wikipedia, a reference is the intentional use of one thing, a point of reference or reference state, to indicate something else. When reference is intended, what the reference points to is called the referent. References are indicated by sounds (like onomatopoeia), pictures (like roadsigns), text (like bibliographies), indexes (by number) and objects (a wedding ring), but many other methods can be used intentionally as references. In language and libraries, references may include dictionaries, thesauri and encyclopedias. In computer science, references may include pointers, addresses or linked lists. In semantics, reference is generally construed as the relationships between nouns or pronouns and objects that are named by them.
Structures, or syntax, enable multiple referents to be combined into more complex and meaningful (interpretable) systems. Vocabularies refer to the set of tokens or words available to act as referents in these structures. Controlled vocabularies attempt to limit and precisely define these tokens as a means of reducing ambiguity and error. Larger vocabularies increase richness and nuance of meaning for the tokens. Combined, syntax, grammar and vocabularies are the building blocks for constructing understandable human languages.
Many researchers believe that language is an inherent capability of humans, especially including children. Language acquisition is expressly understood to be the combined acquisition of syntax, vocabulary and phonetics (for spoken language). Language development occurs via use and repetition, in a social setting where errors are corrected and communication is a constant. Via communication and interaction we learn and discover nuance and differences, and acquire more complex understandings of syntax structures and vocabulary. The contact sport of communication is itself a prime source for acquiring the ability to communicate. Without the structure (syntax) and vocabulary acquired through this process, our language utterances are mere babblings.
Pidgin languages emerge when two parties try to communicate, but do not share the same language. Pidgin languages result in much simplified vocabularies and structure, which lead to frequent miscommunication. Small vocabularies and limited structure share many of these same limitations.
Information theory going back to Shannon defined that the “fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point” . This assertion applies to all forms of communication, from the electronic to animal and human language and speech.
Every living language is undergoing constant growth and change. Current events and culture are one driver of new vocabulary and constructs. We all know the apocryphal observation that northern peoples have many more words for snow, for example. Jargon emerges because specific activities, professions, groups, or events (including technical change) often have their own ideas to communicate. Slang is local or cultural usage that provides context and communication, often outside of “formal” or accepted vocabularies. These sources of environmental and other changes cause living languages to be constantly changing in terms of vocabulary and (also, sometimes) structure.
Natural languages become rich in meaning and names for entities to describe and discern things, from plants to people. When richness is embedded in structure, contexts can emerge that greatly aid removing ambiguity (“disambiguating”). Contexts enable us to discern polysemous concepts (such as bank for river, money institution or pool shot) or similarly named entities (such as whether Jimmy Johnson is a race car driver, football coach, or a local plumber). As with vocabulary growth, contexts sometimes change in meaning and interpretation over time. It is likely the Gay ’90s would not be used again to describe a cultural decade (1890s) in American history.
All this affirms what all of us know about human languages: they are dynamic and changing. Adaptable (living) languages require an openness to changing vocabulary and changing structure. The most dynamic languages also tend to be the most open to the coining of new terminology; English, for example, is estimated to have 25,000 new words coined each year .
One could argue that similar constructs must be present within the semantic Web to enable either machine or human understanding. At first blush this may sound a bit surprising: Isn’t one premise of the semantic Web machine-to-machine communications with “artificial intelligence” acting on our behalf in the background? Well, hmmm, OK, let’s probe that thought.
Recall there are different visions about what constitutes the semantic Web. In the most machine-oriented version, the machines are posited to replace some of what we already do and anticipate what we already want. Like Watson on Jeopardy, machines still need to know that Toronto is not an American city . So, even with its most extreme interpretation — and one that is more extreme than my own view of the near-term semantic Web — machine-based communication still has these imperatives:
These points suggest that machine languages, even in the most extreme machine-to-machine sense, still need to have a considerable capability akin to human languages. Of course, computer programming languages and data exchange languages as artificial languages need not read like a novel. In fact, most artificial languages have more constraints and structure limitations than human languages. They need to be read by machines with fixed instruction sets (that is, they tend to have fewer exceptions and heuristics).
But, even with software or data, people write and interact with these languages, and human readability is a key desirable aspect for modern artificial languages . Further, there are some parts of software or data that also get expressed as labels in user interfaces or for other human factors. The admonition to Web page developers to “view source” is a frequent one. Any communication that is text based — as are all HTTP communications on the Web, including the semantic Web — has this readability component.
Though the form (structure) and vocabulary (tokens) of languages geared to machine use and understanding most certainly differ from that used by humans, that does not mean that the imperatives for reference and structure are excused. It seems evident that small vocabularies, differing vocabularies and small and incompatible structures have the same limiting effect on communications within the semantic Web as they do for human languages.
Yet, that being said, correcting today’s relative absence of reference and structure on the nascent semantic Web should not then mean an overreaction to a solution based on a single global structure. This is a false choice and a false dichotomy, belied by the continued diversity of human languages . In fact, the best analog for an effective semantic Web might be human languages with their vocabularies, references and structures. Here is where we may find the clues for how we might improve the communications (interoperability) of the semantic Web.
Freeform tagging and informal approaches are quick and adaptive. But, they lack context, coherence and a basis for interoperability. Highly engineered ontologies capture nuance and sophistication. But, they are difficult and expensive to create, lack adoption and can prove brittle. Neither of these polar opposites is “correct” and each has its uses and importance. Strident advocacy of either extreme alone is shortsighted and unsuited to today’s realities. There is not an ineluctable choice between freedom and formalism.
An inherently open and changing world with massive growth of information volumes demands a third way. Reference structures and vocabularies sufficient to guide (but not constrain) coherent communications are needed. Structure and vocabulary in an open and adaptable language can provide the communication medium. Depending on task, this language can be informal (RDF or data struct forms convertible to RDF) or formal (OWL). The connecting glue is provided by the reference vocabularies and structures that bound that adaptable language. This is the missing “sweet spot” for the semantic Web.
Just like human languages, these reference structures must be adaptable ones that can accommodate new learning, new ideas and new terminology. Yet, they must also have sufficient internal consistency and structure to enable their role as referents. And, they need to have a richness of vocabulary (with defined references) sufficient to capture the domain at hand. Otherwise, we end up with pidgin communications.
We can thus see a pattern emerging where informal approaches are used for tagging and simple datasets; more formal approaches are used for bounded domains and the need for precise semantics; and reference structures are used when we want to get multiple, disparate sources to communicate and interoperate. So long as these reference structures are coherent and designed for vocabulary expansion and accommodation for synonyms and other means for terminology mapping, they can adapt to changing knowledge and demands.
For too long there has been a misunderstanding and mischaracterization of anything that smacks of structure and referenceability as an attempt to limit diversity, impose control, or suggest some form of “One Ring to rule them all” organization of the semantic Web. Maybe that was true of other suggestions in the past, but it is far from the enabling role of reference structures advocated herein. This reaction to structure has something of the feeling of school children adverse to their writing lessons taking over the classroom and then saying No! to more lessons. Rather than Lord of the Rings we get Lord of the Flies.
To try to overcome this misunderstanding — and to embrace the idea of language and communication for the semantic Web — I and others have tried in the past to find various analogies or imagery to describe the roles of these reference structures. (Again, all of those vagaries of human language and communication!). Analogies for these reference structures have included :
What this post has argued is the analogy of reference structures to human language and communication. In this role, reference structures should be seen as facilitating and enabling. This is hardly a vision of constraints and control. The ability to articulate positions and ideas in fact leads to more diversity and freedom, not less.
To be sure, there is extra work in using and applying reference structures. Every child comes to know there is work in learning languages and becoming articulate in them. But, as adults, we also come to learn from experience the frustration that individuals with speech or learning impairments have when trying to communicate. Knowing these things, why do we not see the same imperatives for the semantic Web? We can only get beyond incoherent babblings by making the commitment to learn and master rich languages grounded in appropriate reference structures. We are not compelled to be inchoate; nor are our machines.
Yet, because of this extra work, it is also important that we develop and put in place semi-automatic  ways to tag and provide linkages to such reference structures. We have the tools and information extraction techniques available that will allow us to reference and add structure to our content in quick and easy ways. Now is the time to get on with it, and stop babbling about how structure and reference vocabularies may limit our freedoms.
Structured Dynamics and Ontotext are pleased to announce — after four years of iterative refinement — the release of version 1.00 of UMBEL (Upper Mapping and Binding Exchange Layer). This version is the first production-grade release of the system. UMBEL’s current implementation is the result of much practical experience.
UMBEL is primarily a reference ontology, which contains 28,000 concepts (classes and relationships) derived from the Cyc knowledge base. The reference concepts of UMBEL are mapped to Wikipedia, DBpedia ontology classes, GeoNames and PROTON.
UMBEL is designed to facilitate the organization, linkage and presentation of heterogeneous datasets and information. It is meant to lower the time, effort and complexity of developing, maintaining and using ontologies, and aligning them to other content.
This release 1.00 builds on the prior five major changes in UMBEL v. 0.80 announced last November. It is open source, provided under the Creative Commons Attribution 3.0 license.
In broad terms, here is what is included in the new version 1.00:
UMBEL’s basic vocabulary can also be used for constructing specific domain ontologies that can easily interoperate with other systems. This release sees a number of changes in the UMBEL vocabulary:
correspondsTopredicate has been added for nearly or approximate
sameAsmappings (symmetric, transitive, reflexive)
relatesToXXXpredicates have been added to relate external entities or concepts to UMBEL SuperTypes
The UMBEL Vocabulary defines three classes:
The UMBEL Vocabulary defines these properties:
relatesToXXX(31 variants related to SuperTypes)
The UMBEL vocabulary also has a significant reliance on SKOS, among other external vocabularies.
Here are links to various downloads, specifications, communities and assistance.
All documentation from the prior v 0.80 has been updated, and some new documentation has been added:
Major updates were made to the specifications and Annex G; Annex H is new. Minor changes were also made to Annexes A and B. All remaining Annexes only had minor header changes. All spec documents with minor or major changes were also versioned, with the earlier archives now date stamped.
All UMBEL files are listed on the Downloads and SVN page on the UMBEL Web site. The reference concept and mapping files may also be obtained from http://code.google.com/p/umbel/source/browse/#svn/trunk.
To learn more about UMBEL or to participate, here are some additional links:
These latest improvements to UMBEL and its mappings have been undertaken by Structured Dynamics and Ontotext. Support has also been provided by the European Union research project RENDER, which aims to develop diversity-aware methods in the ways Web information is selected, ranked, aggregated, presented and used.
This release continues the path to establish a gold standard between UMBEL and Wikipedia to guide other ontological, semantic Web and disambiguation needs. For example, the number of UMBEL reference concepts was expanded by some 36% from 20,512 to 27,917 in order to provide a more balanced superstructure for organizing Wikipedia content. And across all mappings, 60% of all UMBEL reference concepts (or 16,884) are now linked directly to Wikipedia via the new
umbel:correspondsTo property. A later post will describe the design and importance of this gold standard in greater detail.
Next releases will expand this linkage and coverage, and bring in other important reference structures such as GeoNames and others. This version of UMBEL will also be incorporated into the next version of FactForge. We will also be re-invigorating the Web vocabulary access and Web services, and adding tagging services based on UMBEL.
We invite other players with an interest in reusable and broadly applicable vocabularies and reference concepts to join with us in these efforts.
In the semantic Web, arguably SKOS is the right vocabulary for representing simple knowledge structures  and OWL 2 is the right language for asserting axioms and ontological relationships. In the early days we chose a reliance on SKOS for the UMBEL reference concept ontology, because of UMBEL’s natural role as a knowledge structure. Most recently we also migrated UMBEL to OWL 2 to gain (among other reasons) the metamodeling advantages of “punning”, which allows us to treat things as either classes or instances depending on modeling needs, context and viewpoint .
But — until today — we could not get SKOS and OWL 2 to play together nicely. This meant we could not take full advantage of each language’s respective strengths.
Happily, that gap has now been closed. By action of a relatively minor change by the SKOS Work Group (WG) and the addition of a simple statement, SKOS more fully interoperates with OWL 2.
SKOS was and is purposefully designed to be simple and to stick to its knowledge system scope. The authors of the language avoided many restrictions and kept axioms regarding the language to a minimum. As a result, since its inception, in relation to OWL, the core SKOS has been expressed as OWL Full (that is, undecidable and more free-form in interpretation).
However, also for some time it has been understood that, with some relatively minor changes, the core SKOS language could be modified to be OWL DL (decidable). In fact, since June 2009 there has been a DL version of SKOS, called by the editors the “DL prune” . A useful discussion of the specific axioms that cause the core SKOS to require interpretation as OWL Full is provided by the Semantic Web and Interoperability Group at the University of Patras .
Most of the conditions in question relate to the various annotation properties in SKOS, such as
skos:hiddenLabel. These are core constructs within the use of “semsets” to describe and refer to reference concepts in UMBEL . Among others, a simple explanation for one issue is that these labels within SKOS are defined as sub-properties of
rdfs:label, which is not allowable in OWL 2. Where such conflicts exist, they are removed (“pruned”) from the SKOS DL version. There are only a minor few of these, and not (in our view) central to the purpose or usefulness of SKOS.
For UMBEL, and we think other vocabularies, the transition to OWL 2 is important for many reasons, including the “punning” of individuals and classes. Other reasons include better handling of annotation properties, a better emerging set of tools, and the use of inference and reasoning engines.
Thus, to square this circle when using SKOS in OWL 2, it is important to refer to the DL version of SKOS and not SKOS core. However, since there was no version acknowledgment to the core in the SKOS DL namespace, it was not possible to make this reference while retaining general SKOS namespace (
My UMBEL co-editor, Fred Giasson, first brought this issue to the SKOS WG’s attention in November . He further suggested a relatively simple means to resolve the problem. By including a version reference in the SKOS DL version to SKOS core, consistent with the current OWL 2 specification , OWL 2-compliant tools will now know how to discern the proper references. This fix looks as follows:
<rdf:Description rdf:about="http://www.w3.org/2004/02/skos/core"> <owl:versionIRI rdf:resource="http://www.w3.org/TR/skos-reference/skos-owl1-dl.rdf"/> </rdf:Description>
Now, in our ontology (UMBEL), we define the skos namespace (N3 format):
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
and, then, import the SKOS DL version:
<http://umbel.org/umbel> rdf:type owl:Ontology ; <-- statements --> ; owl:imports <http://www.w3.org/TR/skos-reference/skos-owl1-dl.rdf> .
Pretty simple and straightforward.
With the expeditious treatment of the SKOS group , this proposal was floated and commented upon and then passed and adopted today. The change was also posted as an erratum to the SKOS specification . As Fred reported , his testing of the fix with available tools (such as Protégé 4.1 and reasoners) also appears to be working properly (on a 58 MB file with 28 K concepts and all of their annotations and relationships!).
This is a good example of how even simple changes may make major differences. It also shows how even simple changes may take some time and effort in a standards-making environment. But, even though simple, we think this will be highly useful to keeping SKOS a central vocabulary within OWL 2-based systems. Within the context of conceptual and domain vocabularies, such as represented by UMBEL or other ontologies based on the UMBEL vocabulary or similar approaches, we see this as a major win.
Finally, as for UMBEL itself, this change has allowed us to remove duplicate predicates in favor of those from SKOS. It has also caused a delay in our release of UMBEL v. 1.00, planned for today, until February 15 to complete testing and documentation revisions. But we’ll gladly take a change like this in trade for a minor delay any day.
I have been writing and speaking of late about next priorities to promote the interoperability of linked data and the semantic Web. In a talk a few weeks back to the Dublin Core (DCMI) annual conference, I summarized these priorities as the need to address two aspects of the semantic “gap”:
Interoperability comes down to the nature of things and how we describe those things or quite similar things from different sources. Given the robust nature of semantic heterogeneities in diverse sources and datasets on the Web (or anywhere else, for that matter!) , how do we bring similar or related things into alignment? And, then, how can we describe the nature or basis of that alignment?
Of course, classifiers since Aristotle and librarians for time immemorial have been putting forward various classification schemes, controlled vocabularies and subject headings. When one wants to find related books, it is convenient to go to a central location where books about the same or related topics are clustered. And, if the book can be categorized in more than one way — as all are — then something like a card catalog is helpful to find additional cross-references. Every domain of human endeavor makes similar attempts to categorize things.
On the Web we have none of the limitations of physical books and physical libraries; locations are virtual and copies can be replicated or split apart endlessly because of the essentially zero cost of another electron. But, we still need to find things and we still want to gather related things together. According to Svenonius, “Organizing information if it means nothing else means bringing all the same information together” . This sentiment and need remains unchanged whether we are talking about books, Web documents, chemical elements or linked data on the Web.
Like words or terms in human language that help us communicate about things, how we organize things on the Web needs to have an understood and definable meaning, hopefully bounded with some degree of precision, that enables us to have some confidence we are really communicating about the same something with one another. However, when applied to the Web and machine communications, the demands for how these definitions and precisions apply need to change. This makes the notion of a Web basis for organization both easier and harder than traditional approaches to classification.
It is easier because everything is virtual: we can apply multiple classification schema and can change those schema at will. We are not locked into historical anomalies like huge subject areas reserved for arcane or now historically less important topics, such as the Boer Wars or phrenology. We need not move physical books around on shelves in order to accommodate new or expanded classification schemes. We can add new branches to our classification of, say, nanotechnology as rapidly as the science advances.
Yet it is harder because we can no longer rely on the understanding of human language as a basis for naming and classifying things. Actually, of course, language has always been ambiguous, but it is manifestly more so when put through the grinder of machine processing and understanding. Machine processing of related information adds the new hurdles of no longer being able to rely on text labels (“names”) alone as the identifier of things and requires we be more explicit about our concept relationships and connections. Fortunately, here, too, much has been done in helping to organize human language through such lexical frameworks as WordNet and similar.
Many groups and individuals have been grappling with these questions of how to organize and describe information to aid interoperability in an Internet context. Among many, let me simply mention two because of the diversity their approaches show.
Bernard Vatant, for one, has with his colleagues been an advocate for some time for the need for what he calls “hubjects.” With an intellectual legacy from the Topic Maps community, the idea of “hubjects” is to have a flat space of reference subjects to which related information can link and refer. Each hubject is the hub of a spoked wheel of representations by which the same subject matter from different contexts may be linked. The idea of the flat space or neutrality in the system is to place the “hubject” identifier (referent) outside of other systems that attempt to organize and provide “meta-frameworks” of knowledge organization. In other words, there are no inherent suggested relationships in the reference “hubject” structure: just a large bin of defined subjects to which external systems may link.
A different and more formalized approach has been put forward by the FRSAD working group , dealing with subject authority data. Subject authority data is the type of classificatory information that deals with the subjects of various works, such as their concepts, objects, events, or places. As the group stated, the scope of this effort pertains to the “aboutness” of various conceptual works. The framework for this effort, as with the broader FRBR effort, are new standards and approaches appropriate to classifying electronic bibliographic records.
Besides one of the better summaries and introductions to the general problems of subject classification in general, the FRSAD approach makes its main contribution in clearly distinguishing the idea of something (which it calls a thema, or entity used as the subject of a work) from the name or label of something (which it calls nomen). For many in the logic community, steeped in the Peirce triad of sign-object-interpretant , this distinction seems rather obvious and straightforward. But, in library science, labels have been used interchangeably as identifiers, and making this distinction clean is a real contribution. The FRSAD effort does not itself really address how the thema are actually found or organized.
The notion of a reference concept used herein combines elements from both of these approaches. A reference concept is the idea of something, or a thema in the FRSAD sense. It is also a reference hub of sorts, similar to the idea of a “hubject”. But it is also much more and more fully defined.
So, let’s first begin by representing a reference concept in relation to its referers and possible linking predicates as follows:
A referer needs to link appropriately to its reference concept, with some illustrative examples shown on the arrows in the diagram. These links are the predicates, ranging from the exact to the approximate, discussed in the first semantic “gap” posting. (Note: see that earlier post for a longer listing of existing, candidate linking predicates. No further comment is made in this present article as to whether those in that earlier posting or the example ones above are “correct” or not; see the first post for that discussion.)
If properly constructed and used, a reference concept thus becomes a fixed point in an information space. As one or more external sources link to these fixed points, it is then possible to gather similar content together and to begin to organize the information space, in the sense of Svenonius. Further, and this is a key difference from the “hubject” approach, if the reference concept is itself part of a coherent structure, then additional value can be derived from these assignments, such as inference, consistence testing, and alignments. (More on this latter point is discussed below.)
If the right factors are present, it should be possible to relate and interoperate multiple datasets and knowledge representations. If present, these factors can result in a series of fixed reference points to which external information can be linked. In turn, these reference nodes can form constellations to guide the traversal to desired information destinations on the Web.
Let’s look at the seven factors as to what constitutes guidelines for best practices.
By definition, a Web-based reference concept should adhere to linked data principles and should have a URI as its address and identifier. Also, by definition as a “reference”, the vocabulary or ontology in which the concept is a member should be given a permanent and persistent address. Steps should be taken to ensure 24×7 access to the reference concept’s URI, since external sources will be depending on it.
As a general rule, the concepts should also be stated as single nouns and use CamelCase notation (that is, class names should start with a capital letter and not contain any spaces, such as MyNewConcept).
Provide a preferred label annotation property that is used for human readable purposes and in user interfaces. For this purpose, a construct such as the SKOS property of skos:prefLabel works well. Note, this label is not the basis for deciding and making linkages, but it is essential for mouseovers, tooltips, interface labels, and other human use factors.
Give all concepts and properties a definition. The matching and alignment of things is done on the basis of concepts (not simply labels), which means each concept must be defined . Providing clear definitions (along with the coherency of its structure) gives an ontology its semantics. Remember not to confuse the label for a concept with its meaning. For this purpose, a property such as skos:definition works well, though others such as rdfs:comment or dc:description are also commonly used.
The definition is the most critical guideline for setting the concept’s meaning. Adequate text and content also aid semantic alignment or matching tasks.
Include explicit consideration for the idea of a “semset” or “tagset”, which means a series of alternate labels and terms to describe the concept. These alternatives include true synonyms, but may also be more expansive and include jargon, slang, acronyms or alternative terms that usage suggests refers to the same concept. The semset construct is similar to the “synsets” in Wordnet, but with a broader use understanding. Included in the semset construct is the single (per language) preferred (human-readable) label for the concept, the prefLabel, an embracing listing of alternative phrase and terms for the concept (including acronyms, synonyms, and matching jargon), the altLabels, and a listing of prominent or common misspellings for the concept or its alternatives, the hiddenLabels.
This tagset is an essential basis for tagging unstructured text documents with reference concepts, and for search not limited to keywords. The tagset, in combination with the definition, is also the basis for feeding many NLP-driven methods for concept or ontology alignment.
The practice of using an identifier separate from label, and language qualified entries for definition, preferred label and tagset (alternative labels) means that multi-lingual versions can be prepared for each concept. Though this is a somewhat complicated best practice in its own right (for example, being attentive to the xml:lang=”en” tag for English), adhering to this practice provides language independence for reference concepts.
Sources such as Wikipedia, with its richness of concepts and multiple language versions, can then be a basis for creation of alternative language versions.
Use of domains and ranges assists testing, helps in disambiguation, and helps in external concept alignments. Domains apply to the subject (the left hand side of a triple); ranges to the object (the right hand side of the triple). Domains and ranges should not be understood as actual constraints, but as axioms to be used by reasoners. In general, domain for a property is the range for its inverse and the range for a property is the domain of its inverse.
When reference concepts, properly constructed as above, are also themselves part of a coherent structure, further benefits may be gained. These benefits include inferencing, consistency testing, discovery and navigation. For example, the sample at right shows that a retrieval for Saab cars can also inform that these are automobiles, a brand of automobile, and a Swedish kind of car.
To gain these advantages, the coherent structure need not be complicated. RDFS and SKOS-based lightweight vocabularies can meet this test. Properly constructed OWL ontologies can also provide these benefits.
When best practices are combined with being part of a coherent structure, we can refer to these structures as reference ontologies or domain ontologies.
In part, these best practices are met to a greater or lesser extent by many current vocabularies. But few provide complete coverage, and across a broad swath of domain needs, major gaps remain. This unfortunate observation applies to upper-level ontologies, reference vocabularies, and domain ontologies alike.
Upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). While these have a coherency of construction, they are most often incomplete with respect to reference concept construction. With the exception of SUMO and Cyc, domain coverage is also very general.
Our own UMBEL reference ontology  is closest to meeting all criteria. The reference concepts are constructed to standard. But coverage is fairly general, and not directly applicable to most domains (though it can help to orient specific vocabularies).
Wikipedia, as accessed via the DBpedia expression, has good persistent URIs, labels, altLabels and proxy definitions (via the first sentences abstract). As a repository of reference concepts, it is extremely rich. But the organizational structure is weak and provides very few of the benefits for coherent structures noted above.
Going back to 1997, DCMI has been involved in putting forward possible vocabularies that may act as “qualifiers” to dc:subject . Such reference vocabularies can extend from the global or comprehensive, such as the Universal Decimal Classification or Library of Congress Subject Headings, to the domain specific such as MeSH in medicine or Agrovoc in agriculture . One or more concepts in such reference vocabularies can be the object of a dc:subject assertion, for example. While these vocabularies are also a rich source of reference concepts, they are not constructed to standards and at most provide hierarchical structures.
In the area of domain vocabularies, we are seeing some good pockets of practice, especially in the biomedical and life sciences arena . Promising initiatives are also underway in library applications  and perhaps other areas unknown to the author.
In summary, I take the state of the art to be quite promising. We know what to do, and it is being done in some pockets. What is needed now is to more broadly operationalize these practices and to extend them across more domains. If we can bring attention to and publicize exemplar vocabularies, we can start to realize the benefits of actual data interoperability on the Web.